Equality Constrained Optimization: Daniel P. Robinson
Equality Constrained Optimization: Daniel P. Robinson
Equality Constrained Optimization: Daniel P. Robinson
Daniel P. Robinson
Department of Applied Mathematics and Statistics
Johns Hopkins University
Outline
Notes
1 Introduction
Comments
we have conflicting goals:
I minimize the objective function f (x)
I satisfy the constraints
the methods we discuss may be extended to handle inequality constraints
interior-point methods may be better for inequality constrained problems when the
problem is large-scale and you only need to solve a single “one-off” problem
Notes
minimize
n
f (x) subject to c(x) = 0 (2)
x∈R
minimize
n
f (x) subject to c(x) = 0 (3)
x∈R
x(10−1 ) x(1)
x(10−2 )
x(10−4 )
x(10−3 )
x∗
c(x) = 0
Algorithm 1 Basic quadratic penalty algorithm Notes
1: Input µ0 > 0 and x0
2: Set k ← 0 and xS0 ← x0 .
3: loop
4: With initial guess xSk , compute the “approximate” minimizer xk as
xk = argmin Φ(x; µk )
x∈Rn
Lagrangian Hessian:
m
X
H(x, π(x)) = H(x) − πi (x)∇2xx ci (x)
i=1
Theorem 2.1 (Convergence of the quadratic penalty method Algorithm 1) Notes
and J(x∗ ) has full rank, then x∗ satisfies the first-order necessary optimality conditions
for the problem
minimize
n
f (x) subject to c(x) = 0
x∈R
Comment
the “approximate” in Algorithm 1 may now be replaced by finding an xk such that
100
80
Φ(x,µ)
60
40
20
0
−0.5 0 0.5 1 1.5 2 2.5
Figure : Plot of the quadratic penalty function Φ(x;x µ) associated with f (x) = x + 1 and
c(x) = 31 x3 − 32 x2 + 2x for the penalty value of µ = 10−2 . The minimizer on the “right” does not
approximate any minimizer of the associated equality constrained problem.
can not avoid this (unfortunate fact about all known penalty functions).
can help mitigate (not prevent) this issue by using approximate minimizer of
Φ(x; µ) as an initial guess at the solution of Φ(x; µ̄) for 0 < µ̄ < µ.
Practical issue 2: ill-conditioned Newton systems for small µ
Notes
The Newton correction s from x for finding a zero of ∇Φ(x; µ) is given by
∇2xx Φ(x; µ)∆x = −∇x Φ(x; µ) (6)
Limiting derivatives of Φ for small µ (roughly)
∇x Φ(x; µ) = g(x) − J(x)T π(x)
| {z }
moderate
1 1
∇2xx Φ(x; µ) = H(x, π(x)) + J(x)T J(x) ≈ J(x)T J(x)
| {z } µ µ
moderate | {z }
large
Roughly speaking—and assuming that J(x∗ ) has full row rank—we have
m eigenvalues ≈ λi J(xk )T J(xk ) /µk
T
n − m eigenvalues ≈ λi Z (xk )H(x∗ , y∗ )Z(xk )
where Z(x) is an orthogonal basis for null(J(xk )), which implies that
cond ∇2xx Φ(xk ; µk ) = O(1/µk )
Concerns:
may not be able to solve (6) accurately.
may be difficult to minimize Φ(x; µ).
this was blamed for the failure of early quadratic penalty methods.
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ = 100 µ=1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
µ = 0.1 µ = 0.01
the unconstrained minimizer of the quadratic penalty function seems to approach
the solution to the equality constrain problem (3) as µ → 0.
contours become “closer” as µ → 0
from an arbitrary initial guess, Φ(x; µ) becomes “difficult” to minimize as µ → 0
have to follow the “valley” corresponding to the constraints
Practical issue 3: minimizing Φ(x; µ) becomes more difficult as µ → 0
Notes
Under suitable assumptions on a constrained minimizer x∗ , there exists a trajectory
parameterized by µ of unconstrained minimizers of Φ(x; µ) that converge to x∗ .
Proof: It follows from the first-order optimality conditions for problem NEP that
g(x) − J(x)T y
F(x∗ , y∗ , 0) = 0, where F(x, y, µ) := .
c(x) + µy
The derivative matrix of F with respect to (x, y) is given by
H(x, y) −J(x)T
F0 (x, y, µ) = .
J(x) µI
Scaling the last m-columns by −1, using the assumptions of this theorem, and
Lemmas 5.4 and 5.5 from Lecture 01, we may conclude that F0 (x∗ , y∗ , 0) is
nonsingular and, therefore, we may apply the Implicit Function Theorem.
It follows from the Implicit Function Theorem that there exists a unique differentiable
vector function (x(µ), y(µ)) that satisfies
To prove that x(µ) is a minimizer of Φ(x; µ) for µ > 0 sufficiently small, we must prove
that
1 T
∇2xx Φ(x; µ) ≡ H x(µ), y(µ) + J x(µ) J x(µ) 0.
µ
However, this result immediately follows from continuity assumptions, (9), and Debreu’s
result given as Lemma 5.5 of Lecture01.
Practical issue 3: minimizing Φ(x; µ) becomes more difficult as µ → 0
Notes
Recall: we do not want to follow the constraint “valley” to reach the solution!
Taking limits as µ → 0 and using limµ→0 c(xµ ) = 0 and g(x∗ ) = J(x∗ )T y∗ , gives
0 = g(x∗ ) + J(x∗ )T J(x∗ )p = J(x∗ )T y∗ + J(x∗ )T J(x∗ )p = J(x∗ )T (y∗ + J(x∗ )p).
J(x∗ )p = −y∗ .
Therefore, if y∗ 6= 0, we may conclude that J(x∗ )p 6= 0, i.e., that the tangent vector p to
the trajectory of minimizers is not tangent to the constraint manifold.
Main point: the trajectory of minimizers does not approach x∗ along the “constraint
valley” provided J(x∗ ) has full row rank and y∗ 6= 0.
where µ > 0, which is a perturbation of the optimality conditions for problem NEP (1),
which are given by
A primal-dual path-following based method tracks the roots by using Newton’s Method
on (10) in the primal-dual variables (x, y). This leads to the primal-dual Newton system
Primal:
1 T
H(x, π(x)) + J (x)J(x) ∆xP = − g(x) − J(x)T π(x)
µ
Primal-dual:
1
H(x, y) + JT (x)J(x) ∆xPD = − g(x) − J(x)T π(x)
µ
where
c(x)
π(x) = −
µ
x(µ) + ∆x
Idea: shift the constraints so that the quadratic penalty function passes through the
minimizer x∗ for some positive value of µ (perhaps all µ sufficiently small).
x(µ)
x(µ)
x∗
minimize f (x) subject to c(x) − s = 0
c(x) − s = 0 c(x) = 0
Notes
Proof: Homework!
that
k∇x Φ(xk ; yk , µk )k2 ≤ εk for some {εk } satisfying lim εk = 0,
k→∞
and that xk converges to x∗ for which J(x∗ ) is full rank. It then follows that there exists
y∗ such that
lim yFk = y∗ and g(x∗ ) = J(x∗ )T y∗ .
k→∞
Proof: The convergence of {yFk } → y∗ := J+ (x∗ )g(x∗ ) for which g(x∗ ) = JT (x∗ )y∗ is
exactly as for Theorem 2.1. Moreover, the definition of yFk implies that
It follows from this inequality and the assumptions of this theorem that c(x∗ ) = 0.
Therefore, (x∗ , y∗ ) satisfies the first-order optimality conditions for problem (11).
Contours of the augmented Lagrangian function
Notes
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y = 0.5 y = 0.9
Augmented Lagrangian function for min x21 + x22 subject to x1 + x22 = 1 with fixed µ = 1
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
y = 0.99 y = 1 = y∗
Augmented Lagrangian function for min x21 + x22 subject to x1 + x22 = 1 with fixed µ = 1
Algorithm 3 A basic augmented Lagrangian algorithm Notes
1: Input µ0 > 0, x0 , and y0 .
2: Set k ← 0 and xS0 ← x0 .
3: loop
4: Choose suitable εk > 0 and ηk > 0 such that {εk } → 0 and {ηk } → 0.
5: With initial guess xSk , compute xk such that k∇Φ(xk ; µk , yk )k ≤ min(10−1 , εk ).
6: Set yFk ← yk − c(xk )/µk .
7: if kc(xk )k ≤ ηk then
8: Set yk+1 ← yFk and µk+1 ← µk .
9: else
3/2
10: Set yk+1 ← yk and µk+1 ← min 0.1µk , µk .
11: end if
12: Set xSk+1 ← xk .
13: Set k ← k + 1.
14: end loop
15: return approximate solution xSk
Comments
Penalty methods did not aim directly for the solution.
The trial steps are Newton-related steps based on models of the penalty function,
and only converge to a solution of problem (12) as certain parameter(s) that
defined the penalty function converge to some desirable value(s).
Sequential quadratic programming (SQP) methods aim directly for the solution
using Newton’s Method!
Newton’s Method based on what function?
The Newton correction (x+ , y+ ) = (x, y) + (s, w) for this nonlinear system (linear in y)
may be written in several equivalent forms:
asymmetric:
−J(x)T g(x) − J(x)T y
H(x, y) s
=−
J(x) 0 w c(x)
symmetric:
J(x)T g(x) − J(x)T y
H(x, y) s
=−
J(x) 0 −w c(x)
asymmetric (with y+ = y + w):
−J(x)T
H(x, y) s g(x)
=−
J(x) 0 y+ c(x)
Solve the above linear KKT system using previously discussed methods
I full system: symmetric (indefinite) LBLT factorization of K
I Schur-complement: symmetric factorization of B and Schur-Complement J(x)B−1 J(x)T
I Null-space: Cholesky factorization of reduced Hessian ZTBZ and asymmetric
factorization of J(x)Y.
I iterative methods: GMRES(k), MINRES, CG within null J(x) , etc.
An alternative interpretation
Notes
Consider the quadratic program (QP) given by
minimize
n
f (x) + g(x)T s + 21 sT Bs subject to c(x) + J(x)s = 0 (13)
s∈R
Strengths
simple to define/implement
fast (when it works)
I quadratically convergent with Bk = H(xk , yk ) (under standard assumptions)
I superlinearly convergent with good Bk ≈ H(xk , yk ) (under standard assumptions)
I don’t actually need Bk → H(xk , yk ) but rather they converge on a suitable subspace.
Weaknesses
Subproblem may not be feasible or may be unbounded when it is feasible.
Equivalent to Newton’s Method for finding a zero of F, so not globally convergent.
how to choose Bk ?
what if the QP subproblem
minimize
n
f (xk ) + g(xk )T s + 12 sTBk s subject to J(xk )s = −c(xk )
s∈R
pure SQP is equivalent to Newton’s Method for finding a first-order optimal point.
I Question: how do we globalize this iteration?
I Answer: linesearch/trust-region techniques with appropriate step acceptance strategies
Linesearch SQP methods
Notes
Basic idea:
Let the direction
Theorem 4.2
Suppose that both f and c are in C2 , and that x∗ is an isolated local minimizer of the
equality constrained problem (1) with corresponding Lagrange multipliers y∗ . Then x∗
is also an isolated local minimizer of Φ(x; ρ) provided that
ρ > ky∗ kD
Comments
Theorem 4.2 justifies the use of Φ(x; ρ) as a merit function.
Based on this result, one might suspect that the pure SQP step is a descent
direction provided ρ is sufficiently large. It is!!
Theorem 4.3 (pure SQP direction is descent direction for exact penalty
function)
If Bk 0, (sk , yk+1 ) is the SQP search direction and its associated multiplier estimate
for problem (1) at xk , and xk is not a first-order KKT point, then sk is a descent direction
for the non-differentiable exact penalty function Φ(x; ρk ) at xk whenever
ρk ≥ kyk+1 kD
Proof: Taylor’s theorem applied to f and c and ck + Jk sk = 0 imply (for small α) that
Φ(xk + αsk ; ρk ) − Φ(xk ; ρk ) = αsTk gk + ρk (kck + αJk sk k − kck k) + O(α2 )
= αsTk gk + ρk (k(1 − α)ck k − kck k) + O(α2 )
= α sTk gk − ρk kck k + O α2
which combined with (14), the positive definiteness of Bk , the Hölder inequality, sk 6= 0
since xk is not a KKT point, and the bound on ρk , imply that
Φ(xk + αsk ; ρk ) − Φ(xk ; ρk ) = −α sTk Bk sk + cTk yk+1 + ρk kck k + O(α2 )
0.1
−0.1
−0.2
−0.3
−0.4
−0.5
−0.6
−0.7
−0.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Figure : The `1 non-differentiable exact penalty function Φ(x; ρ) = f (x) + ρkc(x)k1 with ρ = 1,
f (x) = 2(x21 + x22 − 1) − x1 , and c(x) = x21 + x22 − 1. The solution is x∗ = (1, 0) and y∗ = 3/2.
Maratos effect: the merit function may prevent acceptance of the pure SQP step
arbitrarily close to x∗ and lead to slow (not Newton-like) convergence!
0.1
−0.1
−0.2
−0.3
−0.4
−0.5
−0.6
−0.7
−0.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
Comments
Can choose Bk+1 0 via a quasi-Newton approximation. To ensure that Bk+1 is
positive-definite, may use Powell’s damping technique (see Procedure 18.2 in
Nocedal and Wright, 2005.).
DpkΦ(xk ; ρk ) is the directional derivative of Φ(x; ρk ) at xk , and it is equal to
DpkΦ(xk ; ρk ) = g(xk )T sk − ρk kck k1 .
Models of Φ
a piecewise linear model of Φ(xk + s; xk , ρk ) is given by
Notes
`Φ(s; xk , ρk ) := fk + gTk s + ρk kck + Jk sk1
and the associated predicted change in Φ given by `Φ for the step s defined by
∆`Φ(s; xk , ρk ) := `Φ(0; xk , ρk ) − `Φ(s; xk , ρk )
= −gTk s + ρk kck k1 − kck + Jk sk1
Observation
We may merely require the computed αk > 0 to satisfy the weaker condition
Φ(xk + αk sk ; ρk ) ≤ Φ(xk ; ρk ) − ηαk ∆qΦ(sk ; xk ρk )
which makes sense since if xk is not a KKT point, then (homework)
DpkΦ(xk ; ρk ) = gTk sk − ρk kck k1
≤ gTk sk + ρk kck + Jk sk k1 − kck k1 = −∆`Φ(sk ; xk , ρk )
Notes
sk = argmin fk + gTk s + 1 T
2
s Bk s subject to Jk s = −ck , ksk ≤ δk
s∈Rn
KA Æ
+
A The nonlinear
onstraint +
Notes
Strengths
subproblem (15) is always consistent
when ρ and δk are large enough, sk is equivalent to the pure SQP direction. Why?
the subproblem (15) is equivalent to the (smooth) quadratic program
Composite-step methods
Notes
sk = n k + t k
where
the normal step nk moves towards feasibility of the linearized constraints (within
the trust region):
kck + Jk nk k < kck k
(model objective may get worse)
the tangential step tk reduces the model of the objective function (within the
trust-region) without sacrificing the improvement in feasibility predicted by nk :
Figure : Diagram illustrates normal and tangential steps. Points on the dotted line are all potential
tangential steps.
Normal step:
Relax the constraints
Jk s = −ck and ksk ≤ δk
so that we find nk that satisfies
Tangential step:
Find
tk ≈ argmin (gk + Bk nk )T t + 21 tT Bk t subject to Jk t = 0, knk + tk ≤ δk
t
Snags:
choice of σk
incompatible constraints
Approach 2: constraint reduction (Byrd and Omojokun)
Notes
Normal step:
Replace
Jk s = −ck and ksk ≤ δk
by
nk ≈ argmin kck + Jk nk subject to knk ≤ δk
n
Tangential step:
Same as with Vardi.
Comments:
use iterative linear conjugate gradient method to (approximately) solve both
subproblems (use Cauchy points in both cases)
globally convergent using `2 merit function
basis of a successful algorithm used in the package KNITRO
Normal step:
Replace
Jk s = −ck and ksk ≤ δk
by finding σk ∈ [0, kck k] large enough so that there is a feasible nk for the constraints
Tangential step:
Compute tk as
Snags:
choice of σk
tangential step subproblem is difficult to solve because of “hard” constraints
Filter method by Fletcher and Leyffer
Notes
Motivation
the simple trust-region SQP subproblem
sk = argmin fk + gTk s + 21 sT Bk s subject to Jk s = −ck , ksk ≤ δk
s∈Rn
v(x) = kc(x)k
measures the constraint violation and we used the notation f` = f (x` ) and v` = v(x` ).
f (x) 6
Notes
q1
q4
-
0 v(x)
q2
q3
Figure : Diagram depicts a filter with four entries, where v(x) = kc(x)k. A new point xk is
“acceptable” to the filter if it associated pair v(xk ), f (xk ) lies below and to the left of the dashed
staircase-like region.
Basic idea behind filter methods Notes
computing a trial step sk
I if the simple SQP trust-region subproblem constraints given by
Jk s = −ck and ksk∞ ≤ δk (16)
are compatible, then define
sk = argmin fk + gTk s + 12 sT Bk s subject to Jk s = −ck , ksk∞ ≤ δk
s∈Rn
I if the constraints given by (16) are not compatible, find sk such that
v(xk + sk ) “<” min(vk , vi ) for all i in the filter
step acceptability
I if xk + sk is “acceptable”
to the filter, set xk+1 = xk + sk , possibly increase δk , add
v(xk+1 ), f (xk+1 ) to the filter, and then remove (prune) any dominated entries from
the filter
I otherwise, reduce δk and try again
In practice
many additional details are needed
far more complicated than this!
Notes