Solving Underdetermined Nonlinear Equations by Newton-Like Method

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Noname manuscript No.

(will be inserted by the editor)

Solving underdetermined nonlinear equations


by Newton-like method

Boris Polyak · Andrey Tremba


arXiv:1703.07810v1 [math.OC] 22 Mar 2017

Received: date / Accepted: date

Abstract Newton method is one of the most powerful methods for finding
solution of nonlinear equations. In its classical form it is applied for systems of
n equations with n variables. However it can be modified for underdetermined
equations (with m < n, m being the number of equations). Theorems on
solvability of such equations as well as conditions for convergence and rate of
convergence of Newton-like methods are addressed in the paper. The results
are applied to systems of quadratic equations, one-dimensional equations and
inequalities.
Keywords nonlinear equations · quadratic equations · existence of solution ·
Newton method · underdetermined equations · feasibility

1 Introduction

Consider nonlinear equation


g(x) = y, (1)
written via vector function g : Rn → Rm . There exists the huge bunch of
literature on solvability of such equations and numerical methods for their
solution, see e.g. the classical monographs [15, 3]. One of the most powerful
methods is Newton method, going back to such giants as Newton, Cauchy,
Fourier. The general form of the method is due to Kantorovich [6, 7]; on history

This work was supported by Russian Science Foundation (project 16-11-10015).

B. Polyak
Institute for Control Sciences, Profsoyuznaya 65, 117997 Moscow, Russia;
Skolkovo Institute of Science and Technology, Skolkovo Innovation Center Building 3, 143026
Moscow, Russia, E-mail: boris@ipu.ru
A. Tremba
Institute for Control Sciences, Profsoyuznaya 65, 117997 Moscow, Russia, E-mail:
atremba@ipu.ru
2 Boris Polyak, Andrey Tremba

and references see [8, 23, 19]. The basic version of Newton method for (1) is
applicable when g(x) is differentiable and g 0 (x) is invertible (this implies m =
n):

xk+1 = xk − g 0 (xk )−1 (g(xk ) − y) (2)


The method converges under some natural conditions, moreover it can be
used for obtaining existence theorems for the solution (see references cited
above). Unfortunately Newton method converges only locally: it requires good
initial approximation x0 (so called “hot start”). Convergence conditions can
be relaxed for damped Newton method

xk+1 = xk − αg 0 (xk )−1 (g(xk ) − y)


with 0 < α < 1.
However the case of underdetermined equations (m < n) attracted much
less attention. The pioneering result is due to Graves [5] in more general setting
of Banach spaces. Graves’ theorem for finite-dimensional case claims, that if
condition
||g(xa ) − g(xb ) − A(xa − xb )|| ≤ C||xa − xb ||
holds in the ball B of radius ρ for a matrix A with minimal singular value
µ > C > 0, then solution of the equation (1) exists provided ||y|| is small
enough, namely ||y|| ≤ ρ(µ − C) and it can be found via a version of modified
Newton method, where next iteration requires solution of the linear equation
with matrix A, see [4, 12] for details.
In explicit form Newton method for m 6= n has been written by Ben-Israel
[1]:

xk+1 = xk − g 0 (xk )† (g(xk ) − y),


where A† stands for Moore-Penrose pseudoinverse of A. However the results
in [1] are mostly oriented on overdetermined systems, and the assumptions of
the theorems in [1] are hard to check.
In the paper [16] results on solvability of nonlinear equations in Banach
spaces and on application of Newton-like methods have been formulated in
different form. One of the results from [16] adopted to our notation and finite-
dimensional case claims that if g(0) = 0, g 0 (x) exists and is Lipschitz on B
and estimate ||g 0 (x)T h|| ≥ µ||h||, µ > 0, ∀h holds on B, then equation (1)
has a solution x∗ provided ||y|| < µρ . Another result deals with convergence of
Newton method; however the method is not provided in explicit form.
The main contribution of the present paper is the analysis of the novel
version of Newton method for solving the underdetermined equation (1) with
m < n. It has the form

xk+1 = xk − αk z k , k = 0, 1, . . .
(3)
z k = arg minz {||z|| : g 0 (xk )z = y − g(xk )}.
Solving underdetermined nonlinear equations by Newton-like method 3

If m = n and g 0 (xk )−1 exists, the method coincides with classical Newton
for αk = 1 and damped Newton for αk = α < 1. Starting at x0 = 0 the latter
method converges to x∗ under some additional constraints on ||y||, L, ρ, µ, α
(see Theorem 1 below on rigorous conditions). Notice that norms in Rn , Rm
can be chosen arbitrarily, and they imply different forms of Newton method
(3) and various conditions on solvability and convergence.
The first goal of the present paper is to give explicit expressions of the
method (3) for various norms and to provide simple, easily checkable conditions
for convergence of the method. This also provides existence theorems: what is
the feasible set Y such that y ∈ Y implies solvability of (1).
The second goal is to develop constructive algorithms for choosing stepsizes
αk to achieve fast and as global as possible convergence. We suggest different
strategies for constructing algorithms and study their properties.
We also examine some special cases of the nonlinear equations. One of
them is the quadratic case, when all components of g are quadratic functions:

1
gi (x) = (Ai x, x) + (bi , x), Ai = ATi , bi ∈ Rn . (4)
2
In this case we try to specify above results and design the algorithms to check
feasibility of a vector y ∈ Rm . The next important case is the scalar one, i.e.
m = 1. We specify general results for scalar equations and inequalities; the
arising algorithms have much in common with unconstrained minimization
methods. Finally we discuss nonlinear equations having some special structure.
Then convergence results can be strongly enhanced.
Few words on comparison with known results. The paper, which contains
the closest results to ours, is [14]. Nesterov addresses the same problem (1)
and his method (in our notation) has the form

xk+1 = xk − z k , k = 0, 1, . . .
z k = arg min{||g(xk ) − y + g 0 (xk )z|| + M ||z||2 },
z

where M is a scalar parameter to be adjusted at each iteration. Nesterov’s


assumptions are close to ours and his results on solvability of equations and on
convergence of the method are similar. The main difference is the method itself;
it is not clear how to solve the auxiliary optimization problem in Nesterov’s
method, while finding z k in our method can be implemented in explicit form.
Other papers on underdetermined equations mentioned above either do not
specify the technique for solving the linearized auxiliary equation, or restrict
analysis with Euclidean norm and pure Newton stepsize αk = 1, see e.g. [16,
20, 21, 9].
The rest of the paper is organized as follows. In next section we remind
few notions and results, and discuss explicit (or half-explicit) solutions for
optimization problem in (3). Next (Section 3) we prove simple solvability con-
ditions for (1). In Section 4 we propose few variants of general Newton algo-
rithm (3) and estimate their convergence rate. Some particular cases (scalar
4 Boris Polyak, Andrey Tremba

equations and inequalities, quadratic equations, problems with special struc-


ture) are treated in Section 5. Results of numerical simulation are exhibited
in Section 6. Conclusion part finalizes the paper (Section 7).

2 Preliminaries

First of all let us specify subproblem of finding vector z k in (3) for different
norms of x ∈ Rn .

1. For ||x|| = ||x||1 vector z k is a solution of LP problem

min{||z||1 : g 0 (xk )z = y − g(xk )},

its dual is LP problem with one scalar constrain

min{||g 0 (xk )T h||∞ : (y − g(xk ), h) = 1}.

2. For ||x|| = ||x||∞ vector z k is a solution of LP problem

min{||z||∞ : g 0 (xk )z = y − g(xk )},

its dual is LP problem with one scalar constrain too

min{||g 0 (xk )T h||1 : (y − g(xk ), h) = 1}.

3. For ||x|| = ||x||2 vector z k can be written explicitly

z k = g 0 (xk )† (g(xk ) − y).

For m ≤ n Moore-Penrose pseudo-inverse of a matrix A is written as


A† = AT (AAT )−1 , if A has full row rank.

Thus in these (most important) cases algorithm (3) can be implemented


effectively. Also solution of first two problems may be non-unique.
Note that linear constraint

Az = b, b ∈ Rm , z ∈ Rn (5)

in the primal optimization problems above describes either linear subspace,


either empty set. The classical result below (which goes back to Banach, see
[7, 12, 14]) guarantees solvability of the linear equation (5) and the estimates of
its solutions. We suppose that spaces Rn , Rm are equipped with some norms,
the dual norms are denoted ||·||∗ (for linear functional c, associated with vector
of same dimension kck∗ = supx:kxk=1 (c, x)). Operator norms are subordinate
with the vector norms, e.g. for A : X → Y we have kAxkY ≤ kAkX,Y kxkX .
In most cases we do not specify vector norms; dual and operator norms are
obvious from the context.
Solving underdetermined nonlinear equations by Newton-like method 5

Lemma 1 If A ∈ Rm×n satisfies condition


kAT hk∗ ≥ µ0 khk∗ , µ0 > 0, (6)
for all h ∈ Rm , then equation (5) has a solution for all b ∈ Rm , and all
solutions of optimization problem
zb = arg min{kzk : Az = b}
1
have bounded norms kb
zk ≤ µ0 kbk∗ .

The Lemma is claiming that matrix A has full row rank equal to m provided
(6) holds. It is another way to say that the mapping A : Rn → Rm is onto
mapping, i.e. covering all image space. In case of Euclidean norms parameter
µ0 is the smallest singular value of the matrix µ0 = σm (A) (we denote singular
values of a matrix in Rm×n in decreasing order as σ1 ≥ σ2 ≥ . . . ≥ σm ).
Finally we introduce sum of double exponential functions Hk : [0, 1) →
P∞ `
R+ , Hk (δ) = `=k δ (2 ) (cf. [16]) and inverse function for the first of them
∆(·) : R+ → [0, 1), such that of ∆(H0 (δ)) ≡ δ, δ ∈ [0, 1). All functions Hk (δ)
are monotonically increasing on δ. In results below we also use specific constant
1 1
c = H0 ≈ 0.8164, s.t. ∆(c) = .
2 2
Following upper and lower approximations appear to be useful for simplifying
Theorems’ resulting expressions
k
δ (2 ) 1 H
Hk (δ) ≤ (2 k) = −(2k) , ∆(H) ≥ . (7)
1−δ δ −1 1+H

3 Solvability of underdetermined equations

Below the problem of solvability of equation (1) is addressed. We apply algo-


rithm (3) with small α and prove that iterations converge while the limit point
is a solution. This techniques follows the idea from [16]. Remind that Rn , Rm
are equipped with some norms, the dual norms are denoted || · ||∗ .
Assumptions.
A. g(0) = 0, g(x) is differentiable in the ball B = {x ∈ Rn : kxk ≤ ρ} and
0
g (x) satisfies Lipschitz condition in B:

kg 0 (xa ) − g 0 (xb )k ≤ Lkxa − xb k.


B. The following inequality holds for all x ∈ B and some µ > 0:
kg 0 (x)T hk∗ ≥ µkhk∗ , ∀h ∈ Rm .
C. kyk < µρ.

Theorem 1 If conditions A, B, C hold then there exists a solution x∗ of (1),


and kx∗ k ≤ kyk
µ .
6 Boris Polyak, Andrey Tremba

Proof We apply algorithm (3) with α > 0 small enough and x0 = 0. The
algorithm is well defined — condition B and Lemma 1 imply existence of
solutions z k provided that xk ∈ B; this is true for k = 0 and will be validated
recurrently. Standard formula
Z 1
g(x + z) = g(x) + g 0 (x + tz)zdt
0

combined with condition A provides for x = xk , z = −αz k and uk = kg(xk ) −


yk recurrent relation

Lα2 k 2
uk+1 ≤ |1 − α|uk + kz k .
2
Now condition B and Lemma 1 transform this estimate into
Lα2 u2k
uk+1 ≤ |1 − α|uk + .
2µ2
2
2µ u0
Choose α = ε Lu 0
(1 − µρ ) with small ε < 1 satisfying 0 < α < 1; it
is possible due to condition C. From the above inequality we get uk+1 ≤
uk (1 − α + αε uuk0 (1 − µρ u0
)). For k = 0 this implies u1 < u0 and recurrently
u0
uk+1 < uk . We also get uk+1 ≤ quk , q = (1 − α + αε(1 − µρ )) < 1. Thus
k
uk ≤ q u0 and uk → 0 for k → ∞.
k
On the other hand we have kxk+1 −xk k = αkzk k ≤ kg(x µ)−yk = uµk ≤ q k uµ0 .
Hence for any k, s and for k → ∞
k+s−1
k+s k
X u0
kx −x k≤ kxi+1 − xi k ≤ q k → 0.
(1 − q)µ
i=k

It means that xk is a Cauchy sequence and converges to some point x∗ (ε).


We had g(xk ) → y, thus continuity reasons imply g(x∗ (ε)) = y. Now, for all
Pk−1 u0
iterations we got kxk − x0 k = kxk k ≤ j=0 kxj+1 − xj k ≤ α µ(1−q) < ρ. Hence
k
all iterations x remain in the ball B and our reasoning was correct. Finally
taking ε → 0 we have kxk k ≤ uµ0 and its limit point x∗ (ε) → x∗ which is a
solution as well and kx∗ k ≤ uµ0 . tu

Corollary 1 If ρ = ∞ (that is conditions A, B hold on the entire space Rn )


then equation (1) has a solution for arbitrary right-hand side y.

It is worth noting that if we apply pure Newton method (i.e. take αk ≡ 1),
2
the conditions of its convergence are more restrictive: we need kyk ≤ 2µ
L , that
is we guarantee only local convergence even for ρ = ∞.

Corollary 2 If m = n and Condition B is replaced with kg 0 (x)−1 k ≤ 1


µ, x ∈
B, then the statement of Theorem 1 holds true.
Solving underdetermined nonlinear equations by Newton-like method 7

In this case our method (3) reduces to classical Newton method (2).
For m < n equation (1) in general case has many solutions, we constructed
just one of them, and we can not guarantee that the solution is the closest to
the initial point x0 = 0.
Among assumptions of Theorem 1 the main difficulty admits Assump-
tion B. In many examples L can be estimated relatively easy; it is not the
case for µ. Much simpler is to check its analog in a single point x = 0. Let us
modify the solvability result in this way, with assumptions
B0 . The following inequality holds for some µ0 > 0:
kg 0 (0)T hk∗ ≥ µ0 khk∗ , ∀h ∈ Rm .
µ20 µ0
C0 . kyk < 4L , 2L ≤ ρ.
Theorem 2 If conditions A, B0 , C0 hold then there exists a solution x∗ of
(1), and kx∗ k ≤ 2kyk
µ0 .

Proof For any linear operator A and its dual A∗ (coinciding with AT in real-
valued vector spaces) holds kAk = kA∗ k∗ , where dual operator norm is induced
by dual norms [7]. Then from condition A follows same Lipschitz constant for
transposed derivative operator g 0 (·)T :
k(g 0 (xa ) − g 0 (xb ))T k∗ = kg 0 (xa ) − g 0 (xb )k ≤ Lkxa − xb k, xa , xb ∈ B.
µ0
Denote r = 2L , due to second part of Condition C ball Br = {x : kxk ≤
r} lays within B. In Br evidently holds condition A. Also in Br we have
kg 0 (x)T hk∗ ≥ kg 0 (0)T hk∗ − k(g 0 (x) − g 0 (0))T hk∗ ≥ (µ0 − Lkxk)khk∗ ≥ (µ0 −
Lr)khk∗ ≥ µ0 khk2

, thus condition B holds with µ = µ20 . Condition C is also
µ20
satisfied on this ball due to first part of Condition C0 as kyk < 4L = rµ and
Theorem 1 holds (with r instead of ρ in its statement). t
u
It is worth noting that the result of Theorem 2 is always local, even for
ρ = ∞ (compare with Corollary 1). From the proof it also follows that second
µ0
condition 2L ≤ ρ in Assumption C0 is non-restrictive.
Corollary 3 Assumption C0 can be replaced with kyk ≤ (µ0 − Lr∗ ) r∗ , result-
µ0
ing in theorem’s statement kx∗ k ≤ r∗ , where r∗ = min{ρ, 2L }.

4 Main algorithms

For our purposes it is more convenient to write the main equation not in the
form (1) (where we assumed g(0) = 0, x0 = 0) but as
P (x) = 0, P : Rn → Rm . (8)
Then via trivial change of variables the Newton method becomes
z k = arg min kzk,
P 0 (xk )z=P (xk )
xk+1 = xk − αk z k , k = 0, 1, . . .
8 Boris Polyak, Andrey Tremba

with x0 not necessarily equal to 0. Conditions on y ensuring solvability of (1)


are essentially transformed into conditions on P (x0 ) guaranteeing solution of
(8) and vice-versa.
In previous Section we proved solvability of equation by use of the al-
gorithm with constant αk ≡ α > 0; choosing α smaller we obtained larger
solvability domain. However in this Section our goal is different — to reach
the fastest convergence to a solution. For this purpose different strategies for
design of step-sizes are needed. The basic policy is as follows. First, we rewrite
assumptions in new notation.
A00 . P (x) is differentiable in the ball B = {x ∈ Rn : kx − x0 k ≤ ρ} and
0
P (x) satisfies Lipschitz condition in B:

kP 0 (xa ) − P 0 (xb )k ≤ Lkxa − xb k.

B00 . The following inequality holds for all x ∈ B and some µ > 0:

kP 0 (x)T hk∗ ≥ µkhk∗ , ∀h ∈ Rm .

If A00 , B00 hold true, we have same recurrent inequalities for uk = kP (xk )k:

Lαk2 kzk k2
uk+1 ≤ |1 − αk |uk + , (9)
2

Lαk2 u2k
uk+1 ≤ |1 − αk |uk + , (10)
2µ2
the second one being just continuation of the first one based on estimate
kzk k2 ≤ uµk , compare calculations in the proof of Theorem 1. Now we can
minimize right-hand sides of these inequalities over αk ; it is natural to expect
that such choice of step-size imply the fastest convergence of uk to zero and
thus the fastest convergence of iterations xk to a solution. If one applies such
policy based on inequality (10), optimal α depends on µ, L (Algorithm 1 be-
low). Its value is hard to estimate in most applications, thus the method would
be hard for implementation. Fortunately, we can modify the algorithm using
parameter adjustment (Algorithm 2). On the other hand the same policy based
on (9) requires just the value L, which is commonly available (Algorithm 3).
Thus we arrive to an algorithm which we call Newton method while in fact
it is blended pure Newton with damped Newton with special rule for damping.
In some relation it reminds Newton method for minimization of self-concordant
functions [13].

4.1 Newton method with known constants


2
If both constants L and µ are known, then αk = min{1, LkPµ(xk )k } is taken as
minimizer of right-hand part (10).
Solving underdetermined nonlinear equations by Newton-like method 9

Algorithm 1 (Basic Newton method)

z k = arg min kzk,


P 0 (xk )z=P (xk )

n µ2 o
xk+1 = xk − min 1, z k , k ≥ 0. (11)
LkP (xk )k

The algorithm is well-defined, as soon kP (xk )k = 0 means that a solution


is already found (formally z k = 0, αk = 1 thereafter). We remind that in
calculation of z k any vector norm in Rn can be used, also any vector norm in
Rm can used for kP (xk )k, and constants L, µ must comply with these norms.
The update step in (11) can be written in less compact but more illustrative
form:
µ2 µ2
xk+1 = xk − LkP (xk )k
zk , if kP (xk )k ≥ L (Stage 1 step),
k+1 k k
x =x −z , otherwise (Stage 2 step).
The latter case is a pure Newton step while the primal one is damped New-
ton step. Direction z k calculation is the same in both stages. The result on
convergence and rate of convergence is given below. We use upper (d·e) and
lower (b·c) rounding to integer; function ∆(·) and constant c ≈ 0.8614 were
introduced in the end of Section 2.
Theorem 3 Suppose that Assumptions A00 , B00 hold and
  
 Lρ Lρ
2 2∆

 , if ≤ c, (12a)
µ 2µ 2µ
kP (x0 )k ≤ ×
L  1 j Lρ k Lρ
1 +

 − 2c , if > c, (12b)
2 µ 2µ
then the sequence {xk } generated by Algorithm 1 converges to a solution x∗ :
P (x∗ ) = 0.
Function kP (xk )k is monotonically decreasing, and there are not more than
 
2L 0
kmax = max{0, kP (x )k − 2} (13)
µ2
iterations at Stage 1. At k-th step following estimates for the rate of conver-
gence hold:
µ2
kP (xk )k ≤ kP (x0 )k − k, k < kmax , (14a)
2L
µ
kxk − x∗ k ≤ (kmax − k + 2c), k < kmax , (14b)
L
2µ2 −(2(k−kmax ) )
kP (xk )k ≤ 2 , k ≥ kmax , (14c)
L
2µ  1 
kxk − x∗ k ≤ Hk−kmax , k ≥ kmax . (14d)
L 2
10 Boris Polyak, Andrey Tremba

Proof Assume that all xk ∈ B, k ≥ 0. Below we state condition enabling this


2
assumption. Using uk = kP (xk )k and denoting β = µL , we rewrite (10) with
generic1 stepsize α as
1 2 2
uk+1 ≤ |1 − α|uk + α uk .

Its optimum over α is at αk = uβk < 1, if uβk < 1 (i.e. uk > β); and αk = 1
otherwise.
During Stage 1 of damped Newton steps (αk < 1) target functional mono-
tonically decreases as
β
uk+1 ≤ uk − . (15)
2
 2u0 
There are at most kmax = max{0, β − 2} iterations in the phase, say
k ones, resulting in uk ≤ β. As soon uk reaches threshold β, the algorithm
switches to Stage 2 (pure Newton steps). Then recurrent relation (15) becomes
 2
1 2 uk
uk+1 ≤ u = 2β , k ≥ k.
2β k 2β
so we can write
 (2` )  (2` )
uk 1
uk+` ≤ 2β ≤ 2β , ` ≥ 0.
2β 2
1
For the second phase kxi+1 − xi k = kz i k ≤ µ ui due Lemma 1, and for `2 ≥
`1 ≥ 0 holds
2 −1
`X u   u 
k+`2 k+`1 2β
kx −x k≤ kxi+1 − xi k ≤ H`1 k − H`2 k . (16)
µ 2β 2β
i=`1

u
The {xk } sequence is a Cauchy sequence because Hj ( 2βk ) ≤ Hj ( 12 ) →j→∞ 0.
It converges to a point x∗ : kP (x∗ )k = limk→∞ uk = 0 due to continuity of P ,
with  
k+` ∗ 2β uk 2β  1 
kx −x k≤ H` ≤ H` , ` ≥ 0. (17)
µ 2β µ 2
Next we are to estimate distance from points xk in Stage 1 to the limit
solution point x∗ . One-step distance is bounded by constant kxk+1 − xk k ≤
αk β
µ uk = µ , k < k, and

k−1
X β
kxk − x∗ k ≤ kxk − x∗ k + kxi+1 − xi k ≤ (k − k + 2c), k < k. (18)
µ
i=k

Note that the formula also coincides with upper bound (17) at k = k. Exact
number k of steps in first phase is not known, but we can replace it with upper
1 α
The relation is very alike to Newton method analysis of [16], with γ = |1 − α|, λ = µ
.
Solving underdetermined nonlinear equations by Newton-like method 11

2
bound kmax in all estimates (15)–(18). Substituting β = µL back we arrive to
Theorem 3 bounds (14).
Finally we are to ensure our primal assumption of algorithm-generated
points xk being within B. This is guaranteed by one of two conditions, de-
pending on whether the Algorithm starts from Stage 1 or Stage 2 steps.
2
In the first case kP (x0 )k ≥ µL , and kx0 − xk k can be bounded similarly
to (16)
k−1
X k−1
X ∞
X
kx0 − xk k ≤ kxi+1 − xi k ≤ kxi+1 − xi k + kxi+1 − xi k ≤
i=0 i=0 i=k
 
µ µ µ  2L 0

≤ (k + 2c) ≤ (kmax + 2c) = kP (x )k − 2 + 2c .
L L L µ2
2
It is sufficient to satisfy kP (x0 )k ≤ µL (1 + 12 Lρ
  0
µ − 2c ) for guaranteeing kx −
2
xk k ≤ ρ. As we assumed kP (x0 )k ≥ µL in this case, we conclude that this
condition may hold only if Lρ
µ ≥ 2c. This results in (12b).
2
In the second case we have kP (x0 )k < µL , and the algorithm makes pure
Newton steps with αk ≡ 1 from the beginning. Then k = 0, and from (16)
follows
2µ  kP (x0 )k   kP (x0 )k  2µ  LkP (x0 )k 
kxk −x0 k ≤ (H0 −Hk )≤ H0 , k ≥ 0.
L 2β 2β L 2µ2
The inequality kx0 − xk k ≤ ρ, k ≥ 0 is satisfied if
2µ2
 

kP (x0 )k ≤ ∆ .
L 2µ
2
In order to have kP (x0 )k < µL , argument value Lρ
2µ must be less than c. This
results in condition (12a). Altogether (12) ensures xk ∈ B, k ≥ 0 and bounds
(14). t
u
Result on the rate of convergence means, roughly speaking, that after no
more than kmax iterations one has very fast (quadratic) convergence. For good
initial approximations kmax = 0, and pure Newton method steps are performed
from the very start.
If we use approximation bounds (7), then condition (12a) can be replaced
with the simpler one:
−1
2µ2

0 2µ Lρ
kP (x )k ≤ 1+ , if ≤ c. (19)
L Lρ 2µ
also (14d) can be roughly estimated as
2µ 1
kxk − x∗ k ≤ (2 k−kmax ) , k ≥ kmax ,
L 2 −1
(k−kmax −1) µ
or even simpler bound kxk − x∗ k ≤ 2.32 · 2−2 L.
12 Boris Polyak, Andrey Tremba

Corollary 4 If ρ = ∞ (that is conditions A00 , B00 hold on the entire space


Rn ) then Algorithm 1 converges to a solution of (8) for any x0 ∈ Rn .

4.2 Adaptive Newton method

Presented Algorithm 1 explicitly uses two constants µ and L but both enter
2
into the algorithm as one parameter β = µL . There is a simple modification
allowing adaptively change an estimate of the parameter.
Input of the algorithm is an initial point x0 , approximation β0 and param-
eter 0 < q < 1.

Algorithm 2 (Adaptive Newton method)


1. Calculate
z k = arg min kzk,
P 0 (xk )z=P (xk )
βk
αk = min{1, kP (x k )k },

uk+1 = P (xk − αk zk ).

2. If either
αk
αk < 1 and uk+1 < (1 − )uk ,
2
or
1
αk = 1 and uk+1 < uk ,
2
holds, then go to Step 4. Otherwise
3. apply update rule βk ← qβk and return to Step 1 with-
out increasing counter.
4. Take
xk+1 = xk − αk zk ,
set βk+1 = βk , increase counter k ← k + 1, and go to
Step 1.

Properties of Algorithm 2 are similar to Algorithm 1. We omit the formal


proof of convergence; it follows the lines of the proof of Theorem 3 with respect
to properties:
– Algorithm 2 does real steps at Step 4 and some number of fictitious steps
resulting in update rule Step 3;
– βk is non-increasing sequence;
– if βk < β, then Step 3 won’t appear and βk won’t decrease anymore.
It means that there is maximum b k = max{0, dlog1/q ( ββ0 )e} check steps.
k
Minimal possible value of βk is βmin
l = q 0β mthen, and number of Stage 1
b

kmax = max{0, 2 kPβ(x


steps is limited by b min
)k
− 2} as well;
Solving underdetermined nonlinear equations by Newton-like method 13

– if Step 4 is made with βk > β due to validity of a condition in Step 2, then


kP (xk+1 )k decrease more than at a corresponding step with “optimal”
β
step-size αk = min{1, kP (x k )k }.

Let us mention two other versions of adaptive Newton method. First one
uses increasing update (e.g. βk+1 = q2 βk with q2 > 1) in the end of Step 4,
thus adapting the constant to current xk . Also other decrease policies can be
applied to βk in Step 3.
Second option is to use line-search or Armijo-like rules for choosing step-size
αk to minimize objective function kP (xk − αz k )k directly. Rigorous validation
of the algorithms can be provided.

4.3 Method for L known

As we mentioned in the beginning of the section, we can use better approxi-


mation (9) instead of (10). It results to an algorithm using Lipschitz constant
only, and it differs in update step-size.

Algorithm 3 (L-Newton method)

z k = arg min kzk,


P 0 (xk )z=P (xk )

n kP (xk )k o
xk+1 = xk − min 1, z k , k ≥ 0.
Lkz k )k2

The algorithm is well-defined, as due to Assumption B00 condition kz k k = 0


holds only if P (xk ) = 0, i.e. the solution was found on previous step. Then
z k = 0, αk = 0 and xk+1 = xk thereafter.
Theorem 3 is valid for the Algorithm 3 with two exceptions. First, condition
(12b) should be replaced with condition
q
2 −1 + 25 + 16( Lρ − 2c)
$ %
µ µ Lρ
kP (x0 )k ≤ , if > c. (20)
2L 2 2µ

Second, upper bound (14b) should be replaced with


 
µ (kmax − k)(kmax − k + 5)
kxk − x∗ k ≤ + 2c , k < kmax . (21)
L 4

Corollary 1 on everywhere convergence for Algorithm 3 is valid as well.


Proof is following same lines as of Theorem 3, and is omitted for brevity. The
only notable difference of the Algorithms 3 is (possible) interlacing Stage 1
and Stage 2 steps, which may lead to larger step-size kxk+1 − xk k ≤ uµk for
k ≤ kmax , (cf. to a formulae prior to (18)). These large steps result to bound on
14 Boris Polyak, Andrey Tremba

distance (21), and eventually lead to more conservative condition (20). Surpris-
ingly enough, in practice the algorithm (and its adaptive variant) sometimes
converges faster than Algorithm 1, because direction-wise (along z k ) Lipschitz
constant is less or equal than uniform Lipschitz constant of Assumption A00 ,
and convergence rate can be better.
The idea of adaptive algorithm with estimates Lk works as well for Algo-
rithm 3; including its modifications with increasing Lk and/or line-search over
α.

4.4 Pure Newton method

For comparison specify convergence property of pure Newton method (αk = 1).


Theorem 4 Let conditions A00 , B00 hold. If δ = 2µ
L 0
2 kP (x )k < 1 and L H0 (δ) ≤

ρ, then pure Newton method converges to a solution x of (8), and
2µ2 (2k ) 2µ
kP (xk )k ≤ δ , kxk − x∗ k ≤ Hk (δ).
L L
It coincides with Corollary 1 of [16], proven in Banach space setup (a misprint
in [16] is corrected here). In m = n case the result is minor extension of
Mysovskikh’s theorem [7].

4.5 µ in a single point known

Most of the theorems above require quite strict Assumption B00 or similar ones.
In fact, usually we can estimate µ in one point, say, x0 , but not for entire ball
B. Initially Kantorovich-type results on Newton method were proved for such
“single-point condition” setup, and we proceed to such results as corollary of
more general case.
Let µ0 satisfy condition (6) with A = P 0 (x0 ). Following same reasoning as
in Theorem 2 proof, we can estimate constants µ on balls Br with variable
radius
µ(r) ≥ µ0 − rL.
We added dependence on r to indicate that for arbitrary r ∈ [0, µL0 ) there is a
corresponding constant µ > 0.
If assumption of type A or A00 hold on Bρ , then Theorem 3 and their exten-
sions can be rewritten in terms of µ(r) and r instead of ρ. It is done similarly
to Corollary 3. Optimization over interval r ∈ [0, µL0 ) ∩ [0, ρ] gives maximal
µ0
allowed range of kP (x0 )||. The simplest choice is r = 2L , then we can take
µ0
µ = 2 . We omit obvious versions of the Algorithms and convergence results
for this case. Two examples of such theorems are given in Subsection 5.4,
for the case where A00 , B00 hold everywhere. We encountered that both Algo-
rithm 1 and Algorithm 3 in such case are subject to the same upper bound,
and, in fact, shall start with Stage 2 of pure Newton method.
Solving underdetermined nonlinear equations by Newton-like method 15

5 Special Cases

In the section we outline few important cases in more detail, namely solv-
ing equations with special structure, solving scalar equations or inequalities,
solvability of quadratic equations.

5.1 Structured problems

The problem is to solve P (x) = y where P (x)i = ϕ(cTi x), ci ∈ Rn , i = 1, . . . m.


Here ϕ(t) is twice differentiable scalar function,

|ϕ0 (t)| ≥ µϕ > 0, |ϕ00 (t)| ≤ M, ∀t

It is not hard to see that Assumptions A00 , B00 hold on the entire space Rn
and Algorithm 1 converges, with Theorem 3 and Corollary 1 providing rate of
convergence. The rate of convergence depends on estimates for L, µ which can
be written as functions of µϕ , M and singular values of matrix C with rows
ci . However the special structure of the problem allows to get much sharper
results.
Indeed P 0 (x) = D(x)C, D(x) = diag (ϕ0 (cTi x)) and the main inequality
(16) can be proved to be

α2 u2k M
uk+1 ≤ (1 − α)uk + γ ,γ = 2 .
2 µϕ

1
Hence uk+1 ≤ uk − 2γ at Stage 1, thus this inequality does not depend on C!
As the result we get estimates for the rate of convergence which are the same
for ill-conditioned and well-conditioned matrices C.
This example is just an illustrating one (explicit solution of the problem can
be found easily), but it emphasizes the role of special structure in equations
to solve.

5.2 One-dimensional case

Suppose we solve one equation with n variables:

f (x) = 0, f : Rn → R.

Here 0 is not a minimal value of f , thus it is not a minimization problem!


Nevertheless our algorithms will remind some minimization methods. This case
has some specific features compared with arbitrary m. For instance calculation
16 Boris Polyak, Andrey Tremba

of z k may be done explicitly. Norm in image space is absolute value | · |, and


`p norms in pre-image space Rn , p ∈ {1, 2, ∞} can be considered. Then

f (xk )
zk = ei , i = arg max |∇f (xk )i |, in case of `1 -norm,
k∇f (xk )k∞ i

k f (xk ) k
z = ∇f (x ), in case of Euclidean norm,
k∇f (xk )k22
f (xk )
zk = sign (∇f (xk )), in case of `∞ -norm,
k∇f (xk )k1

where ej = (0, . . . , 0, 1, 0, . . . , 0)T is j-th orth vector, and sign (·) function is
coordinate-wise sign function, sign : Rn → {−1, 1}n .
Constant µ (and µ0 ) are also calculated explicitly via conjugate (dual)
vector norm µ = minx∈B k∇f (x)k∗ , µ0 = k∇f (x0 )k∗ . For any norms kz k k =
|f (xk )|/k∇f (xk )k∗ , and damped Newton step is performed iff k∇f (xk )k2∗ <
L|f (xk )|, otherwise pure Newton step is made.
If we choose `1 norm, the method becomes coordinate-wise one. Thus, if
we start with x0 = 0 and perform few steps (e.g. we are in the domain of
attraction of pure Newton algorithm) we arrive to a sparse solution of the
equation.
In Euclidean case a Stage 1 step (damped Newton) of Algorithm 1 is
1
xk+1 = xk − sign (f (xk ))∇f (xk ),
L
which is exactly gradient minimization step for function |f (xk )|. Stage 2 (pure
Newton) step is
f (xk )
xk+1 = xk − ∇f (xk ).
k∇f (xk )k22
This reminds well-known subgradient method for minimization of convex func-
tions. However in our case we do not assume any convexity properties, and the
direction may be either gradient or anti-gradient in contrast with minimization
methods!

5.3 Solving an inequality

One-dimensional inequality

f (x) ≤ 0, f : Rn → R.

can be efficiently solved as well. Denote the set of points where the inequality
is violated as S = {x : f (x) > 0}. Suppose that µ = minx∈S k∇f (x)k∗ > 0
and L is Lipshitz constant for ∇f (x) on S. Then Algorithm 1 can be applied
for xk ∈ Rn \ S with the only change — if f (xk ) ≤ 0, then we have obtained
the solution and algorithm stops. Thus we arrive to the method (for `2 norm):
1. If f (xk ) ≤ 0, then stop, a solution is found;
Solving underdetermined nonlinear equations by Newton-like method 17

1
2. If k∇f (x)k22 < Lf (xk ), then xk+1 = xk − k k
L f (x )∇f (x ).
k
f (x )
3. Otherwise xk+1 = xk − k∇f (xk )k22
∇f (xk ), increase k and return to Step 1.

Again Stage 2 is similar to well-known method for solving convex inequal-


ities, the main difference is the necessity of Stage 1 and the lack of convexity
assumption. The method globally converges under above formulated assump-
tions.
Note that solving inequality f (x)−f ∗ −ε ≤ 0 is equivalent to minimization
of function f with known minima f ∗ and desired accuracy ε, again without
convexity assumption. We remind that convergence rate is quadratic for the
algorithms.

5.4 Quadratic equations

Proceed to a specific nonlinear equation, namely the quadratic one. Then the
function g(x) may be written componentwise as (4), with gradients

∇gi (x) = Ai x + bi ∈ Rn , i = 1, . . . , m.

Obviously g(0) = 0, the question is solvability of g(x) = y. There are some


results on on construction of the entire set of feasible points Y = {y : g(x) =
y} = g(Rn ), including its convexity, see e.g. [17]. We focus on local solvability,
trying to derive the largest ball inscribed in Y .
The derivative matrix g 0 (x) is formed row-wise as

∇g1 (x)T
  T
x A1 + bT1
 

g 0 (x) =  .. .. m×n
= ∈R .
   
. .
T T T
∇gm (x) x Am + bm

One has g 0 (0) = H, H being m × n matrix with rows bi . We suppose H


has rank m (recall that m ≤ n), then its smallest singular value σm (H) > 0
serves as µ0 .
The derivative g 0 (x) is linear on x, meaning it has uniform Lipschitz con-
stant L on Rn , thus assumption A holds everywhere. There are several esti-
mates for the Lipschitz constants, for example (for `2 norm)
v !
u
u m
X
L ≤ L1 = tλmax ATi Ai
i=1

from [18], where λmax is the maximal eigenvalue of a matrix. Other esti-
mates can be obtained via elaborate convex semidefinite optimization problem
(SDP), cf. [22] for details. We obtain the following consequence of Theorem 2
for Euclidean norms.
18 Boris Polyak, Andrey Tremba

Theorem 5 Suppose that matrix H has rank m, and µ0 > 0 is its smallest
singular value. For quadratic function g equation g(x) = y has a solution x∗ (y)
for all
µ2
kyk < 0 , (22)
4L
µ0
with kx∗ (y)k ≤ 2L .

Quadratic equations play significant role in power system analysis, because


power flow equations are quadratic, see [10, 11]. It is of interest to compare our
estimate (22) with some known results on solvability of power flow equations
[2, 24].
As for Algorithms 1–3, we can evaluate bounds explicitly likewise Theo-
rem 2 and Corollary 3.
Theorem 6 Suppose that matrix H has rank m, and µ0 > 0 is its smallest
singular value. Then Algorithms 1, 3 converge to a solution of (1) if

µ20
kyk ≤ s1 , s1 ≈ 0.1877178,
L
with kx∗ (y)k ≤ t1 µL0 , t1 ≈ 0.40100511, where the constants s1 and r1 are
t
maxima and maximizer of function S(t) = 2(1 − t)2 ∆( 2(1−t) ), t ∈ [0, 21 ].

Proof As ρ = ∞, the Theorem 3 is valid for any r ≤ µL0 with µ(r) = µ0 − Lr.
We are to estimate a value of r with maximal allowed bound (12) on kP (x0 )k.
First assume that r ≤ µL0 2c+1
2c
≈ 0.62 µL0 . Then function S(t) is a repre-
2
sentation of upper bound (12a) normalized to multiplier µL and written via
variable t = L
µ r. The function is unimodal with maximum s1 = S(t1 ). Direct
2c L L
check of second case r ∈ ( 2c+1 µ0 , µ0 ] and corresponding bounds (12b), (20)
µ20
with respect to substitution µ ← µ(r), ρ ← r reveal lesser than s1 L bound
on kP (x0 )k.

Optimization√over approximation (19) instead of (12a) results in smaller con-


stant s2 = 5 5 − 11 ≈ 0.18034.
We compared all constants and noticed that numerically value s1 is slightly
greater than 34 of maximal range (22) of Theorem 5. As result we propose
estimate solvability of (1) in case of quadratic functions by inequality

3 µ20
kyk ≤ .
16 L

5.5 Solving systems of inequalities

We have discussed above how Newton-like methods can be applied for solving
one inequality. Below we address some tricks to convert systems of inequalities
into systems of equations.
Solving underdetermined nonlinear equations by Newton-like method 19

First, if one seeks a solution of a system of inequalities

gi (x) ≤ 0, i = 1, . . . , m, x ∈ R` ,

with m ≤ 2n then by introducing slack variables the problem is reduced to


solution of underdetermined system of equations

gi (x) + x2`+i = 0, i = 1, . . . , m, x ∈ Rn , n = ` + m.

Similarly finding a feasible point for linear inequalities x ≥ 0, Ax = b, x ∈


Rn , b ∈ Rm can be transformed to underdetermined system

n
X
Ai,j zi2 = bi , i = 1, . . . , m, z ∈ Rn .
j=1

The efficiency of such reductions is unclear a priori and should be checked


by intensive numerical study.

6 Numerical tests

At present we have numerous results on numerical simulation for various test


problems. We plan to present them in a separate publication. Here we restrict
ourselves with the single example to demonstrate how the methods work for
medium-size problems (n = 60, m = 21). The equations have special structure
as in Section 5.1:

Pi (x) = φ((ci , x) − bi ) − yi , x ∈ Rn , y ∈ Rm

where
t 0 1 + (1 + |t|)e−|t|
φ(t) = , φ (t) = .
1 + e−|t| (1 + e−|t| )2
Matrix C with rows ci , vectors b, y were generated randomly. For function
φ(t) we have µφ = maxt φ0 (t) ≥ 0.5, M = maxt |φ00 (t)| ≤ 2 for all t. Thus if
we do not pay attention to the special structure of the problem we have µ ≥
0.5σmin (C), L ≤ 2σmax (C) and the convergence can be slow, because matrix C
can be ill-conditioned.
2
On the other hand if we take into account the structure
and replace µL (= 0.0031 in example) in Algorithm 3 (see Subsection 4.3) with
µ2φ
M = 0.125 much faster convergence is achieved. Simulation results on Figure 1
confirm this conclusion. Most iterations are generally spent in first phase of
the Algorithm, and the number is close to theoretic bound N ≈ kmax (β), (13).
We also tested adaptive Algorithm 2 on same initial point, and it performed
better for bigger initial β0 = 5, as shown on Figure 2.
20 Boris Polyak, Andrey Tremba

Fig. 1 kP (xk )k of Algorithm 1 using constants µ, L (left) and constants µϕ , M (right).

Fig. 2 kP (xk )k of Algorithm 2 using initial approximation β0 = 5.

7 Conclusions and future research

New solvability conditions for underdetermined equations (with wider solv-


ability set) are proposed. The algorithms for finding a solution are easy to
implement, they combine weaker assumptions on initial approximations and
fast convergence rate. No convexity assumptions are required. The algorithms
have large flexibility in using prior information, various norms and problem
structure. It is worth mentioning that we do not try to convert the prob-
lem into optimization one. Combination of damped/pure Newton method is a
contribution for solving classic n = m problems as well.
There are numerous directions for future research.
1. We suppose that the auxiliary optimization problem for finding direction
z k is solved exactly. Of course an approximate solution of the sub-problem
suffices.
2. The algorithms provide a solution of the initial problem which is not spec-
ified apriory. Sometimes we are interested in the solution closest to x0 , i.e.
minP (x)=0 kx − x0 k. An algorithm for this purpose is of interest.
3. More general theory of structured problems (Section 5.1) is needed.
Solving underdetermined nonlinear equations by Newton-like method 21

4. It is not obvious how to introduce regularization techniques into the algo-


rithms.

Acknowledgements Authors thank Yuri Nesterov for helpful discussions and references.

References

1. Ben-Israel, A.: A Newton-Raphson method for the solution of systems of equations. J.


Mathematical Analysis and Applications. 15, 243–252 (1966)
2. Bolognani, S., Zampieri, S.: On the existence and linear approximation of the power flow
solution in power distribution networks. IEEE Transactions on Power Systems. 31(1),
163–172 (2016)
3. Dennis, J.E., Schnabel, R.B.: Numerical Methods for Unconstrained Optimization and
Nonlinear Equations. SIAM, Philadelphia (1996)
4. Dontchev, A.L.: The Graves theorem revisited. J. of Convex Analysis. 3(1), 45–53 (1996)
5. Graves, L.M.: Some mapping theorems. Duke Math. J. 17, 111–114 (1950)
6. Kantorovich, L.V.: The method of successive approximations for functional analysis.
Acta. Math. 71, 63–97 (1939)
7. Kantorovich, L.V., Akilov, G.P.: Functional Analysis. 2nd ed. Pergamon Press, Oxford
(1982)
8. Kelley, C.T.: Solving Nonlinear Equations with Newton’s Method. SIAM, Philadelphia
(2003)
9. Levin, Y., Ben-Israel, A.: A Newton method for systems of m equations in n variables.
Nonlinear Analysis. 47, 1961–1971 (2001)
10. Low, S.H.: Convex relaxation of optimal power flow, part I: Formulations and equiva-
lence. IEEE Trans. on Control of Network Systems. 1(1), 15–27 (2014). Part II: Exact-
ness. ibid. 1(2), 177–189 (2014)
11. Machowski, J., Bialek, J., Bumby, J.: Power System Dynamics. Stability and Control.
2nd ed. John Wiley & Sons Ltd. (2012)
12. Magaril-Il’yaev, G.G., Tikhomirov, V.M.: Newton’s method, differential equations and
the Lagrangian principle for necessary extremum conditions. Proc. Steklov Inst. Math.
262, 149–169 (2008)
13. Nesterov, Yu., Nemirovskii, A.: Interior-point Polynomial Algorithms in Convex Pro-
gramming. SIAM, Philadelphia (1994)
14. Nesterov, Yu.: Modified Gauss-Newton scheme with worst case guarantees for global
performance. Optimization Methods and Software. 22(3), 469–483 (2007)
15. Ortega, J.M., Rheinboldt, W.C.: Iterative Solution of Nonlinear Equations in Several
Variables. SIAM, Philadelphia (2000)
16. Polyak, B.T.: Gradient methods for solving equations and inequalities. USSR Compu-
tational Mathematics and Mathematical Phys. 4(6), 17–32 (1964)
17. Polyak, B.T.: Quadratic transformations and their use in optimization. J. of Optimiza-
tion Theory and Applications. 99(3), 553-583 (1998)
18. Polyak, B.T.: Convexity of nonlinear image of a small ball with applications to opti-
mization. Set-Valued Analysis. 9, 159–168 (2001)
19. Polyak, B.T.: Newton-Kantorovich method and its global convergence. J. Mathematical
Sciences. 133(4), 1513–1523 (2006)
20. Prusinska, A., Tret’yakov, A.A.: On the existence of solutions to nonlinear equations
involving singular mappings with non-zero p-kernel. Set Valued Analysis. 19, 399–416
(2011)
21. Walker, H.F.: Newton-like methods for underdetermined systems. In: Allgower, E.L.,
Georg K. (eds.) Computational Solution of Nonlinear Systems of Equations, Lecture
Notes in Applied Mathematics. 26, pp. 679–699, AMS, Providence, RI (1990)
22. Xia, Y.: On local convexity of quadratic transformations. J. of the Operations Research
Society of China. 8(2), 341–350 (2014)
23. Yamamoto, T.: Historical developments in convergence analysis for Newton’s and
Newton-like methods. J. Computational Appl. Math. 124, 1–23 (2000)
22 Boris Polyak, Andrey Tremba

24. Yu, S., Nguyen, H.D., Turitsyn, K.S.: Simple certificate of solvability of power flow
equations for distribution systems. Power & Energy Society General Meeting, IEEE
(2015)

You might also like