Nonlinear Optimization: Overview of Methods The Newton Method With Line Search

Methods for unconstrained optimization Convergence Descent directions Line search The Newton Method
Overview Line search Trust-region
Overview Nonlinear Optimization

Overview of methods; the Newton method with line search Niclas Brlin
Department of Computing Science Ume University niclas.borlin@cs.umu.se
Most deterministic methods for unconstrained optimization have the following features:
They are iterative, i.e. they start with an initial guess x0 of the variables and tries to nd better points {xk }, k = 1, . . .. They are descent methods, i.e. at each iteration k , f (xk +1 ) < f (xk ) is (at least) required. At each iteration k , the nonlinear objective function f is replaced by a simpler model function mk that approximates f around xk . The next iterate xk +1 = xk + p is sought as the minimizer of mk .
c 2007 Niclas Brlin, CS, UmU Nonlinear Optimization; The Newton method w/ line search
November 19, 2007
The model function mk is usually dened to be a quadratic function of the form 1 mk (xk + p ) = fk + p T fk + p T Bk p , 2 where fk = f (xk ), fk = f (xk ), and Bk is a matrix, usually a positive denite approximation of the hessian 2 f (xk ). If Bk is positive denite, a minimizer of mk may be found by solving p mk (xk + p ) = 0 for p . If the minimizer of mk does not produce a better point, the step p is modied to produce a point xk +1 = xk + p that is better. The modications come in two major avours: line search and trust-region.
Line search
In the line search strategy, the algorithm chooses a search direction pk and tries to solve the following one-dimensional minimization problem min f (xk + pk ),
>0
where the scalar is called the step length.
In theory we would like optimal step lengths, but in practice it is more efcient to test trial step lengths until we nd one that gives us a good enough point.
c 2007 Niclas Brlin, CS, UmU
Nonlinear Optimization; The Newton method w/ line search
Trust-region
0.8
0.6
xk
0.4
In the trust-region strategy, the algorithm denes a region of trust around xk where the current model function mk is trusted. The region of trust is usually dened as p
2
0.2
0.2
where the scalar is called the trust-region radius. A candidate step p is found by approximately solving the following subproblem min mk (xk + p ) s.t. p
p 2
0.4
0.6 2.5 2 1.5 1 0.5
If the candidate step does not produce a good enough new point, we shrink the trust-region radius and re-solve the subproblem.
0.8
In the line search strategy, the direction is chosen rst, followed by the distance. In the trust-region strategy, the maximum distance is chosen rst, followed by the direction.
0.8 0.8
0.6
xk
0.4 0.2
0.6
0.4
xk p
0.6
0.4
xk
0.2
0.2
0.2
0
0.2
0.2
0.4
0.4 0.4
0.6 2.5 2 1.5 1 0.5
0.6 2.5 2 1.5 1 0.5
0.6 2.5 2 1.5 1 0.5
Convergence rate Linear convergence Quadratic convergence Local vs. global convergence Globalization strategies
Convergence rate
Assume we have a series {xk } that converges to a solution x . Dene the sequence of errors as ek = xk x and note that
k
In order to compare different iterative methods, we need an efciency measure. Since we do not know the number of iterations in advance, the computational complexity measure used by direct methods cannot be used. Instead the concept of a convergence rate is dened.
lim ek = 0.
We say that the sequence {xk } converges to x with rate r and rate constant C if e k +1 lim =C k ek r and C < .
Linear convergence, examples
In practice there are three important rates of convergence:

linear convergence, for r = 1 and 0 < C < 1; quadratic convergence, for r = 2. super-linear convergence, for r = 1 and C = 0.
For r = 1, C = 0.1 and e0 = 1, the norm of the error sequence becomes 1, 101 , 102 , . . . , 107
7 iterations
For C = 0.99 the corresponding sequence is 1, 0.9, 0.9801, . . . , 0.997 107 .

1604 iterations
Thus the constant C is of major importance for a method with linear convergence.
Quadratic convergence, examples
Local vs. global convergence
For r = 2, C = 0.1 och e0 = 1, the sequence becomes 1, 101 , 103 , 107 , . . .
For r = 2, C = 3 och e0 = 1, the sequence diverges 1, 3, 27, . . .
A method is called locally convergent if it produces a convergent sequence toward a minimizer x provided a close enough starting approximation. A method is called globally convergent if it produces a convergent sequence toward a minimizer x provided any starting approximation. Note that global convergence does not imply convergence towards a global minimizer.
For r = 2, C = 3 och e0 = 0.1, the sequence becomes 0.1, 0.03, 0.0027, . . . , i.e. it converges despite C > 1.
For quadratic convergence, the constant C is of lesser importance. Instead it is important that the initial approximation is close enough to the solution, i.e. e0 is small.
Globalization strategies
Descent directions
The line search and trust-region methods are sometimes called globalization strategies, since they modify a core method (typically locally convergent) to become globally convergent. There are two efciency requirements on any globalization strategy:
Consider the Taylor expansion of the objective function along a search direction p 1 f (xk + p ) = f (xk ) + p T fk + 2 p T 2 f (xk + p )p , 2 for some (0, )
Far from the solution, they should stop the methods from going out of control. Close to the solution, when the core method is efcient, they should interfere as little as possible.
Any direction p such that p T fk < 0 will produce a reduction of the objective function for a short enough step. A direction p such that p T fk < 0 is called a descent direction.
If the search direction has the form pk = the descent condition

T T 1 pk fk = fk Bk fk < 0 1 B k fk ,
p fk Since cos = p fk is the angle between the search direction and the negative gradient, descent directions are in the same half-plane as the negative gradient. The search direction corresponding to the negative gradient p = fk is called the direction of steepest descent.
1 0.9 0.8 0.7 0.6 0.5
is satised whenever Bk is positive denite.
0.4 0.3 0.2 0.1 0
1.5
0.5
Overview Exact and inexact line searches The Sufcient Decrease Condition Backtracking The Curvature Condition The Wolfe Condition
Line search
Exact and inexact line searches
Each iteration of a line search method computes a search direction pk and then decides how far to move along that direction. The next iteration is given by xk +1 = xk + k pk
Consider the function () = f (xk + pk ), > 0.
We will require pk to be a descent direction. This assures that the objective function will decrease f (xk + k pk ) < f (xk ) for some small k > 0.
Ideally we would like to nd the global minimizer of for every iteration. This is called an exact line search. However, it is possible to construct inexact line search methods that produce an adequate reduction of f at a minimal cost. Inexact line search methods construct a number of candidate values for and stop when certain conditions are satised.
The Sufcient Decrease Condition
() c1=1 c1=0.5 c1=0.1 c1=0
Mathematically, the descent condition f (xk + pk ) < f (xk ) is not enough to guarantee convergence. Instead, the sufcient decrease condition is formulated from the linear Taylor approximation of () () (0) + (0) or
T f (xk + pk ) f (xk ) + fk pk .
f() 0 1/16
The sufcient decrease condition states that the new point must at least produce a fraction 0 < c1 < 1 of the decrease predicted by the Taylor approximation, i.e.
T f (xk + pk ) < f (xk ) + c1 fk pk .
1/8 1/4 1/2 1
This condition is sometimes called the Armijo condition.

c 2007 Niclas Brlin, CS, UmU Nonlinear Optimization; The Newton method w/ line search c 2007 Niclas Brlin, CS, UmU Nonlinear Optimization; The Newton method w/ line search
Backtracking
The Curvature Condition
The sufcient decrease condition alone is not enough to guarantee convergence, since it is satised for arbitrarily small values of . The sufcient decrease condition has to be combined with a strategy that favours large step lengths over small. A simple such strategy is called backtracking: Accept the rst element of the sequence 1 1 1, , , . . . , 2i , . . . 2 4 that satises the sufcient decrease condition. Such a step length always exist. Large step lengths are tested before small ones. Thus, the step length will not be too small. This technique works well for Newton-type algorithms.
Another approximation to the solution of min () f (xk + pk )

>0
is to solve for () = 0, which is approximated to the condition | (k )| c2 | (0)|, where c2 is a constant c1 < c2 < 1. T f (x + p ), we get Since () = pk k k
T T |pk f (xk + k pk )| c2 |pk f (xk )|.
This condition is called the curvature condition.

The Newton-Raphson method in 1 The Classical Newton minimization method in n Geometrical interpretation; the model function Properties of the Newton method Ensuring a descent direction The modied Newton algorithm with line search
The Wolfe Condition
The Newton-Raphson method in 1

The sufcient decrease condition and the curvature condition

T f (xk + pk ) f (xk ) + c1 fk pk , T T |pk f (xk + k pk )| c2 |pk f (xk )|,
Consider the non-linear problem f (x ) = 0, where f , x . The Newton-Raphson method for solving this problem is based on the linear Taylor approximation of f around xk f (xk + p) f (xk ) + pf (xk ).
where 0 < c1 < c2 < 1, are collectively called the strong Wolfe conditions. Step length methods that use the Wolfe conditions are more complicated than backtracking. Several popular implementations of nonlinear optimization routines are based on the Wolfe conditions, notably the BFGS quasi-Newton method.
If f (xk ) = 0 we solve the linear equation f (xk ) + pf (xk ) = 0 for p and get p = f (xk )/f (xk ).
The new iterate is given by xk +1 = xk + pk = xk f (xk )/f (xk ).

The Classical Newton minimization method in n
Geometrical interpretation; the model function
In order to use Newtons method to nd a minimizer we apply the rst-order necessary conditions on a function f f (x ) = 0 (f (x ) = 0)
The approximation of the non-linear function f (x ) with the linear (in p ) polynomial f (xk + p ) f (xk ) + 2 f (xk )p corresponds to approximating the non-linear function f (x ) with the quadratic (in p ) Taylor expansion
This results in the Newton sequence xk +1 = xk (2 f (xk ))1 f (xk ) (xk +1 = xk f (x )/f (x ))
This is often written as xk +1 = xk + pk , where pk is the solution of the Newton equation: 2 f (xk )pk = f (xk ). This formulation emphasizes that a linear equation system is solved in each step, usually by other means than calculating an inverse.
1 mk (xk + p ) f (xk ) + f (xk )T p + p T 2 f (xk )p , 2
i.e. Bk = 2 f (xk ). Newtons method can be interpreted as that at each iteration k , f is approximated by the quadratic Taylor expansion mk around xk and xk +1 is calculated as the minimizer of mk .
Properties of the Newton method

Advantages:
Disadvantages:
It converges quadratically toward a stationary point.
It does not necessarily converge toward a minimizer. It may diverge if the starting approximation is too far from the solution. It will fail if 2 f (xk ) is not invertible for some k . It requires second-order information 2 f (xk ).
Newtons method is rarely used in its classical formulation. However, many methods may be seen as approximations of Newtons method.
Ensuring a descent direction
Since the Newton search direction pN is written as

1 pN = Bk fk ,
with Bk = 2 fk , pN will be a descent direction if 2 fk is positive denite.
If 2 fk is not positive denite, the Newton direction pN may not a descent direction. In that case we will choose Bk as a positive denite approximation of 2 fk . Performed in a proper way, this modied algorithm will converge toward a minimizer. Furthermore, close to the solution the Hessian is usually positive denite and the modication will only be performed far from the solution.
The positive denite approximation Bk of the Hessian may be found with minimal extra effort: The search direction p is calculated as the solution of 2 f (x )p = f (x ).
The modied Newton algorithm with line search

Specify a starting approximation x0 and a convergence tolerance . Repeat for k = 0, 1, . . .

If 2 f (x ) is positive denite, the matrix factorization 2 f (x ) = LDLT
If f (xk ) < , stop. Compute the modied LDLT factorization of the Hessian. Solve
N (LDLT )pk = f (xk ) N for the search direction pk .
may be used, where the diagonal elements of D are positive. If 2 f (x ) is not positive denite, at some point during the factorization, a diagonal element will be dii 0. In this case, the element may be replaced with a suitable positive entry. Finally, the factorization is used to calculate the search direction (LDLT )p = f (x ).
Perform a line search to determine the new approximation N xk +1 = xk + k pk .

Nonlinear Optimization: Overview of Methods The Newton Method With Line Search

Uploaded by

Copyright:

Available Formats

Nonlinear Optimization: Overview of Methods The Newton Method With Line Search

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Nonlinear Optimization: Overview of Methods The Newton Method With Line Search

Uploaded by

Copyright:

Available Formats

Methods for unconstrained optimization Convergence Descent directions Line search The Newton Method

Overview Line search Trust-region

Overview Nonlinear Optimization

November 19, 2007

Overview Line search Trust-region

Overview Line search Trust-region

where the scalar is called the step length.

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

Overview Line search Trust-region

Overview Line search Trust-region

0.6 2.5 2 1.5 1 0.5

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

Overview Line search Trust-region

Overview Line search Trust-region

0.6 2.5 2 1.5 1 0.5

0.6 2.5 2 1.5 1 0.5

0.6 2.5 2 1.5 1 0.5

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

Linear convergence, examples

In practice there are three important rates of convergence:

For C = 0.99 the corresponding sequence is 1, 0.9, 0.9801, . . . , 0.997 107 .

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

Quadratic convergence, examples

Local vs. global convergence

For r = 2, C = 0.1 och e0 = 1, the sequence becomes 1, 101 , 103 , 107 , . . .

For r = 2, C = 3 och e0 = 1, the sequence diverges 1, 3, 27, . . .

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

If the search direction has the form pk = the descent condition

is satised whenever Bk is positive denite.

0.4 0.3 0.2 0.1 0

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

Exact and inexact line searches

Consider the function () = f (xk + pk ), > 0.

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

c 2007 Niclas Brlin, CS, UmU

Nonlinear Optimization; The Newton method w/ line search

The Sufcient Decrease Condition

() c1=1 c1=0.5 c1=0.1 c1=0

This condition is sometimes called the Armijo condition.

The Curvature Condition