Numerical Optimization Using Computer Experiments: Nasa Contractor 201724

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

NASA ICASE

Contractor Report

Report

201724

No. 97-38

_NNIVERSARY

NUMERICAL COMPUTER

OPTIMIZATION EXPERIMENTS

USING

Michael Virginia

W. Trosset Torczon

NASA August Institute NASA

Contract 1997 for Langley VA

No.

NAS1-19480

Computer Research

Applications Center

in Science

and

Engineering

Hampton, Operated

23681-0001 Space Research Association

by Universities

National Space

Aeronautics Administration Research Virginia

and

Langley Hampton,

Center 23681-0001

Numerical

Optimization
Michael Adjunct

Using Computer
W. Trosset Professor & Applied TX

Experiments*

Associate Rice University

Department

of Computational Houston,

Mathematics

Virginia Assistant Department College

Torczon Professor Science & Mary VA

of Computer of William

Williamsburg,

Abstract
Engineering tions that Our are design minimized optimization by derivative-free ideas relies on from kriging the often gives methods. numerical known that are function used test rise to problems We propose and in which a method computer expensive for solving experiment objective such func-

problems literatures. modfrom

synthesizes approach

optimization values

to construct a grid are search

a sequence for

of surrogate Results

els of the numerical

objective experiments

function on

to guide problem

a minimizer.

a standard

presented.

*This research F49620-95-1-0210

was and

supported also by Applications

in part by the NASA Contract in Science and

Air Force Office of Scientific Number NAS1-19480 while Engineering (ICASE), NASA

Research (AFOSR) under Grant No. the authors were in residence at the Langley Research Center, Hampton,

Institute for Computer VA 23681-0001.

Introduction
the problem of minimizing an objective function f (x ) x e [a, b], ai <_ xi <_ bi for i = 1,...,p. f : _P --_ _ subject to bound con-

We consider straints, i.e.

minimize subject to where a,x,b C _P and we write x

(1) We are

E [a,b] to denote

concerned with special cases of Problem (1) for which evaluation of the objective function involves performing one (or more) complicated, deterministic computer simulation(s). Many such problems arise as engineering design problems and are often distinguished by two troubling characteristics that preclude solution by traditional algorithms for bound-constrained optimization. many the First, the approximation, simulation distortions mization output of a complicated computer rounding and truncation errors. simulation is usually affected by a great These errors are not stochastic--repeating

will reproduce them--but of the idealized objective algorithms that compute

their accumulation introduces high-frequency, low-amplitude that we would have liked to optimize. In consequence, optior approximate (by finite differencing) derivatives of f often fail

to exploit general by high-frequency restrict attention Second,

trends in the objective function and become trapped in local minimizers created oscillations. In order to develop effective algorithms for such applications, we to derivative-free methods for numerical optimization. computer simulations are often expensive to perform. Frank (1995) sugcosts several hours of studied an aeroelastic

complicated

gested that one must address problems in which a typical function evaluation supercomputer time. For example, Booker (1996) and Booker et al. (1996) and dynamic response requires approximately objective function uations of f that simulation six hours

of a helicopter rotor blade for which a single function evaluation of cpu time on a Cray Y-MP. We formalize the notion that the

is expensive to evaluate by imposing an upper bound V on the number of evalwe are allowed to perform. The severity of this restriction will depend (in part) function f that is too expensive for standard numer-

on the relation between V and p. When attempting to minimize an objective

ical optimization algorithms to succeed, it has long bccn a standard engineering practice, described by Barthelemy and Haftka (1993), to replace f with an inexpensive surrogate ] and minimize ] instead. (For example, one might evaluate f at V1 carefully selected sites, construct ] from the resulting information, use a standard numerical optimization algorithm to minimize ], and evaluate f at the candidate minimizer thus obtained.) This practice may also have the salutory effect of smoothing high-frequency oscillations in f. The rapidly growing literature on computer experiments prescription offers new and potentially better ways of implementing this traditional practice. The that seems to be gaining some currency in the engineering community was proposed by (Design and Analysis on this methodology,

Welch and Sacks (1991); following current convention, we refer to it as DACE of Computer Experiments). Frank (1995) offered an optimizer's perspective

suggested that the "minimalist approach" of minimizing a single f is not likely to yield satisfactory results, and proposed several sequential modeling strategies as alternatives. Booker (1996) studied several industrial It is not our minimizing numerical applications of DACE and two alternative approaches. purpose in this report to provide a thorough critique objective functions. We regard as occupying opposing ends of DACE as a method for for to

expensive optimization

DACE and traditional of a spectrum. When

iterative methods V is large relative

p, say p = 2 and V = 500, then the expense of function evaluation is not an issue and we are content to rely on traditional iterative methods. When V is not large relative to p, say p = 2 and V = 5, then the expense of function evaluation rely on DACE. (If V < p, then the methodologies is completely crippling and wc are content to that we consider are not appropriate.) In this

report we are concerned

with intermediate

situations

and we borrow

ideas from both the numerical

optimization and the computer in which computer experiment

experiment literatures. We describe a sequential modeling strategy models are used to guide a grid search for a minimizer. Our methods strategy proposed are part of a larger

elaborate and extend an important special case of the general model management by Dennis and Torczon (1996) and developed by Serafini (1997). These efforts collaboration described by Booker et al. (1995) and Booker et al. (1996).

Pattern

Search

Methods
Problem (1) that does not require derivative information. For

We require

a method

of solving

unconstrained optimization, a popular Nelder and Mead (1965). This method

derivative-free method is the simplex method proposed by is sometimes adapted for constrained optimization by means

of a simple ad hoc device, viz. setting f(x) = oc when x _ [a, b]. Unfortunately, the Nelder-Mead simplex method is suspect even for unconstrained optimization. For example, McKinnon (1996) has constructed a family of strictly convex, diffcrentiable objective functions on _2 for which there exist starting points from which Nelder-Mead for which will fail to converge theory to a stationary the pattern point. search Instead, methods by Lewis algorithms we rely on a class of methods a convergence exists,

explicated by Torczon (1997) for the case of unconstrained optimization and and Torczon (1996a) to the case of bound-constrained optimization. Pattern search methods are iterative algorithms for numerical optimization.

extended Such

produce a sequence of points, {xk}, from an initial point, x0, provided by the user. To specify an algorithm, one must specify how it progresses from the current iterate, Xc, to the subsequent iterate, x+. One of thc distinguishing features of pattern search for x+ to a grid (more formally, a lattice) that contains methods Xc. The is that they restrict their search grid is modified as optimization

progresses, according to rules that ensure convergence to a stationary point. The essential logic of a pattern search is summarized in Figure 1. The reader is advised that this description of pattern search methods differs from the presentation in Torczon (1997) and Lewis and Torczon (1996a, 1996b). For example, the choice of a starting point is usually not restricted and the initial grid is constructed so that it contains the starting point. More significantly, pattern search methods are usually specified by rules that prescribe where the algorithm is to search for the subsequent iterate and the notion of an underlying grid is implicit in these rules. In this report, the grid is explicit and the search for a subsequent iterate is not restricted to a specific pattern of points. What should be appreciated is that the present description preserves all of the elements of pattern connection search methods required by their convergence theory. between these perspectives will be provided by Serafini that the crucial elements of a pattern search A detailed (1997). explication of the

It is evident

algorithm

are contained

in the speci-

fication of 2(b) in Figure 1. The fundamental idea is to try to find a point on the current grid that strictly decreases the current value of the objective function. Any such point can be taken to bc the subsequent iterate. If one fails to find such a point, then one replaces the current grid with a finer one and tries again. Torczon (1997) described the search in 2(b)(i) as an exploratory moves algorithm. Here wc

distinguish two components of an exploratory moves algorithm: an oracle that produces a set of trial points on the current grid and a core pattern of trial points on the grid at which the objectivc function must bc evaluated before the algorithm is permitted to refine the grid. The convergencc theory requires oracle. Because that the core pattern proposed satisfy certain hypotheses; depend no hypotheses on the arbitrary are placed nature on the

the methods

in this report

critically

of the or-

1. Specifythe currentgrid. Selectx0


2. Do until convergence: (a) Let T(+) (b) Do while i. Search denote ii. Update (c) Choose = 0. T() = 0:

from the current

grid.

Let xc = xo.

the current grid for a set of xt E [a, b] at which f is then evaluated. Let T(+) the set of grid points xt E [a, b] thus obtained for which f(xt) < f(xc). the grid.

x+ E T().

(d) Let xc = x+.

Figure

1: Pattern

search

methods

for numerical

optimization.

acle, we emphasize that any method whatsoever can be employed to produce points that potentially decrease the current value of the objective. Wc might perform an exhaustive search of the current grid or we might specify a complicated pattern of points at which to search. We might appeal to our prior knowledge of, or our intuition about, the objective function. It does convergence theory for pattern search methods encompasses all such possibilities. For the sake of clarity, we describe more fully a specific construct the grids on which the searches will be conducted. lattices there not matter--the

pattern search algorithm. First, we For n = 0, 1, 2,..., we define vector if and only if for each i -- 1,..., p

F(n) restricted
exists

to the feasible

set [a, b] as follows:

x E r(n)

ji E {0, 1,...,

2 n} such that ji (bi


hi)

xi = ai + _

Thus, V(0) comprises the vertices of [a, b] and F(n+ 1)is obtained from F(n) by halving the distance between adjacent grid points (see below). When we update the current grid, say F(n), in 2(b)(ii), we either retain F(n) or replace r(n) with r(n + 1). Next we specify a core pattern. Given Xc C [a, b], we say that xt E [a, b] is adjacent only if there exists k C {1,... ,p} such that 1 (bk -- ak) xtk = xck -4- _-_ and xu = xci for i _ k. We take as the core pattern the set of grid points adjacent to the current to xc if and

iterate. (For example, if the current grid is the (8, 8)'], then the core pattern of (2, 0)' comprises

integer lattice on _2 restricted to [a = (0, 0) _, b = (3, 0)', (2, 1)', and (1, 0)'.) We refine the grid, i.e. f at each grid point xt adjacent to

we replace F(n) with F(n + 1), if and only if we have evaluated xc and failed to find f(xt) < f(Xc). If f is continuously differentiable, guarantees that the specified algorithm

then the theory developed by Lewis and Torczon (1996a) will produce a sequence {xk} that converges to a Karushmust terminate in a finite for direct search methBy definition, evaluations of

Kuhn-Tucker point of Problem (1). In practice, of course, the algorithm number of steps. Termination criteria for pattern search methods--indeed, ods in general--have the assumption that not been studied extensively, but that Problem (1) is expensive means that

does not concern us here. we cannot afford enough

the objectivefunctionto terminateby the traditional criteriaof numericaloptimization. Forthe problemsthat weconsider,he relevant erminationcriterion is whetheror not wehaveexhausted t t the permittednumber(V) of function evaluations.
Derivative-free methods for numerical optimization can be quite profligate with respect to the number of function evaluations that they require. Because the number of function evaluations available to us is severely limited, wc want to use these evaluations as efficiently as possible. On the bright side, the convergence theory for pattern search methods allows us to replace xc with any xt for which f(xe) < ](xc). Hence, no matter how comprehensive a search for trial points we may have envisioned, we can abort it as soon as we find a single trial point that satisfies f(xt) < f(Xc). On the dark side, the oracle may require a great many function evaluations to produce even one xt for which f(xt) < ](xc). Furthermore, if the oracle is unsuccessful, current grid until after f has been evaluated at each grid point adjacent require as many as 2p additional function evaluations if xc is an interior then we can not refine the to xc--a process that may grid point and f has not

yet been evaluated at any of the grid points adjacent to xc. The fundamental purpose of this report is to introduce methods that reduce oracle. We do not attempt to reduce the number of points in the core pattern. optimization, Lewis and Torczon (1996b) have introduced core patterns that points. Pattern search algorithms example, the oracle employed

the expense of the For unconstrained contain only p + 1

may intentionally use large numbers of function evaluations. For by Dennis's and Torczon's (1991) parallel direct search intentionally of evaluating the objective for which huge numbers of

casts a wide net, relying on parallel computation to defray the expense function at a great many grid points. We are concerned with problems

function evaluations are not possible. Here, we want an oracle that proposes promising trial points using only a small number of function values. Our strategy for creating such an oracle will be to use previous function values to construct a current global model, it, of ], then use ]c to predict trial points x_ at which f(xt) < f(xc). Thus, we will employ the strategy described of it. in Section 1, to a not once to replace Problem (1), but repeatedly to guide description of the models that our oracles will exploit. us to a solution We now turn

Computer

Experiment

Models

Suppose that we have observed y_ = f(xi) for i --- 1,...,n. On the basis of this information, we want to construct a global model ] of f. Such inexpensive surrogates for f will be used by the oracle in the pattern search function values. algorithm to identify promising trial points at which to compute additional

We assume that there is no uncertainty in the Yi = f(xi), i.e. that no stochastic mechanism is involved in evaluating the objective function. It is then reasonable to require the surrogate function to interpolate these values, i.e. to require that ](xi) = f(xi). Furthermore, wc desire families of surrogate functions are led to consider that are rich enough to model a variety of complicated objective certain families of models that have been studied in the spatial are usually motivated by supposing that functions. statistics We and

computer experiment literatures. The families of models that we consider of some (nice) quite plausible.

f is a realization may be of better that that

stochastic process. For certain geostatistical In the context of using computer simulations is less clear. the realization

applications, to facilitate

this supposition the engineering

product designs, its plausibility we have described do resemble are our primary concern do not.

The high-frequency, low-amplitude oscillations of a stochastic process, but the general trends the supposition of an underlying

In any case, we regard

stochastic

process suggest When (see with

as nothing plausible we do are invoke

more ways it,

than it should with

a convenient useful be appreciated to the f(xi) practical = to the the

fiction. models that of the fictional

The and

value

of this criteria process to which spline

fiction such and they

lies invoking

in its

power and

to

of constructing respect the ](x,) kriging whereas latter facts (1989), that _P. f germane about

we will try

to avoid

it excessively. MLE not be invested to the (1984) apand as the

optimality stochastic models In fact,

as BLUe lead. by Watson motivations, f(xi) of this

below) more Our

defined that and

should interpolation

importance theorist two different: goal some of the

than

utility will

requirement these

immediately are formally former f in the and

suggest equivalent.

proximation others, are somewhat the

geostatistician. of the present (1994),

as explicated Their the

well-known

methodologies goal to our Booker methodology.

however,

is to interpolate The and remainder Owen

as smoothly that See section

possible, kriging Welch, surveys continuous covariance Let summarizes

is to approximate kriging

as accurately concerns. context Koehler

as possible. of computer

It is evident experiments.

perspective Mitchell by and

is more relevant Wynn

briefly Sacks,

(1996) F mean that #(x)

for comprehensive is indexed = 0 and positive by the

of computer parameter function

experiment set

We begin

assuming

is a realization that this

of a stochastic process has

process known

Wc assume that each

known dcfinite.

c(-, -), and

symmetric f(xl) y=

p p matrix

c(s, t) is strictly

f(x )
and linear for each unbiased x E _P define b(x) (BLUe) E _n to minimize and
](X) :

E[y'b-

F(x)]

2. that

Then

](x)

= y'b(x)

is the

best

predictor

of f(x)

it is well-known
ylC-l_(x),

(2)

where

C is the

symmetric

positive

definite

n n matrix

[c(xi,

xj)]

and

=
This and is a simple _xj) example of kriging. Notice

[ C(Xl, )
kriging necessarily interpolates: since C-1C = I

that

is column

j of C,
](Xj)

y'C-ic(xj)

-_

y !ej

yj = f(xj). is known. function of a known unknown We now vector, suppose t). that that F is a

Thus Gaussian that unknown Next [ro(xi, xj)],

far

we have with and denote let

assumed mean that the

that it(x)

the

stochastic and that

process covariance element let

process scalar, let A and

----a(xfl_ function,

c(s, t) -= a2re(s, family the

We assume functions.

a : _P

--* ]_q is a known re(-,

f_ C ]_q is an

a 2 > 0 is an

") is an unknown

of correlation symmetric

n q matrix

[aj(xi)],

R(O)

denote

n n

matrix

r(x; 0) =
Then, for _ and 0 fixed, the BLUe ](x) of f(x) = a(x)'/_ is (cf. + (y -

:
equation

.
(2)) O). (3)

A_)'R(O)-lr(x;

Thus, by varying/3 and 0, we define specific ] to model f. Given a family of interpolating (/3, a 2, 0) and explicit thereby ]. formulas:

a family of interpolating functions defined

functions

from which a sensible (MLEs)

we can select a

by (3), we require likelihood estimates

way of specifying of/3 and a 2 have

For 0 fixed, the

maximum

8(o)= [A'R(OI-IA]-I XR(OI-ly


and
n

To compute

_, the MLE of 0, it turns

out that

one must

minimize (4)

n log b2(O) log dct R(O) as a function between the of 0. that, in specifying a family of correlation by (3) and functions, there is a potential (4). The

It is now evident richness family of models, previous research that families greater.

tradeoff richer the

of the family

defined

the ease

of maximizing

the more difficult it may be to select a plausible member of it. Most of the on computer experiments has been concerned with deriving a single model ] surrogate for f. Understandably, the authors have used rich complicated maximizing correlation functions for which 0 is a vector of dimension p or (4) difficult, but 0 need only be computed once. In contrast, we of guiding to simplify 1-parameter

will be used as a permanent with rather This makes

are concerned with deriving a sequence of models that will be used for the sole purpose our optimization of f. Hence, we are content to sacrifice some flexibility in (3) in order minimizing (4). In the numerical experiments isotropic correlation function defined by r6(s,t) where I1" II denotes the Euclidean norm that we report in Section 5, wc used the

= exp (--Oils -tll=), on NP.

4
The

Model-Assisted
optimization strategies

Grid
developed

Search
in this report are all predicated on a simple idea, viz. that

providing the oracle in Section 2 with an inexpensive surrogate will allow it to more efficiently identify promising trial points optimization. as described The surrogate models will be constructed in Section 3. In this section, we describe in greater analogy.

model of the objective function and thereby reduce the cost of values by kriging, the pattern search of smooth objective

from known function the interplay between minimization

algorithm and the kriging models We begin with an instructive

detail. For unconstrained

functions, the numerical algorithms of choice are the quasi-Newton methods. An elementary exposition of these methods was provided by Dennis and Schnabcl (1983), who emphasized the following interpretation: a quasi-Newton algorithm constructs a quadratic model ]c of the objective function f at the current f(xt) < f(xc). iterate xc and uses ]c to identify For example, trust region methods a trial point xt at which it is predicted that obtain xt by (approximately) minimizing ]c

subject to the constraint that xt bc within a specified neighborhood of xc. The rationale for the constraint is that the model it, which is usually constructed by computing or approximating the second-order Taylor polynomial of f at xc, can only be trusted to approximate f locally.

1. Specifyan initial grid, F0,on [a,b].


2. Perform an initial initial computer design experiment: ,xg E F0.

(a) Select

sites Xl,... f(xN).

(b) Compute (c) Construct 3. Let Fc= 4. Do until (a)

f(xl),...,

]0 by kriging. Let ]c = _ and Evalc = {xl,...,XN}.

F0. Let xc =argmin(f(xl),...,f(xg)). convergence: Core(xc) C Evalc: algorithm

Do until i. Apply

an optimization f(xt). < f(xc),

to ]c to obtain Update

xt EFc ]_.

\ Evalc.

ii. Compute iii. If f(xt) (b) Refine Fc.

Let Evalc = Evalc U {xt}. then let xc = xt.

Figure

2: Model-Assisted

Grid Search

(MAGS)

for minimizing

an expensive

objective

function.

Trust region methods make effective use of simple local models of the objective function. Because we are concerned with situations in which evaluation of the objective function is prohibitively expensive, we are prepared to invest more resources cated global models of the objective function. Similarly, classical response surface methodology, in constructing from Box and and optimizing (1951) more complito Myers and

Wilson

Montgomery (1995), constructs local linear or quadratic regression models of a stochastic quadratic objective function. Again, the purpose of these models is to guide the search for a minimizer or maximizer. Glad and Goldstein (1977) also exploited quadratic regression models for optimization, as did Elster and Neumaier (1995) to guide a grid search. Recently, nonparametric response surface methods have been proposed in which global models of more complicated objective functions are constructed. This work, e.g. Haaland numerous issues that arise when using optimization. The remainder mize expensive rized in Figure of this section (1996), is closely related to ours. Frank (1995) surveyed global models of expensive objective functions to facilitate a specific methodology for using global models to mini-

develops

objective functions. The essential logic of the methods that we propose is summa2. For simplicity, we assume that the expense of all operations performed on the in comparison to the expense a general model management of function evaluation(s). Dennis and Torczon strategy that can be employed when it is nec-

model(s) is negligible (1996) have proposed

essary to balance these expenses. This model management strategy is currently being developed and extended by Serafini (1997) and is capable of accommodating our methods as a special case. Techniques for designing except for two details. First, the initial computer experiment in step 2 do not it is convenient to employ a technique that permits Several N, the before such techniques were described number of initial design sites. it becomes possible to suggest concern us here, the initial design by Koehler and A great deal of useful guidelines

sites to be selected from the initial grid. Owen (1996). Second, we must specify numerical experimentation will be required

for the choice N. of


practicable, in initial design,

Because

we are concerned the entire N << V.

with problems budget At present,

for which

sequential

optimization

is

considerably

less than

of functions we prefer

evaluations to choose

(V) will be invested N > p + 1, thereby

i.e. we will choose

permitting at least a simplex design. Because evaluation of the objective function is expensive, it seems sensible to cache the function values that have been computed. We denote the current set of points at which f has been evaluated by Evalc, which in step 3 we initialize to comprise the initial design denote by Core(x) the core pattern of points adjacent to x. Then, is precisely the condition required by the convergence be satisfied before the current grid can be refined. sites. For any point x EFc, we in loop 4(a), Core(xc) C Evalc search methods condition must that must eventually

theory for pattern Notice that this

obtain, i.e. loop 4(a) cannot repeat indefinitely, because Fc is finite and we only consider trial points xt EFc at which f has not yet been evaluated. Thus, the logic of loop 4 is to (a) search for points of improvement on the current grid until (b) we replace Loop 4 repeats until we have exhausted our budget of objective the current grid with function evaluations. a finer grid.

It is within loop 4(a) that we exploit the models obtained by kriging. Step 4(a)(i) produces a trial point xt at which f is evaluated in the hope that f(xt) < f(xc). As soon as f(xt) has been computed, we add xt to the set Evalc and update the current model ]c. It is not entirely clear how best to update values, but ]c. One could decline completely refit the family certain parameters of models to the new set of function that the value of one might to re-estimate if one believed

updating them did not justify the expense. The generality of the convergence theory for pattern search methods permits us to be rather vague in specifying step 4(a)(i). This ambiguity is desirable because it encompasses a great many possibilities deserving consideration. Thus, we can search for a local minimizer of ]c using whatever algorithm we prefer, e.g. quasi-Newton, steepest descent, pattern search, etc. We can start our search from whatever point we prefer or we can use multiple starting points and pursue multiple searches before determining xt. We can even search for a global minimizer of ]c if we are so inclined. Furthermore, we can terminate our search whenever we please. (We envision searching until we near
^

a local minimizer

of fc, but

we can terminate

sooner

if the search

becomes

expensive.

Tradc-offs

between the cost of evaluating the objective function and the cost of constructing and optimizing the models can be mediated by the model management strategy proposed by Dennis and Torczon (1996) and developed by Serafini (1997).) The only requirement is that we eventually point x_ that belongs to the current grid and at which f has not yet been evaluated. Most search strategies for a desirable xt will not restrict attention to the current return grid. a trial Because

wc require xt E Fc, we suggest terminating the search when step lengths become appreciably smaller than the minimal distance between grid points and choosing xt to be a grid point that is near the terminal iterate of the search. There are various plausible ways of implementing this suggestion. Wc might prefer the nearest grid point or we might prefer a nearby grid point at which the model predicts greater decrease. Because we require xt Evalc, we might not actually preferred grid point. Similarly, if we are impatient to refine the grid, we might even whcn a nonadjacent point is nominally preferred. select the nominally select xt E Core(x_)

Numerical

Experiments
some numerical experiments intended to illustrate the viability Satellite software. of the methods

We now describe proposed computer univariate

in Section 4. These experiments were performed with a 75MHz Pentium processor, using S-PLUS optimization (for parameter estimation)

on a Toshiba for Windows

100CS notebook In what follows, function opti-

was performed

using the S-PLUS

Minimizer
(0,-10) (-6,-4) (18,2) (12,8)

f(x) 3 30 84 840

Frequency 38 44 8 10

Table

1: Minima

of the

rescaled

Goldstein-Price

objective

function

on [-20, 20] [-20,

20] found

by 100 quasi-Newton

searches.

mize, an implementation of Brent's (1973) safeguarded finite-difference quasi-Newton methods for multivariate

polynomial interpolation procedure. When optimization were employed (typically for function developed nlminb, an implementation by Gay (1983, 1984).

minimizing models), they were performed using the S-PLUS of a trust-region method for bound-constrained optimization The objective function nomial of p = 2 variables employed by Dixon and

used in our experiments was a rescaled version of an eighth-order polyconstructed by Goldstein and Price (1971). One of the test problems Szeg5 (1978) in their study of global optimization was to minimize the

Goldstein-Price polynomial in the feasible region [-2, 2] [-2, 2]. We rescaled this problem so that the feasible region was [-20, 20] [-20, 20]. The Goldstein-Pricc polynomial is a complicated function that has four minimizers and is not easily modelled. Because it is a polynomial, the minimizers are easily located by a finite-difference quasi-Newton method. To estimate the relative sizes of their basins of attraction, we drew 100 points results from a uniform are summarized distribution on the feasible in Table 1. are a special region and started nlminb from each point. The

The models

that we employed

case of the family specified

by (3). We set q = 1, so specified by (5) and

that fl is a scalar parameter, and a(x) -_- 1. We used the correlation estimated the scalar parameter 0 by applying optimize to (4).

function

We implemented two strategies for derivative-free optimization with a small value of V, the permitted number of objective function evaluations. First, we implemented a DACE strategy with V = 11, 16, 21, 26. For these procedures, we constructed a model, ], from N = V - 1 design sites obtained by Latin hypercube sampling. We then minimized ] by a finite-difference quasiNewton method, starting from the initial The objective function was then evaluated the V function values was recorded. Second, procedures, design site with the smallest objective function value. at the minimizer of ] so obtained and the smallest of 4 with V -- 11, 16. For these a trade-off: coarse grids tend

we implemented the MAGS strategy described in Section it is first necessary to specify an initial grid. This involves

to safeguard against unlucky initial designs, whereas fine grids permit more function evaluations before it becomes necessary to evaluate f at a core pattern of points in order to refine the grid. Our choice of a family of grids was further complicated by the fact that the minimizers of our objective function selected occur on the integer lattice in _2. To avoid using grids that contain the minimizers, we an initial grid of the form xi = -20 + j_r/2, required, were obtained by halving the an initial design on the initial grid, we moved each point to the nearest

for ji = 0, 1, 2,.... Subsequent grids, which were rarely current distance between adjacent grid points. To obtain

first obtained N = 5 points by Latin hypercube sampling, then grid point. The initial model was constructed from this design.

Percentile Minimum 5% 10% 25% 5O% 75% 9O% 95% Maximum

DACEll 3.41 9.60 15.63 37.26 116.75 367.95 601.86 951.55 1027.52

DACE16 5.29 11.99 13.74 30.83 71.61 122.35 240.92 423.13 1243.90

DACE21 5.02 10.64 17.16 26.04 51.69 99.96 190.71 230.89 774.91

DACE26 3.26 9.89 14.42 24.75 41.70 86.33 154.20 281.84 349.43

MAGSll 5.77 5.77 6.78 28.98 43.89 87.78 343.21 529.02 1617.61

MAGS16 3.43 5.77 5.77 6.78 30.48 64.55 101.63 135.22 885.32

Table

2: Percentiles strategies,

of smallest

values

of the Goldstein-Price

objective

function

found

by six opti-

mization

each replicated

100 times.

To obtain a trial point, xt, from the current iterate, xc, a finite-difference was started from xc to obtain a minimizer, &, of the current model, it.

quasi-Newton Let denote

method the grid

point nearest &. If the objective function had not previously been evaluated at _, then we set xt = _; otherwise, we set xt equal to the point in Core(_) nearest &. Each time that a new function value was obtained, we re-estimated all three of the model parameters, (fl, a 2, 0). This process was repeated until V function value was recorded. evaluations had been performed, at which time the smallest function

Each of the six procedures described above was replicated 100 times. The results are summarized in Table 2, from which several interesting conclusions can be drawn. First, we note that each strategy did indeed do better when permitted more function evaluations. Thus, DACE with V = 16 function evaluations usually V = 16 function evaluations result is not surprising. Second, we note that uations than did DACE. function evaluations and outperformed DACE usually outperformed typically found with V = 11 function evaluations and MAGS with MAGS with V = 11 function evaluations. This smaller function values with fewer function MAGS with evaluations. eval-

MAGS

The crucial comparisons are between DACE and between DACE and MAGS with V = 16 function

V = 11 In each

case, typical performance decisively favors MAGS. For example, the median best function value found by MAGSll is less than 40 percent of the median best value found by DACEll. Similarly, the median best function value found by MAGS16 is less than 45 percent of the median the performance best value of MAGS. found by DACE16. The DACE experiments

with V -- 21, 26 were included

to benchmark

Except for the extreme upper percentiles, the distribution of best function values found by MAGS with V -- 11 is strikingly similar to the distribution of best function values found by DACE with V -- 26 function evaluations. Thus, another way to state the case for MAGS is to observe that DACE typically required more than twice as many of MAGSll. Such observations confirm the central efficiently accomplished raised when comparing by Nair (1992) and by iterative Taguchi and methods. response Our function evaluations thesis of this report, to match the performance that optimization is most

This is a familiar issuc in statistics that is often surface methods, e.g. in the pancl discussion edited results strengthen the cases made by Frank modeling (1995), (1996) for using sequential MAGS occasionally strategies For

by Trosset

(1996).

Booker et al. (1995), Booker et al. (1996) and Booker to optimize expensive functions. Wc note that despite its generally superior

performance, 10

disappoints.

example, he lower75percentof the distributionsof bestvalues t foundby DACEwith


by MAGS with V -- 11 are remarkably similar, but the former's upper 10 percent than the latter's. MAGS with V = 16 found a smallest function value less than on 97 occasions, anomalous sites, Latin but found smallest values of 688.75, 855.64, and 885.32

V = 26 and is markedly better or equal to 142.70 three. This

on the other

phenomenon seems to result from unfortunate initial designs. With only N = 5 design hypercube sampling on a grid occasionally results in transparently bad designs, e.g. five from which using more MAGS has difficulty recovering. To safeguard against this possibility, sophisticated procedures for determining the initial design.

collinear sites, we recommend

6
The

Conclusions
results presented

and

Future
5 suggest

Work
that the potential value of the methods that we have

in Section

proposed numerical

for optimizing expensive experiments on a variety

objective functions is considerable. Of course, comprehensive of objective functions in a variety of dimensions will bc required

to fully appreciate the advantages and disadvantages of these methods. We plan to undertake such investigations in future work. We conclude this report by describing some modifications of MAGS that we envision for these investigations.
^

We begin by noting that the design of the current model, fc, is sequential, comprising the initial design sites xl,..., xg and the trial sites xt subsequently identified in 4(a)(i) in Figure 2. At present, the initial design sites are selected according to the principles of experimental design without regard to optimization, whereas the subsequent trial sites are determined by an optimization algorithm without regard to the design of experiments. Because optimization algorithm tends to cluster in promising the sequence regions, this of trial points generated by an sequence is rarely space-filling

and is not likely to be a good experimental design. Moreover, especially during the early stages of optimization, greater gains are likely to come from improving the fit of fc to f than from accurately identifying a minimizer of ]c. Our present implementation of MAGS is a greedy algorithm in the sense that it tries to minimize ]c without concern for the fit of the subsequent model, ]+. The desirability of balancing perimental design was recognized Given an optimization fraction of new design the concerns of numerical optimization and the concerns of exby Frank (1995), who proposed a dichotomous search strategy.

criterion, e.g. the objective function, and a design criterion, one obtains some sites using one criterion and the balance using the other. An implementation et al. (1995) as follows: and apply

of this Balanced Local-Global Search (BLGS) strategy was described by Booker employed by Booker (1996) and Booker et al. (1996). In contrast to the dichotomous BLGS strategy, we envision modifying 4(a)(i) an optimization algorithm to a merit function, (_c, to obtain should be specified so as to balance the potentially competing

xt EFc \ Evalc. The merit function goals of finding sites at which ]c is

small and choosing good design sites. Notice that this modification in no way affects the convergence theory for MAGS--it is a purely empirical matter whether or not it improves the performance of the algorithm. To illustrate what we have in mind, we again invoke the fiction that the objective in Section function, f,

is a realization of a stochastic process. Under squared error of (3) for predicting .f(x) is

the assumptions

delineated

3, the mean

MSE

[](x)]

= a2 ( 1-

[a(x)',r(x;O)']

[ 0

R(O) A'

]-1[

r(x;O) a(x)

])A

As detailed by Sacks, Welch, Mitchell and Wynn (1989), this function plays a fundamental in several approaches to optimal design. It is also the design criterion employed by Booker 11

role et al.

(1995)andBooker(1996)in BLGS.Our ideais to constructa merit functionthat encourages the selectionof trial points at which the meansquarederror is large,i.e. at which we believethat lessis knownabout the objectivefunction. This strategyhas a great deal in commonwith the Sequential Designfor Optimization(SDO)algorithmfor globaloptimizationproposed Coxand by John (1996). The meansquared error functioncanbe estimatedby replacing parameters 0 "2, 0) with the (_,
their MLEs, resulting in an expression that we denote by

Then

a natural

family of merit

functions Oc(x)

comprises : ]c(x)

those

of the

form ,

- pcMS_E [it(x)]

where Pc >_ O. We anticipate that a great deal of numerical determine useful choices of the Pc. We also note that our family of kriging models was derived stochastic process. In fact, it seems rather implausible that

experimentation under the

will be required of a stationary

to

the assumption same correlations

will obtain

throughout the feasible set, so that we likely will prefer different values of 0 for different regions of [a, b]. This poses a problem, because the ]c are global models. As we descend into the basin of a local minimizer, we would like to use a value of 0 that is suited to that basin, not a value of 0 that was determined An obvious by a global compromise. this difficulty is to implement the "zoom-in" process proposed by way of addressing

Frank (1995). To do so, we occasionally modify the feasible set, restricting it to a much smaller set in which a minimizer is deemed to lie. Then the current model, ]c, is determined by the design sites in, and restricted this is a delicate convergence However, that properly in domain to, the current feasible set, [ac, bc]. From a theoretical perspective, strategy that must be carefully implemented in order to preserve the guaranteed has thus far been inherited implemented, updating from the standard the feasible theory for pattern a useful search localization methods. of the set may permit feasible

kriging models. Assuming that

we are prepared

to replace

the current

set with a subset

of it, when guess grid.

should we do so? When step 4(a)(i) begins to produce trial points xt E Core(xc), that Xc is near a local minimizer and that the algorithm is preparing to refine As previously noted, this can be an expensive undertaking, requiring as many

one might the current

as 2p evaluations

of f. At this stage, an interactive user might very well instruct the algorithm to refine the grid without actually cvaluating f on Core(xc), reasoning that an asymptotic convergence theory may be of little relevance when one can afford only a small number of iterations. Alternatively, First, onc would one might interpret the same scenario as a signal to recalibrate replace the current feasible set with a smaller one. Second, instead the problem. of performing

2p function evaluations on Core(xc)--a set of points that is not a good design_ne would rcplace the current grid with a finer grid, design an experiment for the new grid, and proceed. At present, there is no theory to justify these tactics, but it seems quite plausible that they will result in practical improvements of an alre_ty-promising procedure.

Acknowledgments
This research cvolvcd Andrew from Booker, Although expressed an ongoing Paul Frank collaboration and with John (Boeing Dennis and David Serafini (Rice Conn University); Grcg Shubin Company); and Andrew

(IBM Corporation). some of the opinions

we have been profoundly influenced by the ideas of our collaborators, in this report may not be shared by all members of the collaboration. 12

References
Barthelemy, J.-F. M. and Haftka, R. T. (1993). Approximation design--a review. Structural Optimization, 5:129-144. Booker, A. J. (1994). DOE for computer Computer Services, Seattle, WA. Booker, A. J. (1996). Case studies in design output. Technical concepts for optimum structural

Report

BCSTECH-94-052,

Boeing

and analysis Sciences. P. D.,

of computer American Serafini, Report

experiments. Statistical D., Torczon,

In Proceedings

of the Section Booker,

on Physical

and Engineering

Association. V., and Project Trosset, Infor-

A. J., Conn, & Support

A. R., Dennis, design Services,

J. E., Frank,

M. (1996). mation Report.

Multi-level

optimization. Seattle,

Technical

ISSTECH-95-031, Collaborative

Boeing

WA. Boeing/IBM/Rice

1996 Final

Booker, A. J., Conn, A. R., Dennis, Global modeling for optimization. Support Services, Seattle,

J. E., Frank, P. D., Trosset, M., Technical Report ISSTECH-95-032, Collaborative

and Torczon, V. (1995). Boeing Information L: 1995 Final Report. conditions.

WA. Boeing/IBM/Rice

Project

Box, G. E. P. and Wilson, K. B. (1951). On the experimental attainment Journal of the Royal Statistical Society, Series B, 13:1-45. Includes Brent, R. (1973). Cliffs, NJ. Algorithms for Minimization Without Derivatives.

of optimum discussion. Prentice-Hall,

Englewood

Cox, D. D. and John, S. (1996). SDO: A statistical method for global optimization. In Alexandrov, N. and Hussaini, M. Y., editors, Multidisciplinary Design Optimization: State of the Art. SIAM, Philadelphia. Dennis, J. E. and Schnabel, R. B. (1983). Numerical Nonlinear Equations. Prentice-Hall, Englewood Dennis, J. E. and Torczon, V. (1991). on Optimization, 1:448-474. Dennis, J. E. and Torczon, drov, N. and Hussaini, SIAM, Philadelphia. Direct search Methods for Unconstrained Cliffs, New Jersey. methods on parallel machines. Optimization and

SIAM

Journal

V. (1996). Managing approximation models in optimization. In AlexanM. Y., editors, Multidisciplinary Design Optimization: State of the Art.

Dixon, L. C. W. and Szegb, G. P. (1978). The global optimisation problem: An introduction. In Dixon, L. C. W. and Szegb, G. P., editors, Towards Global Optimisation 2, pages 1-15. Elsevier North-Holland, New York. Elster, C. and Neumaier, IMA A. (1995). A grid algorithm Analysis, for bound 15:585-608. SIAG/OPT Views-and-News, minimization 9:503-524. (7):9-12. constrained optimization of noisy

functions.

Journal Global

of Numerical modeling

Prank, P. D. (1995).

for optimization.

Gay, D. M. (1983). Algorithm region approach. ACM

611. Subroutines Transactions

for unconstrained

using a model/trust-

on Mathematical

Software,

Gay, D. M. (1984). A trust region approach to linearly constrained optimization. In Lootsma, F. A., editor, Numerical Analysis. Proceedings, Dundee 1983, pages 171-189, Berlin. Springer. 13

Glad, T. andGoldstein,A. (1977).Optimizationof functionswhose valuesare subjectto small errors. BIT, 17:160-169.
Goldstein, A. A. and Price, tation, 25:569-574. J. F. (1971). On descent from local minima. Mathematics of Compu-

Haaland, P. D. (1996). Nonparametric Papers Presented at the 1996 Joint

response Statistical

surface methods. In Abstracts: Summaries Meetings in Chicago, Illinois, page 324.

of

Koehler, J. R. and Owen, A. B. (1996). Computer experiments. In Ghosh, S. and Rao, C. R., editors, Handbook of Statistics, Volume 13, pages 261-308. Elsevier Science, New York. Lewis, R. M. and Torczon, V. (1996a). Pattern search algorithms for bound constrained minimization. Technical Report 96-20, ICASE, NASA Langley Research Center, Hampton, VA. To appear in SIAM Journal on Optimization.

Lewis, R. M. and Torczon, V. (1996b). Rank ordering and positive bases in pattern search algorithms. Technical Report 96-71, ICASE, NASA Langley Research Center, Hampton, VA. McKinnon, point. K. I. M. (1996). Convergence Technical Report MS 96-006, Edinburgh, Scotland. and Product of the Nelder-Mead simplex Department of Mathematics method to a non-stationary & Statistics, University of

Edinburgh,

Myers, R. H. and Montgomery, D. C. (1995). Response Surface Methodology: Process Optimization Using Designed Experiments. John Wiley & Sons, New York. Nair, V. N. (1992). Taguchi's parameter design: A panel discussion. Technometrics,

34:127-161. Journal,

Ncldcr, J. A. and Mead, 7:308-313. Sacks, J., Welch, experiments. Serafini,

R. (1965).

A simplex

method

for function

minimization.

Computer

W. J., Mitchell, T. J., and Wynn, H. P. (1989). Design Statistical Science, 4:409-435. Includes discussion. A Model Management Functions. PhD thesis, Framework for Nonlinear Houston,

and analysis

of computer

D. (1997).

Optimization TX. SIAM

of Computation-

ally Expensive Torczon, V. (1997). 7:1-26.

Rice University, search

In progress. Journal on Optimization,

On the convergence

of pattern

methods.

Trosset, M. W. (1996). Taguchi and robust design. Technical Report 96-31, Department putational & Applied Mathematics, Rice University, Houston, TX. Watson, G. S. (1984). Smoothing Geology, 16:601-615. and interpolation by kriging and with splincs.

of Com-

Mathematical

Welch, W. J. and Sacks, J. (1991). A system for quality Communications in Statistics--Theory and Methods,

improvement 20:477-495.

via computer

experiments.

14

REPORT

DOCUMENTATION

PAGE

Form Approved OMB No 0704-0188

Publicreportingburdenfor thiscollection information estimatedto average1 hourper response, of is includingthe time for reviewing instructions, earchingxistingdata sources, s e gathering andmaintainingthe data needed,andcompleting and reviewing collection information.Sendcommentsregarding the of this burden estimate any other aspect f this or o collectionof information,ncluding i suggestions reducing this burden,to Washington for Headquarters Services, irectoratefor InformationOperationsand Reports,1215 Jefferson D Davis Highway, uite 1204, Arlington,VA 22202-4302,andto the Office of ManagementandBudget, PaperworkReduction S Project(0704-0188), Washington, DC 20503. 1. AGENCY USE ONLY(Leave blank) 2. REPORT August 4. TITLE AND SUBTITLE Optimization Using Computer Experiments C NAS1-19480 WU 6. AUTHOR(S) Michael Virginia "_V. Trosset Torczon 505-90-52-01 DATE 1997 3. REPORT Contractor TYPE AND DATES COVERED

Report S. FUNDING NUMBERS

Numerical

7. PERFORMING Institute Mail Stop for

ORGANIZATION Computer NASA 23681-0001

NAME(S)

AND in

ADDRESS(ES) and Engineering

Applications Langley Research

Science Center

8. PERFORMING ORGANIZATION REPORT NUMBER ICASE Report No. 97-38

403, VA

Hantpton,

g. SPONSORING/MONITORING National Lmtgley Hampton, Aeronautics Research VA and Center 23681-0001

AGENCY Space

NAME(S)

AND

ADDRESS(ES)

Administration

10. SPONSORING/MONITORING AGENCY REPORT NUMBER NASA ICASE CR-201724 Report No. 97-38

11. SUPPLEMENTARY Langley Technical

NOTES Monitor: Dennis hi. Bushnell

Final Report Submitted to 12a.

Technontetrics STATEMENT 12b. DISTRIBUTION CODE

DISTRIBUTION/AVAILABILITY Uncla._sified Unlimited

Stfl)je('t

Category

64

13. ABSTRACT Engineering derivative-free optimization a sequence froIn

(Maximum 200 words) design optintization methods. anti computer experiments of surrogate We models on

often propose

gives

rise

to problems for solving Our ftmction problem that are

in which such are relies used

expensive that on kriging

objective synthesizes known a grid

functions ideas function

are from values

minimized the to numerical construct Results

by

a method literatures. objective test

problents

experiment of the a standard

approach

to guide

search

for a minimizer.

numerical

presented.

14. SUBJECT design kriging,

TERMS optimization, nonparametric computer response sinmlation, surface derivative-free methodology optintization pattern search,

15. NUMBER 16

OF PAGES

16. PRICE CODE A03

17. SECURITY CLASSIFICATION OF REPORT Unclassified _ISN 7540-01-280-5500

18. SECURITY CLASSIFICATIOI_ OF THIS PAGE Unclassified

lg.

SECURITY CLASSIFICATION OF ABSTRACT

20. LIMITATION OF ABSTRACT

'Standard Form 298(Rev. 2-8g) Prescribed ANSI Std. Z3918 by 298-102

You might also like