Numerical Optimization Using Computer Experiments: Nasa Contractor 201724
Numerical Optimization Using Computer Experiments: Nasa Contractor 201724
Numerical Optimization Using Computer Experiments: Nasa Contractor 201724
Contractor Report
Report
201724
No. 97-38
_NNIVERSARY
NUMERICAL COMPUTER
OPTIMIZATION EXPERIMENTS
USING
Michael Virginia
W. Trosset Torczon
No.
NAS1-19480
Computer Research
Applications Center
in Science
and
Engineering
Hampton, Operated
by Universities
National Space
and
Langley Hampton,
Center 23681-0001
Numerical
Optimization
Michael Adjunct
Using Computer
W. Trosset Professor & Applied TX
Experiments*
Department
of Computational Houston,
Mathematics
of Computer of William
Williamsburg,
Abstract
Engineering tions that Our are design minimized optimization by derivative-free ideas relies on from kriging the often gives methods. numerical known that are function used test rise to problems We propose and in which a method computer expensive for solving experiment objective such func-
synthesizes approach
optimization values
a sequence for
of surrogate Results
objective experiments
function on
to guide problem
a minimizer.
a standard
presented.
was and
Air Force Office of Scientific Number NAS1-19480 while Engineering (ICASE), NASA
Research (AFOSR) under Grant No. the authors were in residence at the Langley Research Center, Hampton,
Introduction
the problem of minimizing an objective function f (x ) x e [a, b], ai <_ xi <_ bi for i = 1,...,p. f : _P --_ _ subject to bound con-
(1) We are
E [a,b] to denote
concerned with special cases of Problem (1) for which evaluation of the objective function involves performing one (or more) complicated, deterministic computer simulation(s). Many such problems arise as engineering design problems and are often distinguished by two troubling characteristics that preclude solution by traditional algorithms for bound-constrained optimization. many the First, the approximation, simulation distortions mization output of a complicated computer rounding and truncation errors. simulation is usually affected by a great These errors are not stochastic--repeating
their accumulation introduces high-frequency, low-amplitude that we would have liked to optimize. In consequence, optior approximate (by finite differencing) derivatives of f often fail
trends in the objective function and become trapped in local minimizers created oscillations. In order to develop effective algorithms for such applications, we to derivative-free methods for numerical optimization. computer simulations are often expensive to perform. Frank (1995) sugcosts several hours of studied an aeroelastic
complicated
gested that one must address problems in which a typical function evaluation supercomputer time. For example, Booker (1996) and Booker et al. (1996) and dynamic response requires approximately objective function uations of f that simulation six hours
of a helicopter rotor blade for which a single function evaluation of cpu time on a Cray Y-MP. We formalize the notion that the
is expensive to evaluate by imposing an upper bound V on the number of evalwe are allowed to perform. The severity of this restriction will depend (in part) function f that is too expensive for standard numer-
ical optimization algorithms to succeed, it has long bccn a standard engineering practice, described by Barthelemy and Haftka (1993), to replace f with an inexpensive surrogate ] and minimize ] instead. (For example, one might evaluate f at V1 carefully selected sites, construct ] from the resulting information, use a standard numerical optimization algorithm to minimize ], and evaluate f at the candidate minimizer thus obtained.) This practice may also have the salutory effect of smoothing high-frequency oscillations in f. The rapidly growing literature on computer experiments prescription offers new and potentially better ways of implementing this traditional practice. The that seems to be gaining some currency in the engineering community was proposed by (Design and Analysis on this methodology,
Welch and Sacks (1991); following current convention, we refer to it as DACE of Computer Experiments). Frank (1995) offered an optimizer's perspective
suggested that the "minimalist approach" of minimizing a single f is not likely to yield satisfactory results, and proposed several sequential modeling strategies as alternatives. Booker (1996) studied several industrial It is not our minimizing numerical applications of DACE and two alternative approaches. purpose in this report to provide a thorough critique objective functions. We regard as occupying opposing ends of DACE as a method for for to
expensive optimization
p, say p = 2 and V = 500, then the expense of function evaluation is not an issue and we are content to rely on traditional iterative methods. When V is not large relative to p, say p = 2 and V = 5, then the expense of function evaluation rely on DACE. (If V < p, then the methodologies is completely crippling and wc are content to that we consider are not appropriate.) In this
with intermediate
situations
and we borrow
experiment literatures. We describe a sequential modeling strategy models are used to guide a grid search for a minimizer. Our methods strategy proposed are part of a larger
elaborate and extend an important special case of the general model management by Dennis and Torczon (1996) and developed by Serafini (1997). These efforts collaboration described by Booker et al. (1995) and Booker et al. (1996).
Pattern
Search
Methods
Problem (1) that does not require derivative information. For
We require
a method
of solving
derivative-free method is the simplex method proposed by is sometimes adapted for constrained optimization by means
of a simple ad hoc device, viz. setting f(x) = oc when x _ [a, b]. Unfortunately, the Nelder-Mead simplex method is suspect even for unconstrained optimization. For example, McKinnon (1996) has constructed a family of strictly convex, diffcrentiable objective functions on _2 for which there exist starting points from which Nelder-Mead for which will fail to converge theory to a stationary the pattern point. search Instead, methods by Lewis algorithms we rely on a class of methods a convergence exists,
explicated by Torczon (1997) for the case of unconstrained optimization and and Torczon (1996a) to the case of bound-constrained optimization. Pattern search methods are iterative algorithms for numerical optimization.
extended Such
produce a sequence of points, {xk}, from an initial point, x0, provided by the user. To specify an algorithm, one must specify how it progresses from the current iterate, Xc, to the subsequent iterate, x+. One of thc distinguishing features of pattern search for x+ to a grid (more formally, a lattice) that contains methods Xc. The is that they restrict their search grid is modified as optimization
progresses, according to rules that ensure convergence to a stationary point. The essential logic of a pattern search is summarized in Figure 1. The reader is advised that this description of pattern search methods differs from the presentation in Torczon (1997) and Lewis and Torczon (1996a, 1996b). For example, the choice of a starting point is usually not restricted and the initial grid is constructed so that it contains the starting point. More significantly, pattern search methods are usually specified by rules that prescribe where the algorithm is to search for the subsequent iterate and the notion of an underlying grid is implicit in these rules. In this report, the grid is explicit and the search for a subsequent iterate is not restricted to a specific pattern of points. What should be appreciated is that the present description preserves all of the elements of pattern connection search methods required by their convergence theory. between these perspectives will be provided by Serafini that the crucial elements of a pattern search A detailed (1997). explication of the
It is evident
algorithm
are contained
in the speci-
fication of 2(b) in Figure 1. The fundamental idea is to try to find a point on the current grid that strictly decreases the current value of the objective function. Any such point can be taken to bc the subsequent iterate. If one fails to find such a point, then one replaces the current grid with a finer one and tries again. Torczon (1997) described the search in 2(b)(i) as an exploratory moves algorithm. Here wc
distinguish two components of an exploratory moves algorithm: an oracle that produces a set of trial points on the current grid and a core pattern of trial points on the grid at which the objectivc function must bc evaluated before the algorithm is permitted to refine the grid. The convergencc theory requires oracle. Because that the core pattern proposed satisfy certain hypotheses; depend no hypotheses on the arbitrary are placed nature on the
the methods
in this report
critically
of the or-
grid.
Let xc = xo.
the current grid for a set of xt E [a, b] at which f is then evaluated. Let T(+) the set of grid points xt E [a, b] thus obtained for which f(xt) < f(xc). the grid.
x+ E T().
Figure
1: Pattern
search
methods
for numerical
optimization.
acle, we emphasize that any method whatsoever can be employed to produce points that potentially decrease the current value of the objective. Wc might perform an exhaustive search of the current grid or we might specify a complicated pattern of points at which to search. We might appeal to our prior knowledge of, or our intuition about, the objective function. It does convergence theory for pattern search methods encompasses all such possibilities. For the sake of clarity, we describe more fully a specific construct the grids on which the searches will be conducted. lattices there not matter--the
pattern search algorithm. First, we For n = 0, 1, 2,..., we define vector if and only if for each i -- 1,..., p
F(n) restricted
exists
to the feasible
x E r(n)
ji E {0, 1,...,
xi = ai + _
Thus, V(0) comprises the vertices of [a, b] and F(n+ 1)is obtained from F(n) by halving the distance between adjacent grid points (see below). When we update the current grid, say F(n), in 2(b)(ii), we either retain F(n) or replace r(n) with r(n + 1). Next we specify a core pattern. Given Xc C [a, b], we say that xt E [a, b] is adjacent only if there exists k C {1,... ,p} such that 1 (bk -- ak) xtk = xck -4- _-_ and xu = xci for i _ k. We take as the core pattern the set of grid points adjacent to the current to xc if and
iterate. (For example, if the current grid is the (8, 8)'], then the core pattern of (2, 0)' comprises
integer lattice on _2 restricted to [a = (0, 0) _, b = (3, 0)', (2, 1)', and (1, 0)'.) We refine the grid, i.e. f at each grid point xt adjacent to
we replace F(n) with F(n + 1), if and only if we have evaluated xc and failed to find f(xt) < f(Xc). If f is continuously differentiable, guarantees that the specified algorithm
then the theory developed by Lewis and Torczon (1996a) will produce a sequence {xk} that converges to a Karushmust terminate in a finite for direct search methBy definition, evaluations of
Kuhn-Tucker point of Problem (1). In practice, of course, the algorithm number of steps. Termination criteria for pattern search methods--indeed, ods in general--have the assumption that not been studied extensively, but that Problem (1) is expensive means that
the objectivefunctionto terminateby the traditional criteriaof numericaloptimization. Forthe problemsthat weconsider,he relevant erminationcriterion is whetheror not wehaveexhausted t t the permittednumber(V) of function evaluations.
Derivative-free methods for numerical optimization can be quite profligate with respect to the number of function evaluations that they require. Because the number of function evaluations available to us is severely limited, wc want to use these evaluations as efficiently as possible. On the bright side, the convergence theory for pattern search methods allows us to replace xc with any xt for which f(xe) < ](xc). Hence, no matter how comprehensive a search for trial points we may have envisioned, we can abort it as soon as we find a single trial point that satisfies f(xt) < f(Xc). On the dark side, the oracle may require a great many function evaluations to produce even one xt for which f(xt) < ](xc). Furthermore, if the oracle is unsuccessful, current grid until after f has been evaluated at each grid point adjacent require as many as 2p additional function evaluations if xc is an interior then we can not refine the to xc--a process that may grid point and f has not
yet been evaluated at any of the grid points adjacent to xc. The fundamental purpose of this report is to introduce methods that reduce oracle. We do not attempt to reduce the number of points in the core pattern. optimization, Lewis and Torczon (1996b) have introduced core patterns that points. Pattern search algorithms example, the oracle employed
may intentionally use large numbers of function evaluations. For by Dennis's and Torczon's (1991) parallel direct search intentionally of evaluating the objective for which huge numbers of
casts a wide net, relying on parallel computation to defray the expense function at a great many grid points. We are concerned with problems
function evaluations are not possible. Here, we want an oracle that proposes promising trial points using only a small number of function values. Our strategy for creating such an oracle will be to use previous function values to construct a current global model, it, of ], then use ]c to predict trial points x_ at which f(xt) < f(xc). Thus, we will employ the strategy described of it. in Section 1, to a not once to replace Problem (1), but repeatedly to guide description of the models that our oracles will exploit. us to a solution We now turn
Computer
Experiment
Models
Suppose that we have observed y_ = f(xi) for i --- 1,...,n. On the basis of this information, we want to construct a global model ] of f. Such inexpensive surrogates for f will be used by the oracle in the pattern search function values. algorithm to identify promising trial points at which to compute additional
We assume that there is no uncertainty in the Yi = f(xi), i.e. that no stochastic mechanism is involved in evaluating the objective function. It is then reasonable to require the surrogate function to interpolate these values, i.e. to require that ](xi) = f(xi). Furthermore, wc desire families of surrogate functions are led to consider that are rich enough to model a variety of complicated objective certain families of models that have been studied in the spatial are usually motivated by supposing that functions. statistics We and
computer experiment literatures. The families of models that we consider of some (nice) quite plausible.
stochastic process. For certain geostatistical In the context of using computer simulations is less clear. the realization
applications, to facilitate
product designs, its plausibility we have described do resemble are our primary concern do not.
The high-frequency, low-amplitude oscillations of a stochastic process, but the general trends the supposition of an underlying
stochastic
The and
value
lies invoking
in its
power and
to
of constructing respect the ](x,) kriging whereas latter facts (1989), that _P. f germane about
we will try
to avoid
should interpolation
than
utility will
requirement these
suggest equivalent.
well-known
however,
is to approximate kriging
as possible. of computer
It is evident experiments.
briefly Sacks,
experiment set
We begin
assuming
process known
known dcfinite.
symmetric f(xl) y=
p p matrix
c(s, t) is strictly
f(x )
and linear for each unbiased x E _P define b(x) (BLUe) E _n to minimize and
](X) :
E[y'b-
F(x)]
2. that
Then
](x)
= y'b(x)
is the
best
predictor
of f(x)
it is well-known
ylC-l_(x),
(2)
where
C is the
symmetric
positive
definite
n n matrix
[c(xi,
xj)]
and
=
This and is a simple _xj) example of kriging. Notice
[ C(Xl, )
kriging necessarily interpolates: since C-1C = I
that
is column
j of C,
](Xj)
y'C-ic(xj)
-_
y !ej
yj = f(xj). is known. function of a known unknown We now vector, suppose t). that that F is a
far
that it(x)
the
----a(xfl_ function,
We assume functions.
a : _P
f_ C ]_q is an
a 2 > 0 is an
") is an unknown
of correlation symmetric
n q matrix
[aj(xi)],
R(O)
denote
n n
matrix
r(x; 0) =
Then, for _ and 0 fixed, the BLUe ](x) of f(x) = a(x)'/_ is (cf. + (y -
:
equation
.
(2)) O). (3)
A_)'R(O)-lr(x;
Thus, by varying/3 and 0, we define specific ] to model f. Given a family of interpolating (/3, a 2, 0) and explicit thereby ]. formulas:
functions
we can select a
maximum
To compute
out that
one must
minimize (4)
n log b2(O) log dct R(O) as a function between the of 0. that, in specifying a family of correlation by (3) and functions, there is a potential (4). The
It is now evident richness family of models, previous research that families greater.
of the family
defined
the ease
of maximizing
the more difficult it may be to select a plausible member of it. Most of the on computer experiments has been concerned with deriving a single model ] surrogate for f. Understandably, the authors have used rich complicated maximizing correlation functions for which 0 is a vector of dimension p or (4) difficult, but 0 need only be computed once. In contrast, we of guiding to simplify 1-parameter
are concerned with deriving a sequence of models that will be used for the sole purpose our optimization of f. Hence, we are content to sacrifice some flexibility in (3) in order minimizing (4). In the numerical experiments isotropic correlation function defined by r6(s,t) where I1" II denotes the Euclidean norm that we report in Section 5, wc used the
4
The
Model-Assisted
optimization strategies
Grid
developed
Search
in this report are all predicated on a simple idea, viz. that
providing the oracle in Section 2 with an inexpensive surrogate will allow it to more efficiently identify promising trial points optimization. as described The surrogate models will be constructed in Section 3. In this section, we describe in greater analogy.
model of the objective function and thereby reduce the cost of values by kriging, the pattern search of smooth objective
functions, the numerical algorithms of choice are the quasi-Newton methods. An elementary exposition of these methods was provided by Dennis and Schnabcl (1983), who emphasized the following interpretation: a quasi-Newton algorithm constructs a quadratic model ]c of the objective function f at the current f(xt) < f(xc). iterate xc and uses ]c to identify For example, trust region methods a trial point xt at which it is predicted that obtain xt by (approximately) minimizing ]c
subject to the constraint that xt bc within a specified neighborhood of xc. The rationale for the constraint is that the model it, which is usually constructed by computing or approximating the second-order Taylor polynomial of f at xc, can only be trusted to approximate f locally.
(a) Select
f(xl),...,
Do until i. Apply
to ]c to obtain Update
xt EFc ]_.
\ Evalc.
Figure
2: Model-Assisted
Grid Search
(MAGS)
for minimizing
an expensive
objective
function.
Trust region methods make effective use of simple local models of the objective function. Because we are concerned with situations in which evaluation of the objective function is prohibitively expensive, we are prepared to invest more resources cated global models of the objective function. Similarly, classical response surface methodology, in constructing from Box and and optimizing (1951) more complito Myers and
Wilson
Montgomery (1995), constructs local linear or quadratic regression models of a stochastic quadratic objective function. Again, the purpose of these models is to guide the search for a minimizer or maximizer. Glad and Goldstein (1977) also exploited quadratic regression models for optimization, as did Elster and Neumaier (1995) to guide a grid search. Recently, nonparametric response surface methods have been proposed in which global models of more complicated objective functions are constructed. This work, e.g. Haaland numerous issues that arise when using optimization. The remainder mize expensive rized in Figure of this section (1996), is closely related to ours. Frank (1995) surveyed global models of expensive objective functions to facilitate a specific methodology for using global models to mini-
develops
objective functions. The essential logic of the methods that we propose is summa2. For simplicity, we assume that the expense of all operations performed on the in comparison to the expense a general model management of function evaluation(s). Dennis and Torczon strategy that can be employed when it is nec-
essary to balance these expenses. This model management strategy is currently being developed and extended by Serafini (1997) and is capable of accommodating our methods as a special case. Techniques for designing except for two details. First, the initial computer experiment in step 2 do not it is convenient to employ a technique that permits Several N, the before such techniques were described number of initial design sites. it becomes possible to suggest concern us here, the initial design by Koehler and A great deal of useful guidelines
sites to be selected from the initial grid. Owen (1996). Second, we must specify numerical experimentation will be required
Because
for which
sequential
optimization
is
considerably
less than
of functions we prefer
evaluations to choose
permitting at least a simplex design. Because evaluation of the objective function is expensive, it seems sensible to cache the function values that have been computed. We denote the current set of points at which f has been evaluated by Evalc, which in step 3 we initialize to comprise the initial design denote by Core(x) the core pattern of points adjacent to x. Then, is precisely the condition required by the convergence be satisfied before the current grid can be refined. sites. For any point x EFc, we in loop 4(a), Core(xc) C Evalc search methods condition must that must eventually
obtain, i.e. loop 4(a) cannot repeat indefinitely, because Fc is finite and we only consider trial points xt EFc at which f has not yet been evaluated. Thus, the logic of loop 4 is to (a) search for points of improvement on the current grid until (b) we replace Loop 4 repeats until we have exhausted our budget of objective the current grid with function evaluations. a finer grid.
It is within loop 4(a) that we exploit the models obtained by kriging. Step 4(a)(i) produces a trial point xt at which f is evaluated in the hope that f(xt) < f(xc). As soon as f(xt) has been computed, we add xt to the set Evalc and update the current model ]c. It is not entirely clear how best to update values, but ]c. One could decline completely refit the family certain parameters of models to the new set of function that the value of one might to re-estimate if one believed
updating them did not justify the expense. The generality of the convergence theory for pattern search methods permits us to be rather vague in specifying step 4(a)(i). This ambiguity is desirable because it encompasses a great many possibilities deserving consideration. Thus, we can search for a local minimizer of ]c using whatever algorithm we prefer, e.g. quasi-Newton, steepest descent, pattern search, etc. We can start our search from whatever point we prefer or we can use multiple starting points and pursue multiple searches before determining xt. We can even search for a global minimizer of ]c if we are so inclined. Furthermore, we can terminate our search whenever we please. (We envision searching until we near
^
a local minimizer
of fc, but
we can terminate
sooner
if the search
becomes
expensive.
Tradc-offs
between the cost of evaluating the objective function and the cost of constructing and optimizing the models can be mediated by the model management strategy proposed by Dennis and Torczon (1996) and developed by Serafini (1997).) The only requirement is that we eventually point x_ that belongs to the current grid and at which f has not yet been evaluated. Most search strategies for a desirable xt will not restrict attention to the current return grid. a trial Because
wc require xt E Fc, we suggest terminating the search when step lengths become appreciably smaller than the minimal distance between grid points and choosing xt to be a grid point that is near the terminal iterate of the search. There are various plausible ways of implementing this suggestion. Wc might prefer the nearest grid point or we might prefer a nearby grid point at which the model predicts greater decrease. Because we require xt Evalc, we might not actually preferred grid point. Similarly, if we are impatient to refine the grid, we might even whcn a nonadjacent point is nominally preferred. select the nominally select xt E Core(x_)
Numerical
Experiments
some numerical experiments intended to illustrate the viability Satellite software. of the methods
in Section 4. These experiments were performed with a 75MHz Pentium processor, using S-PLUS optimization (for parameter estimation)
was performed
Minimizer
(0,-10) (-6,-4) (18,2) (12,8)
f(x) 3 30 84 840
Frequency 38 44 8 10
Table
1: Minima
of the
rescaled
Goldstein-Price
objective
function
20] found
by 100 quasi-Newton
searches.
mize, an implementation of Brent's (1973) safeguarded finite-difference quasi-Newton methods for multivariate
polynomial interpolation procedure. When optimization were employed (typically for function developed nlminb, an implementation by Gay (1983, 1984).
minimizing models), they were performed using the S-PLUS of a trust-region method for bound-constrained optimization The objective function nomial of p = 2 variables employed by Dixon and
used in our experiments was a rescaled version of an eighth-order polyconstructed by Goldstein and Price (1971). One of the test problems Szeg5 (1978) in their study of global optimization was to minimize the
Goldstein-Price polynomial in the feasible region [-2, 2] [-2, 2]. We rescaled this problem so that the feasible region was [-20, 20] [-20, 20]. The Goldstein-Pricc polynomial is a complicated function that has four minimizers and is not easily modelled. Because it is a polynomial, the minimizers are easily located by a finite-difference quasi-Newton method. To estimate the relative sizes of their basins of attraction, we drew 100 points results from a uniform are summarized distribution on the feasible in Table 1. are a special region and started nlminb from each point. The
The models
that we employed
that fl is a scalar parameter, and a(x) -_- 1. We used the correlation estimated the scalar parameter 0 by applying optimize to (4).
function
We implemented two strategies for derivative-free optimization with a small value of V, the permitted number of objective function evaluations. First, we implemented a DACE strategy with V = 11, 16, 21, 26. For these procedures, we constructed a model, ], from N = V - 1 design sites obtained by Latin hypercube sampling. We then minimized ] by a finite-difference quasiNewton method, starting from the initial The objective function was then evaluated the V function values was recorded. Second, procedures, design site with the smallest objective function value. at the minimizer of ] so obtained and the smallest of 4 with V -- 11, 16. For these a trade-off: coarse grids tend
we implemented the MAGS strategy described in Section it is first necessary to specify an initial grid. This involves
to safeguard against unlucky initial designs, whereas fine grids permit more function evaluations before it becomes necessary to evaluate f at a core pattern of points in order to refine the grid. Our choice of a family of grids was further complicated by the fact that the minimizers of our objective function selected occur on the integer lattice in _2. To avoid using grids that contain the minimizers, we an initial grid of the form xi = -20 + j_r/2, required, were obtained by halving the an initial design on the initial grid, we moved each point to the nearest
for ji = 0, 1, 2,.... Subsequent grids, which were rarely current distance between adjacent grid points. To obtain
first obtained N = 5 points by Latin hypercube sampling, then grid point. The initial model was constructed from this design.
DACEll 3.41 9.60 15.63 37.26 116.75 367.95 601.86 951.55 1027.52
DACE16 5.29 11.99 13.74 30.83 71.61 122.35 240.92 423.13 1243.90
DACE21 5.02 10.64 17.16 26.04 51.69 99.96 190.71 230.89 774.91
DACE26 3.26 9.89 14.42 24.75 41.70 86.33 154.20 281.84 349.43
MAGSll 5.77 5.77 6.78 28.98 43.89 87.78 343.21 529.02 1617.61
MAGS16 3.43 5.77 5.77 6.78 30.48 64.55 101.63 135.22 885.32
Table
2: Percentiles strategies,
of smallest
values
of the Goldstein-Price
objective
function
found
by six opti-
mization
each replicated
100 times.
To obtain a trial point, xt, from the current iterate, xc, a finite-difference was started from xc to obtain a minimizer, &, of the current model, it.
point nearest &. If the objective function had not previously been evaluated at _, then we set xt = _; otherwise, we set xt equal to the point in Core(_) nearest &. Each time that a new function value was obtained, we re-estimated all three of the model parameters, (fl, a 2, 0). This process was repeated until V function value was recorded. evaluations had been performed, at which time the smallest function
Each of the six procedures described above was replicated 100 times. The results are summarized in Table 2, from which several interesting conclusions can be drawn. First, we note that each strategy did indeed do better when permitted more function evaluations. Thus, DACE with V = 16 function evaluations usually V = 16 function evaluations result is not surprising. Second, we note that uations than did DACE. function evaluations and outperformed DACE usually outperformed typically found with V = 11 function evaluations and MAGS with MAGS with V = 11 function evaluations. This smaller function values with fewer function MAGS with evaluations. eval-
MAGS
The crucial comparisons are between DACE and between DACE and MAGS with V = 16 function
V = 11 In each
case, typical performance decisively favors MAGS. For example, the median best function value found by MAGSll is less than 40 percent of the median best value found by DACEll. Similarly, the median best function value found by MAGS16 is less than 45 percent of the median the performance best value of MAGS. found by DACE16. The DACE experiments
to benchmark
Except for the extreme upper percentiles, the distribution of best function values found by MAGS with V -- 11 is strikingly similar to the distribution of best function values found by DACE with V -- 26 function evaluations. Thus, another way to state the case for MAGS is to observe that DACE typically required more than twice as many of MAGSll. Such observations confirm the central efficiently accomplished raised when comparing by Nair (1992) and by iterative Taguchi and methods. response Our function evaluations thesis of this report, to match the performance that optimization is most
This is a familiar issuc in statistics that is often surface methods, e.g. in the pancl discussion edited results strengthen the cases made by Frank modeling (1995), (1996) for using sequential MAGS occasionally strategies For
by Trosset
(1996).
Booker et al. (1995), Booker et al. (1996) and Booker to optimize expensive functions. Wc note that despite its generally superior
performance, 10
disappoints.
on the other
phenomenon seems to result from unfortunate initial designs. With only N = 5 design hypercube sampling on a grid occasionally results in transparently bad designs, e.g. five from which using more MAGS has difficulty recovering. To safeguard against this possibility, sophisticated procedures for determining the initial design.
6
The
Conclusions
results presented
and
Future
5 suggest
Work
that the potential value of the methods that we have
in Section
proposed numerical
objective functions is considerable. Of course, comprehensive of objective functions in a variety of dimensions will bc required
to fully appreciate the advantages and disadvantages of these methods. We plan to undertake such investigations in future work. We conclude this report by describing some modifications of MAGS that we envision for these investigations.
^
We begin by noting that the design of the current model, fc, is sequential, comprising the initial design sites xl,..., xg and the trial sites xt subsequently identified in 4(a)(i) in Figure 2. At present, the initial design sites are selected according to the principles of experimental design without regard to optimization, whereas the subsequent trial sites are determined by an optimization algorithm without regard to the design of experiments. Because optimization algorithm tends to cluster in promising the sequence regions, this of trial points generated by an sequence is rarely space-filling
and is not likely to be a good experimental design. Moreover, especially during the early stages of optimization, greater gains are likely to come from improving the fit of fc to f than from accurately identifying a minimizer of ]c. Our present implementation of MAGS is a greedy algorithm in the sense that it tries to minimize ]c without concern for the fit of the subsequent model, ]+. The desirability of balancing perimental design was recognized Given an optimization fraction of new design the concerns of numerical optimization and the concerns of exby Frank (1995), who proposed a dichotomous search strategy.
criterion, e.g. the objective function, and a design criterion, one obtains some sites using one criterion and the balance using the other. An implementation et al. (1995) as follows: and apply
of this Balanced Local-Global Search (BLGS) strategy was described by Booker employed by Booker (1996) and Booker et al. (1996). In contrast to the dichotomous BLGS strategy, we envision modifying 4(a)(i) an optimization algorithm to a merit function, (_c, to obtain should be specified so as to balance the potentially competing
small and choosing good design sites. Notice that this modification in no way affects the convergence theory for MAGS--it is a purely empirical matter whether or not it improves the performance of the algorithm. To illustrate what we have in mind, we again invoke the fiction that the objective in Section function, f,
is a realization of a stochastic process. Under squared error of (3) for predicting .f(x) is
the assumptions
delineated
3, the mean
MSE
[](x)]
= a2 ( 1-
[a(x)',r(x;O)']
[ 0
R(O) A'
]-1[
r(x;O) a(x)
])A
As detailed by Sacks, Welch, Mitchell and Wynn (1989), this function plays a fundamental in several approaches to optimal design. It is also the design criterion employed by Booker 11
role et al.
(1995)andBooker(1996)in BLGS.Our ideais to constructa merit functionthat encourages the selectionof trial points at which the meansquarederror is large,i.e. at which we believethat lessis knownabout the objectivefunction. This strategyhas a great deal in commonwith the Sequential Designfor Optimization(SDO)algorithmfor globaloptimizationproposed Coxand by John (1996). The meansquared error functioncanbe estimatedby replacing parameters 0 "2, 0) with the (_,
their MLEs, resulting in an expression that we denote by
Then
a natural
family of merit
functions Oc(x)
comprises : ]c(x)
those
of the
form ,
- pcMS_E [it(x)]
where Pc >_ O. We anticipate that a great deal of numerical determine useful choices of the Pc. We also note that our family of kriging models was derived stochastic process. In fact, it seems rather implausible that
to
will obtain
throughout the feasible set, so that we likely will prefer different values of 0 for different regions of [a, b]. This poses a problem, because the ]c are global models. As we descend into the basin of a local minimizer, we would like to use a value of 0 that is suited to that basin, not a value of 0 that was determined An obvious by a global compromise. this difficulty is to implement the "zoom-in" process proposed by way of addressing
Frank (1995). To do so, we occasionally modify the feasible set, restricting it to a much smaller set in which a minimizer is deemed to lie. Then the current model, ]c, is determined by the design sites in, and restricted this is a delicate convergence However, that properly in domain to, the current feasible set, [ac, bc]. From a theoretical perspective, strategy that must be carefully implemented in order to preserve the guaranteed has thus far been inherited implemented, updating from the standard the feasible theory for pattern a useful search localization methods. of the set may permit feasible
we are prepared
to replace
the current
should we do so? When step 4(a)(i) begins to produce trial points xt E Core(xc), that Xc is near a local minimizer and that the algorithm is preparing to refine As previously noted, this can be an expensive undertaking, requiring as many
as 2p evaluations
of f. At this stage, an interactive user might very well instruct the algorithm to refine the grid without actually cvaluating f on Core(xc), reasoning that an asymptotic convergence theory may be of little relevance when one can afford only a small number of iterations. Alternatively, First, onc would one might interpret the same scenario as a signal to recalibrate replace the current feasible set with a smaller one. Second, instead the problem. of performing
2p function evaluations on Core(xc)--a set of points that is not a good design_ne would rcplace the current grid with a finer grid, design an experiment for the new grid, and proceed. At present, there is no theory to justify these tactics, but it seems quite plausible that they will result in practical improvements of an alre_ty-promising procedure.
Acknowledgments
This research cvolvcd Andrew from Booker, Although expressed an ongoing Paul Frank collaboration and with John (Boeing Dennis and David Serafini (Rice Conn University); Grcg Shubin Company); and Andrew
we have been profoundly influenced by the ideas of our collaborators, in this report may not be shared by all members of the collaboration. 12
References
Barthelemy, J.-F. M. and Haftka, R. T. (1993). Approximation design--a review. Structural Optimization, 5:129-144. Booker, A. J. (1994). DOE for computer Computer Services, Seattle, WA. Booker, A. J. (1996). Case studies in design output. Technical concepts for optimum structural
Report
BCSTECH-94-052,
Boeing
In Proceedings
on Physical
and Engineering
J. E., Frank,
Multi-level
optimization. Seattle,
Technical
ISSTECH-95-031, Collaborative
Boeing
WA. Boeing/IBM/Rice
1996 Final
Booker, A. J., Conn, A. R., Dennis, Global modeling for optimization. Support Services, Seattle,
WA. Boeing/IBM/Rice
Project
Box, G. E. P. and Wilson, K. B. (1951). On the experimental attainment Journal of the Royal Statistical Society, Series B, 13:1-45. Includes Brent, R. (1973). Cliffs, NJ. Algorithms for Minimization Without Derivatives.
Englewood
Cox, D. D. and John, S. (1996). SDO: A statistical method for global optimization. In Alexandrov, N. and Hussaini, M. Y., editors, Multidisciplinary Design Optimization: State of the Art. SIAM, Philadelphia. Dennis, J. E. and Schnabel, R. B. (1983). Numerical Nonlinear Equations. Prentice-Hall, Englewood Dennis, J. E. and Torczon, V. (1991). on Optimization, 1:448-474. Dennis, J. E. and Torczon, drov, N. and Hussaini, SIAM, Philadelphia. Direct search Methods for Unconstrained Cliffs, New Jersey. methods on parallel machines. Optimization and
SIAM
Journal
V. (1996). Managing approximation models in optimization. In AlexanM. Y., editors, Multidisciplinary Design Optimization: State of the Art.
Dixon, L. C. W. and Szegb, G. P. (1978). The global optimisation problem: An introduction. In Dixon, L. C. W. and Szegb, G. P., editors, Towards Global Optimisation 2, pages 1-15. Elsevier North-Holland, New York. Elster, C. and Neumaier, IMA A. (1995). A grid algorithm Analysis, for bound 15:585-608. SIAG/OPT Views-and-News, minimization 9:503-524. (7):9-12. constrained optimization of noisy
functions.
Journal Global
of Numerical modeling
Prank, P. D. (1995).
for optimization.
for unconstrained
using a model/trust-
on Mathematical
Software,
Gay, D. M. (1984). A trust region approach to linearly constrained optimization. In Lootsma, F. A., editor, Numerical Analysis. Proceedings, Dundee 1983, pages 171-189, Berlin. Springer. 13
Glad, T. andGoldstein,A. (1977).Optimizationof functionswhose valuesare subjectto small errors. BIT, 17:160-169.
Goldstein, A. A. and Price, tation, 25:569-574. J. F. (1971). On descent from local minima. Mathematics of Compu-
response Statistical
of
Koehler, J. R. and Owen, A. B. (1996). Computer experiments. In Ghosh, S. and Rao, C. R., editors, Handbook of Statistics, Volume 13, pages 261-308. Elsevier Science, New York. Lewis, R. M. and Torczon, V. (1996a). Pattern search algorithms for bound constrained minimization. Technical Report 96-20, ICASE, NASA Langley Research Center, Hampton, VA. To appear in SIAM Journal on Optimization.
Lewis, R. M. and Torczon, V. (1996b). Rank ordering and positive bases in pattern search algorithms. Technical Report 96-71, ICASE, NASA Langley Research Center, Hampton, VA. McKinnon, point. K. I. M. (1996). Convergence Technical Report MS 96-006, Edinburgh, Scotland. and Product of the Nelder-Mead simplex Department of Mathematics method to a non-stationary & Statistics, University of
Edinburgh,
Myers, R. H. and Montgomery, D. C. (1995). Response Surface Methodology: Process Optimization Using Designed Experiments. John Wiley & Sons, New York. Nair, V. N. (1992). Taguchi's parameter design: A panel discussion. Technometrics,
34:127-161. Journal,
R. (1965).
A simplex
method
for function
minimization.
Computer
W. J., Mitchell, T. J., and Wynn, H. P. (1989). Design Statistical Science, 4:409-435. Includes discussion. A Model Management Functions. PhD thesis, Framework for Nonlinear Houston,
and analysis
of computer
D. (1997).
of Computation-
On the convergence
of pattern
methods.
Trosset, M. W. (1996). Taguchi and robust design. Technical Report 96-31, Department putational & Applied Mathematics, Rice University, Houston, TX. Watson, G. S. (1984). Smoothing Geology, 16:601-615. and interpolation by kriging and with splincs.
of Com-
Mathematical
Welch, W. J. and Sacks, J. (1991). A system for quality Communications in Statistics--Theory and Methods,
improvement 20:477-495.
via computer
experiments.
14
REPORT
DOCUMENTATION
PAGE
Publicreportingburdenfor thiscollection information estimatedto average1 hourper response, of is includingthe time for reviewing instructions, earchingxistingdata sources, s e gathering andmaintainingthe data needed,andcompleting and reviewing collection information.Sendcommentsregarding the of this burden estimate any other aspect f this or o collectionof information,ncluding i suggestions reducing this burden,to Washington for Headquarters Services, irectoratefor InformationOperationsand Reports,1215 Jefferson D Davis Highway, uite 1204, Arlington,VA 22202-4302,andto the Office of ManagementandBudget, PaperworkReduction S Project(0704-0188), Washington, DC 20503. 1. AGENCY USE ONLY(Leave blank) 2. REPORT August 4. TITLE AND SUBTITLE Optimization Using Computer Experiments C NAS1-19480 WU 6. AUTHOR(S) Michael Virginia "_V. Trosset Torczon 505-90-52-01 DATE 1997 3. REPORT Contractor TYPE AND DATES COVERED
Numerical
NAME(S)
AND in
Science Center
403, VA
Hantpton,
AGENCY Space
NAME(S)
AND
ADDRESS(ES)
Administration
10. SPONSORING/MONITORING AGENCY REPORT NUMBER NASA ICASE CR-201724 Report No. 97-38
Stfl)je('t
Category
64
(Maximum 200 words) design optintization methods. anti computer experiments of surrogate We models on
often propose
gives
rise
by
problents
approach
to guide
search
for a minimizer.
numerical
presented.
TERMS optimization, nonparametric computer response sinmlation, surface derivative-free methodology optintization pattern search,
15. NUMBER 16
OF PAGES
lg.