Bayesian Optimization Meets Bayesian Optimal Stopping
Bayesian Optimization Meets Bayesian Optimal Stopping
Bayesian Optimization Meets Bayesian Optimal Stopping
tions which can be obtained by using a subset of the training objectives of GP-UCB vs. BOS (respectively, objective func-
data or by training the ML model for a small number of tion v.s. expected loss), BO-BOS can preserve the trademark
epochs. However, in each BO iteration, since the fidelity (asymptotic) no-regret performance of GP-UCB with our
(e.g., number of epochs) is determined before function eval- specified choice of BOS parameters that is amenable to an el-
uation, it is not influenced by information that is typically egant interpretation in terms of the exploration-exploitation
available during the training process (e.g., validation accu- trade-off (Section 4). Though the focus of our work here is
racy after each epoch). In addition to BO, attempts have on epoch-efficient BO for hyperparameter tuning, we addi-
also been made to improve the epoch efficiency of other tionally evaluate the performance of BO-BOS empirically
hyperparameter optimization algorithms: Some heuristic in two other interesting applications to demonstrate its gen-
methods (Baker et al., 2017; Domhan et al., 2015; Klein erality: policy search for reinforcement learning, and joint
et al., 2017) predict the final training outcome based on hyperparameter tuning and feature selection (Section 5).
partially trained learning curves in order to identify hyper-
parameter settings that are predicted to under-perform and 2. Background and Notations
early-stop their model training. Hyperband (Li et al., 2017),
which dynamically allocates the computational resource 2.1. Bayesian Optimization (BO) and GP-UCB
(e.g., training epochs) through random sampling and elimi- Consider the problem of sequentially maximizing an un-
nates under-performing hyperparameter settings by succes- known objective function f : D → R representing the val-
sive halving, has been proposed and shown to perform well idation accuracy over a compact input domain D ⊆ Rd of
in practice. Both the learning curve prediction methods and different hyperparameter settings for training an ML model:
Hyperband can be combined with BO to further improve In each iteration t = 1, . . . , T , an input query zt , [xt , nt ]
the epoch efficiency (Domhan et al., 2015; Falkner et al., of a hyperparameter setting (comprising N0 < nt ≤ N
2018; Klein et al., 2017), but their resulting performances training epochs and a vector xt of the other hyperparam-
are not theoretically guaranteed. Despite these recent ad- eters) is selected for evaluating the validation accuracy f
vances, we still lack an epoch-efficient algorithm that can of the ML model to yield a noisy observed output (valida-
incorporate early stopping into BO (i.e., by exploiting infor- tion accuracy) yt , f (zt ) + with i.i.d. Gaussian noise
mation available during the training process) and yet offer a ∼ N (0, σ 2 ) and noise variance σ 2 . Since every eval-
theoretical performance guarantee, the design of which is uation of f is costly (Section 1), our goal is to strategi-
likely to require a principled decision-making mechanism cally select input queries to approach the global maximizer
for determining the optimal stopping time. z∗ , arg maxz∈D f (z) as rapidly as possible. This can be
achieved by minimizing a standard BO objective such as
Optimal stopping is a classic research topic in statistics and
the simple regret ST , f (z∗ ) − maxt∈{1,...,T } f (zt ). A
operations research regarding sequential decision-making
BO algorithm is said to guarantee no regret asymptotically
problems whose objective is to make the optimal stopping
if it satisfies limT →∞ ST = 0, thus implying that it will
decision with a small number of observations (Ferguson,
eventually converge to the global maximum.
2006). In Bayesian optimal stopping (BOS) or Bayesian
sequential design, the decision between stopping vs. continu- To guarantee no regret, the belief of f is modeled by a
ing is made to maximize the expected utility or, equivalently, Gaussian process (GP). Let {f (z)}z∈D denote a GP, that
minimize the expected loss (Powell & Ryzhov, 2012). BOS is, every finite subset of {f (z)}z∈D follows a multivariate
has found success in application domains such as finance Gaussian distribution (Rasmussen & Williams, 2006). Then,
(Longstaff & Schwartz, 2001), clinical design (Brockwell & a GP is fully specified by its prior mean µ(z) and covariance
Kadane, 2003; Müller et al., 2007; Wathen & Thall, 2008), k(z, z0 ) for all z, z0 ∈ D, which are, respectively, assumed
and economics (Davis & Cairns, 2012). The capability of w.l.o.g. to be µ(z) = 0 and k(z, z0 ) ≤ 1 for notational
BOS in providing a principled optimal stopping mechanism simplicity. Given a column vector yT , [yt ]> t=1,...,T of
makes it a prime candidate for introducing early stopping noisy outputs observed from evaluating f at the selected
into BO in a theoretically sound and rigorous way. input queries z1 , . . . , zT ∈ D after T iterations, the GP
posterior belief of f at some input z ∈ D is a Gaussian with
This paper proposes to unify Bayesian optimization (specifi-
the following posterior mean µT (z) and variance σT2 (z):
cally, GP-UCB) with Bayesian optimal stopping (BO-BOS)
to boost the epoch efficiency of BO (Section 3). Intu- µT (z) , kT (z)> (KT + σ 2 I)−1 yT ,
itively, GP-UCB is acclaimed for being sample-efficient (1)
σT2 (z) , k(z, z) − kT (z)> (KT + σ 2 I)−1 kT (z)
in the number of function evaluations while BOS can reduce
the required number of epochs for each function evalua- where KT , [k(zt , zt0 )]t,t0 =1,...,T and kT (z) ,
tion. BO-BOS unifies the best of both worlds to yield an [k(zt , z)]>
t=1,...,T . In each iteration t of BO, an input query
epoch-efficient hyperparameter optimization algorithm. In- zt ∈ D is selected to maximize the GP-UCB acquisition
terestingly, in spite of the seemingly disparate optimization function (Srinivas et al., 2010) that trades off between ob-
Bayesian Optimization Meets Bayesian Optimal Stopping
serving an expected maximum (i.e., with a large GP poste- Solving the BOS problem (2) yields a Bayes-optimal de-
rior mean µt−1 (z)) given the current GP belief of f (i.e., cision rule in each epoch n: Take the Bayes-optimal stop-
exploitation) vs. that of high predictive uncertainty (i.e., ping decision if the expected loss of either terminal deci-
2
with a large GP posterior variance σt−1 (z)) to improve sion d1 or d2 is at most that of the continuation decision
the GP belief of f over D (i.e.,√ exploration). That is, d0 , that is, min{ Eθt |yt,n [l(d1 , θt )], Eθt |yt,n [l(d2 , θt )] } ≤
zt , arg maxz∈D µt−1 (z) + βt σt−1 (z) where the pa- cd0 + Eyt,n+1 |yt,n [ρt,n+1 (yt,n+1 )]. Otherwise, continue
rameter βt > 0 is set to trade off between exploitation vs. model training to yield the noisy observed output (validation
exploration for guaranteeing no regret asymptotically with accuracy) yt,n+1 and repeat this rule in the next epoch n + 1.
high probability.
3. BO-BOS Algorithm
2.2. Bayesian Optimal Stopping (BOS)
In this section, we will describe our proposed BO-BOS algo-
BOS provides a principled mechanism for making the
rithm (Section 3.1) and define the loss function l in BOS (2)
Bayes-optimal stopping decision with a small number of
such that it can serve as an effective early-stopping mech-
observations. As shall be seen in Algorithm 1, in each itera-
anism in BO (Section 3.2). We focus on problem settings
tion t of BO-BOS, BOS is used to early-stop model training
where the objective function f is bounded and monotoni-
under the selected input hyperparameters xt that will end
cally increasing in n:
up under-performing, hence reducing the required number
of training epochs. In a BOS problem, the goal is to de- Assumption 1. (a) f (z) ∈ [0, 1] for all z ∈ D and (b)
cide whether to stop and conclude either hypothesis/event f ([x, n]) ≤ f ([x, n + 1]) for all x and n = 1, . . . , N − 1.
θt = θt,1 or θt = θt,2 corresponding to terminal decision d1 Assumption 1a is not restrictive since it applies to any
or d2 , or to gather one more observation via the continuation bounded f with a proper transformation. Assumption 1b
decision d0 . Let yt,n0 be the noisy output (validation accu- holds reasonably well in a number of important ML prob-
racy) observed in epoch n0 and yt,n , [yt,n0 ]> n0 =1,...,n be a lems: (a) f represents the validation accuracy of an ML
vector of noisy outputs observed up till epoch n in iteration model and n denotes the number of training epochs or the
t. Recall from Section 2.1 that in iteration t, the ML model number of selected features during feature selection, and (b)
is trained using the selected input hyperparameter setting f represents the (discounted) cumulative rewards (assuming
[xt , nt ] to yield the noisy observed output (validation accu- non-negative rewards) in reinforcement learning (RL) and n
racy) yt . So, yt = yt,nt for t = 1, . . . , T . After each epoch denotes the number of steps taken by the agent in the envi-
n, the posterior belief of event θt is updated to Pr(θt |yt,n ) ronment. Our experiments in Section 5 will demonstrate that
which will be used to compute the expected losses of termi- BO-BOS outperforms the state-of-the-art hyperparameter
nal decisions d1 and d2 . Such a loss function l has to encode optimization algorithms in these ML problems.
the cost of making a wrong decision. Define ρt,n (yt,n ) as
the minimum expected loss among all decisions in epoch n: 3.1. Algorithm Description
ρt,n (yt,n ) , min{ Eθt |yt,n [l(d1 , θt )], Eθt |yt,n [l(d2 , θt )],
In each iteration t of BO-BOS (Algorithm 1), the input
cd0 + Eyt,n+1 |yt,n [ρt,n+1 (yt,n+1 )] } hyperparameters xt are selected to maximize the GP-UCB
(2)
acquisition function with the input dimension of training
for n = N0 + 1, . . . , N − 1 where the first two terms are
epochs fixed at N (line 2). The ML model is trained using
the expected losses of terminal decisions d1 and d2 , the last
xt for N0 initial training epochs to yield the noisy observed
term sums the immediate cost cd0 and expected future loss
outputs (validation accuracies) yt,N0 (line 3). After that,
of making the continuation decision d0 to continue model
the BOS problem is solved (line 5) to obtain Bayes-optimal
training in the next epoch n + 1 to yield the noisy observed
decision rules (see Sections 2.2 and 3.2). Then, in each
output (validation accuracy) yt,n+1 , and ρt,N (yt,N ) ,
epoch n > N0 , model training continues under xt to yield
min{ Eθt |yt,N [l(d1 , θt )], Eθt |yt,N [l(d2 , θt )] }. Since ρt,n
the noisy observed output (validation accuracy) yt,n (line
depends on ρt,n+1 , it naturally prompts the use of backward
8). If both of the following conditions are satisfied (line 9):
induction to solve the BOS problem (2) exactly, which is
unfortunately intractable due to an uncountable set of possi- C1. BOS decision rule in epoch n outputs stopping decision;
ble observed outputs yt,n+1 . This computational difficulty
C2. σt−1 ([xt , N ]) ≤ κ σt−1 ([xt , n]) ,
can be overcome using approximate backward induction
techniques (Brockwell & Kadane, 2003; Müller et al., 2007) then model training is early-stopped in epoch nt = n. Other-
whose main ideas include using summary statistics to repre- wise, the above procedure is repeated in epoch n+1. If none
sent the posterior beliefs, discretizing the space of summary of the training epochs n = N0 + 1, . . . , N − 1 satisfy both
statistics, and approximating the expectation terms via sam- C1 and C2, then nt = N (i.e., no early stopping). Finally,
pling. Appendix A describes a commonly-used approximate the GP posterior belief (1) is updated with the selected input
backward induction algorithm (Müller et al., 2007). hyperparameter setting zt = [xt , nt ] and the corresponding
Bayesian Optimization Meets Bayesian Optimal Stopping
noisy observed output (validation accuracy) yt = yt,nt (line We define l(d1 , θt ) and l(d2 , θt ) of the respective terminal
11). BO-BOS then proceeds to the next iteration t + 1. decisions d1 and d2 as 0-K loss functions which are com-
monly used in clinical designs (Jiang et al., 2013; Lewis &
To understand the rationale of our choices of C1 and C2, the
Berry, 1994) due to their simplicity and interpretablility:
BOS decision rule in C1 recommends the stopping decision
to early-stop model training in epoch n if it concludes that l(d1 , θt ) , K1 1θt =θt,2 and l(d2 , θt ) , K2 1θt =θt,1 (3)
model training under xt for N epochs will produce a valida-
tion accuracy not exceeding the currently found maximum where the parameters K1 > 0 and K2 > 0 represent the
in iterations 1, . . . , t − 1; this will be formally described in costs of making the wrong terminal decisions d1 and d2 ,
Section 3.2. On the other hand, C2 prefers to evaluate the respectively. Since f ([xt , N ]) is not known in epoch n <
validation accuracy f of the ML model with the input query N , the expected losses of terminal decisions d1 and d2 have
[xt , n] of fewer training epochs n < N than [xt , N ] if the to be evaluated instead:
uncertainty of the validation accuracy f ([xt , N ]) achieved
by model training under xt for N epochs is not more than a Eθt |yt,n [l(d1 , θt )] = K1 Pr(θt = θt,2 |yt,n ) ,
(4)
factor of κ ≥ 1 of that of f ([xt , n]) for n epochs; the degree Eθt |yt,n [l(d2 , θt )] = K2 Pr(θt = θt,1 |yt,n ) .
of preference is controlled by parameter κ. Thus, by satis-
fying both C1 and C2, C2 lends confidence to the resulting According to (4), if K1 (K2 ) is set to +∞, then
performance of model training under xt for N epochs that is Eθt |yt,n [l(d1 , θt )] = +∞ (Eθt |yt,n [l(d2 , θt )] = +∞). Con-
concluded by C1 to be underwhelming. So, model training sequently, terminal decision d1 (d2 ) is never recommended.
can be early-stopped in epoch nt = n. More importantly, The above definitions are plugged into (2) to derive the mini-
both C1 and C2 are necessary for theoretically guaranteeing mum expected loss ρt,n (yt,n ) in epoch n = N0 + 1, . . . , N .
the no-regret performance of BO-BOS (Section 4). Our formulation of the BOS problem (2) for early stopping
3.2. BOS for Early Stopping in BO in BO can be solved using an adapted approximate back-
ward induction algorithm: To account for Assumption 1b,
Let the currently found maximum in iterations 1, . . . , t − 1
∗ a kernel with a prior bias towards exponentially decaying
be denoted as yt−1 , maxt0 ∈{1,...,t−1} yt0 . In the context
learning curves (Swersky et al., 2014) is used to fit a GP
of early stopping in BO, BOS has to decide in each epoch
model to the validation errors 1 − yt,N0 of the ML model
n of iteration t whether model training under xt for N
trained for N0 initial epochs. Samples are then drawn from
epochs will produce a validation accuracy not more than
the resulting GP posterior belief for forward simulation of
the currently found maximum (offset by a noise correction
∗ sample paths from epochs N0 + 1 to N , which are used
term ξt ), i.e., f ([xt , N ]) ≤ yt−1 − ξt where y0∗ , 0 and
to estimate the Pr(θt |yt,n ) and Pr(yt,n+1 |yt,n ) terms neces-
ξt is defined later in Theorem 1. To achieve this, the ter-
sary for approximate backward induction. Following some
minal decisions d1 and d2 and the continuation decision
applications of BOS (Jiang et al., 2013; Müller et al., 2007),
d0 in BOS are defined as follows: d1 stops and concludes
∗ the average validation error is used as the summary statistic.
that f ([xt , N ]) ≤ yt−1 − ξt , d2 stops and concludes that
∗ Our adapted approximate backward induction algorithm is
f ([xt , N ]) > yt−1 −ξt , and d0 continues model training for
explained in detail in Appendix B. Note that the use of the
one more epoch. Then, the event θ (Section 2.2) becomes
( kernel favoring exponentially decaying learning curves in
∗
θt,1 if f ([xt , N ]) ≤ yt−1 − ξt , generating the forward simulation samples is critical for
θt =
θt,2 otherwise . incorporating our prior knowledge about the behavior of
learning curves, which gives BO-BOS an advantage over
multi-fidelity BO algorithms which do not exploit this prior
Algorithm 1 BO-BOS knowledge, thus contributing to the favorable performance
1: for t = 1, 2, . . . , T do √ of BO-BOS.
2: xt ← arg maxx µt−1 ([x, N ]) + βt σt−1 ([x, N ])
3: Train model using xt for N0 epochs to yield yt,N0 After our BOS problem is solved, Bayes-optimal decision
4: n ← N0 rules are obtained and used by C1 in BO-BOS (Algorithm 1):
5: Solve BOS problem (2) to obtain decision rules Specifically, after model training to yield validation accu-
6: repeat racyPynt,n (line 8), the summary statistic is first updated
7: n←n+1 to n0 =1 yt,n0 /n and the BOS decision rule in epoch n
8: Continue model training using xt to yield yt,n recommends a corresponding optimal decision. If the rec-
9: until (n = N ) ∨ (C1 ∧ C2) ommended decision is d1 (i.e., stopping and concluding
∗
10: nt = n f ([xt , N ]) ≤ yt−1 − ξt ), then model training under xt
11: Update GP posterior belief (1) with zt = [xt , nt ] and is early-stopped in epoch n (assuming that C2 is satis-
yt = yt,nt fied). Otherwise, model training continues under xt for
one more epoch and the above procedure is repeated in
Bayesian Optimization Meets Bayesian Optimal Stopping
epoch n + 1 until the last epoch n = N is reached. Note Theorem 2 below states how the BOS parameters should be
that terminal decision d2 (i.e., stopping and concluding chosen to make BO-BOS asymptotically no-regret:
∗
f ([xt , N ]) > yt−1 − ξt ) does not align with the BO ob-
Theorem 2. In Theorem 1, if K1,t is an increasing sequence
jective of sequentially maximizing f . So, when the recom-
such that K1,1 ≥ K2 + cd0 and that K1,t goes to +∞ in
mended decision is d2 , there is no early stopping and model
finite number of BO iterations, then, with probability of at
training continues under xt for one more epoch.
least 1 − δ − δ 0 − δ 00 , E[ST ] goes to 0 asymptotically.
Hoang, T. N., Hoang, Q. M., and Low, K. H. A dis- Low, K. H., Dolan, J. M., and Khosla, P. Adaptive multi-
tributed variational inference framework for unifying par- robot wide-area exploration and mapping. In Proc. AA-
allel sparse Gaussian process regression models. In Proc. MAS, pp. 23–30, 2008.
ICML, pp. 382–391, 2016.
Low, K. H., Dolan, J. M., and Khosla, P. Information-
Hoang, T. N., Hoang, Q. M., and Low, K. H. Decentral- theoretic approach to efficient adaptive path planning for
ized high-dimensional Bayesian optimization with factor mobile robotic environmental sensing. In Proc. ICAPS,
graphs. In Proc. AAAI, pp. 3231–3238, 2018. pp. 233–240, 2009.
Hoang, T. N., Hoang, Q. M., Low, K. H., and How, J. P. Col- Low, K. H., Dolan, J. M., and Khosla, P. Active Markov
lective online learning of Gaussian processes in massive information-theoretic path planning for robotic environ-
multi-agent systems. In Proc. AAAI, 2019. mental sensing. In Proc. AAMAS, pp. 753–760, 2011.
Jiang, F., Jack Lee, J., and Müller, P. A Bayesian decision-
Low, K. H., Chen, J., Dolan, J. M., Chien, S., and Thomp-
theoretic sequential response-adaptive randomization de-
son, D. R. Decentralized active robotic exploration and
sign. Statistics in Medicine, 32(12):1975–1994, 2013.
mapping for probabilistic field classification in environ-
Kandasamy, K., Dasarathy, G., Oliva, J. B., Schneider, J., mental sensing. In Proc. AAMAS, pp. 105–112, 2012.
and Póczos, B. Gaussian process bandit optimisation with
multi-fidelity evaluations. In Proc. NIPS, pp. 992–1000, Low, K. H., Chen, J., Hoang, T. N., Xu, N., and Jaillet,
2016. P. Recent advances in scaling up Gaussian process pre-
dictive models for large spatiotemporal data. In Ravela,
Kandasamy, K., Dasarathy, G., Schneider, J., and Póczos, S. and Sandu, A. (eds.), Dynamic Data-Driven Environ-
B. Multi-fidelity Bayesian optimisation with continuous mental Systems Science: First International Conference,
approximations. In Proc. ICML, pp. 1799–1808, 2017. DyDESS 2014, pp. 167–181. LNCS 8964, Springer Inter-
national Publishing, 2014a.
Klein, A., Falkner, S., Springenberg, J. T., and Hutter, F.
Learning curve prediction with Bayesian neural networks. Low, K. H., Xu, N., Chen, J., Lim, K. K., and Özgül, E. B.
In Proc. ICLR, 2017. Generalized online sparse Gaussian processes with ap-
Krizhevsky, A. Learning multiple layers of features from plication to persistent mobile robot localization. In Proc.
tiny images. Master’s thesis, Department of Computer ECML/PKDD Nectar Track, pp. 499–503, 2014b.
Science, University of Toronto, 2009.
Low, K. H., Yu, J., Chen, J., and Jaillet, P. Parallel Gaussian
LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Na- process regression for big data: Low-rank representation
ture, 521(7553):436–444, 2015. meets Markov approximation. In Proc. AAAI, pp. 2821–
2827, 2015.
Lewis, R. J. and Berry, D. A. Group sequential clinical tri-
als: A classical evaluation of Bayesian decision-theoretic Müller, P., Berry, D. A., Grieve, A. P., Smith, M., and
designs. J. American Statistical Association, 89(428): Krams, M. Simulation-based sequential Bayesian design.
1528–1534, 1994. J. Statistical Planning and Inference, 137(10):3140–3150,
2007.
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and
Talwalkar, A. Hyperband: A novel bandit-based approach Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B.,
to hyperparameter optimization. JMLR, 18(1):6765–6816, and Ng, A. Y. Reading digits in natural images with
2017. unsupervised feature learning. In Proc. NIPS Workshop
Ling, C. K., Low, K. H., and Jaillet, P. Gaussian process on Deep Learning and Unsupervised Feature Learning,
planning with Lipschitz continuous reward functions: To- 2011.
wards unifying Bayesian optimization, active learning,
Ouyang, R. and Low, K. H. Gaussian process decentral-
and beyond. In Proc. AAAI, pp. 1860–1866, 2016.
ized data fusion meets transfer learning in large-scale
Lizotte, D. J., Wang, T., Bowling, M. H., and Schuurmans, distributed cooperative perception. In Proc. AAAI, pp.
D. Automatic gait optimization with Gaussian process 3876–3883, 2018.
regression. In IJCAI, pp. 944–949, 2007.
Ouyang, R., Low, K. H., Chen, J., and Jaillet, P. Multi-robot
Longstaff, F. A. and Schwartz, E. S. Valuing American active sensing of non-stationary Gaussian process-based
options by simulation: A simple least-squares approach. environmental phenomena. In Proc. AAMAS, pp. 573–
The Review of Financial Studies, 14(1):113–147, 2001. 580, 2014.
Bayesian Optimization Meets Bayesian Optimal Stopping
for all epochs n, n0 = 1, . . . , N where φ is a probability measure over λ that is chosen to be a Gamma prior with parameters
α and β. The above kernel (5) is used to fit a GP model to the validation errors 1 − yt,N0 of the ML model trained using xt
for a fixed number N0 of initial epochs (e.g., N0 = 8 in all our experiments when N = 50), specifically, by computing
the values of parameters α and β in (5) via Bayesian update (i.e., assuming that the validation errors follow the Gamma
conjugate prior with respect to an exponential likelihood). Samples are then drawn from the resulting GP posterior belief for
forward simulation of sample paths from epochs N0 + 1 to N , which are used to estimate the Pr(θt |yt,n ) and Pr(yt,n+1 |yt,n )
terms necessary for approximate backward induction. Fig. 5 plots some of such sample paths and demonstrates that the GP
kernel in (5) can characterize a monotonic learning curve (Assumption 1b) well.
Following the practices in related applications of BOS (Brockwell & Kadane, 2003; Jiang et al., 2013; Müller et al., 2007),
the average validation error (or, equivalently, average validation accuracy) over epochs 1 to n is used as the summary
statistics. Firstly, the entire space of summary statistics is partitioned into a number of discrete intervals in each epoch,
which results in a two-dimensional domain with one axis being the number of epochs and the other axis representing the
Bayesian Optimization Meets Bayesian Optimal Stopping
0.3
Validation Error
0.2
0.1
0.0
5 10 15 20 25 30 35 40 45
Epochs
Figure 5. Forward simulation of some sample paths drawn from a GP posterior belief based on the kernel in (5).
discretized intervals of the summary statistic (i.e., average validation error). Next, a forward simulation of a large number
(i.e., 100, 000 in all our experiments) of sample paths is performed using the GP kernel in (5), as described above. Each
sample path corresponds to a curve in the 2-D domain. Starting from the last epoch N , for each interval, we consider all
sample paths ending in this interval and use the proportion of such sample paths with a validation accuracy (from model
training for N epochs) larger than the currently found maximum (offset by a noise correction term) to estimate the posterior
∗
probability Pr(θt = θt,2 |yt,N ) = Pr(f ([xt , N ]) > yt−1 − ξt |yt,N ), which is in turn used to evaluate the expected losses of
4
the terminal decisions d1 and d2 for this interval. The minimum of the expected losses among the two terminal decisions is
the expected loss for this particular interval.
Next, the algorithm proceeds backwards from epoch n = N − 1 all the way to epoch n = N0 + 1. In each epoch n, the
expected loss of each terminal decision is evaluated in the same way as that in the last epoch N , as described above. The
expected loss of the continuation decision d0 is evaluated in the same way as that in Appendix A: For each sample path
passing through an interval in epoch n, the expected loss for the interval that it passes through in the next epoch n + 1
is recorded and an average of all the recorded expected losses in the next epoch n + 1 is summed with the cost cd0 of
observing the validation accuracy yt,n+1 to yield the expected loss of the continuation decision d0 for this particular interval.
Note that this step is equivalent to approximating the Eyt,n+1 |yt,n [ρt,n+1 (yt,n )] term in (2) via Monte Carlo sampling of
the posterior belief Pr(yt,n+1 |yt,n ). Following (2), the minimum of expected losses among all terminal and continuation
decisions is the expected loss for this particular interval and the corresponding decision is recorded as the optimal decision
to be recommended when the summary statistic falls into this particular interval. Then, the algorithm continues backwards
until epoch n = N0 + 1 is reached. We present in Algorithm 2 the pseudocode for the above-mentioned approximate
backward induction algorithm for ease of understanding.
After solving our BOS problem for early stopping in BO using the approximate backward induction algorithm described
above, Bayes-optimal decision rules are obtained in every pair of epoch and interval. Fig. 6 shows an example of optimal
decision rules obtained from solving an instance of our BOS problem where the white, yellow, and red regions correspond
to recommending optimal continuation decision d0 and terminal decisions d1 and d2 , respectively. In particular, after model
training under xt to yield the validation error 1 − yt,n in epoch n, the summary statistic is updated to the average validation
error over epochs 1 to n. The updated summary statistic falls into an interval with a corresponding optimal decision to be
recommended. For example, Fig. 6 shows that if the summary statistic falls into the yellow region in any epoch n, then the
optimal terminal decision d1 is recommended to early-stop model training under xt (assuming that C2 is satisfied). If the
summary statistic falls into any other region, then model training continues under xt for one more epoch and the above
procedure is repeated in epoch n + 1 until the last epoch n = N is reached. This procedure, together with C2, constitutes
lines 6 to 9 in Algorithm 1.
4
In contrast to the approximate backward induction algorithm of Müller et al. (2007) (Appendix A), we employ a computationally
cheaper way to approximate the expected losses of the terminal decisions for an interval.
Bayesian Optimization Meets Bayesian Optimal Stopping
Algorithm 2 Approximate Backward Induction Algorithm for Solving BOS Problem in BO-BOS
1: Partition the domain of summary statistics into M discrete intervals
2: Train the ML model using xt for N0 epochs
3: Generate a large number of forward simulation samples using kernel (5)
4: Let n = N
5: for m = 1, 2, . . . , M do
6: Find all sample paths ending in interval m at epoch n, denoted as S
∗
7: Estimate Pr(θt = θt,2 |yt,n ) = Pr(f ([xt , N ]) > yt−1 − ξt |yt,n ) by the proportion of S that end up (after N epochs)
∗
having larger validation accuracy than yt−1 − ξt
8: Calculate the expected losses of the terminal decisions d1 and d2 using (4)
9: Use the minimum of these two expected losses as the expected loss of epoch n and interval m
10: for n = N − 1, N − 2, . . . , N0 + 1 do
11: for m = 1, 2, . . . , M do
12: Find all sample paths passing through interval m at epoch n, denoted as S
∗
13: Estimate Pr(θt = θt,2 |yt,n ) = Pr(f ([xt , N ]) > yt−1 − ξt |yt,n ) by the proportion of S that end up (after N
∗
epochs) having larger validation accuracy than yt−1 − ξt
14: Calculate the expected losses of the terminal decisions d1 and d2 using (4)
15: ld0 ,n+1,m = 0
16: for each sample path s in S do
17: ld0 ,n+1,m = ld0 ,n+1,m + the expected loss of the interval reached by s at epoch n + 1
18: ld0 ,n+1,m = ld0 ,n+1,m /|S|
19: Calculate the expected loss of the continuation decision d0 as: Eyt,n+1 |yt,n [ρt,n+1 (yt,n )] + cd0 = ld0 ,n+1,m + cd0
20: Use the minimum expected losses among d1 , d2 and d0 as the expected loss of epoch n and interval m (follow-
ing (2)), and record the corresponding decision as the optimal decision
1.0
Summary Statistics
0.75
0.5
0.25
0.0
1 6 11 16 21 26 31 36
Epochs
Figure 6. An example of optimal decision rules obtained from solving an instance of our BOS problem: White, yellow, and red regions
correspond to recommending optimal continuation decision d0 and terminal decisions d1 and d2 , respectively. The sample paths cannot
reach the black regions due to the use of the GP kernel in (5) for characterizing a monotonic learning curve (Assumption 1b).
Bayesian Optimization Meets Bayesian Optimal Stopping
Note that in our algorithm, the BO iterations can be divided into two types: 1) t+ such that nt+ = N : those iterations that
are not early-stopped; and 2) t− such that nt− < N : those that are early-stopped. For all t+ , it follows from Equation 6 that
rt = f (z∗ ) − max{ft−1∗
, f (zt )} ≤ f (z∗ ) − f (zt ) = f (z∗ ) − f ([xt , nt ]) = f (z∗ ) − f ([zt , N ]) , rt+ ; for all t− , from
Equation 6, we have that rt = f (z∗ ) − max{ft−1 ∗
P, f (zt )} ≤Pf (z∗ ) − ft−1
∗
, rt− . In the following, we will focus on the
analysis of the sum of all rt+ and all rt− : RT = t+ rt+ + t− rt− . As a result of the definition, RT0 is an upper bound
0
Next, note that for all t− such that nt− < N (when xt is early-stopped),
(1) (7)
≤ f (z∗ ) − f ([xt , nt ]) + f ([xt , N ]) − ft−1
∗
, RT,1 + RT,2
in which (1) makes use of the definition of RT0 , (2) results from Equation 7 and (3) follows by combining the first two terms
on the previous line. The first term following (3) of Equation 8 is summed over all time steps, whereas the second term is
only summed over those time steps that are early-stopped (nt < N ). As mentioned earlier, in the sequel, we will attempt to
Bayesian Optimization Meets Bayesian Optimal Stopping
in which the expectation is taken with respect to the posterior probabilities used in the BOS problems, corresponding to
∗
those iterations that are early-stopped: Πt∈{t0 |t0 =1,...,T,nt0 <N } Pr(f ([xt , N ]) > yt−1 − ξt |yt,nt ). Note that the probability
distributions are independent across all the early-stopped iterations, therefore, for each early-stopped iteration t, the
∗
expectations of both rt,1 and rt,2 are only taken over the specific distribution: Pr(f ([xt , N ]) > yt−1 − ξt |yt,nt ); whereas
for each not-early-stopped iteration t, E[rt,1 ] = rt,1 (whereas rt,2 is absent). In the next two sections, we will prove upper
bounds on E[RT,1 ] and E[RT,2 ] respectively.
The proof of lemma 1 makes use of standard Gaussian tail bounds and a number of union bounds, and the proof is identical
to the proof of lemma 5.1 in (Srinivas et al., 2010). The next supporting lemma makes use of the Lipschitz continuity of f to
bound the differences between function values whose inputs only differ by the dimension corresponding to the number of
training epochs.
Lemma 2. Suppose that Assumption 2 holds and let δ 0 ∈ (0, 1). Then, with probability ≥ 1 − δ 0 ,
r
da
|f ([x, N ]) − f ([x, n])| ≤ N b log 0 ∀x, n = 1, . . . , N .
δ
Proof. Let z = [x, n] denote the input to the objective function f . Assumption 2, together with a union bound over
L 2
j = 1, . . . , d, implies that with probability ≥ 1 − dae−( b ) ,
Since [x, N ] and [x, n] differ only by the dimension corresponding to the number of training epochs, we have that
The next lemma bounds E[rt,1 ] by the Gaussian process posterior standard deviation with some scaling constants.
Lemma 3. Let δ, δ 0 ∈ (0, 1) and κ ≥ 1 be the constant used in C2 in the BO-BOS algorithm. Then, at iteration t of the
BO-BOS algorithm, we have that, with probability ≥ 1 − δ − δ 0 ,
r
da
E[rt,1 ] ≤ 2κβt σt−1 ([xt , nt ]) + N b log 0 1nt <N .
1/2
δ
in which (1) follows from Assumption 1 which states that, for each x, the function value is monotonically non-decreasing in
the number of training epochs, which implies that at the (unknown) global maximum z∗ , the dimension corresponding to the
Bayesian Optimization Meets Bayesian Optimal Stopping
(1)
1/2
E[rt,1 ] = E[f (z∗ ) − f ([xt , nt ])] ≤ E[βt σt−1 ([xt , N ]) + µt−1 ([xt , N ]) − f ([xt , nt ])]
1/2
= E[βt σt−1 ([xt , N ]) + µt−1 ([xt , N ]) − f ([xt , N ]) + f ([xt , N ]) − f ([xt , nt ])]
(2)
1/2
≤ E[2βt σt−1 ([xt , N ])] + E[f ([xt , N ]) − f ([xt , nt ])]
r
(3) da (11)
≤ E[2βt σt−1 ([xt , N ])] + N b log 0 1nt <N
1/2
δ
r
(4) da
≤ 2βt σt−1 ([xt , N ]) + N b log 0 1nt <N
1/2
δ
r
(5) da
≤ 2κβt σt−1 ([xt , nt ]) + N b log 0 1nt <N
1/2
δ
in which (1) follows from Equation 10, and (2) results from Lemma 1 and the linearity of the expectation operator. 1nt <N
in (3) is the indicator function, which takes the value of 1 if the event nt < N is true and 0 otherwise. (3) is obtained
by analyzing two different cases separately: if nt = N (xt is not early-stopped),
q then E[f ([xq t , N ]) − f ([xt , nt ])] = 0;
if nt < N (xt is early-stopped), then E[f ([xt , N ]) − f ([xt , nt ])] ≤ E[N b log da da
δ 0 ] = N b log δ 0 with probability
0
≥ 1 − δ following Lemma 2. (4) is due to the fact that σt−1 ([xt , N ]) only depends on the observations up to step
∗
t − 1 and is not dependent on the probability Pr(f ([xt , N ]) > yt−1 − ξt |yt,nt ). (5) follows from the design of the
algorithm; in particular, if nt < N , then κσt−1 ([xt , nt ]) ≥ σt−1 ([xt , N ]) is guaranteed by C2; otherwise, if nt = N , then
κσt−1 ([xt , nt ]) ≥ σt−1 ([xt , nt ]) = σt−1 ([xt , N ]) since κ ≥ 1.
PT
Subsequently, we can upper bound E[RT,1 ] = t=1 E[rt,1 ] by extensions of Lemma 5.3 and 5.4 from (Srinivas et al., 2010),
which are presented here for completeness. The following lemma connects the information gain about the objective function
with the posterior predictive variance, whose proof results from straightforward extension of Lemma 5.3 of (Srinivas et al.,
2010).
Lemma 4. Let yT be a set of observations of size T , and let fT be the corresponding function values. The information gain
about fT from observing yT is
T
1X
I(yT ; fT ) = log[1 + σ −2 σt−1
2
([xt , nt ])] .
2 t=1
Next, we use the following lemma to bound the sum of the first term of the expected instantaneous regret as given in Lemma
3.
8 2 2
Lemma 5. Let δ ∈ (0, 1), C1 , log(1+σ −2 ) , βt , 2 log(|D|t π /6δ), and γT , maxA∈D,|A|=T I(yA ; fA ) is the maximum
T
1/2
X p p
2βt σt−1 ([xt , nt ]) ≤ T C1 βT I(yT ; fT ) ≤ T C1 βT γT .
t=1
(1)
1/2
(2βt σt−1 ([xt , nt ]))2 = 4βt σt−1
2
([xt , nt ]) ≤ 4βT σ 2 [σ −2 σt−1
2
([xt , nt ])]
(2) σ −2
≤ 4βT σ 2 log[1 + σ −2 σt−1
2
([xt , nt ])] (12)
log(1 + σ −2 )
8 1
≤ βT −2
log[1 + σ −2 σt−1
2
([xt , nt ])]
log(1 + σ ) 2
Bayesian Optimization Meets Bayesian Optimal Stopping
σ −2
in which (1) holds since βt is monotonically increasing in t; (2) results from the fact that σ −2 x ≤ log(1+σ −2 ) log[1 + σ −2 x]
2
for x ∈ (0, 1], whereas 0 < σt−1 ([xt , nt ]) ≤ 1. Next, summing over t = 1, . . . , T , we get
T T
X 1/2 8 1X
(2βt σt−1 ([xt , nt ]))2 ≤ βT −2
log[1 + σ −2 σt−1
2
([xt , nt ])]
t=1
log(1 + σ ) 2 t=1
(13)
(1) 8 (2)
= βT −2
I(yT ; fT ) ≤ C1 βT γT
log(1 + σ )
in which (1) results from Lemma 4, and (2) follows from the definitions of C1 and γT . Next, making use of the Cauchy-
Schwarz inequality, we get
v
T u T
X 1/2
√ uX 1/2
2βt σt−1 ([xt , nt ]) ≤ T t (2βt σt−1 ([xt , nt ]))2
t=1 t=1
(14)
p
≤ C1 T βT γT
Next, putting everything together, we get the follow lemma on the upper bound on E[RT,1 ].
Lemma 6. Suppose that Assumptions 1 and 2 hold. Let δ, δ 0 ∈ (0, 1), C1 = 8/ log(1 + σ −2 ), βt = 2 log(|D|t2 π 2 /6δ),
κ ≥ 1 be the constant used in C2 in the BO-BOS algorithm, and γT = maxA∈D,|A|=T I(yA ; fA ). Let τT = t=1 1nt <N
PT
be the number of BO iterations in which early stopping happens from iterations 1 to T . Assume that f is a sample from a
GP, and y(z) = f (z) + ∀z ∈ D in which ∼ N (0, σ 2 ). Then, with probability ≥ 1 − δ − δ 0 ,
T r
X p da
E[RT,1 ] = E[rt,1 ] ≤ κ T C1 βT γT + N b log 0 τT ∀T ≥ 1 .
t=1
δ
Proof.
T (2) T r
da
1n <N ]
(1) X X 1/2
E[RT,1 ] = E[rt,1 ] ≤ [2κβt σt−1 ([xt , nt ]) + N b log
t=1 t=1
δ0 t
(3) T r
da
N b log 0 1nt <N
p X
≤ κ T C1 βT γT +
t=1
δ
(15)
r T
da X
1n <N
p
= κ T C1 βT γT + N b log 0
δ t=1 t
r
p da
= κ T C1 βT γT + N b log 0 τT
δ
in which (1) follows from the linearity of the expectation operator, (2) results from Lemma 3, and (3) follows from Lemma
5.
in which r
π 2 t2 (t − 1)
ξt = 2σ 2 log ∀t > 1
6δ 00
and ξ1 = 0.
Proof. The lemma trivially holds for t = 1. Assume we are at iteration t > 1 of the BO-BOS algorithm, and let
t0 ∈ {1, 2, . . . , t − 1}. Since yt0 = f (zt0 ) + , in which ∼ N (0, σ 2 ), we have that yt0 ∼ N (f (zt0 ), σ 2 ). Making use of
the upper deviation inequality for Gaussian distribution and the definition of ξt , we get
ξt 2 6δ 00
Pr[yt0 ≥ f (zt0 ) + ξt ] ≤ e− 2σ2 = (16)
π 2 t2 (t− 1)
Denote the event that {∃ t0 ∈ {1, 2, . . . , t − 1} s.t. yt0 ≥ f (zt0 ) + ξt } as At . Next, taking a union bound over the entire
observation history t0 ∈ {1, 2, . . . , t − 1}, we get
t−1
X
Pr[At ] ≤ Pr[yt0 ≥ f (zt0 ) + ξt ]
t0 =1 (17)
6δ 00 6δ 00
≤(t − 1) 2 2 = 2 2
π t (t − 1) π t
00
which implies that at iteration t, with probability ≥ 1 − π6δ2 t2 , yt0 − f (zt0 ) < ξt ∀t0 ∈ {1, 2, . . . , t − 1}, which further
∗ ∗
suggests that yt−1 − ft−1 ≤ ξt at iteration t. Next, taking a union bound over t ≥ 1, we get
X X 6δ 00
Pr[∃t ≥ 1 s.t. At holds] ≤ Pr[At ] ≤ = δ 00 (18)
π 2 t2
t≥1 t≥1
The next lemma shows that, with appropriate choices of the incumbent value, the posterior probability used in Bayesian
optimal stopping is upper-bounded.
∗
Lemma 8. If in iteration t of the BO-BOS algorithm, the BOS algorithm is run with the incumbent value yt−1 − γt and the
corresponding cost parameters K1 , K2 and cd0 , and the algorithm early-stops after nt < N epochs, then with probability
≥ 1 − δ 00 ,
∗ K2 + cd0
Pr(f ([xt , N ]) > ft−1 |yt,nt ) ≤ ∀t ≥ 1 . (19)
K1
Proof. Recall that when running the Bayesian optimal stopping algorithm in iteration t of BO-BOS, we only early-stop the
experiment (nt < N ) when we can safely conclude that the performance of the currently evaluated hyperparameter xt will
end up having smaller (or equal) validation accuracy than the currently observed optimum offset by a noise correction term:
∗
yt−1 − ξt ; i.e., when the expected loss of decision d1 is the smallest among all decisions. Therefore, when the evaluation of
xt is early-stopped after nt < N epochs, we can conclude that
∗
K1 Pr(f ([xt , N ]) > yt−1 − ξt |yt,nt )
h
∗ ∗
≤Eyt,nt +1 |yt,nt min{K1 Pr(f ([xt , N ]) > yt−1 − ξt |yt,nt +1 ), K2 Pr(f ([xt , N ]) ≤ yt−1 − ξt |yt,nt +1 ),
i
Eyt,nt +2 |yt,nt +1 [ρt,nt +2 (yt,nt +2 )] + cd0 } + cd0
(20)
∗
≤Eyt,nt +1 |yt,nt [K2 Pr(f ([xt , N ]) ≤ yt−1 − ξt |yt,nt +1 )] + cd0
∗
≤K2 Eyt,nt +1 |yt,nt [Pr(f ([xt , N ]) ≤ yt−1 − ξt |yt,nt +1 )] + cd0
≤K2 + cd0
Equation 20, together with Lemma 7, implies that
∗ ∗
Pr(f ([xt , N ]) > ft−1 |yt,nt ) ≤Pr(f ([xt , N ]) > yt−1 − ξt |yt,nt )
K2 + cd0 (21)
≤
K1
Bayesian Optimization Meets Bayesian Optimal Stopping
Subsequently, we use the next Lemma to upper-bound E[RT2 ] by the BOS cost parameters. We set K2 and cd0 as constants,
and use different values of K1 in different iterations t of the BO-BOS algorithm, which is represented by K1,t .
K2 +cd0
Lemma 9. In iteration t of the BO-BOS algorithm, define K1,t , ηt . Then, with probability ≥ 1 − δ 00 ,
T
X
E[RT,2 ] ≤ ηt ∀T ≥ 1 .
t=1
Proof. Recall that according to Assumption 1, the value of the objective function f is bounded in the range [0, 1]. In iteration
t, assume we early-stop the evaluation of xt after nt < N epochs, then
(1)
∗
E[f ([xt , N ]) − ft−1 |yt,nt ] ≤ E[1f ([xt ,N ])−ft−1
∗
∗
>0 |yt,nt ] = Pr(f ([xt , N ]) > ft−1 |yt,nt )
(22)
Step (1) in Equation 22 is because x ≤ 1x>0 ∀x ∈ [−1, 1] and substituting x = f ([xt , N ]) − ft−1
∗
. As a result, with
probability ≥ 1 − δ 00
in which (1) follows from the linearity of expectation, (2) holds because the Expectation of rt,2 is taken over Pr(f ([xt , N ]) >
∗
yt−1 − ξt |yt,nt ), (3) results from Equation 22, and (4) follows from Lemma 8. This completes the proof.
The first term in the upper bound of E[ST ] Firstly, the first term in the upper bound matches the upper bound on the
simple regret of the GP-UCB algorithm (Srinivas et al., 2010) (up to the constant κ). The maximum information gain,
γT , has been analyzed for a few of the commonly used kernels in GP (Srinivas et al., 2010). For example, for the Square
Exponential kernel, γT = O((log T )d+1 ), whereas for the Matérn kernel with ν > 1, γT = O(T d(d+1)/(2ν+d(d+1)) log T ).
Plugging both expressions of γT into Theorem √
1, together with the expression of βT as given in Theorem 1, shows that both
kernels lead to sub-linear growth of the term T C1 βT γT , which implies that the first term in the upper bound of E[ST ]
asymptotically goes to 0.
The second term in the upper bound of E[ST ] Given that K1,t is an increasing sequence with K1,1 ≥ K2 + cd0 , the
PT PT K2 +cd0
series t=1 ηt = t=1 K 1,t
grows sub-linearly, thus making the second term in the upper bound of E[ST ] given in
PT
ηt
Theorem 1, t=1
T , asymptotically go to 0.
Bayesian Optimization Meets Bayesian Optimal Stopping
The third term in the upper bound of E[ST ] Next, suppose that K1,t becomes +∞ for the first time at iteration T0 .
Since K1,t is a non-decreasing sequence, K1,t = +∞ for all t ≥ T0 . Therefore, for t ≥ T0 , decision d1 will never be taken
and the algorithm will never early-stop. In other words, nt = N for all t ≥ T0 .
Therefore, we can conclude that τT ≤ T0 for all T ≥ 1. As a result, the last term in the upper bound on E[ST ] in Theorem 1
can be upper-bounded by
q
T N b log da
r
τT da 0 δ0 1
N b log 0 ≤ =O (25)
T δ T T
which asymptotically goes to 0 as T goes to +∞, because the numerator term is a constant. Therefore, this term also
asymptotically vanishes in the upper bound.
To summarize, if the BOS parameters are selected according to Theorem 2, we have that
√ PT !
T βT γT t=1 ηt 1
E[ST ] = O + + (26)
T T T
0.35
0.30
validation error
0.25
0.20
0.15
0.10
0 10 20 30 40 50
epochs
0.20
0.15
0.10
0.05
0 5 10 15 20 25
BO iterations
Figure 8. Illustration of the effectiveness of the early stopping decisions made during the BO-BOS algorithm.
0.32
0.31
0.30
0.29
Validation Error
0.28
0.27
0.26 GP-UCB
Hyperband
0.25 BOCA
LC Prediction
0.24 BO-BOS
0 1 2 3 4 5 6 7 8
Wall-Clock Time (Hours)
Figure 9. Best-found validation error of CNN v.s. run-time using the CIFAR-10 dataset, with standard error (averaged over 30 random
initializations).
0.16
GP-UCB
0.15 Hyperband
BOCA
0.14 LC Prediction
BO-BOS
Validation Error
0.13
0.12
0.11
0.10
0.09
0 1 2 3 4 5 6 7 8
Wall-Clock Time (Hours)
Figure 10. Best-found validation error of CNN v.s. run-time using the SVHN dataset, with standard error (averaged over 30 random
initializations).
As mentioned in the main text, the rewards are discounted in order to make the objective function, the discounted cumulative
rewards, resemble the learning curves of ML models, such that the BO-BOS algorithm can be naturally applied. This
rationale is illustrated in Fig. 11, which plots some example un-discounted (γ = 1.0) and discounted (γ = 0.9) cumulative
rewards respectively. The figures indicate that, compared with un-discounted cumulative rewards, discounted cumulative
rewards bear significantly closer resemblance to the learning curves of ML models, thus supporting the claim made in the
main text motivating the use of discounted rewards, as well as the experimental results shown in Fig. 3 (specifically, the
poor performance of the curve corresponding to N = 50 and γ = 1.0). In addition to the results presented in the main text
in section 5.3.1, we further present the results with standard errors in Fig. 12, to emphasize the significant performance
advantage offered by BO-BOS compared with GP-UCB. To avoid clutter, we only present the results with error bar for
GP-UCB with γ = 1.0 and BO-BOS with N = 50 and γ = 0.9, which are best-performing settings for GP-UCB and
BO-BOS respectively.
0.65
Un-discounted Cumulative Reward
0.60
0.55
0.50
0.45
0.40
0 10 20 30 40 50
Epochs
0.85
0.80
Discounted Cumulative Reward
0.75
0.70
0.65
0.60
0.55
0.50
0.45
0 10 20 30 40 50
Epochs
200
Average Return
150 GP-UCB ( = 1.0)
BO-BOS (N = 50, = 0.9)
100
50
0 1 2 3 4 5 6
Steps 1e5
Figure 12. Best-found return (averaged over 5 episodes) v.s. the total number of steps of the robot in the environment (averaged over 30
random initializations) using the Swimmer-v2 task, with standard error.
0.038
GP-UCB
Hyperband
0.036 BO-BOS
Validation Error
0.034
0.032
0.030
0.028
0 1 2 3 4 5 6
Wall-Clock Time (Hours)
Figure 13. Best-found validation error of XGBoost v.s. run-time with standard error (averaged over 30 random initializations), obtained
using joint hyperparameter tuning and feature selection.
Repository (Dheeru & Karra Taniskidou, 2017), which represents a binary classification problem: whether the email is a
spam or not. We use 3065 emails as the training set and the remaining 1536 emails as the validation set; each email consists
of 57 features. The maximum number of features for each hyperparameter setting is set as N = 50. In addition to Figure 4
in the main text, the same plot with error bar (standard error) is shown in Figure 13.