Bayesian and surroagte 2

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO.
9, SEPTEMBER 2023 11283
Surrogate Modeling for Bayesian Optimization

Beyond a Single Gaussian Process
Qin Lu , Member, IEEE, Konstantinos D. Polyzos , Student Member, IEEE, Bingcong Li , Member, IEEE,
and Georgios B. Giannakis , Fellow, IEEE
Abstract—Bayesian optimization (BO) has well-documented While BO can automate the selection of the best-performing
merits for optimizing black-box functions with an expensive eval- machine learning model along with its optimal hyperparameters,
uation cost. Such functions emerge in applications as diverse as it still necessitates domain-specific expert knowledge to design
hyperparameter tuning, drug discovery, and robotics. BO hinges
on a Bayesian surrogate model to sequentially select query points both the surrogate model and the acquisition function [1]. In the
so as to balance exploration with exploitation of the search space. Gaussian process (GP) based surrogate model, one has to select
Most existing works rely on a single Gaussian process (GP) based the kernel type and the corresponding hyperparameters. Also,
surrogate model, where the kernel function form is typically pres- decision has to be made on the selection from the available
elected using domain knowledge. To bypass such a design process, acquisition functions, and the associated design parameters if
this paper leverages an ensemble (E) of GPs to adaptively select
the surrogate model fit on-the-fly, yielding a GP mixture posterior there is any. Minimizing such design efforts so as to automate
with enhanced expressiveness for the sought function. Acquisition BO is especially appealing for modern AI tasks. Given that
of the next evaluation input using this EGP-based function pos- in many setups BO is inherently time-consuming, parallelizing
terior is then enabled by Thompson sampling (TS) that requires function evaluations to reduce convergence time is also of utmost
no additional design parameters. To endow function sampling importance. Further, rigorous analysis is desired to establish
with scalability, random feature-based kernel approximation is
leveraged per GP model. The novel EGP-TS readily accommodates convergence of BO algorithms to the global optimum. To address
parallel operation. To further establish convergence of the proposed the aforementioned desiderata, the goal of the present work is to
EGP-TS to the global optimum, analysis is conducted based on the develop a BO method that entails the least tuning efforts, accom-
notion of Bayesian regret for both sequential and parallel settings. modates parallel operation, and enjoys convergence guarantees.
Tests on synthetic functions and real-world applications showcase
the merits of the proposed method.
Index Terms—Bayesian optimization, Gaussian processes, A. Related Works
ensemble learning, Thompson sampling, Bayesian regret analysis.
Prior art is outlined next to contextualize our contributions.
Ensemble BO: Several choices are available for the surro-
I. INTRODUCTION gate model, acquisition function, and acquisition optimizer for
NUMBER of machine learning and artificial intelli- BO [4]. Without prior knowledge of the problem at hand, com-
A gence (AI) applications boil down to optimizing an
‘expensive-to-evaluate’ black-box function, including hyperpa-
bining the merits of different options can intuitively robustify
performance. As pointed out in the 2020 black-box optimiza-
rameter tuning [1], drug discovery [2], and policy optimization tion challenge, ensembling methods can empirically boost BO
in robotics [3]. As in hyperparameter tuning, lack of analytic performance for hyperparameter tuning [6]. In a broader sense,
expressions for the objective function and overwhelming evalu- the ensemble rule has been applied to BO in different contexts,
ation cost discourage grid search, and adoption of gradient-based including high-dimensional input [7], and meta learning [8]. In
solvers. To find the global optimum under a limited evaluation the basic BO setup, combining acquisition functions has been
budget, Bayesian optimization (BO) offers a principled frame- explored for a single GP-based surrogate model in a principled
work by leveraging a statistical model to guide the acquisition way [9], [10]. The complementary setting of an ensemble of
of query points on-the-fly [4], [5]. (GP) surrogate models with a given acquisition function has not
been touched upon.
Manuscript received 3 October 2022; revised 13 February 2023; accepted Thompson Sampling (TS) and Regret Analysis for BO: Since
26 March 2023. Date of publication 5 April 2023; date of current version 4 its invention by [11], TS has not received much attention in
August 2023. This work was supported by NSF grants 1901134, 2128593,
2126052, 2212318, and 2220292. K. D. Polyzos was also supported by the the bandit community until the past decade that its empirical
Onassis Foundation Scholarship. Recommended for acceptance by M. Zhang. success [12] and theoretical guarantees [13] have been well
(Qin Lu and Konstantinos D. Polyzos are equal contribution.) (Corresponding documented. In the context of BO, TS has been recently explored
author: Qin Lu.)
The authors are with the Department of Electrical and Computer Engineering, under different settings, including high-dimensionality [14],
University of Minnesota, Minneapolis, MN 55455 USA (e-mail: qlu@umn.edu; inputs with categorical variables [15], [16], as well as distributed
polyz003@umn.edu; lixx5599@umn.edu; georgios@umn.edu). learning [17], [18]. Without additional design parameters, TS is
This article has supplementary downloadable material available at
https://doi.org/10.1109/TPAMI.2023.3264741, provided by the authors. very attractive for automated machine learning. Convergence of
Digital Object Identifier 10.1109/TPAMI.2023.3264741 TS for BO has been recently established using regret analysis
0162-8828 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: TONGJI UNIVERSITY. Downloaded on August 27,2024 at 15:05:13 UTC from IEEE Xplore. Restrictions apply.
11284 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 9, SEPTEMBER 2023
both in the Bayesian [13], [17], and in the frequentist setting [19], design of acquisition functions, which select query points
[20]. Although TS has been investigated with a mixture prior actively. Two novel EGP-based acquistion functions are
for linear bandits [21], its counterpart in BO with the associated devised and tested, namely, EGP-TS and EGP-EI.
regret analysis has not been studied so far. ii) Although random feature-based approximation has been
Parallel BO: To reduce convergence time of BO approaches, used also by [28], it serves a different purpose here.
parallel function evaluations at distributed computing resources In [28], where the number of samples is large, the RF
is well motivated. Coupled with upper confidence bound [22] approximation alleviates the computational complexity of
and expected improvement [23] based acquisition rules, this updating the GP model; whereas in the current BO context
parallel operation typically relies on additional hyperparameters with limited labelled data, RFs are motivated to conduct
or selection rules to ensure the diversity of query points at differ- function sampling with scalability in the TS-based acqui-
ent locations. On the other hand, TS-based parallel processing sition function.
necessitates no additional design as in the sequential setting [18], iii) An extra weight and model reinitializaiton is needed each
and enjoys rigorous convergence guarantees [17]. Moreover, time the kernel hyperparameters are updated using all data
parallel BO has also been investigated for input spaces with acquired (cf. lines 10–15 in Algorithm 1).
high dimensions [7] as well as categorical variables [15]. iv) Building on the novel EGP-TS approach, Bayesian regret
Kernel Selection for GPs: Discovery of the form of the kernel analysis has been conducted to guarantee convergence to
function has been considered for conventional GP learning; see, the global optimum. The analysis is novel and nontrivial
e.g., [24], [25], [26], [27]. These approaches usually operate in to deal with the additional challenge brought by the EGP
the batch mode and rely on a large number of samples, thus prior (cf. the proof sketch following Theorem 1).
rendering them inapplicable for BO where data are not only Notation: Scalars are denoted by lowercase, column vec-
acquired online, but also scarce due to the expensive evaluation tors by bold lowercase, and matrices by bold uppercase fonts.
cost. While an online kernel selection scheme has been put Superscripts and −1 denote transpose, and matrix inverse,
forth for prediction-oriented tasks using a candidate of GP mod- respectively; while 0N stands for the N × 1 all-zero vector;
els [28], it entails additional design of the acquisition function IN for the N × N identity matrix, and N (x; μ, K) for the
before being applied to the BO context. How to automatically probability density function (pdf) of a Gaussian random vector
select the kernel function for the GP model in BO is still x with mean μ, and covariance K.
unexplored.
II. PRELIMINARIES
B. Contributions Consider the following optimization problem
Relative to the aforementioned previous works, the contribu-
tions of this work are summarized in the following four aspects. x∗ = arg max f (x), (1)
x∈X
c1) Rather than a single GP surrogate model with a pres-
elected kernel function for BO in previous works, an where X is the feasible set for the d × 1 optimization variable
ensemble (E) of GPs is leveraged here to adaptively select x, and the objective f (x) is black-box with analytic expression
the fitted model for the sought function by adjusting the unavailable and is often expensive to evaluate. This mathematical
per-GP weight on-the-fly. Capitalizing on the random abstraction characterizes a variety of application domains. When
feature (RF) based approximation per GP, acquisition of tuning hyperparameters of machine learning models with x col-
the next query input is facilitated by TS with scalability lecting the hyperparameters, the mapping to the validation accu-
and no additional design parameters. racy f (x) is not available in closed form, and each evaluation is
c2) The resulting EGP-TS approach readily accommodates computationally demanding especially for deep neural networks
parallel function evaluation (a)synchronously. and large data sizes [1]. For example, it takes 4 days to train
c3) Convergence of the novel EGP-TS approach to the global BERT-large on 64 TPUs [29]. The lack of analytic expression
maximum is established by sublinear Bayesian regret for discourages one from leveraging conventional gradient-based
both the sequential and parallel settings. solvers to find x∗ . Exhaustive enumeration is also inapplicable
c4) Tests on synthetic functions and real-world applications, given the expensive evaluation cost. Fortunately, BO offers a
including hyperparameter tuning for three machine learn- theoretically elegant solution by judiciously selecting query
ing models and robot pushing tasks, demonstrate the pairs for a given evaluation budget [4], [5].
merits of EGP-TS relative to the single GP-based TS, In short, BO relies on a statistical surrogate model to ex-
and alternative ensemble approaches. tract information from the evaluated input-output pairs Dt :=
Relation With [28]: The EGP function model has been con- {(xτ , yτ )}tτ =1 so as to select the next query input xt+1 . Specifi-
sidered in our previous work [28] for supervised learning tasks. cally, this procedure is implemented iteratively via two steps, that
However, its adaptation to the BO context here is novel and well is: s1) Obtain p(f (x)|Dt ) based on the surrogate model; and, s2)
motivated for the purpose of kernel selection that is important Find xt+1 = arg maxx∈X α(x|Dt ) based on p(f (x)|Dt ). Here,
in practice. Coping with limited data in BO, this work differs the so-termed acquisition function α, usually available in closed
from [28] in the following directions. form, is designed to balance exploration with exploitation of the
i) Unlike [28] that relies on a large dataset of passively search space. There are multiple choices for both the surrogate
labelled samples, the novel EGP-based BO entails extra model and the acquisition function, see, e.g., [4], [5]. Next, we
LU et al.: SURROGATE MODELING FOR BAYESIAN OPTIMIZATION BEYOND A SINGLE GAUSSIAN PROCESS 11285
will outline the GP based surrogate model, which is the most where {vi }Di=1 are drawn i.i.d. from πκ̄ (v) – kernel κ’s nor-
widely used in BO, and TS for the acquisition function. malized spectral density, and σθ2 is the magnitude of κ (cf. Ap-
pendix A in the supplementary file, available online).
A. GP-Based Surrogate Model and TS for Acquisition Henceforth, the function posterior pdf will be captured by
p(θ|Dt ) = N (θ; θ̂ t , Σt ), based on which TS will select the next
GPs are established nonparametric Bayesian approaches to query point as
learning functions in a sample-efficient manner [30]. This
sample efficiency makes it extremely appealing for surro- xt+1 = arg max φ
v (x)θ̃ t , θ̃ t ∼ p(θ|Dt ). (6)
x∈X
gate modeling in BO when function evaluations are expen-
sive. Specifically, to learn f (·) that links the input xτ with It is worth mentioning that the mean θ̂ t and covariance matrix
the scalar output yτ as xτ → f (xτ ) → yτ , a GP prior is as- Σt can be updated efficiently in a recursive Bayes manner with
sumed on the unknown f as f ∼ GP(0, κ(x, x )), where κ(·, ·) the inclusion of each new (input, evaluation) pair.
is a positive-definite kernel (covariance) function measuring
pairwise similarity of any two inputs. Then, the joint prior III. ENSEMBLE GPS WITH TS FOR BO
pdf of function evaluations ft := [f (x1 ), . . . , f (xt )] at inputs
Xt := [x1 , . . . , xt ] (∀t) is Gaussian distributed as p(ft |Xt ) The performance of BO approaches depends critically on the
= N (ft ; 0t , Kt ), where Kt is a t × t covariance matrix whose chosen surrogate model. While most existing works rely on a
(τ, τ )th entry is [Kt ]τ,τ = cov(f (xτ ), f (xτ )) := κ(xτ , xτ ). single GP with preselected kernel form, we here leverage an
The value f (xτ ) is linked with the noisy output yτ via the per- ensemble (E) of M GPs, each relying on a kernel function
datum likelihood p(yτ |f (xτ )) = N (yτ ; f (xτ ), σn2 ), where σn2 selected from a given dictionary K := {κ1 , . . . , κM }. Set K can
is the noise variance. The function posterior pdf after acquiring be constructed with kernels of different types and different hy-
input-output pairs Dt is then obtained according to Bayes’ rule perparameters. Specifically, each GP m ∈ M := {1, . . . , M }
as [30] places a unique prior on f as f |m ∼ GP(0, κm (x, x )). Taking
a weighted combination of the individual GP priors, yields the
p(f (x)|Dt ) = N (f (x); fˆt (x), σt2 (x)), (2) EGP prior of f (x) given by
where the mean and variance are expressed via kt (x) :=

M
M
f (x) ∼ w0m GP (0, κm (x, x )), w0m = 1, (7)
[κ(x1 , x) . . . κ(xt , x)] and yt := [y1 . . . yt ] as
m=1 m=1
fˆt (x) = k 2 −1
t (x)(Kt + σn It ) yt (3a) where w0m := Pr(i = m) is the prior probability that assesses
the contribution of GP model m. Here, the latent variable i
σt2 (x) = κ(x, x) − k 2 −1
t (x)(Kt + σn It ) kt (x). (3b) is introduced to indicate the contribution from GP m. While
this non-Gaussian EGP prior (7) has been advocated for con-
With the function posterior pdf at hand, one readily selects
ventional prediction-oriented tasks in [28], the novelty here is
the next evaluation point xt+1 using TS, where the function
its adaptation for BO along with the extra design step needed
maximizer x∗ in (1) is viewed as random. Specifically, TS
for query selection. Besides EGP for BO, we will employ
selects the next query point by sampling from the posterior
TS-based acquisition function, which again, relies on sampling
pdf p(x∗ |Dt ) = p(x∗ |f (x))p(f (x)|Dt )df (x). Upon approx-
from p(f (x)|Dt ). Coupled with the EGP prior (7), this posterior
imating this integral using a sample from the function posterior
pdf is expressed via the sum-product rule as
p(f (x)|Dt ), the next query is found as

M
xt+1 = arg max f˜t (x), f˜t (x) ∼ p(f (x)|Dt ) . (4) p(f (x)|Dt ) = Pr(i = m|Dt )p(f (x)|i = m, Dt ), (8)
x∈X
m=1
This random sampling procedure nicely balances exploration which is a mixture of posterior GPs with per-GP weight wtm
and exploitation. Implementation of sampling a function from := Pr(i = m|Dt ) given by
the GP posterior p(f (x)|Dt ) can be realized by discretizing
the input space X [17], leveraging the RF based parametric wtm ∝ Pr(i = m)p(Dt |i = m) = w0m p(Dt |i = m), (9)
approximant [10], [31], or more recently relying on sparse GP where p(Dt |i = m) is the marginal likelihood of the acquired
decomposition for efficiency [32]. data Dt for GP m. As with sampling from a Gaussian mixture
Specifically, RF-based approximation leverages the spectral (GM) distribution, drawing a sample f˜t (x) from (8) is imple-
properties of (commonly used) stationary kernels to convert mented by the following two steps
nonparametric GP learning into a parametric one, yielding [31],
[33] mt ∼ CAT (M, wt ), f˜t (x) ∼ p(f (x)|i = mt , Dt ), (10)
where CAT (M, wt ) represents a categorical distribution that
fˇ(x) = φ 2
v (x)θ, θ ∼ N (θ; 02D , σθ I2D )
assigns one of the values from M with probabilities wt
1 := [wt1 , . . . , wtM ] .
φv (x) := √ [sin(v1 x), cos(v1 x), . . . , sin(vD

x), cos(vD x)] ,
D There are several choices for the function sampling step (10)
(5) in the novel EGP-TS as mentioned in Section II-A. Here, we will
m
adopt the random feature (RF) based method since it can not only where the updated mean θ̂ t+1 and covariance matrix Σm
t+1 are
efficiently draw the function path f˜t (x) that is differentiable
m m
with respect to x, but also accommodate incremental updates θ̂ t+1 = θ̂ t + (σt+1|t
m
)−2 Σm
t φv (xt+1 )(yt+1 − ŷt+1|t ) (15a)
m m
of wtm (9) and p(f (x)|i = m, Dt ) across iterates, as elaborated −2 m m

next. t+1 = Σt − (σt+1|t ) Σt φv (xt+1 )φv (xt+1 )Σt .
Σm m m m m
(15b)
A. RF-Based EGP-TS
(Re)initialization: In accordance with existing BO implemen-
When the kernels in the dictionary are shift-invariant, the tations, EGP-TS initializes with a small number (t0 ) of evalua-
RF vector φm v (x) per GP m can be formed via (??) by first tion pairs Dt0 to obtain kernel hyperparameter estimate α̂mt0 per
drawing i.i.d. random vectors {vjm }D j=1 from πκ̄ (v), which is
m
GP m by maximizing the marginal likelihood. The weight wtm0
the spectral density of the standardized kernel κ̄ . Let σθ2m be the
m is then obtained via (9) using α̂m t0 . As proceeding, the kernel
kernel magnitude so that κm = σθ2m κ̄m . The generative model hyperparameters per GP are updated every few iterations using
for the sought function and the noisy output y per GP m can be all the acquired data, and subsequently the weights are reinitial-
characterized through the 2D × 1 vector θ m as ized via the batch form (9) using the updated hyperparameters.
Between updates of hyperparameters, EGP-TS leverages (13)
p(θ m ) = N (θ m ; 02D , σθ2m I2D )
and (14) to incrementally propagate the function posterior pdf.
p(f (xt )|i = m, θ m ) = δ(f (xt ) − φm
v (xt )θ )
m Please refer to Algorithm 1 for the detailed implementation of
(sequential) EGP-TS.
2
p(yt |θ m , xt ) = N (yt ; φm
v (xt )θ , σn ) .
m
(11)
This parametric form readily allows one to capture the func- B. Parallel EGP-TS
m
tion posterior pdf per GP m via p(θ m |Dt ) = N (θ m ; θ̂ t , Σm
t ), As with the single GP-based TS [17], EGP-TS can readily
which together with the weight wtm , approximates the EGP func- accommodate parallel implementation for both synchronous and
tion posterior (8). Next, we will describe how RF-based EGP-TS asynchronous settings without extra design. Suppose there are K
selects the next evaluation input xt+1 , and propagates the EGP computing centers/workers that conduct function evaluations in
function pdf by updating the set {wtm , θ m t , Σt , m ∈ M} from
m
parallel. In the synchronous setup, K query points are assigned
slot to slot. for the workers to evaluate simultaneously by implementing (10)
Given Dt , acquisition of xt+1 is obtained as the maximizer K times. After all workers obtain the evaluated outputs, the EGP
of the RF-based function sample f˜ ˇt (x) based on (10), whose function posterior is then updated using the K input-output pairs.
detailed implementation is given by As for the asynchronous case, whenever a worker finishes her/his
job, the EGP posterior will be updated and the next evaluation
xt+1 = arg max f˜
ˇt (x), where f˜
ˇt (x) := φmt (x)θ̃ t
v point will be acquired. Note that the asynchronous setup is
x∈X
very similar to the sequential one except that multiple function
mt ∼ CAT (M, wt ), θ̃ t ∼ p(θ mt |Dt ), (12) evaluations are performed at the same time; see Algorithm 2 in
which can be solved using gradient-based solvers because the the supplementary file, available online, for details. Algorithm 1
objective is available in an analytic form. Upon acquiring the contains the implementation of synchronous parallel EGP-TS
evaluation output yt+1 for the selected input xt+1 , the updated when K > 1.
m
weight wt+1 := Pr(i = m|Dt , xt+1 , yt+1 ) can be obtained per The following two remarks are in order.
GP m via Bayes’ rule as Remark 1 (EGP With Other Acquisition Functions): Besides
TS, the EGP surrogate model can be coupled with other exist-
Pr(i = m|Dt , xt+1 )p(yt+1 |xt+1 , i = m, Dt ) ing single GP-based acquisition functions, including the well-
m
wt+1 =
p(yt+1 |xt+1 , Dt ) known expected improvement (EI) [34] and upper confidence
bound (UCB) [35]. The most direct implementation per iteration
wtm N yt+1 ; ŷt+1|t
m
, (σt+1|t
m
)2 is to first draw the model index mt based on the weights wt as
= , (13)
M m m m 2 in (10), and then proceed with the conventional EI/UCB acqui-
m =1 wt N yt+1 ; ŷt+1|t , (σt+1|t )
sition rule for GP mt . Results for this preliminary EGP-EI are
where the sum-product rule allows one to obtain the presented in Appendix E, available online. Instead of sampling
per-GP
predictive likelihood as p(yt+1 |i = m, Dt , xt+1 ) one GP model per iteration, one could alternatively build on the
= p(yt+1 |θ m , xt+1 )p(θ m |Dt )dθ m = N (yt+1 ; ŷt+1|t
m
, GP mixture pdf to devise the EI or UCB based acquisition rule.
m Further investigation along this direction is deferred to our future
(σt+1|t
m
)2 ) with ŷt+1|t
m
= φm
v (xt+1 )θ̂ t and (σt+1|t
m
)2 =
2 agenda.
φm
v (xt+1 )Σt φv (xt+1 ) + σn .
m m
Remark 2 (Relation With Fully Bayesian GP-Based BO):
Further, the posterior pdf of θ m can be propagated in the
When the dictionary consists of kernel functions of the same
recursive Bayes fashion as
type, the EGP prior amounts to a pseudo Bayesian GP model,
p(θ m |Dt )p(yt+1 |θ m , xt+1 ) where the kernel hyperparameters are chosen from a finite set.
p(θ m |Dt+1 ) =
p(yt+1 |xt+1 , i = m, Dt ) This EGP-based pseudo Bayesian model achieves a “sweet spot”
m between the Bayesian and non-Bayesian treatment of GP hyper-
= N (θ m ; θ̂ t+1 , Σm
t+1 ), (14) parameters, where the former entails specifying a reasonable
techniques in [21], where TS with a mixture prior is studied for

Algorithm 1: EGP-TS.
linear bandits, but not in the BO context.
1: Input: Kernel dictionary K, number D of RFs, number To proceed, we will need the following assumption and inter-
K of workers, and w0m = 1/M ∀m mediate lemmas.
2: Initialization: Assumption 1 (Smoothness of a GP Sample Path [36]): If
3: Randomly evaluate t0 points to obtain Dt0 ; x ∈ X ⊂ [0, 1]d is compact and convex, there exist constants
4: for m = 1, 2, . . . , M do a, b, L > 0 such that for any f (x) ∼ GP (0, κm (x, x ))
5: Obtain kernel hyperparameters estimates α̂m t0 by
maximizing the marginal likelihood; ∂f (x)
Pr sup > L ≤ ae−(L/b)2 , ∀j ∈ {1, . . . , d}.
6: Draw D random vectors {vim }D i=1 from πκ̄ (v) using
m
∂x
xj j
α̂t0 ;
m
m
7: Obtain wtm0 , θ̂ t0 , and Σm t0 based on(9) and(??);
Lemma 1 (Maximum Information Gain (MIG) [35]): Let
8: end for I m (f ; yA ) represent the Shannon mutual information one can
9: for t = t0 , t0 + 1, . . . do gain about the function f ∼ GP (0, κm ) using observations yA
10: if Reinitialization then evaluated at finite subset A := {x1 , . . . , xT } ⊂ X . For any
11: for m = 1, 2, . . . , M do m ∈ M, the MIG for commonly used kernels can be upper
12: Obtain α̂mt by marginal likelihood maximization
bounded by
using Dt ;
γT := max I m (f ; yA ) ≤ Õ(T c ), 0 ≤ c < 1 ,
13: Draw D random vectors {vim }D i=1 from πκ̄ (v)
m
A⊂X ,|A|=T m∈M
using α̂t ;
m
m
14: Obtain wtm , θ̂ t , and Σm t based on(9) and(??); where Õ ignores polylog factors.
15: end for Lemma 2 (Ratio of Posterior Variances [22]): Let yA and
16: end if yB denote the observations when evaluating f ∼ GP (0, κm )
17: for k = 1, 2, . . . , K do at A and B, which are finite subsets of X . With σA m
(x) and
18: Sample mkt based on pmf wt ; σA∪B (x) representing the posterior standard deviation of the
m
k mk mk GP conditioned on A and A ∪ B, there exists ρK ≥ 1 so that
19: Sample θ̃ t from N (θ̂ t t , Σt t );
k k the following holds for |B| < K
20: Obtain xkt+1 = arg max θ̃ t φmt (x);
x∈X
(σA
m
(x))2 ≤ ρK (σA∪B
m
(x))2 , ∀x ∈ X , m ∈ M .
21: Evaluate xkt+1 to obtain yt+1 k
;
22: end for As stated in [35], Assumption 1 is satisfied for various com-
m
23: Update {wt+1
m
, θ̂ t+1 , Σmt+1 }m with {xt+1 , yt+1 }k
k k
monly used stationary kernels that are four times differentiable,
based on(13) and (14); including Gaussian kernels and Matérn ones with parameter
24: Dt+1 = Dt ∪ {xkt+1 , yt+1 k
}k ; ν > 2, which implicitly allows EGP-TS to draw functions with
25: end for scalability using RFs as in the preceding section. The MIG
in Lemma 1 plays an important role in the regret bound. It
is an information-theoretic measure quantifying the statistical
prior and also needs demanding MCMC sampling. In addition, difficulty of BO [13], [35]. Lemma 2 will be useful in deriving
the proposed EGP-TS framework not only accommodates differ- the regret in the parallel setup. After making these comments,
ent types of kernels, but also enjoys the upcoming convergence we are ready to present a Bayesian regret upper bound pertinent
guarantees relative to fully Bayesian GP-based BO. to EGP-TS in the sequential setting.
Theorem 1: Under Assumption 1, the cumulative Bayesian
IV. BAYESIAN REGRET ANALYSIS regret (16) of EGP-TS over T slots, is bounded by
To establish convergence of the proposed EGP-TS algorithm
BR(T ) ≤ c1 M T c+1 log T + 2σn M T log T + c2 ,
to the global optimum, analysis will be conducted via the notion √
of Bayesian regret over T slots, that is defined as where the constants c1 :=
√ (2 + d)(2/ log(1 + σn−2 )1/2 and c2
:= 6 MB + (π 2 d)/6 + 2πM/12 (B is a constant given in

T
BR(T ) := E[f (x∗ ) − f (xt )], (16) Lemma 3 in Appendix B, available online) are not dependent
t=1 on T .
Proof Sketch: The detailed proof of Theorem 1 is deferred to
where the expectation is over all random quantities, including
Appendix B, available online. The key step in the proof builds on
the function prior, the observations, and the sampling procedure.
the connection with UCB based approaches, that is manifested
Unlike previous works that sample the function from a single
via decomposing the Bayesian regret (16) as
GP prior [13], [17], here we draw f from the EGP prior (7) as
m∗ ∼ CAT (M, w0 ), f (x) ∼ GP (0, κm∗ (x, x )).
T
T
BR(T ) = E[f (x∗ ) − Utm∗ (x∗ )] + E[Utmt (xt ) − f (xt )],
This EGP prior presents additional challenge to the regret t=1

t=1

analysis. Towards addressing this challenge, we will adapt the BR1 (T ) BR2 (T )
Fig. 1. Simple regret on Ackley-5 d, Zakharov, DropWave and Eggholder function (from left to right). Dictionary has 4 kernels with distinct forms:
RBF with(out) ARD and Matérn with ν = 3/2, 5/2.
Fig. 2. Simple regret on Ackley-5 d, Zakharov, DropWave and Eggholder function (from left to right) using RBF kernels. Dictionary has 11 RBF
kernels with lengthscales given by {10c }6c=−4 .
Fig. 3. Simple regret on Robot pushing 3D and Robot pushing 4D tasks with dictionary (a) that has 4 kernels with distinct forms: RBF with(out)
ARD and Matérn with ν = 3/2, 5/2; and (b) that has 11 RBF kernels with characteristic lengthscales given by {10c }6c=−4 .
Fig. 4. The best validation accuracy (so far) versus the number of function evaluations on Breast Cancer, Transportation, Iris, and Wine datasets
(from left to right) for the NN hyperparameter tuning task. Dictionary has 4 kernels with distinct forms: RBF with(out) ARD and Matérn with ν = 3/2, 5/2.
where Utm (x) := μm

1/2 the probability that f (x∗ ) is larger than Utm∗ (x∗ ) across all the
t−1 (x) + βt σt−1 (x) with βt specified by
m
(17) in Appendix B, available online, is a UCB for f (x) under slots is low [17].
GP m. This decomposition of BR(T ) holds since {mt , xt } and To further bound BR2 (T ) involving the extra latent variable
{m∗ , x∗ } are i.i.d. and Utm (x) is deterministic conditioned on mt sampled from the EGP posterior (cf. (10)), we adapt the
Dt−1 , yielding [13], [21], technique in [21] that constructs a confidence set Ct for the
latent variable such that m∗ ∈ Ct holds with high probability;
Et−1 [Utmt (xt )] = Et−1 [Utm∗ (x∗ )], ∀t . see Lemma 4 in Appendix B, available online. It turns out that
BR2 (T ) can also be bounded by the sum of posterior standard
Then, the Bayesian regret bound of EGP-TS can be estab- deviations, which further yields the upper bound given by the
lished by upper bounding BR1 (T ) and BR2 (T ). Since f ∼ MIG along the lines of [35].
GP (0, κm∗ ), the former can be conveniently bounded based on The proof of Theorem 1 in Appendix B, available online,
related works that rely on a single GP [13], [17]. Specifically, involves an additional discretization step of X per step t, in
BR1 (T ) is proved to be upper bounded by a constant, because order to cope with the continuous feasible set X .
(from left to right) for the NN hyperparameter tuning task. Dictionary has 11 RBF kernels with characteristic lengthscales given by {10c }6c=−4 .
(from left to right) for the SVM hyperparameter tuning task. Dictionary has 4 kernels with distinct forms: RBF with(out) ARD and Matérn with ν = 3/2, 5/2.
(from left to right) for the SVM hyperparameter tuning task. Dictionary has 11 RBF kernels with characteristic lengthscales given by {10c }6c=−4 .
(from left to right) for the GradientBoosting hyperparameter tuning task. Dictionary has 4 kernels with distinct forms: RBF with(out) ARD and Matérn with
ν = 3/2, 5/2.
(from left to right) for the GradientBoosting hyperparameter tuning task. Dictionary has 11 RBF kernels with characteristic lengthscales given by {10c }6c=−4 .
The following two theorems further establish the cumulative present work, the RBF kernel is considered and a uniform
Bayesian regret bounds of parallel EGP-TS in the asynchronous prior is assumed for the amplitude σθ2 , characteristic lengthscale
and synchronous settings, whose proofs are deferred to Ap- and noise variance σn2 , within intervals [1, 100], [10−3 , 103 ] and
pendix C– D, available online. [0.1,0.3] respectively. The fully Bayesian GP-TS proceeds by
Theorem 2 (Asynchronously Parallel Setting): For K work- first drawing a sample of the kernel hyperparameters using GPy-
ers conducting parallel function evaluations asynchronously, Torch and Pyro Python packages, based on which function
EGP-TS under Assumption 1 incurs the following cumulative sampling is conducted. Existing kernel selection methods for
Bayesian regret over T function evaluations conventional GP learning operate in batch mode using a large
number of samples, hence being not suitable for the low-data
BRasy (T ) ≤ c1 ρK M T c+1 log T + 2σn M T log T + c2 . BO setting. For initialization in all the methods, the first 10
Theorem 3 (Synchronously Parallel Setting): For K workers evaluation pairs are randomly selected and used to obtain the
performing T function evaluations synchronously, the cumula- kernel hyperarameters per GP by maximizing the marginal like-
tive Bayesian regret of EGP-TS under Assumption 1 is bounded lihood. In EGP-TS, the per-GP prior weight is set as uniform, i.e.,
by w0m = 1/M ∀m. Unless stated otherwise, the hyperparameters
are refitted every 50 iterations for EGP-TS, and every iteration
BRsyn (T ) ≤ (K − 1) d log(K − 1) + 2σn M T log T for the rest of the baselines. RF approximation with 50 spectral
+ c2 + c3 ρK M T c+1 log T features is leveraged by all the TS-based approaches for fair-
ness in comparison. All the experiments are repeated 10 times,
+ c4 M T c+1 log(T + K − 1), where the average performance and the standard deviation of all
competing approaches are reported.
where the two constants are given by c3 := 2(2/ log(1 + Additional results concerning ablation studies of the EGP-TS
σn−2 )1/2 , and c4 := (2 d/ log(1 + σn−2 )1/2 .
√ approach, runtime comparison, and the parallel setting are de-
The first term of the regret bound in Theorem 2 is ρK times ferred to Appendix B in the supplementary file, available online.
its counterpart in Theorem 1 for the sequential setting. It shall be
easily verified that Bayesian regret bounds of parallel EGP-TS
A. Tests on Synthetic Functions
become equivalent to that in the sequential setting when K = 1
with ρ1 = 1. Note that the regret bounds for parallel EGP-TS We tested the competing methods on a suite of standard
here are for the number of evaluations, that will typically exceed synthethic functions for BO, including Ackley-5 d, Za-
the bound in the sequential setup. This can be certainly the kharov, Drop-wave, as well as Eggholder, where the
other way around if the evaluation time is of interest [17]. In latter two are challenging functions with many local optima.
all the three settings, the cumulative Bayesian regret bounds The performance metric per slot t is given by the simple regret
of EGP-TS boil down to O( M T c+1 log T ) after ignoring (SR), defined as SR(t) := f (x∗ ) − maxτ ∈{1,...,t} f (xτ ). First,
irrelevant constants, which is sublinear in the number of evalu- to explore the effect of the kernel functions in the (E)GP model,
ations when 0 ≤ c < 1. Hence, EGP-TS enjoys the diminishing we tested GP-TS with the kernel function being RBF with
average regret per evaluation as T grows, hereby establishing and without auto-relevance determination (ARD), and Matérn
convergence to the global optimum. kernels with ν = 3/2, 5/2. For all the ensemble methods, the
kernel dictionary is comprised of the aforementioned four kernel
V. NUMERICAL TESTS functions. It is evident from Fig. 1 that the form of kernel
function plays an important role in the performance of GP-TS.
In this section, the performance of the proposed EGP-TS will
Combining different kernel functions, EGP-TS not only yields
be tested on a set of benchmark synthetic functions, two robotic
substantially improved performance relative to GP-TS counter-
tasks, and the hyperparameter tuning tasks of three machine
parts, but also requires the least design efforts on the choice of
learning models. The competing baselines are GP-EI [34], the
the kernel function. In addition, EGP-TS achieves lower simple
default method for many traditional BO problems, and TS-based
regret than BanditBO and EXP3BO. Although GP-EI is superior
methods, including GP-TS with a preselected kernel type, fully
to GP-TS baselines on Zakharov function, EGP-TS yields
Bayesian GP-TS, as well as two ensemble approaches, which
better performance relative to the former, what demonstrates
are BanditBO [15], and EXP3BO [16]. It is worth mentioning
the benefit of ensembling GP models. Upon fixing the kernel
that the latter two, combining multi-armed bandits and BO, are
type as RBF without ARD and constructing the dictionary as 11
originally designed for inputs with categorical variables, but are
RBF functions with lengthscales given by {10c }6c=−4 , EGP-TS
adapted as ensemble methods here with each “arm” referring to
is also compared with fully Bayesian GP-based TS in addition
a GP model with the same input variables.
to the aforementioned baselines. Still, EGP-TS outperforms all
The kernel hyperparameters per GP for all the TS-based
competitors as shown in Fig. 2.
methods other than fully Bayesian GP-TS are obtained by
maximizing the marginal likelihood using sklearn. GP-EI
is implemented using BoTorch with the ARD kernel, whose B. Robot Pushing Tasks
hyperparameters are refitted each iteration. The fully Bayesian The second experiment concerns a practical task in robotics,
GP model hinges on a pre-defined kernel type where the kernel where a robot adjusts its action so as to push an object towards
hyperparameters are assumed to be random variables. In the a given goal location. By minimizing the distance between the
Fig. 10. Simple regret of EGP-EI versus GP-EI with a preselected kernel on Ackley-5 d, Zakharov, DropWave and Eggholder function (from left to
right). Dictionary has 4 kernels with distinct forms: RBF with(out) ARD and Matérn with ν = 3/2, 5/2.
Fig. 11. Simple regret of EGP-EI versus GP-EI with a preselected kernel on Robot pushing 3D and Robot pushing 3D tasks (from left to right).
Dictionary has 4 kernels with distinct forms: RBF with(out) ARD and Matérn with ν = 3/2, 5/2.
target location and the end position of the pushed object, we TABLE I
FEASIBLE VALUES OF THE HYPERPARAMETERS FOR DIFFERENT
tested two scenarios with 3 and 4 input variables following [37]. CLASSIFICATION MODELS
The former optimizes the 2-D position of the robot and the
push duration, and the latter entails optimizing an additional
push angle. We used the github codes1 from [37] to generate
the movement of the object pushed by the robot. Each scenario
was repeated for 10 randomly selected goal locations, and the
average performance of the competing methods are depicted in
Fig. 3. Adaptively selecting kernel function from the dictionary
with 4 distinct forms (that is, RBF with(out) ARD, and Matérn
with ν = 3/2, 5/2), the proposed EGP-TS outperforms all the
competitors, including GP-EI, GP-TS with a preslected kernel,
and the other two ensemble methods, as shown in Fig. 3(a). The
superior performance of EGP-TS when the kernel function is
fixed as RBF is also shown in Fig. 3(b), what is in accordance
with Fig. 2. It is worth highlighting that EGP-TS not only to be used to evaluate different BO methods. We tested all the
outperforms fully Bayesian GP-TS in simple regret, but also competing baselines on Breast cancer [38], Iris [39],
runs much faster. Transportation [40], as well as Wine [41] datasets. For
all the datasets, 70% of the data are used as the training set,
C. Hyperparameter Tuning Tasks and the remaining are used as the validation set based on which
the classification accuracy is calculated. The hyperparameters of
The last test deals with hyperparameter tuning tasks for three
the FNN consist of the number of neurons per layer, the learning
classification models, including a 2-layer FNN with ReLU ac-
rate, and the batch size. As for SVM, the values of C and γ are
tivation function, support vector machine (SVM), and gradient
to be tuned. For GB, the hyperparameters include the learning
boosting (GB). Note that although the FNN architecture does not
rate, subsample ratio, and the ratio of maximum features. Table I
yield the state-of-the-art classification performance, it suffices
summarizes the feasible values of the hyperparameters of the
three methods. For each set of hyperparameters, the evaluated
1 https://github.com/zi-w/Max-value-Entropy-Search validation accuracy is obtained as the average of 10 independent
runs on a given dataset. In FNN training, the number of epochs past data Dt−1 , the high probability upper confidence bound
1/2 m
is chosen to be 20 and the optimizer used is Adam. for f (x) is given by Utm (x) := μm
t−1 (x) + βt σt−1 (x), where
2
Figs. 4, 5, 6, 7, 8, and 9 depict the best validation accuracy (so βt := 2 log(t |Xt |). Here, Xt is obtained
√ by discretizing each
far) of the competing methods versus the number of iterations dimension of X using nt = t2 dab π equally spaced grids.
for these three classification models. Apparently, EGP-TS with Thus, |Xt | = (nt )d , and
different kernel types or RBF kernels of different lengthscales √
is shown to outperform the competitors in most of the cases, βt = 4(d + 1) log t + 2d log(dab π) ≈ d log t . (17)
demonstrating the robustness of the EGP model across tasks. With [x]t representing the closest point to x in Xt , it can be
easily verified that
D. Preliminary Results for EGP-EI
x − [x]t 1 ≤ d/nt , ∀x ∈ X . (18)
Here, we couple the EGP surrogate with the EI acquisition
rule [34], yielding the novel EGP-EI approach. In line with Consider next the following decomposition
the proposed EGP-TS, one could first select a GP model by
T
random sampling based on the weights wt , and then implement BR(T ) := E[f (x∗ ) − f (xt )]
the EI acquisition function based on the chosen GP model. To t=1
benchmark the performance of this advocated EGP-EI, com-
T
T
(a)
parison has been made relative to GP-EI with a preselected = E [f (x∗ ) − f ([x∗ ]t )] + E [f ([x∗ ]t − Utm∗ ([x∗ ]t )]
kernel function. We use BoTorch to implement both EGP-EI t=1 t=1

and GP-EI with kernel hyperparameters updated every iteration :=A1 :=A2
and without RF approximation. As shown in Figs. 10 and 11,
EGP-EI outperforms GP-EI in three out of the four synthetic
T
+ E [Utm∗ ([x∗ ]t ) − Utmt ([xt ]t )]
functions and both of the robotic tasks – what demonstrates the
t=1
benefits accompanied with the more expressive EGP model for
the EI acquisition function. Rather than sampling a single GP :=A3
from the EGP, future work includes investigation of the EI rule

T
based on the GP mixture function model (cf. Remark 1). + E [Utmt ([xt ]t ) − f ([xt ]t )]
t=1

VI. CONCLUSION :=A4
This work introduced a non-Gaussian EGP prior with adaptive

T
kernel selection for the sought black-box function in BO. Capi- + E [f ([xt ]t ) − f (xt )] . (19)
talizing on the RF approximation per GP, acquisition of the sub- t=1

sequent query point is effected via TS, which bypasses the need :=A5
for design parameters and can readily afford parallel implemen-
tation. Convergence of the proposed EGP-TS algorithm has been Since {xt , mt } and {x∗ , m∗ } are identically distributed given
established by sublinear cumulative Bayesian regret in both the Dt−1 , the fact that Utm (x) is a deterministic function of Dt−1
sequential and parallel settings. Numerical tests demonstrated yields A3 = 0 [13], [21].
the merits of EGP-TS relative to existing alternatives. Future Next, we will provide an upper bound for A1 and A5 following
work includes investigation of other acquisition functions based the proof in [17]. Letting Lmax = sup sup| ∂f∂x(x)
j
|, the
j={1,...,d} x∈X
on the novel EGP surrogate model, as well as analysis via the union bound under Assumption 1 implies that
notion of frequentist regret. 2
Pr(Lmax ≥ c) ≤ dae−(c/b)
VII. PROOFS
which allows us to obtain
A. Proof of Theorem 1 (a) d
E[|f (x) − f ([x]t )|] ≤ E[Lx − [x]t 1 ] ≤ E[Lmax ]
Before performing the Bayesian regret analysis for EGP-TS, nt
the following lemma will be first presented. ∞ ∞
d
(b) d 2
Lemma 3 (Supremum of a GP Sample Path [42]): If f ∼ = Pr(Lmax ≥ c)dc ≤ dae−(c/b) dc
nt c=0 nt c=0
GP (0, κm ) is a continuous sample path for any m ∈ M, then √ 2
E[f ∞ ] = B < ∞, and further =
πd ab
= 2
d
2nt 2t
max|f (x∗ ) − f (x)| ≤ 2B .
x∈X where (a) results from
∞(18), and (b) utilizes for Lmax ≥ 0 the
Lemma 3 holds when kernels are twice differentiable - what equality E[Lmax ] = c=0 Pr(Lmax ≥ c)dc. Hence, A1 and A5
is readily satisfied under Assumption 1. are bounded by
To bound the cumulative Bayesian regret of EGP-TS, we
T
d π2 d
will rely on its link with the corresponding upper confidence A1 = A5 ≤ ≤ . (20)
bound algorithm in [13]. Conditioned on GP model m and t=1
2t2 12
Further, A2 can be upper bounded as As m∗ and mt are identically distributed given Dt−1 [13],
[21], Lemma 4 allows A4,3 to be readily bounded by

T
A2 ≤ E [I(f ([x∗ ]t ) > Utm∗ ([x∗ ]t )) [f ([x∗ ]t − Utm∗ ([x∗ ]t )]]
T
t=1 A4,3 = 2B E[I(m∗ ∈ Ct )] ≤ 4 M B . (24)

t=1

T
≤ E [I(f (x) > Utm (x)) [f (x) − Utm (x)]] Meanwhile, we have that
t=1 m∈M x∈Xt
T
mt
A4,2 = E μt−1 (xt ) − Lm
t (xt ) I(mt ∈ Ct )
t
(a)
T σ m (x)
= √ t−1 t=1
t=1 m∈M x∈Xt

2πt2 |Xt |
T
√ + E [(Lm
t (xt ) − yt ) I(mt ∈ Ct )]
t
(b)
T
1 2πM t=1
≤ √ = (21)
t=1 m∈M x∈Xt
2πt2 |Xt | 12 (a)
T
≤ E[ησt−1
mt
(xt )] + E[Gm
tm
max
] + 2 MB
1/2 m
where, since f (x) − Utm (x)|Dt−1
∼ N (−βt σt−1 (x), t=1 m∈M
(σt−1
m
(x))2 ), (a) holds using the identity E[rI(r > 0)] = (b)
T
2
√σ exp(− μ 2 ) if r ∼ N (μ, σ 2 ) and μ < 0. Inequality (b) is
2σ
≤ E[ησt−1
mt
(xt )] + E[2σn NTm log T ] + 2 M B
2π
m
simply due to σt−1 (x) ≤ 1. t=1 m∈M
The last step is to upper bound A4 , by constructing a con- (c)

T
fidence set Ct for the latent state per slot t so that m∗ ∈ ≤ E[ησt−1
mt
(xt )] + 2σn M T log T + 2 M B .
Ct holds with high probability [21]. We will replace [xt ]t t=1
by xt for notational brevity, given that the following re- (25)
sult holds for both cases. Consider Ct := {m ∈ M : Gm t ≤
t−1 where, with tm max being the last slot that m is selected, (a)
2σn Nt−1 log T }, where Nt−1 = τ =1 I(mτ = m), and
m m
holds by leveraging the definition of Gm and bounding the
tm
max
Lm
tm (x) − y t m by 2B; (b) comes from the definition of Ct ;

t−1 max max
and, (c) leverages Cauchy-Schwarz inequality to yield
t :=
Gm I(mτ = m) (Lm
τ (xτ ) − yτ ) . (22)

τ =1
M M
√
√ NTm ≤ M NTm = M T . (26)
Here, Lm t (x) = μt−1 (x) − ησt−1 (x) with η = 2 log T is a
m m
m=1 m=1
lower confidence bound for f (x) conditioned on model m. For
later use, we will first present the following two lemmas, whose Putting together the bounds for A1 − A5 , the cumulative
proofs are deferred to Sections VII-A1 and VII-A2. Bayesian regret of EGP-TS over T evaluations is bounded by
Lemma 4: It holds that Pr(m∗ ∈ / Ct |Dt−1 ) ≤ 2M T −1 , ∀t ∈

T
T := {1, . . . , T }. BR(T ) ≤ (η + βT )
1/2
E[σt−1
mt
(xt )] + 2σn M T log T
Lemma 5: It holds that E[μm t−1 (xt ) − f (xt )] < 2B, ∀mt ∈
t
t=1
M, xt ∈ X , t ∈ T . √
π2 d 2πM
The following decomposition will be applied towards bound- + 6 MB + + (27)
ing A4 6 12
T where the first term can be bounded with TTm := {t ∈ T : mt
= m} and Tm := |TTm | as
A4 = E Ut (xt ) − μt−1 (xt )
mt mt
⎡ ⎤
t=1 T M
E[σt−1
mt
(xt )] ≤ E⎣ m
σt−1 (xt )⎦

T
mt
+E μt−1 (xt ) − f (xt ) t=1 m=1 t∈TTm
t=1 T 1/2
(a)
M m (b)
M
Tm
m 2

T
1/2 mt

T ≤ E m
σt−1 (xt ) ≤ E Tm σt−1 (xt )
≤ E[βt σt−1 (xt )] + E [2BI(mt ∈ Ct )] m=1 t=1 m=1 t=1
M 1/2 M 1/2
t=1 t=1
(c) 2Tm γTm (d) 2Tm1+c
:=A4,1 :=A4,3 ≤ ≤
m=1
log (1 + σn−2 ) m=1
log (1 + σn−2 )

T
t
+ E μm 1/2
t−1 (xt ) − f (xt ) I(mt ∈ Ct ) (23) (e) 2 M T 1+c
t=1 ≤ (28)
log (1 + σn−2 )
:=A4,2
where (a) holds since σtm (x) decreases as t grows; (b) is due
where the last inequality holds based on Lemma 5. to the Cauchy-Schwarz inequality; (c) leverages Lemmas 5.3
and 5.4 of [35] that bound the sum of posterior variances via the + Pr(Ē1:T
m∗
) ≤ 2M T −1 . (32)
MIG; (d) follows upon bounding γTm using Lemma 1; and, (e)
holds upon utilizing the following inequality based on Cauchy- thus finalizing the proof of Lemma 4.
Schwarz inequality 2) Proof for Lemma 5: Since {xt , mt } and {x∗ , m∗ } are
identically distributed conditioned on Dt−1 , it holds that
1/2
M
1+c 1/2
M
Tm ≤ M Tm1+c E[μm
t−1 (xt )] = E[Et−1 [μt−1 (xt )]]
t mt
m=1 m=1 = E[Et−1 [μm

t−1 (x∗ )]] = E[μt−1 (x∗ )] .
∗ m∗
⎛ ⎞
1+c 1/2

M
Further, the identity Et−1 [μm t−1 (x∗ )] = Et−1 [f (x∗ )] with f ∼
∗
1+c 1/2
≤ ⎝M Tm ⎠ = MT . GP (0, κ ), yields the following result
m∗
m=1 t
E μm t−1 (xt ) − f (xt )
Upon plugging in (28) into (27), Theorem 1 holds with η t
√ 1/2 √
= 2 log T and βT ≈ d log T . = E μm t−1 (xt ) − μt−1 (x∗ ) + μt−1 (x∗ )
m∗ m∗
1) Proof for Lemma 4: For m∗ ∈ M, define the following −f (x∗ ) + f (x∗ ) − f (xt )]
event at slot t
= E[f (x∗ ) − f (xt )] ≤ 2B
Etm∗ := {|f (x) − μm
t−1 (x)| ≤ ησt−1 (x)}
∗ m∗
(29)
where, thanks to Lemma 3, the last inequality holds.
the collection of which over T slots is E1:T
m∗
:= ∩Tt=1 Etm∗ . With
Ē1:T representing its complement, it follows that
m∗
B. Proof of Theorem 2

T
For the asynchronous parallel setting, the upper confidence
E[I(Ē1:T
m∗
)] ≤ E Et−1 [I(Ētm )]
t=1 m∈M
bound for the tth function evaluation is given by
1/2 m

T
Ūtm (x) := μm
Dt−1 (x) + βt σDt−1 (x) (33)
= E Prt−1 |μm
t−1 (x) − f (x)| > m
ησt−1 (x)
t=1 m∈M
where Dt−1 contains all the acquired data before evaluation
index t is assigned. Here, |Dt−1 | = t − K for t > K, and
(a)
≤ M T −1 (30) |Dt−1 | = 0 for t ≤ K.
Leveraging a decomposition similar to that in (19), A1 –A3
2
where (a) comes from the inequality Pr(|r| > η) ≤ e−η /2
√ with and A5 could be derived as in Section A. Upon replacing the
r = |μt−1 (x) − f (x)|/σt−1 (x) ∼ N (0, 1) and η = 2 log T .
m m
subscript t − 1 of μ and σ by Dt−1 , the term A4 can be bounded
Since nτ = f (xτ ) − yτ ∼ N (0, σn2 ), {nτ }τ ∈Ttm is then a as in (23), that is
martingale difference sequence w.r.t. {Dτ }τ ∈Ttm , where Ttm

T
:= {τ |mτ = m, τ ∈ {1, . . . , t}}. A4 ≤ E[(βt
1/2
+ η)σD
mt
(xt )] + 2σn M T log T + 6M B
t−1
t I(E1:T )
Gm m t=1
(34)

= (Lm where the first term can be further bounded based on Lemma 2
τ (xτ ) − yτ )I(|f (xτ ) − μτ −1 (xτ )| ≤ ηστ −1 (xτ ))
m m
τ ∈Ttm and (28) as

T (a)
T
= (Lm
τ (xτ ) − yτ )I(Lτ (xτ ) < f (xτ )) ≤
m
nτ . 1/2
E[(βt + mt
η)σD (xt )] ≤
1/2
(βT + η)
1/2
E[ρK σt−1
mt
]
t−1
τ ∈Ttm τ ∈Ttm
t=1 t=1
(31) 1/2
√ 2ρK M T log T
c+1
For any m ∈ M and t ∈ T , u = |Ttm | is random and takes value ≤ (2 + d) −2
. (35)
log (1 + σn )
from {1, . . . , t − 1}. For any u, Azuma’s inequality yields
Thus, the cumulative Bayesian regret for parallel EGP-TS in the
Prt−1 (Gmt I(E1:T ) ≥ 2σn
m
u log T ) asynchronous setup can be established as in Theorem 2.
⎛ ⎞

≤ Pr ⎝ nτ ≥ 2σn u log T ⎠ ≤ exp(−2 log T ) = T −2 C. Proof of Theorem 3
τ ∈Ttm The proof of Theorem 3 entails introducing
based on which, we arrive at 1/2
Vtm (x) := μm
Dt−1 (x) + βt+K−1 σt−1 (x)
m
(36)

t−1
Prt (m∗ ∈ Ct ) ≤ Pr(Gm
t ≥ 2σn u log T ) based on which the cumulative Bayesian regret can be decom-
m∈M u=1 posed after using (33) as (cf. (19))

t−1 # $
T
≤ E Prt−1 Gm
t I(E1:T ) ≥ 2σn
m
u log T BRsyn (T ) := E[f (x∗ ) − f (xt )]
m∈M u=1 t=1
(a)
T
T

T
1/2 1/2
= E [f (x∗ ) − f ([x∗ ]t )] + E f ([x∗ ]t − Ūtm∗ ([x∗ ]t ) ≤ E[βT +K−1 σt−1
mt
(xt ) + ηρK σt−1
mt
(xt )]
t=1 t=1 t=1

:=C1 :=C2 + 2σn M T log T + 6 M B (39)

T

+ E Ūtm∗ ([x∗ ]t ) − Vtm∗ ([x∗ ]t ) which, based on the derivation of (28), and the bounds of other
t=1 factors, yields the regret bound in Theorem 3.

:=C3
ACKNOWLEDGMENTS

T
+ E [Vtm∗ ([x∗ ]t ) − Vtmt ([xt ]t )] The authors would like to thank the anonymous reviewers for
t=1
their constructive feedback.
:=C4

T REFERENCES
+ E [Vtmt ([xt ]t ) − f ([xt ]t )] [1] J. Snoek, H. Larochelle, and R. P. Adams, “Practical Bayesian optimization
t=1 of machine learning algorithms,” in Proc. Int. Conf. Neural Inf. Process.
Syst., 2012, pp. 2951–2959.
:=C5
[2] K. Korovina et al., “ChemBO: Bayesian optimization of small organic
molecules with synthesizable recommendations,” in Proc. Int. Conf. Artif.

T
Intell. Statist., 2020, pp. 3393–3403.
+ E [f ([xt ]t ) − f (xt )] . (37) [3] A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret, “Robots that can adapt
t=1 like animals,” Nature, vol. 521, no. 7553, pp. 503–507, 2015.
[4] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas, “Taking
:=C6
the human out of the loop: A review of Bayesian optimization,” Proc. IEEE,
As with the proof of Theorem 1, it follows that vol. 104, no. 1, pp. 148–175, Jan. 2016.
[5] P. I. Frazier, “A tutorial on Bayesian optimization,” 2018, arXiv:
√
π2 d 2πM 1807.02811.
C1 = C6 ≤ , C4 = 0, C2 ≤ . (38) [6] R. Turner et al., “Bayesian optimization is superior to random search
12 12 for machine learning hyperparameter tuning: Analysis of the black-box
optimization challenge 2020,” 2021, arXiv:2104.10201.
Next, we will further bound C3 and C5 , starting with [7] Z. Wang, C. Gehring, P. Kohli, and S. Jegelka, “Batched large-scale
Bayesian optimization in high-dimensional spaces,” in Proc. Int. Conf.

K−1
1/2

T
1/2 Artif. Intell. Statist., 2018, pp. 745–754.
C3 = βt E[σD
m∗
t−1
([x∗ ]t )] − βt+M −1 E[σt−1
m∗
([x∗ ]t )] [8] M. Feurer, B. Letham, and E. Bakshy, “Scalable meta-learning for
t=1 T −K+1 Bayesian optimization using ranking-weighted Gaussian process ensem-
bles,” in Proc. AutoML Workshop ICML, 2018.
−K
T [9] M. Hoffman et al., “Portfolio allocation for Bayesian optimization,” in
1/2
+ βt E[σD
m∗
t−1
([x∗ ]t ) − σt−K
m∗
([x∗ ]t )] Proc. Conf. Uncerntainty Artif. Intell., 2011, pp. 327–336.
[10] B. Shahriari, Z. Wang, M. W. Hoffman, A. Bouchard-Côté, and N.
t=K
de Freitas, “An entropy search portfolio for Bayesian optimization,”
(a) 2014, arXiv:1406.4625.
1/2
≤ (K − 1)βK−1 [11] W. R. Thompson, “On the likelihood that one unknown probability exceeds
another in view of the evidence of two samples,” Biometrika, vol. 25,
where (a) holds since σD m∗
t−1
([x∗ ]t ) ≤ σt−K
m∗
([x∗ ]t ), and 0 < no. 3/4, pp. 285–294, 1933.
[12] O. Chapelle and L. Li, “An empirical evaluation of Thompson sampling,”
σt (x) ≤ 1.
m∗
in Proc. Int. Conf. Neural Inf. Process. Syst., 2011, pp. 2249–2257.
Lastly, C5 can be bounded as [13] D. Russo and B. Van Roy, “Learning to optimize via posterior sampling,”
T Math. Operations Res., vol. 39, no. 4, pp. 1221–1243, 2014.
[14] M. Mutnỳ and A. Krause, “Efficient high dimensional Bayesian optimiza-
C5 = E Vt (xt ) − μDt−1 (xt )
mt mt
tion with additivity and quadrature Fourier features,” in Proc. Int. Conf.
t=1 Neural Inf. Process. Syst., 2019, pp. 9005–9016.
T
[15] D. Nguyen, S. Gupta, S. Rana, A. Shilton, and S. Venkatesh, “Bayesian
optimization for categorical and category-specific continuous inputs,” in
+E Dt−1 (xt )
μm t
− f (xt ) Proc. AAAI Conf. Artif. Intell., 2020, pp. 5256–5263.
[16] S. Gopakumar, S. Gupta, S. Rana, V. Nguyen, and S. Venkatesh, “Al-
t=1
gorithmic assurance: An active approach to algorithmic testing using

T
1/2

T Bayesian optimisation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2018,
pp. 5470–5478.
≤ E[βt+K−1 σt−1
mt
(xt )] + E [2BI(mt ∈ Ct )] [17] K. Kandasamy, A. Krishnamurthy, J. Schneider, and B. Póczos, “Paral-
t=1 t=1 lelised Bayesian optimisation via Thompson sampling,” in Proc. Int. Conf.
Artif. Intell. Statist., 2018, pp. 133–142.

T # $ [18] J. M. Hernández-Lobato, J. Requeima, E. O. Pyzer-Knapp, and A. Aspuru-
+ E Dt−1 (xt ) − f (xt ) I(mt ∈ Ct )
μm t
Guzik, “Parallel and distributed Thompson sampling for large-scale accel-
t=1 erated exploration of chemical space,” in Proc. Int. Conf. Mach. Learn.,
2017, pp. 1470–1479.

T
1/2 [19] S. R. Chowdhury and A. Gopalan, “On kernelized multi-armed bandits,”
≤ E[βt+K−1 σt−1
mt
(xt ) + ησD
mt
t−1
(xt )] in Proc. 34th Int. Conf. Mach. Learn., 2017, pp. 844–853.
t=1 [20] S. Vakili, V. Picheny, and A. Artemev, “Scalable Thompson sampling
using sparse Gaussian process models,” in Proc. Int. Conf. Neural Inf.
+ 2σn M T log T + 6M B Process. Syst., 2021, pp. 5631–5643 .
[21] J. Hong, B. Kveton, M. Zaheer, M. Ghavamzadeh, and C. Boutilier, Konstantinos D. Polyzos (Student Member, IEEE)
“Thompson sampling with a mixture prior,” 2021, arXiv:2106.05608. received the diploma degree from the Department
[22] T. Desautels, A. Krause, and J. W. Burdick, “Parallelizing exploration- of Electrical Engineering and Computer Technology,
exploitation tradeoffs in Gaussian process bandit optimization,” J. Mach. University of Patras, Greece, in 2018. Currently, he
Learn. Res., vol. 15, pp. 3873–3923, 2014. is working toward the PhD degree with the Depart-
[23] J. Wang, S. C. Clark, E. Liu, and P. I. Frazier, “Parallel Bayesian global ment of Electrical and Computer Engineering (ECE),
optimization of expensive functions,” 2016, arXiv:1602.05149. University of Minnesota (UMN) – Twin Cities. He is
[24] T. Teng, J. Chen, Y. Zhang, and B. K. H. Low, “Scalable variational a member of the SPiNCOM Research Group under
Bayesian kernel selection for sparse Gaussian process regression,” in Proc. the supervision of Prof. Georgios B. Giannakis. His
AAAI Conf. Artif. Intell., 2020, pp. 5997–6004. research interests span the areas of machine learning,
[25] D. Duvenaud, J. Lloyd, R. Grosse, J. Tenenbaum, and G. Zoubin, “Struc- signal processing, network science, and data science.
ture discovery in nonparametric regression through compositional kernel Lately, he focuses on learning over graphs which can model complex networks
search,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 1166–1174. including financial, social and biological ones to list a few. In the past, he has
[26] H. Kim and Y. W. Teh, “Scaling up the automatic statistician: Scalable worked on the development of automatic aerial target recognition systems using
structure discovery using Gaussian processes,” in Proc. Int. Conf. Artif. passive Radar data. He has been awarded the UMN ECE Department fellow-
Intell. Statist., 2018, pp. 575–584. ship (2019), Gerondelis Foundation scholarship (2020), Onassis Foundation
[27] G. Malkomes, C. Schaff, and R. Garnett, “Bayesian optimization for scholarship (2021), the Best Paper Award at the International CIT & DS 2019
automated model selection,” in Proc. Int. Conf. Neural Inf. Process. Syst., International Conference (2019) and the Outstanding Reviewer Award (top 10
2016, pp. 41–47. %) at the International Conference on Machine Learning (ICML 2022).
[28] Q. Lu, G. Karanikolas, Y. Shen, and G. B. Giannakis, “Ensemble Gaus-
sian processes with spectral features for online interactive learning with
scalability,” in Proc. Int. Conf. Artif. Intell. Statist., 2020, pp. 1910–1920.
[29] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of
deep bidirectional transformers for language understanding,” 2018, arXiv:
1810.04805.
[30] C. E. Rasmussen and C. K. Williams, Gaussian Processes for Machine
Learning. Cambridge, MA, USA: MIT Press, 2006.
[31] A. Rahimi and B. Recht, “Random features for large-scale kernel ma- Bingcong Li (Member, IEEE) received the MSc and
chines,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2008, pp. 1177–1184. PhD degrees in electrical and computer engineering
[32] J. Wilson, V. Borovitskiy, A. Terenin, P. Mostowsky, and M. Deisenroth, (ECE) from the University of Minnesota (UMN), in
“Efficiently sampling functions from Gaussian process posteriors,” in 2019 and 2022, respectively. He is now with Huawei
Proc. Int. Conf. Mach. Learn., 2020, pp. 10 292–10 302. as a research engineer. His research interests lie in op-
[33] M. Lázaro-Gredilla, J. Quiñonero Candela, C. E. Rasmussen, and A. timization and machine learning systems. He received
Figueiras-Vidal, “Sparse spectrum Gaussian process regression,” J. Mach. the National Scholarship twice from China in 2014
Learn. Res., vol. 11, no. Jun., pp. 1865–1881, 2010. and 2015, and UMN ECE Department Fellowship in
[34] D. R. Jones, M. Schonlau, and W. J. Welch, “Efficient global optimiza- 2017.
tion of expensive black-box functions,” J. Glob. Optim., vol. 13, no. 4,
pp. 455–492, 1998.
[35] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger, “Information-
theoretic regret bounds for Gaussian process optimization in the bandit
setting,” IEEE Trans. Inf. Theory, vol. 58, no. 5, pp. 3250–3265, May 2012.
[36] S. Ghosal and A. Roy, “Posterior consistency of Gaussian process
prior for nonparametric binary regression,” Ann. Statist., vol. 34, no. 5,
pp. 2413–2429, 2006.
[37] Z. Wang and S. Jegelka, “Max-value entropy search for efficient Bayesian Georgios B. Giannakis (Fellow, IEEE) received the
optimization,” in Proc. Int. Conf. Mach. Learn., 2017, pp. 3627–3635. diploma in electrical engineering from the National
[38] Breast Cancer dataset. [Online]. Available: https://archive.ics.uci.edu/ml/ Technical University of Athens, Greece, in 1981,
datasets/breast+cancer and the MSc degree in electrical engineering, the
[39] Iris dataset. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/ MSc degree in mathematics, and the PhD degree in
iris electrical engineering from the University of South-
[40] Transportation dataset. [Online]. Available: https://gisdata.mn.gov ern California (USC), in 1983, 1986, and 1986, re-
[41] Wine dataset. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/ spectively. He was a faculty member with the Uni-
wine versity of Virginia from 1987 to 1998, and since
[42] R. J. Adler, An Introduction to Continuity, Extrema, and Related Topics 1999, he has been a professor with the University
for General GAussian Processes. Virginia Beach, VA, USA: IMS, 1990. of Minnesota, where he holds an ADC endowed
[43] W. Rudin, Principles of Mathematical Analysis, vol. 3. New York, NY, chair, a University of Minnesota McKnight presidential chair in ECE, and
USA: McGraw-Hill, 1964. serves as director of the Digital Technology Center. His general interests
span the areas of statistical learning, signal processing, communications, and
networking - subjects on which he has published more than 480 journal pa-
Qin Lu (Member, IEEE) received the PhD degree pers, 780 conference papers, 25 book chapters, two edited books, and two
in electrical engineering from the University of Con- research monographs. Current research focuses on Data Science, and Net-
necticut (UConn), in 2018. Currently, she is a post- work Science with applications to the Internet of Things, and power net-
doctoral research associate with the University of works with renewables. He is the (co-) inventor of 34 issued patents, and the
Minnesota, Twin Cities. Her research interests span (co-) recipient of 10 best journal paper awards from IEEE Signal Processing
the areas of machine learning, data science, and net- (SP) and Communications Societies, including the G. Marconi Prize Paper
work science, with special focus on Bayesian in- Award in Wireless Communications. He also received the IEEE-SPS Norbert
ference, Bayesian optimization, and spatio-temporal Wiener Society Award (2019); EURASIP’s A. Papoulis Society Award (2020);
inference over graphs. In the past, she has worked on Technical Achievement Awards from the IEEE-SPS (2000) and from EURASIP
statistical signal processing, multiple target tracking, (2005); the IEEE ComSoc Education Award (2019); and IEEE Fourier Technical
and underwater acoustic communications. She was Field Award (2015). He is a member of the Academia Europaea, and fellow of
awarded Summer Fellowship and Doctoral Dissertation Fellowship with UConn. the National Academy of Inventors, the European Academy of Sciences, and
She was also a recipient of the Women of Innovation Award in Collegian EURASIP. He has served the IEEE in a number of posts, including that of a
Innovation and Leadership by Connecticut Technology Council in March, 2018. distinguished lecturer for IEEE-SPS.

Bayesian and surroagte 2

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Bayesian and surroagte 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian and surroagte 2

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO.

9, SEPTEMBER 2023 11283

Surrogate Modeling for Bayesian Optimization

where the mean and variance are expressed via kt (x) :=

of wtm (9) and p(f (x)|i = m, Dt ) across iterates, as elaborated −2 m m

techniques in [21], where TS with a mixture prior is studied for

where Utm (x) := μm

from the EGP, future work includes investigation of the EI rule

This work introduced a non-Gaussian EGP prior with adaptive

t=1 A4,3 = 2B E[I(m∗ ∈ Ct )] ≤ 4 M B . (24)

t=1 m∈M x∈Xt

The last step is to upper bound A4 , by constructing a con- (c)

m=1 m=1 = E[Et−1 [μm

τ ∈Ttm and (28) as

You might also like

Bayesian and surroagte 2

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Bayesian and surroagte 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian and surroagte 2

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO.

9, SEPTEMBER 2023 11283

Surrogate Modeling for Bayesian Optimization

where the mean and variance are expressed via kt (x) :=

of wtm (9) and p(f (x)|i = m, Dt ) across iterates, as elaborated −2 m m

techniques in [21], where TS with a mixture prior is studied for

where Utm (x) := μm

from the EGP, future work includes investigation of the EI rule

This work introduced a non-Gaussian EGP prior with adaptive

t=1 A4,3 = 2B E[I(m∗ ∈ Ct )] ≤ 4 M B . (24)

t=1 m∈M x∈Xt

The last step is to upper bound A4 , by constructing a con- (c)

m=1 m=1 = E[Et−1 [μm

τ ∈Ttm and (28) as

You might also like

t=1 A4,3 = 2B E[I(m∗ ∈ Ct )] ≤ 4 M B . (24)