Reading 3-Russo & Van Roy 2014
Reading 3-Russo & Van Roy 2014
Reading 3-Russo & Van Roy 2014
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide
range of content in a trusted digital archive. We use information technology and tools to increase productivity and
facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at
https://about.jstor.org/terms
INFORMS is collaborating with JSTOR to digitize, preserve and extend access to Mathematics of
Operations Research
This paper considers the use of a simple posterior sampling algorithm to balance between exploration and exploitation when
learning to optimize actions such as in multiarmed bandit problems. The algorithm, also known as Thompson Sampling and as
probability matching, offers significant advantages over the popular upper confidence bound (UCB) approach, and can be applied
to problems with finite or infinite action spaces and complicated relationships among action rewards. We make two theoretical
contributions. The first establishes a connection between posterior sampling and UCB algorithms. This result lets us convert
regret bounds developed for UCB algorithms into Bayesian regret bounds for posterior sampling. Our second theoretical
contribution is a Bayesian regret bound for posterior sampling that applies broadly and can be specialized to many model classes.
This bound depends on a new notion we refer to as the eluder dimension, which measures the degree of dependence among
action rewards. Compared to UCB algorithm Bayesian regret bounds for specific model classes, our general bound matches the
best available for linear models and is stronger than the best available for generalized linear models. Further, our analysis
provides insight into performance advantages of posterior sampling, which are highlighted through simulation results that
demonstrate performance surpassing recently proposed UCB algorithms.
1. Introduction. We consider an optimization problem faced by an agent who is uncertain about how
actions influence performance. The agent selects actions sequentially, and upon each action observes a reward
A reward function governs the mean reward of each action. The agent represents his initial beliefs through a p
distribution over reward functions. As rewards are observed, the agent learns about the reward function, and t
allows him to improve his behavior. Good performance requires adaptively sampling actions in a way that strik
an effective balance between exploring poorly understood actions and exploiting previously acquired knowledge
attain high rewards. In this paper, we study a simple algorithm for selecting actions and provide finite-tim
performance guarantees that apply across a broad class of models.
The problem we study has attracted a great deal of recent interest and is often referred to as the multiarm
bandit (MAB) problem with dependent arms. We refer to the problem as one of learning to optimize to empha
its divergence from the classical MAB literature. In the typical MAB framework, there are a finite number
actions that are modeled independently; sampling one action provides no information about the rewards that can
gained through selecting other actions. In contrast, we allow for infinite action spaces and for general forms
model uncertainty, captured by a prior distribution over a set of possible reward functions. Recent papers hav
addressed this problem in cases where the relationship among action rewards takes a known parametric form.
example, Abbasi-Yadkori et al. [2], Dani et al. [16], and Rusmevichientong and Tsitsiklis [31] study the c
where actions are described by a finite number of features and the reward function is linear in these features. Othe
authors have studied cases where the reward function is Lipschitz continuous (Bubeck et al. [13], Kleinb
et al. [23], Valko et al. [36]), sampled from a Gaussian process (Srinivas et al. [35]), or takes the form o
generalized (Filippi et al. [18]) or sparse (Abbasi-Yadkori et al. [3]) linear model.
Each paper cited above studies an upper confidence bound (UCB) algorithm. Such an algorithm forms
optimistic estimate of the mean reward value for each action, taking it to be the highest statistically plausible v
It then selects an action that maximizes among these optimistic estimates. Optimism encourages selection of poo
understood actions, which leads to informative observations. As data accumulates, optimistic estimates are adap
and this process of exploration and learning converges toward optimal behavior.
We study an alternative algorithm that we refer to as posterior sampling. It is also known as Thompson sampl
and as probability matching. The algorithm randomly selects an action, according to the probability it is optim
Although posterior sampling was first proposed almost 80 years ago, it has until recently received little attentio
the literature on MABs. While its asymptotic convergence has been established in some generality (May et al. [
1221
not much else is known about its theoretical properties in the case of dependent arms, or even
independent arms with general prior distributions. Our work provides some of the first theoreti
Our interest in posterior sampling is motivated by several potential advantages over UCB algori
highlight in §4.3. While particular UCB algorithms can be extremely effective, performance an
tractability depends critically on the confidence sets used by the algorithm. For any given model
deal of design flexibility in choosing the structure of these sets. Because posterior sampling avoid
confidence bounds, its use greatly simplifies the design process and admits practical implement
where UCB algorithms are computationally onerous. In addition, we show through simulations t
sampling outperforms various UCB algorithms that have been proposed in the literature.
In this paper, we make two theoretical contributions. The first establishes a connection betw
sampling and UCB algorithms. In particular, we show that while the regret of a UCB algorithm c
terms of the confidence bounds used by the algorithm, the Bayesian regret of posterior sampling
an analogous way by any sequence of confidence bounds. In this sense, posterior sampling preserv
appealing theoretical properties of UCB algorithms without requiring explicit, designed optimism
due to this connection, existing analysis available for specific UCB algorithms immediately transl
regret bounds for posterior sampling.
Our second theoretical contribution is a Bayesian regret bound for posterior sampling that appl
can be specialized to many specific model classes. Our bound depends on a new notion of dimensio
the degree of dependence among actions. We compare our notion of dimension to the Vapnik-C
dimension and explain why that and other measures of dimension used in the supervised learning
suffice when it comes to analyzing posterior sampling.
The remainder of this paper is organized as follows. The next section discusses related literature
provides a formal problem statement. We describe UCB and posterior sampling algorithms in §4. W
in §5 a connection between them, which we apply in §6 to convert existing bounds for UCB algor
for posterior sampling. Section 7 develops a new notion of dimension and presents Bayesian reg
depend on it. Section 8 presents simulation results. A closing section makes concluding remarks
2. Related literature. One distinction of results presented in this paper is that they center ar
regret as a measure of performance. In the next subsection, we discuss this choice and how it relate
measures used in other work. Following that, we review prior results and their relation to result
2.1. Measures of performance. Several recent papers have established theoretical resul
sampling. One difference between this work and ours is that we focus on a different measure
These papers all study the algorithm's regret, which measures its cumulative loss relative to an
always selects the optimal action, for some fixed reward function. To derive these bounds, each
uninformative prior distribution with a convenient analytic stmcture, and studies posterior sampli
particular prior is used. With one exception (Agrawal and Goyal [6]), the focus is on the classica
where sampling one action provides no information about others.
Posterior sampling can be applied to a much broader class of problems, and one of its greates
its ability to incorporate prior knowledge in a flexible and coherent way. We therefore aim to
that accommodate the use of a wide range of models. Accordingly, most of our results allow fo
prior distribution over a particular class of mean reward functions. To derive meaningful results
generality, we study the algorithm's expected regret, where the expectation is taken with resp
distribution over reward functions. This quantity is sometimes called the algorithm's Bayesian reg
to be a practically relevant measure of performance and find this choice allows for more elegant a
as we discuss in §3, the Bayesian regret bounds we provide in some cases immediately yield reg
In addition, studying Bayesian regret reveals deep connections between posterior sampling an
of optimism in the face of uncertainty, which we feel provides new conceptual insight into th
performance. Optimism in the face of uncertainty is a general principle and is not inherently tied
of performance. Indeed, algorithms based on this principle have been shown to be asymptotically e
of both regret (Lai and Robbins [26]) and Bayesian regret (Lai [25]), to satisfy order optimal mi
bounds (Audibert and Bubeck [8]), to satisfy order optimal bounds on regret and Bayesian regret
function is linear (Rusmevichientong and Tsitsiklis [31]), and to satisfy strong bounds when the re
sampled from a Gaussian process prior (Srinivas et al. [35]). We take a very general view of optim
allowing upper confidence bounds to be constructed in an essentially arbitrary way based on th
observations and possibly the prior distribution over reward functions.
2.2. Related results. Though it was first proposed in 1933, posterior sampling has until re
relatively little attention. Interest in the algorithm grew after empirical studies (Chapelle and Li
demonstrated performance exceeding state-of-the-art methods. An asymptotic convergence result w
May et al. [30], but finite time guarantees remain limited. The development of further performa
raised as an open problem at the 2012 Conference on Learning Theory (Li and Chapelle [29]).
Three recent papers (Agrawal and Goyal [4, 5]; Kauffmann et al. [22]) provide regret bounds f
sampling when applied to MAB problems with finitely many independent actions and rewards that
processes. These results demonstrate that posterior sampling is asymptotically optimal for the cla
considered. A key feature of the bounds is their dependence on the difference between the optima
mean reward values. Such bounds tend not to be meaningful when the number of actions is large o
they can be converted to bounds that are independent of this gap, which is sometimes the case
In this paper, we establish distribution independent bounds. When the action space sd is finite,
finite-time Bayesian regret bound of order v/j.ri|r log T. This matches what is implied by the ana
and Goyal [5], However, our bound does not require actions are modeled independently, and our
leads to meaningful bounds for problems with large or infinite action sets.
Only one other paper has studied posterior sampling in a context involving dependent actions
Goyal [6]). That paper considers a contextual bandit model with arms whose mean reward values
d-dimensional linear model. The cumulative T-period regret is shown to be of order (d2/e)VT1+
with probability at least 1 - 8. Here e e (0,1) is a parameter used by the algorithm to control ho
posterior distribution concentrates. The Bayesian regret bounds we will establish are stronger than
the results of Agrawal and Goyal [6], In particular, we provide a Bayesian regret bound of
that holds for any compact set of actions. This is order optimal up to a factor of In T (Rusmev
Tsitsiklis [31]).
We are also the first to establish finite-time performance bounds for several other problem classes. One applies to
linear models when the vector of coefficients is likely to be sparse; this bound is stronger than the aforementioned
one that applies to linear models in the absence of sparsity assumptions. We establish the first bounds for posterior
sampling when applied to generalized linear models and to problems with a general Gaussian prior. Finally, we
establish bounds that apply very broadly and depend on a new notion of dimension.
Unlike most of the relevant literature, we study MAB problems in a general framework, allowing for complicated
relationships between the rewards generated by different actions. The closest related work is that of Amin et al. [7],
who consider the problem of learning the optimum of a function that lies in a known, but otherwise arbitrary set of
functions. They provide bounds based on a new notion of dimension, but unfortunately, this notion does not
provide a bound for posterior sampling. Other work provides general bounds for contextual bandit problems where
the context space is allowed to be infinite, but the action space is small (see, e.g., Beygelzimer et al. [10]). Our
model captures contextual bandits as a special case, but we emphasize problem instances with large or infinite
action sets, and where the goal is to learn without sampling every possible action.
A focus of our paper is the connection between posterior sampling and UCB approaches. We discuss UCB
algorithms in some detail in §4. UCB algorithms have been the primary approach considered in the segment of
the stochastic MAB literature that treats models with dependent arms. Other approaches are the knowledge
gradient algorithm (Ryzhov et al. [32]), forced exploration schemes for linear bandits (Abbasi-Yadkori et al. [1],
Rusmevichientong and Tsitsiklis [31], Deshpande and Montanari [17]), and exponential weighting schemes
(Beygelzimer et al. [10]).
There is an immense and rapidly growing literature on bandits with independent arms and on adversarial bandits.
Theoretical work on stochastic bandits with independent arms often focuses on UCB algorithms (Auer et al. [9],
Lai and Robbins [26]), or on the Gittin's index approach (Gittins and Jones [19]). Bubeck and Cesa-Bianchi [11]
provide a review of work on UCB algorithms and on adversarial bandits. Gittins et al. [20] cover work on Gittin's
indices and related extensions.
Since an initial version of this paper was made publicly available, the literature on the analysis of posterior
sampling has rapidly grown. Korda et al. [24] extend their earlier work (Kauffmann et al. [22]) to the case where
reward distributions lie in the one-dimensional exponential family. Bubeck and Liu [12] combine the regret
decomposition we derive in §5 with the confidence bound analysis of Audibert and Bubeck [8] to tighten the
bound provided in §6.1, and also consider a problem setting where the regret of posterior sampling is bounded
uniformly over time. Li [28] explores a connection between posterior sampling and exponential weighting schemes,
and Gopalan et al. [21] study the asymptotic growth rate of regret in problems with dependent arms.
3. Problem formulation. We consider a model involving a set of actions si and a set of real-
W = {fp: si i-> R I p e 0}, indexed by a parameter that takes values from an index set 0. We w
variables with respect to a probability space (fl, F, P). A random variable 6 indexes the true r
At each time t, the agent is presented with a possibly random subset si, ç si and selects an ac
which she observes a reward Rt.
We denote by H, the history (sit, A,, Rx,..., si,_x,A,_x, R,_x,sâ,) of observations available t
choosing an action At. The agent employs a policy 7r = {tt, \ t e N}, which is a deterministic seq
each mapping the history H, to a probability distribution over actions si. For each realization
distribution over si with support si„ though with some abuse of notation, we will often write this d
The action A, is selected by sampling from the distribution 7r„ so that P (A, e-\v,) — P(A, e-
We assume that E[R( | Ht, d, A,] = fe(A,). In other words, the realized reward is the mean rewar
by zero mean noise. We will also assume that for each / e W and t € N, arg maxa€VÎ f(a) is n
probability one, though algorithms and results can be generalized to handle cases where this
not hold.
The T-period Bayesian regret is defined by E[Regret(T, tt, 0)], where the expectation is taken with respect to the
prior distribution over 6. Hence
T
Remark 1. Measurability assumptions are required for the above expectations to be well defined. To avoid
technicalities that do not present fundamental obstacles in the contexts we consider, we will not explicitly address
measurability issues in this paper and instead simply assume that functions under consideration satisfy conditions
that ensure relevant expectations are well defined.
Remark 2. All equalities between random variables in this paper hold almost surely with respect to the
underlying probability space.
3.1. On regret and Bayesian regret. To interpret results about the regret and Bayesian regret of various
algorithms and to appreciate their practical implications, it is useful to take note of several properties of and
relationships between these performance measures. For starters, asymptotic bounds on Bayesian regret are
essentially asymptotic bounds on regret. In particular, if BayesRegret(T, tt) = 0(g(T)) for some nonnegative
function g, then an application of Markov's inequality shows Regret(7\ tt, 6) = 0P(g(T)). Here 0P indicates that
Regret(T, 7r, 6)/g(T) is stochastically bounded under the prior distribution. In other words, for all e > 0, there
exists M > 0 such that
V 8(T)
This observation can be further extended to establish a sense in which Bayesian regret i
mis-specification. In particular, if the agent's prior over 6 is /x but for convenience he selects act
prior were an alternative jx, the resulting Bayesian regret satisfies
dfi
Ee„~JRegret(7'' 77- *o)] ^ E0o~JRegret(7\ tt, 0„)],
djx fl, 00
where d^/dfi is the Radon-Nikodym derivative1 of /x with respect to jx and || • ||^>00 is the essential supremum
magnitude with respect to jx. Note that the final term on the right-hand side is the Bayesian regret for a problem
with prior jx without mis-specification.
It is also worth noting that an algorithm's Bayesian regret can only differ significantly from its worst-case
regret if regret varies significantly depending on the realization of 6. This provides one method of converting
1 Note that the Radon-Nikodym derivative is only well defined when /i is absolutely continuous with respect to jx.
Bayesian regret bounds to regret bounds. For example, consider the linear model fe(a) = 0'4>(a),
© = {p e ||p||2 = 5} is the boundary of a hypersphere in IR^. Let :A, = :A for each t and let the set of fea
vectors be \ a e :A} = {« e Ud \ \\u\\2 < 1}. Consider a problem instance where 9 is uniformly distribut
©, and the noise terms R, — fe{A,) are independent of 9. By symmetry, the regret of most reasonable algo
for this problem should be the same for all realizations of 9, and indeed this is the case for posterior s
Therefore, in this setting, Bayesian regret is equal to worst-case regret. This view also suggests that to attai
minimax regret bounds, one should not choose a uniform prior as in Agrawal and Goyal [5], but should
place more prior weight on the worst-possible realizations of 9 (see the discussion of "least favorable" p
distributions in Lehmann and Casella [27]).
3.2. On changing action sets. Our stochastic model of action sets si, is distinct relative to most of the
literature, which assumes that si, = si. This construct allows our formulation to address a variety of practic
that are usually viewed as beyond the scope of standard MAB formulations. Let us provide three examp
Example 1 (Contextual Models). The contextual MAB model is a special case of the formu
presented above. In such a model, an exogenous Markov process X, taking values in a set influences rewa
In particular, the expected reward at time t is given by fe(a,X,). However, this is mathematically equivalen
problem with stochastic time-varying decision sets si,. In particular, one can define the set of actions to be
of state-action pairs si := {(*, a): x e:A, ae 'A(x)}, and the set of available actions to be si, = {(X,, a): a € si
Example 2 (Cautious Actions). In some applications, one may want to explore without risking te
performance. This can be accomplished by restricting the set si, to conservative actions. Then, the instant
regret in our framework is the gap between the reward from the chosen action and the reward from t
conservative action. In many settings, the Bayesian regret bounds we will establish for posterior sampling
that the algorithm either attains near-optimal performance or converges to a point where any better dec
unacceptably risky.
A number of formulations of this flavor are amenable to efficient implementations of posterior sampli
example, consider a problem where si is a polytope or ellipsoid in (Rd and fe(a) — (a, 9). Suppose
Gaussian prior and that reward noise is Gaussian. Then, the posterior distribution of 9 is Gaussian. Cons
ellipsoidal confidence set %, — {u\ ||n - < ß}, for some scalar constant ß > 0, where fi, and X, are t
and covariance matrix of 9, conditioned on H,. One can attain good worst-case performance with high pro
by solving the robust optimization problem Vrobust smaxae^min„e^ (a, u), which is a tractable linear saddl
problem. Letting our cautious set be given by
for some scalar constant a > 0, we can then select an optimal cautious action given 9 by solving maxa(E;i (a, 9),
which is equivalent to
maximize (a, 9)
subject to aesi
Example 3 (Adaptive Adversaries). Consider a model in which rewards are influenced by the choices of
an adaptive adversary. At each time period, the adversary selects an action A~ from some set si" based on past
observations. The agent observes this action, responds with an action A+ selected from a set :A+, and receives a
reward that depends on the pair of actions (A+, A~). This fits our framework if the action A, is taken to be the
pair (Aj, A~), and the set of actions available to the agent is si, = {(a, A~) | a € :A+).
4. Algorithms. We will establish finite-time performance bounds for posterior sampling by leveraging prior
results pertaining to UCB algorithms and a connection we will develop between the two classes of algorithms.
To set the stage for our analysis, we discuss the algorithms in this section.
4.1. UCB algorithms. UCB algorithms have received a great deal of attention in the MAB literat
algorithm makes use of a sequence of UCBs U = [U,\t eN), each of which is a function that takes
as its argument. For each realization of H„ U,(Ht) is a function mapping :A to R. With some abuse o
will often write this function as U, and its value at a e sä as U,(a). The UCB U,(a) represents the gre
fe(a) that is statistically plausible given Ht. A UCB algorithm selects an action Ä, e argmax
maximizes the UCB. We will assume that the argmax operation breaks ties among optima in a determ
As such, each action is determined by the history H,, and for the policy tt — {ît, 11 e f^l} follow
algorithm, each action distribution tt, concentrates all probability on a single action.
As a concrete example, consider Algorithm 1 proposed by Auer et al. [9] to address MAB proble
finite number of independent actions. For such problems, si, = si, 9 is a vector with one independen
per action, and the reward function is given by fe(a) — 9a. The algorithm begins by selecting eac
Then, for each subsequent time f>M, the algorithm generates point estimates of action rewards,
based on them, and selects actions accordingly. For each action a, the point estimate 6, (a) is taken
average reward obtained from samples of action a taken prior to time t. The UCB is produced by
"uncertainty bonus" ß*j\ogt/Nt(a) to the point estimate, where N,(a) is the number of times action
prior to time t and ß is an algorithm parameter generally selected based on reward variances. Thi
bonus leads to an optimistic assessment of expected reward when there is uncertainty, and it is this
encourages exploration that reduces uncertainty. As N,(a) increases, uncertainty about action a dimi
does the uncertainty bonus. The log / term ensures that the agent does not permanently rule out any
is important as there is always some chance of obtaining an overly pessimistic estimate by observin
sequence of rewards.
log t
A, € arg max0t(i)+ ß
aesA IV, (a)
4. Increment t and Go to Step 2.
Our second example treats a linear bandit problem. Here, we assume 6 is drawn from a normal distribution
2„) but without assuming that the covariance matrix is diagonal. We consider a linear reward function
fe(a) = (cf)(a), 6) and assume the reward noise R, — /((A,) is normally distributed and independent from
(//,, A,, 6). One can show that, conditioned on the history Ht, 6 remains normally distributed. Algorithm 2
presents a UCB algorithm for this problem. The expectations can be computed efficiently via Kaiman filtering.
The algorithm employs UCB (</>(a),/z() + ß\og(t)\\(p(a)\\^r The term ||<^(u)||2; captures the posterior variance of
9 in the direction cf)(a), and, as with the case of independent arms, causes the uncertainty bonus /31og(t)||<£(a)||s
to diminish as the number of observations increases.
1. Update Statistics:
fi,<-t[0\H,]
S(4-E[(0-M;)(ö-Mt)T|//,]
2. Select Action:
Ä, 6 argmax{(<£(a),/i,) +j81og(f)IW(a)||s,}
aesâ
4.2. Posterior sampling. The posterior sampling algorithm simply samples each action according to the
probability it is optimal. In particular, the algorithm applies action sampling distributions tt, — IP(A* e ■ \ H,),
where A* is a random variable that satisfies A* e argmaxaes^ fn(a). Practical implementations typically operate
by, at each time t, sampling an index 9, 6 © from the distribution P(0 e • | H,) and then generating an action
A, e argmaxaesî( f~e (a). To illustrate, let us provide concrete examples that address problems analogous to
Algorithms 1 and 2.
Our first example involves a model with independent arms. In particular, suppose 9 is drawn from a normal
distribution N(fi0,20) with a diagonal covariance matrix £0, the reward function is given by fe(a) = 9a, and
the reward noise R, — fe(At) is normally distributed and independent from (//,, A„ 6). It then f
conditioned on the history Ht, 6 remains normally distributed with independent components. Algori
an implementation of posterior sampling for this problem. The expectations are easy to compute an
computed recursively.
1. Sample Model:
0,~A(a,-i,2,-i)
2. Select Action:
A, € argmax0€^ 0,(a)
3. Update Statistics: For each a,
A« E[0a I H,]
^iaa E[(0O - Am)2 I H,]
4. Increment t and Go to Step 1.
Our second example treats a linear bandit problem. Algorithm 4 presents a posterior sampling analog t
Algorithm 2. As before, we assume d is drawn from a normal distribution N(/l0, X0). We consider a linear rewar
function fe(a) = (4>(a), 6) and assume the reward noise R, - fe(At) is normally distributed and independent fro
CHt,A„6).
Algorithm 4 (Linear posterior sampling)
1. Sample Model:
Ö, ~ Af(^,r_!, S,_j)
2. Select Action:
A, e argmaxa6^(</>(a), 6t)
3. Update Statistics:
A,«-E[0| Ht]
2(^E[(0-M,)(0-M()T| H,\
4. Increment t and Go to Step 1.
4.3. Potential advantages of posterior sampling. Well-designed optimistic algorithms can be extremely
effective. When simple and efficient UCB algorithms are available, they may be preferable to posterior sampling.
Our work is primarily motivated by the challenges in designing optimistic algorithms to address very complicated
problems, and the important advantages posterior sampling sometimes offers in such cases.
The performance of a UCB algorithm depends critically on the choice of UCBs (£/,: teN). These functions
should be chosen so that Ut(A*) > fe(A*) with high probability. However, unless the posterior distribution of fe(a)
can be expressed in closed form, computing high quantiles of this distribution can require extensive Monte-Carlo
simulation for each possible action. In addition, since A* depends on 0, it isn't even clear that U,(a) should be set
to a particular quantile of the posterior distribution of fe(a). Strong frequentist UCBs were recently developed
(Cappé et al. [14]) for problems with independent arms, but it is often unclear how to generate tight confidence
sets for more complicated models. In fact, even in the case of a linear model, the design of confidence sets has
required sophisticated tools from the study of multivariate self-normalized martingale processes (Abbasi-Yadkori
et al. [2]). We believe posterior sampling provides a powerful tool for practitioners, as a posterior sampling
approach can be designed for complicated models without sophisticated statistical analysis. Further, posterior
sampling does not require computing posterior distributions but only sampling from posterior distributions.
As such, Markov Chain Monte- Carlo methods can often be used to efficiently generate samples even when the
posterior distribution is complex.
Posterior sampling can also offer critical computational advantages over UCB algorithms when the action space
is intractably large. Consider the problem of learning to solve a linear program, where the action set si is a
polytope encoded in terms of linear inequalities, and fe(a) := (4>(a), 6) is a linear function. Algorithm 2 becomes
impractical, because as observed by Dani et al. [16], the action selection step entails solving a problem equivalent
to linearly constrained negative definite quadratic optimization, which is NP-hard (Sahni [33]).2 By contrast, the
action selection step of Algorithm 4 only requires solving a linear program. The figure below displays the level
sets of the linear objective (4>(a), 6) and of the UCBs used by Algorithm 2. While posterior sampling preserves
!Dani et al. [16] studies a slightly different UCB algorithm, but the optimization step shares the same structure.
the linear structure of the functions fe{a), it is challenging to maximize the UCBs of Algorit
strictly convex.
(a) Tractable
(a) Tractableproblem
problemofof maximizing
maximizing a linear
a linear function
(b) (b) Intractable
Intractable
function problem of
problem ofmaximizing
maximi an
over
over a polytope over a polytope
a polytope over a polytope
5.2. Posterior sa
regret of posteri
that, with some
The following p
denote the policy
Proposition 1. F
BayesRegr
t= 1 t= 1
Proof. Note that, conditioned on Ht, the optimal action A* and the action At selected by posterior samp
are identically distributed, and U, is deterministic. Hence E[t/,(A*) | H,] = E[U,(At) \ H,]. Therefore
To compare (2) and (3), consider the case where fe takes values in [0, C]. Then
an^ T T
BayesRegret(T, ir
t= 1 t=l
An important differe
sequence U used by th
UCB sequences. This s
specific choice of conf
is a crucial advantage
appropriate confidence
sampling significantly
We have shown how U
concept in the next tw
use of UCBs, the actua
Proof. Let Nfa) = E'i= i 1(A, = a) denote the number of times a is sampled over the first
fifa) = Nfa)~l EL, 1(A, = a)R, denote the empirical average reward from these samples. Defi
lower confidence bounds as follows:
(5)
i/,(„) = mi„|A,_,W + 72 + 6l°Sa<)T).lJ = -,(«>-.0
The next lemma, which is a consequence of analysis in Abbasi-Yadkori et al. [2], shows these hold with high
probability. Were actions sampled in an iid fashion, this lemma would follow immediately from the Hoeffding
inequality. For more details, see Appendix A.
Lemma 1. If Ufa) and Lfa) are defined as in (5), then P(Uf=i{/ö(a) ^ [Lt(a), Ufa)]}) < l/T.
First, consider the case where T <K. Since fe(a) e [0, 1], BayesRegret(7\ ttps) < T = min{Af, T\.
Now, assume T > K. Then
r t
PS
BayesRegret(T, it ) < E Z{Ut-L,)(At) + Tp({J\J{fe(a)t[Lfa),Ufa)]}
Lt=\ Vaestf t=l
r t
< E E(Ut-Lt)(A,) + K.
We now turn to bounding Ej=i(U, — L,)(At). Let Wa = {t < T: A, = a] denote the periods in which a is
selected. Then ELi(u, ~ L,)(A,) = - L,)(a). We show
Ar(a)-1
and
NT(a)—l ^(a)
E 0 + 1)-1/2 < f x~],2dx = 2y/NT(a).
= 2K+4f KT (2 +6log(T))
= 2min{K, T}+4^KT(2 + 61og(T)),
where (a) follows from the Cauchy-Shwartz inequality and (b) follows from the assumption that T > K. □
6.2. Linear and generalized linear models. We now consider function classes that represent linear
generalized linear models. The bound of Proposition 2 applies so long as the number of actions is finite, b
will establish alternative bounds that depend on the dimension of the function class rather than the num
actions. Such bounds accommodate problems with infinite action sets and can be much stronger than the bo
Proposition 2 if there are many actions.
The Bayesian regret bounds we provide in this section derive from regret bounds of the UCB literature. I
we will establish a Bayesian regret bound that is as strong for the case of linear models and stronger for th
of generalized linear models. Since the results of §7 to a large extent supersede those we present here, we ai
be brief and avoid formal proofs in this section's discussion of the bounds and how they follow from results
literature.
6.2.1. Linear models. In the "linear bandit" problem studied by Abbasi-Yadkori et al. [2, 3], Dani et al. [16],
and Rusmevichientong and Tsitsiklis [31], reward functions are parameterized by a vector d e 0 C IRrf, and there is
a known feature mapping fi: si i-a- Ud such that fä(a) = {(f>(a), 6). The following proposition establishes Bayesian
regret bounds for such problems. The proposition uses the term cr-sub-Gaussian to describe any random variable X
that satisfies Eexp(ÀY) < exp(A2cr2/2) for all A e 1R.
Proposition 3. Fix positive constants a, c,, and c2. If © C Pd, ffa) = {<fi(a), 6) for some </>: si
suPpe® II PII2 5 ci> an-d supaes) \\4>(a)\\2 < c2, and for each t, R, — fe(At) conditioned on (H,, An 6) is
Gaussian, then
BayesRegret(7\ irps) = 0(d log T Vf)
and
The second bound essentially replaces the dependence on the dimension d with one on E VWW- The "z
norm" ||0||o is the number of nonzero components, which can be much smaller than d when the reward fu
admits a sparse representation. Note that Ö ignores logarithmic factors. Both bounds follow from our r
decomposition (3) together with the analysis of Abbasi-Yadkori et al. [2] in the case of the first bound,
analysis of Abbasi-Yadkori et al. [3] in the case of the second bound. We now provide a brief sketch of how
bounds can be derived.
BayesRegret(r, trps) < E jjUfAt) - LfA,)] + 2C £[P\fe{A]) > UfA])) + P(fe(At) < LfA,))]. (6)
i=i r=i
The analyses of A
this equation. In t
for some À € R, V
to time t. The u
Ufa) := max{-C,
P(6 £ 0,1 Ht) < 1/
the second step
presented on pag
bound of order V
leads to the boun
6.2.2. Generalize
ffa) := g((4>(a), 6))
The analysis of Fi
but with one cave
a,,..., ad with line
actions exist or th
exploration, the a
results from Filip
first d actions ta
posterior distribu
We denote this m
present a result w
include a designed
Proposition 4. F
increasing continu
i <l>(ai)<t>(ai)T b
BayesRegret
where r = supp a
6.3. Gaussian processes. In this section, we consider the case where the reward function fe is sampled
Gaussian process. That is, the stochastic process (fe(a): a e sd) is such that for any ax,..., ak e sd, the col
fe(ax),... ,fe(ak) follows a multivariate Gaussain distribution. Srinivas et al. [35] study a UCB alg
designed for such problems and provide general regret bounds. Again, through the regret decomposition
analysis provides a Bayesian regret bound for posterior sampling.
For simplicity, we focus our discussion on the case where sd is finite, so that (fe{a): a e sd) fol
multivariate Gaussian distribution. As shown by Srinivas et al. [35], the results extend to infinite actio
through a discretization argument as long as certain smoothness conditions are satisfied.
When confidence bounds hold, a UCB algorithm incurs high regret from sampling an action only wh
confidence bound at that action is loose. In that case, one would expect the algorithm to learn a lot about
on the observed reward. This suggests the algorithm's cumulative regret may be bounded in an appropr
sense by the total amount it is expected to learn. Leveraging the structure of the Gaussian distribution, S
et al. [35] formalize this idea. They bound the regret of their UCB algorithm in terms of the maximum
that any algorithm could learn about fe. They use an information-theoretic measure of learning: the info
gain. This is defined to be the difference between the entropy of the prior distribution of (fe(a): a e si) a
entropy of the posterior. The maximum possible information gain is denoted yT, where the maximum is ta
all sequences ax,... ,aT.3 Their analysis also supports the following result on posterior sampling.
Srinivas et al. [35] also provide bounds on yT for kernels commonly used in Gaussian process regression,
including the linear kernel, radial basis kernel, and Matérn kernel. Combined with the above proposition, this
yields explicit Bayesian regret bounds in these cases.
We will briefly comment on their analysis and how it provides a bound for posterior sampling. First, note
that the posterior distribution is Gaussian, which suggests a UCB of the form Ufa) := p,_l (a) + frß^a,_l(a),
where pt_x(a) is the posterior mean, (T,_fa) is the posterior standard deviation of fe(a), and ß, is a confidence
parameter. We can provide a Bayesian regret bound by bounding both terms of (3). The next lemma, which follows
bounds the second term. Some new analysis is required since the Gaussian distribution is unbounded, and we study
Bayesian regret, whereas Srinivas et al. [35] bound regret with high probability under the prior.
Lemma 2. If Ufa) p,,_1(a) + y[ßtat_fa) and ß, 21n(((f2 + \)\sd\)/*JlTr), then for all T e N
~ß,
E[1 {fe(a) - Ufa) > 0 }[f,(a) - Ufa)] \ H,) = exp (7)
(?2 + l)|s?| - (t2 + l)\sd\
The final inequality above follows from the assumption that crn(a) < 1. The claim follows from (7) since
T oo oo i
E£[/e
(=1 r=l aest r=l U + l)
Here, the second equality follows by the tower property of conditional expectation, and the final ste
the Cauchy-Schwartz inequality. Therefore, to establish a Bayesian regret bound, it is sufficient
bound on the sum of posterior variances YÜt=\ °f-i(ar) that holds for any au ... ,aT. Under the as
aQ(a) < 1, the proof of Lemma 5.4 of Srinivas et al. [35] shows that of^a,) < a"1 log(l + tr_2a-(2_1(
a = 1 + a"1. At the same time, Lemma 5.3 of Srinivas et al. [35] shows the information gain f
au ... ,aT is equal to \ YJt=\l°g(l + (a,))- This shows that for any actions a,,..., aT,
posterior variances Y?t=\ °»2-i(at) can he bounded in terms of the information i>gain
■ from selectin
Therefore £f=i of_i (At) can be bounded in terms of the largest possible information gain yT.
7. Bounds for general function classes. The previous section treated models in which the relationship among
action rewards takes a simple and tractable form. Indeed, nearly all of the MAB literature focuses on such problems.
Posterior sampling can be applied to a much broader class of models. As such, more general results that hold
beyond restrictive cases are of particular interest. In this section, we provide a Bayesian regret bound that applies
when the reward function lies in a known, but otherwise arbitrary class of uniformly bounded real-valued functions
W. Our analysis of this abstract framework yields a more general result that applies beyond the scope of specific
problems that have been studied in the literature, and also identifies factors that unify more specialized prior results.
Further, our more general result when specialized to linear models recovers the strongest known Bayesian regret
bound and in the case of generalized linear models, yields a bound stronger than that established in prior literature.
If 3* is not appropriately restricted, it is impossible to guarantee any reasonably attractive level of Bayesian
regret. For example, in a case where sd = [0, 1], fe{a) = 1(0 = a), W = {fe \ 0 e [0,1]}, and 0 is uniformly
distributed over [0,1], it is easy to see that the Bayesian regret of any algorithm over T periods is T, which is no
different from the worst level of performance an agent can experience.
This example highlights the fact that Bayesian regret bounds must depend on the function class 'J. The bound
we develop in this section depends on W through two measures of complexity. The first is the Kolmogorov
dimension, which measures the growth rate of the covering numbers of and is closely related to measures of
complexity that are common in the supervised learning literature. It roughly captures the sensitivity of W to
statistical overfitting. The second measure is a new notion we introduce, which we refer to as the eluder dimension.
This captures how effectively the value of unobserved actions can be inferred from observed samples. We highlight
in §7.3 why notions of dimension common to the supervised learning literature are insufficient for our purposes.
Though the results of this section are very general, they do not apply to the entire range of problems represented
by the formulation we introduced in §3. In particular, throughout the scope of this section, we fix constants C > 0
and a > 0 and impose two simplifying assumptions. The first concerns boundedness of reward functions.
7.1. Confidence bounds. The construction of tight confidence sets for specific classes of functions presents
technical challenges. Even for the relatively simple case of linear bandit problems, significant analysis is required.
It is therefore perhaps surprising that, as we show in this section, one can construct strong confidence sets for an
arbitrary class of functions without much additional sophistication. While the focus of our work is on providing a
Bayesian regret bound for posterior sampling, the techniques we introduce for constructing confidence sets may
find broader use.
The confidence sets constructed here are centered around least squares estimates ftLS e argmin^y L2,(/),
where L2>,(/) = ~ R,)2 is the cumulative squared prediction error.4 The sets take the form W, :=
4 The results can be extended to the case where the infimum of Ll t(f) is unattainable by selecting a function with squared prediction error
sufficiently close to the infimum.
{/ € W: 11/ -/"||2,£, < yfW,} where ß, is an appropriately chosen confidence parameter, and the e
2-norm || ■ ||2£j is defined by ||g||; £( = J/,-1 g2(Ak). Hence ||/ — fe\\\ E measures the cumulative disc
between the previous predictions of / and fe.
The following lemma is the key to constructing strong confidence sets (Wt: t eN). For an arbitrary funct
bounds the squared error of / from below in terms of the empirical loss of the true function fe and the a
empirical discrepancy ||/-/J2>£i between / and f0. It establishes that for any function /, with high prob
the random process (L2,(/): t € N) never falls below the process (L2 l(f8) + \\\f - /J2,£ : f € N) by more
fixed constant. A proof of the lemma is provided in the appendix.
Lemma 3. For any S > 0 and f : sd m>- [R, with probability at least 1—5,
By Lemma 3, with high probability, / can enjoy lower squared error than fe only if its empirical dev
II/ — fe\\l,E, from fe is less than 8cr2log(l/5). Through a union bound, this property holds uniformly
functions in a finite subset of W. Using this fact and a discretization argument, together with the observat
,t(ftLS) - we can establish the following result, which is proved in the appendix. Let N(1, a,
denote the a-covering number of W in the sup-norm || • H^, and let
p(/<6nyf)>i-28.
While the expression (8) defining the confidence parameter is complicated, it can be bounded by
expressions in important cases. We provide three examples.
Example 6 (Generalized Linear Models). Consider the case of a d-dimensional generalized linear
fo{a) := g({<t)(a), 8)), where g is an increasing Lipschitz continuous function. Fix g, y = supae.^ \\4>(
s = suppg0 ||p||. Then, the previous argument shows logN(W, a, || • H^) = 0(d log(l/a)). Again, choosing a =
yields a confidence parameter ß*f3F, 5, l/t2) = 0(d log(?/S)).
The confidence parameter ß*(S•, l/t2, l/t2) is closely related to the following concept.
Proof. By definition,
The result follows from the fact that limsup,_>00log(A^(^', l/t2, || • |L))/log(t2) = dimÄ.(ET). □
7.2. Bayesian regret bounds. In this section, we introduce a pew notion of complexity—the elud
dimension—and then use it to develop a Bayesian regret bound. First, we note that, using the regret decompo
(3) and the confidence sets (f¥,\ t e N) constructed in the previous section, we can bound the Bayesian regre
posterior sampling in terms confidence interval widths i%(a) := sup/ey f(a) — inff(a). In particular
following lemma follows from our regret decomposition (3).
Lemma 4. For all T eN, if infpe?r fp(a) < fe(a) < suppe?T fp(a) for all t eN and a e.d with probab
at least 1 — 1 /T, then
T
We can use the confidence sets constructed in the previous section to guarantee that the conditions of
hold. In particular, choosing 8 = 1/2T in (8) guarantees that fe e HE ^t with probability at least
Our remaining task is to provide a worst-case bound on the sum First, consider the
linearly parameterized model where fp(a) := (<f>(a), p) for each p € 0 C1R''. Then, it can be show
confidence set takes the form t := {/ : p e 0,}, where 0, c R** is an ellipsoid. When an action A, is sa
ellipsoid shrinks in the direction f(At). Here, the explicit geometric structure of the confidence set imp
width shrinks not only at A, but also at any other action whose feature vector is not orth
Some linear algebra leads to a worst-case bound on i%(A,). For a general class of functions, the
much subtler, and we need to measure the way in which the width at each action can be reduced by
other actions. To do this, we introduce the following notion of dependence.
Intuitively, an action a is independent of {ax,... ,an} if two functions that make similar pre
{ax,..., an] can nevertheless differ significantly in their predictions at a. The above definition measu
"similarity" of predictions at e-scale, and measures whether two functions make similar predictions at
based on the cumulative discrepancy v £"=1(/(a,-) —/(a,-))2- This measure of dependence sugges
following notion of dimension. In this definition, we imagine that the sequence of elements in si is ch
eluder that hopes to show the agent poorly understood actions for as long as possible.
Definition 3. The e-eluder dimension dim£(:T, e) is the length d of the longest sequence of elemen
such that, for some e' > e, every element is e'-independent of its predecessors.
Recall that a vector space has dimension d if and only if d is the length of the longest sequence o
such that each element is linearly independent or equivalently, 0-independent of its predecessors. D
replaces the requirement of linear independence with e-independence. This extension is advantageous a
both nonlinear dependence and approximate dependence. The following result uses our new notion of d
bound the number of times the width of the confidence interval for a selected action A, can exceed a
E >£)^ dime(&> £)
satisfying Kd + 1 < r <Kd + d, we will construct K disjoint subsequences BX,...,BK. First let ß, = (at
i =1,..., K. If aK+l is e-dependent on each subsequence Bx,..., BK, our claim is established. Otherwise,
a subsequence ß, such that aK+l is e-independent and append aK+l to ß,. Repeat this process for elements w
indices j > K + 1 until aj is e-dependent on each subsequence or j = r. In the latter scenario |ß, | > K
since each element of a subsequence ß, is e-independent of its predecessors, |ß, | = d. In this case, aT must
e-dependent on each subsequence.
Now, consider taking (al,...,aT) to be the subsequence (A(|,At) of (A,,..., AT) consistin
elements A, for which i%(A,) > e. As we have established, each A, is e-dependent on fewer than 4/
disjoint subsequences of (A,,..., A(._,). It follows that each is e-dependent on fewer than 4ßT/e2 dis
subsequences of (a1(..., ay_,). Combining this with the fact that we have established that there is some th
e-dependent on at least r/d — 1 disjoint subsequences of (a,,..., aj_x), we have r/J — 1 <4ßT/e2. It f
that t < (4ßT/e2 + 1 )d, which is our desired result. □
Using Proposition 8, one can bound the sum i%(A,), as established by the following lemma.
Lemma 5. If (ß, > 0 11 e M) is a nondecreasing sequence and &■, := {/ e W: \\f — ffis\\2,E, — s/Wt)< t^le
T
for all T 6 N.
Proof. To reduce notation, write d = dim£(3\ T'1) and w, = w,(A,). Reorder the sequence (to,,..., wr) —>■
(wii, wh), where ui,-, > wi2 > ■ • • > wir. We have
Ewv,(At) = Ewi, = EwiMwi, 5 T'1} + witl{u>(f > T~x} < 1 + EV>,1 {wit > T-]}.
(=1 (=1 1=1 (=1 /=1
£v>i,l{w/,
,_i
I j i " t d J y/1
>T
i-l r=<i+l (=0
BayesRegret(T, ttps)
for all Te N.
Example 7 (Linear Models). Consider the case of a d-dimensional linear model fp(a) := (4>
y = supaga) ||0(a)|| and s = supp(E0 ||p||. Then, dim£(57, e) = 0(dlog(l/e)) and dim^üf) = 0(d). Pr
therefore yields an 0(d~jT\og(T)) Bayesian regret bound. This is tight to within a factor of log T (Rus
and Tsitsiklis [31]), and matches the best available bound for a linear UCB algorithm (Abbasi-Yadk
One advantage of studying posterior sampling in a general framework is that it allows bounds to b
specific classes of models by specializing more general results. This advantage is highlighted
developing a performance guarantee for generalized linear models. The problem is reduced to one of
eluder dimension, and such a bound follows almost immediately from the analysis of linear mode
literature, extending results from linear to generalized linear models required significant technical d
presented in Filippi et al. [18].
7.3. Relation to the VC dimension. To close our section on general bounds, we discuss important
between our new notion of eluder dimension and complexity measures used in the analysis of super
problems. We begin with an example that illustrates how a class of functions that is learnable in con
supervised learning context may require an arbitrarily long duration when learning to optimize.
In the preceding example, the e-eluder dimension is « for e e (0,1). On the other hand, the VC d
which characterizes the sample complexity of supervised learning, is 1. To highlight conceptual d
between the eluder dimension and the VC dimension, we will now define VC dimension in a way
how we defined eluder dimension. We begin with a notion of independence.
Definition 5. The VC dimension of a class of binary-valued functions with domain si is the largest
of a set k ç si such that every a e k is VC-independent of k\{a).
In the above example, any two actions are VC dependent because knowing the label of one
completely determine the value of the other action. However, this only happens if the sampled acti
If it has label 0, one cannot infer anything about the value of the other action. Instead of capturin
one could gain useful information about the reward function through exploration, we need a stronge
that guarantees one will gain useful information through exploration. Such a requirement is capt
following concept.
In this setting, the problem of choosing UCBs reduces to choosing a single confidence parameter ß. For mo
complicated problems, however, significant analysis may be required to choose a structural form for confidence
The results in this section suggest that it can be quite challenging to use such analysis to derive confidence bou
3.0
a: Linear UCB
b: Gaussian UCB
2.5 c: Posterior sampling
d: Gaussian UCB—Tuned heuristic
CD
œ 2.0
1.5
CD
CD 1.0
CO
CD
>
<
0.5
100 200 300 400 500 600 700 800 900 1,000
Time period
that lead to strong empirical performance. In particular, this is challenging even for linear mode
the paper (Abbasi-Yadkori et al. [2]) uses sophisticated tools from the study of multivariate self
martingales to derive a confidence bound that is stronger than those of Dani et al. [16] or Rusme
Tsitsiklis [31], but their algorithm still incurs about three and a half times the regret of posteri
highlights a crucial advantage of posterior sampling that we have emphasized throughout this pape
separates confidence bound analysis from algorithm design.
Finally, it should be noted that the algorithms of Abbasi-Yadkori et al. [2], and Srinivas et al
parameters that must be chosen by the user. We have attempted to set these values in a way th
average regret over the 1,000-period time horizon. Both algorithms construct confidence bounds
prespecified probability 1 - 8 e [0,1], Higher levels of 8 lead to lower UCBs, which we find impro
We set 8 = 1 to minimize the average regret of the algorithms. The algorithm of Abbasi-Yadkori
two other choices. We used a line search to set the algorithm's regularization parameter to the
which minimizes cumulative regret. The algorithm of Abbasi-Yadkori et al. [2] also requires a u
bound on ||0||, but the Gaussian distribution is unbounded. We avoid this issue by providing the
value ||0|| as an input to algorithm.
9. Conclusion. This paper has considered the use of a simple posterior sampling algorithm for
optimize actions when the decision maker is uncertain about how his actions influence perform
that, particularly for difficult problem instances, this algorithm offers significant potential advantag
design simplicity and computational tractability. Despite its great potential, not much is know
sampling when there are dependencies between actions. Our work has taken a significant step tow
this gap. We showed that the Bayesian regret of posterior sampling can be decomposed in terms o
which allowed us to establish a number of new results on posterior sampling by leveraging pri
algorithms. We then used this regret decomposition to analyze posterior sampling in a very genera
developed Bayesian regret bounds that depend on a new notion of dimension.
In constructing these bounds, we have identified two factors that control the hardness of a parti
an agent's ability to quickly attain near-optimal performance depends on the extent to which the
one action can be inferred by sampling other actions. However, to select an action, the agent must
about many possible actions, and an error in its evaluation of any one could result in large reg
measure of complexity controls for the difficulty of maintaining appropriate confidence sets simulta
action. While our bounds are nearly tight in some cases, further analysis is likely to yield stronger
cases. We hope, however, that our work provides a conceptual foundation for the study of such
inspires further investigation.
Acknowledgments. Daniel Russo is supported by a Burt and Deedee McMurty Stanford Graduate Fello
was supported, in part, by the National Science Foundation Award [CMMI-0968707]
Appendix A. Details regarding Lemma 1. Lemma 1 follows as a special case of Theorem 1 of Abbasi-Y
which is much more general. Note that because reward noise R, — fe(At) is bounded in [—1,1], it
Equation (12) in Abbasi-Yadkori et al. [2] gives a specialization of Theorem 1 to the problem we consider. It s
8 > 0 with probability at least 1 — 8,
Y,l{Ak=a)(Rk ~ fe(a))
Jfe=l <J(,+W)(,+Îhg(i1«)) v,„
We choose 8 = 1/7' and use that for t < T, N,(a) < T — 1 to show that with probability at least l/T,
Since 1 + N,(a) < 2Nt(a) whenever a has been played at least once, with probability at least l/T,
B.l. Preliminaries: Martingale exponential inequalities. Consider random variables (Z„ | n € N) adapted
(X n\ n = 0,1,... ). Assume E[exp{AZ,}] is finite for all À. Define the conditional mean pi = E[Z, | Xi_
conditional cumulant generating function of the centered random variable [Z; — /x;] by t/r,(A) = logE[e
Let
Proof. By definition,
]-<A,(A) exp{A[Z„-/i„]-t/r„(A)}
E[Af„(A) 1= E^exp|]Tj A[Z, —
= «p(eI 1=1
AtZ. ~ /Li ~ 'At (A) E[exp{A[Z„ — fin] — i/t„(A)}
n—1
Lemma 7. For all x > 0 ant/ A > 0, P(£" AZ, < x + ]C"[A/L + iA,(A)] Vn € N) > 1 - e *.
Proof. For any A, Mn(\) is a martingale with EAfn(A) = 1. Therefore, for any stopping time r, EA/ta„(A) = 1. For
arbitrary x > 0, define rx = infjn > 0 | Mn(A) > x} and note that rx is a stopping time corresponding to the first time Mn
crosses the boundary at x. Then EMT a„(A) = 1 and by Markov's inequality,
We note that the event {M7 a„(A) > x] — (Jl=\{A/^(A) > x}. So we have shown that for all x > 0 and n > 1,
pflJWA) >*})<"•
V*=i / x
L2,,(f)>L2Me) + ^\f-fe\\lEt-^^g(\/8)
simultaneously for all t G f
We will transform our problem to apply the general exponential martingale result shown above. We set Xt_x to
be the cr-algebra generated by (//,, A,, 6). By previous assumptions, e, := R, — fe(A,) satisfies E[e, | Xt_,] = 0, and
E[exp{Ae,} | Xt__t\ < exp{(A2tr2)/2} a.s. for all A. Define Z, = (fe(A,) - R,)2 - (/(A,) - R,)2.
Proof. By definition, Z, = L2 r+1 (fe) - L2,r+i(/)• Some calculation shows that Z, = -(/(Ar) -fe(A,))2 +2(/(A,) -
fe(At))e,. Therefore the conditional mean and conditional cumulant generating function satisfy,
p, = E[Z,\X,_,] = -(/(A,)-fe(A,))2
, irr ! - (2At/(A,)-/»(A,)])20-2
= logE[exp(2A(/(A,) -/„(A,))e,) | X,_{] < .
Or rearranging terms
PfÉzt<T+ È(/(At)
\k=\ A *=î /
Choosing À = 1/(4ct2), x = log(l/Ô), a
k,(/")-Li,,(/*)> III/"-/»Iks,-4
Therefore, with probability at least 1 — 5 for all
+ /^j5ll/a-/oll2,E,-lll/-/
Discretization error
Lemma 8. If f satisfies ||/ — f ||œ < a, then with probability at least 1—5,
1511/° - fe III e, " 511/" fe 111 e, + k,,(/) - k,,(/") | < af [8C + V8tr2ln(4f2/5)] Vf eN. (B1)
Proof. Since any two functions in /, f e 9" satisfy ||/ - /°|L < C, it is enough to consider a < C. We find
l(/")2(a)-(/)2(a)l —a<y<a
< max |(/(a) +y)2-/(a)2| = 2/(a)a + a2 < 2Ca + a2,
which implies
Since l/?J < C + leJ, this shows that with probability at least 1 — 5 the discretization error is bounded for all f by at),, where
% := f[8C + 2-v/2<r2 ln(4f2/5)]. □
Appendix C. Bounds on eluder dimension for common function classes. Definition 3, which de
dimension of a class of functions, can be equivalently written as follows. The e-eluder dimension of a clas
the length of the longest sequence a,,... ,aT such that for some e' > e,
C.l. Finite a
sup{(/Pl
Therefore, f
Proposition 11. Suppose © C Rd and fe(a) = f) ' (p(a). Assume there exist constants y and S such that for all
p e ©, |[p||2 < S, and ||0(tt) ||2 < y. Then dim£(!F, e) < 3d(e/(e— l))ln{3 + 3((25)/e)2} + 1.
To simplify the notation, define wk as in (CI), <pk = 4>(ak), p = pt — p2, and 4>,4>J ■ In t
Iw=i'(fp, — fp2)2(ai) — PT<^kP- anci by the triangle inequality ||p||2 < 25. The proof follows by bounding the numb
wk > e' can occur.
Step 1. If wk > e', then <f>l 14>k > \ where Vk := <î>k + A/ and À = (e'/(25))2.
Proof. We find wk < max{pT4>k: pT<î>kp < (e')2, pTIp < (25)2} < maxjp7^.: pTVkpk < 2(e')2} = J2(e')2\\
The second inequality follows because any p that is feasible for the first maximization problem must satisfy
(e')2 + A(25)2 = 2(e')2. By this result, wk > e' implies ||$ti|2_i >1/2. □
Vk
Step 2. If w, > e' for each i < k, then det Vk > A'i(|)i_1 and det Vk < ((y2(k — 1 ))/d + A)d.
Proof. Manipulating the result of Step 2 shows k must satisfy the inequality: (|)(t_1)/d < an[(k — 1 )/d] + 1, where
a0 = y2/A = (ISy/e')1. Let B(x, a) = max{B: ( I + x)B < aB+ 1). The number of times wk > e' can occur is bounded by
dB{\/2, a0) + 1.
We now derive an explicit bound on B(x, a) for any x <1. Note that any B > 1 must satisfy the inequality ln{ 1 + x)B <
ln{ 1 + a} + In B. Since ln{ 1 +x) > x/(l + x). using the transformation of variables y = B[x/{ 1 + x)] gives
1-f-jc € ( 1 —j— \
y < ln{ 1 + a} +ln f lny < ln{ 1 +a} + ln 1-- ==> y< 1 ln{ 1 +a} + ln ).
x x e e — 1 \ x )
This implies B(x, a) < (( I +x)/x
jc = 1/2. □
Proposition 12. Suppose ©cBli and fe(a) = g(QTf>{a)) where g( ■) is a dijferentiable and strictly increasing fun
Assume that there exist constants h, h, y, and S such that for all a € si and p € 0, 0 < h < g' (pT <f>(a)) < h, ||p||2 <
||<^(fl)||2 < y. Then dim£(S", e) < 3dr2(e/(e - l))ln{3r2 + 3r2((25Ä)/e)2} + 1.
The proof follows three steps that closely mirror those used to prove Proposition 11.
Step 1. If wk > e', then > l/(2r2) where Vk := 4>t + A/ and A = (e'/(2Sh))2.
Proof. By definition wk < max{g(pTct>k): g(pT4>(ai))2 < (e')2, pTIp < (25)2}. By the uniform bound on g'(-
is less than max{hpTc/>k: h2pT<Pkp < (e')2, pTIp < (25)2} < max{hpT<f>k: h2pTVkp < 2(e')2} = v/^(e')2/r211^* II -
Step 2. If Wj > e' for each i < k, then det Vk > Ad(|)*—1 and det Vk < ((y2(k — 1 ))/d + A)d.
Step 3. Complete proof.
Proof. The above inequalities imply k must satisfy (1 + 1/(2r2))(t_1'/<t < a0[(k — 1 )/d], where a0 = y2/A. Theref
in the linear case, the number of times wk > e' can occur is bounded by dB( \/(2r2), a0) + 1. Plugging these constants in
earlier bound B(x, a) < ((1 +x)/x)(e/(e-l))(ln{ 1 +a} +ln((l +x)/x)) and using 1 + x < 3/2, yields the result
References
[1] Abbasi-Yadkori Y, Antos A, Szepesvâri C (2009) Forced-exploration based algorithms for playing in stochastic linear bandits. COLT
Workshop: On-line Learn. Limited Feedback.
[2] Abbasi-Yadkori Y, Pâl D, Szepesvâri C (2011) Improved algorithms for linear stochastic bandits. Shawe-Taylor J, Zemel RS, Bartlett PL,
Pereira FCN, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems (NIPS), (Currant Associates, Red Hook, NY), 2312-2320.
[3] Abbasi-Yadkori Y, Pal D, Szepesvâri C (2012) Online-to-confidence-set conversions and application to sparse stochastic bandits. Lawrence
ND, Girolami MA, eds. Proc. 15th Internat. Conf. Artificial Intelligence Statist. (AISTATS), JMLR Workshop and Conference Proceedings,
Vol. 22, 1-9.
[4] Agrawal S, Goyal N (2012) Analysis of Thompson sampling for the multi-armed bandit problem. Mannor S, Srebro N, Williamson RC,
eds. Proc. 25th Ann. Conf. Learn. Theory, JMLR Workshop and Conference Proceedings, Vol. 23, 39.1-39.26
[5] Agrawal S, Goyal N (2012) Further optimal regret bounds for Thompson sampling. arXiv preprint arXiv:1209.3353.
[6] Agrawal S, Goyal N (2012) Thompson sampling for contextual bandits with linear payoffs. arXiv preprint arXiv: 1209.3352.
[7] Amin K, Kearns M, Syed U (2011) Bandits, query learning, and the haystack dimension. Kakade SM, von Luxburg U, eds. Proc. 24th
Annual Conf. Learn. Theory (COLT), JMLR Workshop and Conference Proceedings, Vol. 19, 87-106.
[8] Audibert J-Y, Bubeck S (2009) Minimax policies for adversarial and stochastic bandits. Proc. 22th Annual Conf. Learn. Theory (COLT),
(Omnipress, Madison, WI), 773-818.
[9] Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learn. 47(2):235-256.
[10] Beygelzimer A, Langford J, Li L, Reyzin L, Schapire RE (2011) Contextual bandit algorithms with supervised learning guarantees. Proc.
14th Internat. Conf. Artificial Intelligence Statist. (AISTATS), JMLR Workshop and Conference Proceedings, Vol. 15, 19-26.
[11] Bubeck S, Cesa-Bianchi N (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. arXiv preprint
arXiv:1204.5721.
[12] Bubeck S, Liu C-Y (2013) Prior-free and prior-dependent regret bounds for Thompson sampling. Adv. Neural Inform. Processing System
(NIPS), 638-646.
[13] Bubeck S, Munos R, Stoltz G, Szepesvâri C (2011) X-armed bandits. J. Machine Learn. Res. 12:1655-1695.
[14] Cappé O, Garivier A, Maillard O-A, Munos R, Stoltz G (2013) Kullback-Leibler upper confidence bounds for optimal sequentia
allocation. Ann. Statist. 41(3): 1516—1541.
[15] Chapelle 0, Li L (2011) An empirical evaluation of Thompson sampling. Shawe-Taylor J, Zemel RS, Bartlett PL, Pereria F, Weinberge
KQ, eds. Adv. Neural Inform. Processing Systems (NIPS) (Currant Associates, Red Hook, NY), 2249-2257.
[16] Dani V, Hayes TP, Kakade SM (2008) Stochastic linear optimization under bandit feedback. Proc. 21st Annual Conf. Learn. Theory
(COLT) (Omnipress, Madison, WI), 355-366.
[17] Deshpande Y, Montanari A (2012) Linear bandits in high dimension and recommendation systems. 50th Annual Allerton Conf
Communication, Control, Comput. (IEEE, Piscataway, NJ), 1750-1754.
[18] Filippi S, Cappé O, Garivier A, Szepesvâri C (2010) Parametric bandits: The generalized linear case. Lafferty J, Williams C, Shawe-Taylor
J, Zemel RS, Culotta A, eds. Adv. Neural Inform. Processing Systems (NIPS), (Currant Associates, Red Hook, NY), 586-594.
[19] Gittins JC, Jones DM (1979) A dynamic allocation index for the discounted multiarmed bandit problem. Biometrika 66(3):561-565.
[20] Gittins J, Glazebrook K, Weber R (2011) Multi-Armed Bandit Allocation Indices (John Wiley & Sons, Chichester, UK).
[21] Gopalan A, Mannor S, Mansour Y (2013) Thompson sampling for complex bandit problems. arXiv preprint arXiv: 1311.0466.
[22] Kauffmann E, Korda N, Munos R (2012) Thompson sampling: An asymptotically optimal finite time analysis. Bshouty NH, Stoltz G
Vayatis N, Zeugmann T, eds., Algorithmic Learn. Theory, Lecture Notes in Computer Science, Vol. 7568 (Springer-Verlag, Berlin
Heidelberg), 586-594.
[23] Kleinberg R, Slivkins A, Upfal E (2008) Multi-armed bandits in metric spaces. Proc. 40th ACM Sympos. Theory Comput. (ACM, New
York), 681-690.
[24] Korda N, Kaufmann E, Munos R (2013) Thompson sampling for one-dimensional exponential family bandits. Burges CJC, Bottou L,
Ghahramani Z, Weinberger KQ, eds. Adv. Neural Inform. Processing Systems, 1448-1456.
[25] Lai TL (1987) Adaptive treatment allocation and the multi-armed bandit problem. Ann. Statist. 15(3): 1091-1114.
[26] Lai TL, Robbins H (1985) Asymptotically efficient adaptive allocation rules. Adv. Appl. Math. 6(l):4-22.
[27] Lehmann EL, Casella G (1998) Theory of Point Estimation, 2nd ed. (Springer-Verlag, New York).
[28] Li L (2013) Generalized Thompson sampling for contextual bandits. arXiv preprint arXiv:1310.7163.
[29] Li L, Chapelle O (2012) Open problem: Regret bounds for Thompson sampling. Mannor S, Srebro B, Williamson RC, eds. Proc. 25t
Annual Conf. Learn. Theory (COLT), JMLR Workshop and Conference Proceedings, Vol. 23, 43.1-43.3.
[30] May BC, Korda N, Lee A, Leslie DS (2012) Optimistic Bayesian sampling in contextual-bandit problems. J. Machine Learn. Res
13(1):2069—2106.
[31] Rusmevichientong P, Tsitsiklis JN (2010) Linearly parameterized bandits. Math. Oper. Res. 35(2):395—411.
[32] Ryzhov IO, Powell WB, Frazier PI (2012) The knowledge gradient algorithm for a general class of online learning problems. Oper. Re
60(1):180—195.
[33] Sahni A (1974) Computationally related problems. SI AM J. Comput. 3(4):262-279.
[34] Scott SL (2010) A modern Bayesian look at the multi-armed bandit. Appl. Stochastic Models Bus. Indust. 26(6):639—658.
[35] Srinivas N, Krause A, Kakade SM, Seeger M (2012) Information-theoretic regret bounds for Gaussian process optimization in the bandit
setting. Inform. Theory, IEEE Trans. 58(5):3250-3265.
[36] Valko M, Carpentier A, Munos R (2013) Stochastic simultaneous optimistic optimization. Dasgupta S, Mcallester D, eds. Proc. 30th
Internat. Conf. Machine Learn. (ICML-13), JMLR Workshop and Conference Proceedings, Vol. 28, 19-27.