On the Problem of Best Arm Retention

Chen, Houshuang; He, Yuchen; Zhang, Chihao

doi:10.1007/978-981-97-7752-5_1

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14752))

Included in the following conference series:

International Workshop on Frontiers in Algorithmics

81 Accesses

Abstract

This paper presents a comprehensive study on the problem of Best Arm Retention (BAR), which requires retaining m arms with the best arm included from n after some trials, in stochastic multi-armed bandit settings. We explore many perspectives of the problem.

We begin by revisiting the lower bound for the $(\varepsilon ,\delta )$-PAC algorithm for Best Arm Identification (BAI), where we remove the previously imposed restriction of $\delta <0.5$ in the lower bound found in the literature.
By refining the technique above, we obtain optimal bounds for $(\varepsilon ,\delta )$-PAC algorithms for BAR.
We further study another variant of the problem, called r-BAR, which has recently found applications in streaming algorithms for multi-armed bandits. The goal of the r-BAR problem is to ensure the expected gap between the best arm and the optimal arm retained is less than r. We prove tight sample complexity for the problem.
We explore the regret minimization problem for r-BAR and develop algorithm beyond pure exploration. We also propose a conjecture regarding the optimal regret in this setting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

€32.70 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: EUR 29.95; Price includes VAT (France)

eBook: EUR 56.49; Price includes VAT (France)

Softcover Book: EUR 69.62; Price includes VAT (France)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The non-stationary stochastic multi-armed bandit problem

Article 30 March 2017

Refined Algorithms for Infinitely Many-Armed Bandits with Deterministic Rewards

On the Complexity of All $$\varepsilon $$ -Best Arms Identification

Notes

1.
T can be a stopping time.
2.
We use mean gap to refer to the mean difference between $i^*$ and the other fixed arm, and expected gap to denote the expected difference in means between $i^*$ and the optimal arm of an arm subset, where the randomness of the expectation arises from the arm subset.

References

Assadi, S., Wang, C.: Single-pass streaming lower bounds for multi-armed bandits exploration with instance-sensitive sample complexity. In: Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS) (2022)
Google Scholar
Audibert, J., Bubeck, S., Munos, R.: Best arm identification in multi-armed bandits. In: Proceedings of the 23th Conference on Learning Theory (COLT) (2010)
Google Scholar
Bubeck, S., Munos, R., Stoltz, G.: Pure exploration in multi-armed bandits problems. In: Proceedings of the 20th International Conference on Algorithmic Learning Theory (ALT) (2009)
Google Scholar
Chen, C.H., He, D., Fu, M., Lee, L.H.: Efficient simulation budget allocation for selecting an optimal subset. INFORMS J. Comput. 20(4), 579–595 (2008). https://doi.org/10.1287/IJOC.1080.0268
Article Google Scholar
Chen, H., He, Y., Zhang, C.: On interpolating experts and multi-armed bandits. arXiv preprint arXiv:2307.07264 (2023). https://doi.org/10.48550/ARXIV.2307.07264
Chen, L., Li, J., Qiao, M.: Towards instance optimal bounds for best arm identification. In: Proceedings of the 30th Conference on Learning Theory (COLT) (2017)
Google Scholar
Chen, S., Lin, T., King, I., Lyu, M.R., Chen, W.: Combinatorial pure exploration of multi-armed bandits. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NeurIPS) (2014)
Google Scholar
Degenne, R., Ménard, P., Shang, X., Valko, M.: Gamification of pure exploration for linear bandits. In: Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)
Google Scholar
Even-Dar, E., Mannor, S., Mansour, Y.: Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7, 1079–1105 (2006)
MathSciNet Google Scholar
Fiez, T., Jain, L., Jamieson, K.G., Ratliff, L.: Sequential experimental design for transductive linear bandits. In: Proceedings of the 32th Annual Conference on Neural Information Processing Systems (NeurIPS) (2019)
Google Scholar
Garivier, A., Kaufmann, E.: Optimal best arm identification with fixed confidence. In: Proceedings of the 29th Conference on Learning Theory (COLT) (2016)
Google Scholar
He, Y., Ye, Z., Zhang, C.: Understanding memory-regret trade-off for streaming stochastic multi-armed bandits. arXiv preprint arXiv:2405.19752 (2024)
Howard, S.R., Ramdas, A.: Sequential estimation of quantiles with applications to A/B testing and best-arm identification. Bernoulli 28(3), 1704–1728 (2022)
Article MathSciNet Google Scholar
Jamieson, K., Malloy, M., Nowak, R., Bubeck, S.: lil’UCB: an optimal exploration algorithm for multi-armed bandits. In: Proceedings of the 27th Conference on Learning Theory (COLT) (2014)
Google Scholar
Jourdan, M., Degenne, R., Kaufmann, E.: An $\varepsilon $-best-arm identification algorithm for fixed-confidence and beyond. In: Proceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS) (2023)
Google Scholar
Kalyanakrishnan, S., Stone, P.: Efficient selection of multiple bandit arms: theory and practice. In: Proceedings of the 27th International Conference on Machine Learning (ICML) (2010)
Google Scholar
Kalyanakrishnan, S., Tewari, A., Auer, P., Stone, P.: Pac subset selection in stochastic multi-armed bandits. In: Proceedings of the 29th International Conference on Machine Learning (ICML) (2012)
Google Scholar
Karnin, Z., Koren, T., Somekh, O.: Almost optimal exploration in multi-armed bandits. In: Proceedings of the 30th International Conference on Machine Learning (ICML) (2013)
Google Scholar
Kaufmann, E., Cappé, O., Garivier, A.: On the complexity of best arm identification in multi-armed bandit models. J. Mach. Learn. Res. 17, 1–42 (2016)
MathSciNet Google Scholar
Kone, C., Kaufmann, E., Richert, L.: Bandit pareto set identification: the fixed budget setting. arXiv preprint arXiv:2311.03992 (2023). https://doi.org/10.48550/ARXIV.2311.03992
Lattimore, T., Gyorgy, A.: Mirror descent and the information ratio. In: Proceedings of the 34th Conference on Learning Theory (COLT) (2021)
Google Scholar
Lattimore, T., Szepesvári, C.: Bandit Algorithms. Cambridge University Press, Cambridge (2020)
Google Scholar
Locatelli, A., Gutzeit, M., Carpentier, A.: An optimal algorithm for the thresholding bandit problem. In: Proceedings of the 33th International Conference on Machine Learning (ICML) (2016)
Google Scholar
Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res. 5, 623–648 (2004)
MathSciNet Google Scholar
Mason, B., Jain, L., Tripathy, A., Nowak, R.: Finding all $\epsilon $-good arms in stochastic bandits. In: Proceedings of the 33th Annual Conference on Neural Information Processing Systems (NeurIPS) (2020)
Google Scholar
Robbins, H.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58(5), 527–535 (1952)
Article MathSciNet Google Scholar
Russo, D.: Simple Bayesian algorithms for best arm identification. In: Proceedings of the 29th Conference on Learning Theory (COLT) (2016)
Google Scholar
Siegmund, D.: Sequential Analysis: Tests and Confidence Intervals. Springer, Heidelberg (2013)
Google Scholar
Simchi-Levi, D., Wang, C., Xu, J.: On experimentation with heterogeneous subgroups: An asymptotic optimal $\delta $-weighted-PAC design. Available at SSRN 4721755 (2024)
Google Scholar
Simchowitz, M., Jamieson, K., Recht, B.: The simulator: understanding adaptive sampling in the moderate-confidence regime. In: Proceedings of the 30th Conference on Learning Theory (COLT) (2017)
Google Scholar
Topsøe, F.: Some bounds for the logarithmic function. Inequality Theory Appl. 4, 137 (2007)
MathSciNet Google Scholar
Wang, P.A., Tzeng, R.C., Proutiere, A.: Fast pure exploration via frank-wolfe. In: Proceedings of the 34th Annual Conference on Neural Information Processing Systems (NeurIPS) (2021)
Google Scholar
You, W., Qin, C., Wang, Z., Yang, S.: Information-directed selection for top-two algorithms. In: Proceedings of the 36th Conference on Learning Theory (COLT) (2023)
Google Scholar
Zhao, Y., Stephens, C., Szepesvári, C., Jun, K.S.: Revisiting simple regret: fast rates for returning a good arm. In: Proceedings of the 40th International Conference on Machine Learning (ICML) (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, 200240, China
Houshuang Chen, Yuchen He & Chihao Zhang

Authors

Houshuang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yuchen He
View author publications
You can also search for this author in PubMed Google Scholar
Chihao Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chihao Zhang .

Editor information

Editors and Affiliations

The Hong Kong Polytechnic University, Hong Kong, China
Bo Li
City University of Hong Kong, Hong Kong, China
Minming Li
Chinese Academy of Sciences, Beijing, China
Xiaoming Sun

Appendices

A Proof of Lemma 1

Let $T_i(s)$ denote the index of the s-th pull of arm i for $s\le T_i$. Define the log-likelihood , abbreviated as $L_T$ when the context is clear. By applying the chain rule to $L_T$, we have

where the second equality follows from and that $r_t$ is independent of $\mathcal {F}_{t-1}$ conditioned on $a_t$. With , we apply Wald’s Lemma (see e.g. [28]) to to obtain:

(1)

The remaining task is to prove for any event $\mathcal {E}\in \mathcal {F}_T$, we reformulate the definition of $L_T$ as

Summing over all $\omega \in \mathcal {E}$, we obtain

(2)

Continuing to lower bound Eq. (2), we have

where the inequality follows from the Jensen inequality. Rearranging, we get . Similarly, . Hence, we conclude

which completes our proof in conjunction with Eq. (1).

B Bounds of KL Divergence

We will utilize the following inequalities from [31] to bound the KL divergence.

Fact 2

The following inequalities hold.

(a)
$\log (1+x)\ge \frac{x}{1+x}, \forall x>-1$;
(b)
$\log (1+x)\ge \frac{x}{1+x}(1+\frac{x}{2+x})=\frac{2x}{2+x}, \forall x>0$;
(c)
$\log (1+x)\ge \frac{x}{1+x}\frac{2+x}{2}, \text { if } -1<x\le 0$.

Lemma 6

(Restate 2). $\mathfrak {d}(\frac{1-\delta }{2}+\frac{1}{2n},1-\delta )=\varOmega \left( \frac{1-\delta }{2}-\frac{1}{2n}\right) $ if $1-\delta =\frac{1+\varOmega (1)}{n}$.

Proof

By definition,

where the inequality follows from (a) & (b) of Fact 2.

Lemma 7

(Restate Lemma 3). For any $x_1,x_2\dots ,x_n \in [0,1]$ with average $a:=\frac{\sum _{i}x_i}{n}< b\in [0,1]$, then $\sum _{i:x_i<b}\mathfrak {d}(x_i,b)\ge n\cdot \mathfrak {d}(a,b).$

Proof

Recall that $\mathfrak {d}(\cdot ,y)$ is convex for any fixed y in Fact 1. Let $S=\left\{ \,i:x_i<b\,\right\} $ and $k=\left\| S\right\| $. By the convexity of $\mathfrak {d}(\cdot ,b)$, we have $\frac{1}{k}\sum _{i\in S}\mathfrak {d}(x_i,b)\ge \mathfrak {d}\left( \frac{\sum _{i\in S}x_i}{k},b\right) .$ Since $\mathfrak {d}(x,b)>\mathfrak {d}(y,b)$ if $x<y<b$ in Fact 1,

$$\sum _{i\in S}\mathfrak {d}(x_i,b)\ge k\cdot \mathfrak {d}\left( \frac{\sum _{i\in S}x_i}{k},b\right) \ge k\cdot \mathfrak {d}\left( \frac{an-(n-k)b}{k},b\right) .$$

Using the convexity of $\mathfrak {d}(\cdot ,b)$ again, we get

$$\frac{k}{n}\cdot \mathfrak {d}\left( \frac{an-(n-k)b}{k},b\right) +\frac{n-k}{n}\cdot \mathfrak {d}(b,b)\ge \mathfrak {d}(a,b),$$

which implies $k\cdot \mathfrak {d}\left( \frac{an-(n-k)b}{k},b\right) \ge n\cdot \mathfrak {d}(a,b)$ since $\mathfrak {d}(b,b)=0$.

Lemma 8

(Restate Lemma 4). For any $0<a<b<1$, if $\frac{b-a}{a}={\Omega }(1)$, then $\mathfrak {d}(b,a)={\Omega }\left( b\cdot \log \frac{b}{a}\right) .$

Proof

By definition of the KL divergence, $\mathfrak {d}(b,a)=b\log \frac{b}{a}+(1-b)\log \frac{1-b}{1-a}.$

By Fact 2 (b) & (c),

$$b\log \frac{b}{a}=b\log \left( 1+\frac{b-a}{a}\right) \ge (b-a)\left( 1+\frac{(b-a)/a}{2+(b-a)/a}\right) $$

and

$$(1-b)\log \frac{1-b}{1-a}=(1-b)\log \left( 1+\frac{a-b}{1-a}\right) \ge -(b-a)\left( 1-\frac{b-a}{2(1-a)}\right) . $$

Therefore if $r:=\frac{b-a}{a}={\Omega }(1)$,

$$\begin{aligned} \mathfrak {d}(b,a) &= \left( 1-\frac{1}{1+r/(2+r)}\right) b\log \frac{b}{a}+\frac{1}{1+r/(2+r)}b\log \frac{b}{a}+(1-b)\log \frac{1-b}{1-a}\\ &\ge \left( 1-\frac{1}{1+r/(2+r)}\right) b\log \frac{b}{a} +(b-a)-(b-a)\left( 1-\frac{b-a}{2(1-a)}\right) \\ &\ge \left( 1-\frac{1}{1+r/(2+r)}\right) b\log \frac{b}{a}. \end{aligned}$$

C Details of the OSMD Algorithm Corresponding to Proposition 1

For completeness, we provide a description of the OSMD algorithm used in Algorithm 1. For more detailed information, please refer to the work of [21].

Let $\varDelta _{(n-1)}$ denote the probability simplex with $n-1$ dimensions, defined as $\varDelta _{(n-1)}=\left\{ \,\textbf{q}\in \mathbb R_{\ge 0}: \sum _{i=1}^n \textbf{q}(i)=1\,\right\} $. Here, $\textbf{q}(i)$ represents the value at the i-th position of vector $\textbf{q}$. Consider a function $F:\mathbb R^n\rightarrow \mathbb R\cup \left\{ \,\infty \,\right\} $. The Bregman divergence with respect to F is defined as $B_F(\textbf{q},\textbf{p}) = F(\textbf{q})-F(\textbf{p}) - \langle \nabla F(\textbf{p}),\textbf{q}-\textbf{p}\rangle $ for any $\textbf{q},\textbf{p}\in \mathbb R^n$.

The algorithm proposed in [21] is designed for loss cases, where each pull results in a loss associated with the corresponding arm instead of a reward. To adapt their algorithm to our setting, we can perform a simple reduction by constructing the loss of each arm $\ell _t(i)$ as $1-r_t(i)$, where $r_t(i)$ is the reward of arm ${arm}_i$. It is straightforward to verify that the results in [21] also hold for the reward setting. Let $\eta $ be the learning rate and $F:\mathbb R^{\left\| S\right\| }\rightarrow \mathbb R\cup \left\{ \,\infty \,\right\} $ be the potential function, where S is the arm set. Without loss of generality, we index the arms in S by $[\left\| S\right\| ]$.

By choosing $\eta = \sqrt{\frac{8}{L}}$ and $F(\textbf{q}) = -2\sum _{i=1}^{\left\| S\right\| } \sqrt{\textbf{q}(i)}$, the conclusion in Proposition 1 can be directly derived from Theorem 11 in [21].

D Details of the MedianElimination Algorithm Corresponding to Proposition 2

For completeness, we present the description of the MedianElimination algorithm we used in Algorithm 1. For more detailed information, please refer to Theorem 10 of [9].

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, H., He, Y., Zhang, C. (2025). On the Problem of Best Arm Retention. In: Li, B., Li, M., Sun, X. (eds) Frontiers of Algorithmics. IJTCS-FAW 2024. Lecture Notes in Computer Science, vol 14752. Springer, Singapore. https://doi.org/10.1007/978-981-97-7752-5_1

Download citation

DOI: https://doi.org/10.1007/978-981-97-7752-5_1
Published: 29 December 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-7751-8
Online ISBN: 978-981-97-7752-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Problem of Best Arm Retention

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The non-stationary stochastic multi-armed bandit problem

Refined Algorithms for Infinitely Many-Armed Bandits with Deterministic Rewards

On the Complexity of All $$\varepsilon $$ -Best Arms Identification

Notes

References