Abstract
This paper presents a comprehensive study on the problem of Best Arm Retention (BAR), which requires retaining m arms with the best arm included from n after some trials, in stochastic multi-armed bandit settings. We explore many perspectives of the problem.
-
We begin by revisiting the lower bound for the \((\varepsilon ,\delta )\)-PAC algorithm for Best Arm Identification (BAI), where we remove the previously imposed restriction of \(\delta <0.5\) in the lower bound found in the literature.
-
By refining the technique above, we obtain optimal bounds for \((\varepsilon ,\delta )\)-PAC algorithms for BAR.
-
We further study another variant of the problem, called r-BAR, which has recently found applications in streaming algorithms for multi-armed bandits. The goal of the r-BAR problem is to ensure the expected gap between the best arm and the optimal arm retained is less than r. We prove tight sample complexity for the problem.
-
We explore the regret minimization problem for r-BAR and develop algorithm beyond pure exploration. We also propose a conjecture regarding the optimal regret in this setting.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
T can be a stopping time.
- 2.
We use mean gap to refer to the mean difference between \(i^*\) and the other fixed arm, and expected gap to denote the expected difference in means between \(i^*\) and the optimal arm of an arm subset, where the randomness of the expectation arises from the arm subset.
References
Assadi, S., Wang, C.: Single-pass streaming lower bounds for multi-armed bandits exploration with instance-sensitive sample complexity. In: Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS) (2022)
Audibert, J., Bubeck, S., Munos, R.: Best arm identification in multi-armed bandits. In: Proceedings of the 23th Conference on Learning Theory (COLT) (2010)
Bubeck, S., Munos, R., Stoltz, G.: Pure exploration in multi-armed bandits problems. In: Proceedings of the 20th International Conference on Algorithmic Learning Theory (ALT) (2009)
Chen, C.H., He, D., Fu, M., Lee, L.H.: Efficient simulation budget allocation for selecting an optimal subset. INFORMS J. Comput. 20(4), 579–595 (2008). https://doi.org/10.1287/IJOC.1080.0268
Chen, H., He, Y., Zhang, C.: On interpolating experts and multi-armed bandits. arXiv preprint arXiv:2307.07264 (2023). https://doi.org/10.48550/ARXIV.2307.07264
Chen, L., Li, J., Qiao, M.: Towards instance optimal bounds for best arm identification. In: Proceedings of the 30th Conference on Learning Theory (COLT) (2017)
Chen, S., Lin, T., King, I., Lyu, M.R., Chen, W.: Combinatorial pure exploration of multi-armed bandits. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NeurIPS) (2014)
Degenne, R., Ménard, P., Shang, X., Valko, M.: Gamification of pure exploration for linear bandits. In: Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)
Even-Dar, E., Mannor, S., Mansour, Y.: Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7, 1079–1105 (2006)
Fiez, T., Jain, L., Jamieson, K.G., Ratliff, L.: Sequential experimental design for transductive linear bandits. In: Proceedings of the 32th Annual Conference on Neural Information Processing Systems (NeurIPS) (2019)
Garivier, A., Kaufmann, E.: Optimal best arm identification with fixed confidence. In: Proceedings of the 29th Conference on Learning Theory (COLT) (2016)
He, Y., Ye, Z., Zhang, C.: Understanding memory-regret trade-off for streaming stochastic multi-armed bandits. arXiv preprint arXiv:2405.19752 (2024)
Howard, S.R., Ramdas, A.: Sequential estimation of quantiles with applications to A/B testing and best-arm identification. Bernoulli 28(3), 1704–1728 (2022)
Jamieson, K., Malloy, M., Nowak, R., Bubeck, S.: lil’UCB: an optimal exploration algorithm for multi-armed bandits. In: Proceedings of the 27th Conference on Learning Theory (COLT) (2014)
Jourdan, M., Degenne, R., Kaufmann, E.: An \(\varepsilon \)-best-arm identification algorithm for fixed-confidence and beyond. In: Proceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS) (2023)
Kalyanakrishnan, S., Stone, P.: Efficient selection of multiple bandit arms: theory and practice. In: Proceedings of the 27th International Conference on Machine Learning (ICML) (2010)
Kalyanakrishnan, S., Tewari, A., Auer, P., Stone, P.: Pac subset selection in stochastic multi-armed bandits. In: Proceedings of the 29th International Conference on Machine Learning (ICML) (2012)
Karnin, Z., Koren, T., Somekh, O.: Almost optimal exploration in multi-armed bandits. In: Proceedings of the 30th International Conference on Machine Learning (ICML) (2013)
Kaufmann, E., Cappé, O., Garivier, A.: On the complexity of best arm identification in multi-armed bandit models. J. Mach. Learn. Res. 17, 1–42 (2016)
Kone, C., Kaufmann, E., Richert, L.: Bandit pareto set identification: the fixed budget setting. arXiv preprint arXiv:2311.03992 (2023). https://doi.org/10.48550/ARXIV.2311.03992
Lattimore, T., Gyorgy, A.: Mirror descent and the information ratio. In: Proceedings of the 34th Conference on Learning Theory (COLT) (2021)
Lattimore, T., Szepesvári, C.: Bandit Algorithms. Cambridge University Press, Cambridge (2020)
Locatelli, A., Gutzeit, M., Carpentier, A.: An optimal algorithm for the thresholding bandit problem. In: Proceedings of the 33th International Conference on Machine Learning (ICML) (2016)
Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res. 5, 623–648 (2004)
Mason, B., Jain, L., Tripathy, A., Nowak, R.: Finding all \(\epsilon \)-good arms in stochastic bandits. In: Proceedings of the 33th Annual Conference on Neural Information Processing Systems (NeurIPS) (2020)
Robbins, H.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58(5), 527–535 (1952)
Russo, D.: Simple Bayesian algorithms for best arm identification. In: Proceedings of the 29th Conference on Learning Theory (COLT) (2016)
Siegmund, D.: Sequential Analysis: Tests and Confidence Intervals. Springer, Heidelberg (2013)
Simchi-Levi, D., Wang, C., Xu, J.: On experimentation with heterogeneous subgroups: An asymptotic optimal \(\delta \)-weighted-PAC design. Available at SSRN 4721755 (2024)
Simchowitz, M., Jamieson, K., Recht, B.: The simulator: understanding adaptive sampling in the moderate-confidence regime. In: Proceedings of the 30th Conference on Learning Theory (COLT) (2017)
Topsøe, F.: Some bounds for the logarithmic function. Inequality Theory Appl. 4, 137 (2007)
Wang, P.A., Tzeng, R.C., Proutiere, A.: Fast pure exploration via frank-wolfe. In: Proceedings of the 34th Annual Conference on Neural Information Processing Systems (NeurIPS) (2021)
You, W., Qin, C., Wang, Z., Yang, S.: Information-directed selection for top-two algorithms. In: Proceedings of the 36th Conference on Learning Theory (COLT) (2023)
Zhao, Y., Stephens, C., Szepesvári, C., Jun, K.S.: Revisiting simple regret: fast rates for returning a good arm. In: Proceedings of the 40th International Conference on Machine Learning (ICML) (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Proof of Lemma 1
Let \(T_i(s)\) denote the index of the s-th pull of arm i for \(s\le T_i\). Define the log-likelihood
, abbreviated as \(L_T\) when the context is clear. By applying the chain rule to \(L_T\), we have

where the second equality follows from
and that \(r_t\) is independent of \(\mathcal {F}_{t-1}\) conditioned on \(a_t\). With
, we apply Wald’s Lemma (see e.g. [28]) to
to obtain:

The remaining task is to prove
for any event \(\mathcal {E}\in \mathcal {F}_T\), we reformulate the definition of \(L_T\) as

Summing over all \(\omega \in \mathcal {E}\), we obtain

Continuing to lower bound Eq. (2), we have

where the inequality follows from the Jensen inequality. Rearranging, we get
. Similarly,
. Hence, we conclude

which completes our proof in conjunction with Eq. (1).
B Bounds of KL Divergence
We will utilize the following inequalities from [31] to bound the KL divergence.
Fact 2
The following inequalities hold.
-
(a)
\(\log (1+x)\ge \frac{x}{1+x}, \forall x>-1\);
-
(b)
\(\log (1+x)\ge \frac{x}{1+x}(1+\frac{x}{2+x})=\frac{2x}{2+x}, \forall x>0\);
-
(c)
\(\log (1+x)\ge \frac{x}{1+x}\frac{2+x}{2}, \text { if } -1<x\le 0\).
Lemma 6
(Restate 2). \(\mathfrak {d}(\frac{1-\delta }{2}+\frac{1}{2n},1-\delta )=\varOmega \left( \frac{1-\delta }{2}-\frac{1}{2n}\right) \) if \(1-\delta =\frac{1+\varOmega (1)}{n}\).
Proof
By definition,

where the inequality follows from (a) & (b) of Fact 2.
Lemma 7
(Restate Lemma 3). For any \(x_1,x_2\dots ,x_n \in [0,1]\) with average \(a:=\frac{\sum _{i}x_i}{n}< b\in [0,1]\), then \(\sum _{i:x_i<b}\mathfrak {d}(x_i,b)\ge n\cdot \mathfrak {d}(a,b).\)
Proof
Recall that \(\mathfrak {d}(\cdot ,y)\) is convex for any fixed y in Fact 1. Let \(S=\left\{ \,i:x_i<b\,\right\} \) and \(k=\left\| S\right\| \). By the convexity of \(\mathfrak {d}(\cdot ,b)\), we have \(\frac{1}{k}\sum _{i\in S}\mathfrak {d}(x_i,b)\ge \mathfrak {d}\left( \frac{\sum _{i\in S}x_i}{k},b\right) .\) Since \(\mathfrak {d}(x,b)>\mathfrak {d}(y,b)\) if \(x<y<b\) in Fact 1,
Using the convexity of \(\mathfrak {d}(\cdot ,b)\) again, we get
which implies \(k\cdot \mathfrak {d}\left( \frac{an-(n-k)b}{k},b\right) \ge n\cdot \mathfrak {d}(a,b)\) since \(\mathfrak {d}(b,b)=0\).
Lemma 8
(Restate Lemma 4). For any \(0<a<b<1\), if \(\frac{b-a}{a}={\Omega }(1)\), then \(\mathfrak {d}(b,a)={\Omega }\left( b\cdot \log \frac{b}{a}\right) .\)
Proof
By definition of the KL divergence, \(\mathfrak {d}(b,a)=b\log \frac{b}{a}+(1-b)\log \frac{1-b}{1-a}.\)
By Fact 2 (b) & (c),
and
Therefore if \(r:=\frac{b-a}{a}={\Omega }(1)\),
C Details of the OSMD Algorithm Corresponding to Proposition 1
For completeness, we provide a description of the OSMD algorithm used in Algorithm 1. For more detailed information, please refer to the work of [21].
Let \(\varDelta _{(n-1)}\) denote the probability simplex with \(n-1\) dimensions, defined as \(\varDelta _{(n-1)}=\left\{ \,\textbf{q}\in \mathbb R_{\ge 0}: \sum _{i=1}^n \textbf{q}(i)=1\,\right\} \). Here, \(\textbf{q}(i)\) represents the value at the i-th position of vector \(\textbf{q}\). Consider a function \(F:\mathbb R^n\rightarrow \mathbb R\cup \left\{ \,\infty \,\right\} \). The Bregman divergence with respect to F is defined as \(B_F(\textbf{q},\textbf{p}) = F(\textbf{q})-F(\textbf{p}) - \langle \nabla F(\textbf{p}),\textbf{q}-\textbf{p}\rangle \) for any \(\textbf{q},\textbf{p}\in \mathbb R^n\).
The algorithm proposed in [21] is designed for loss cases, where each pull results in a loss associated with the corresponding arm instead of a reward. To adapt their algorithm to our setting, we can perform a simple reduction by constructing the loss of each arm \(\ell _t(i)\) as \(1-r_t(i)\), where \(r_t(i)\) is the reward of arm \({arm}_i\). It is straightforward to verify that the results in [21] also hold for the reward setting. Let \(\eta \) be the learning rate and \(F:\mathbb R^{\left\| S\right\| }\rightarrow \mathbb R\cup \left\{ \,\infty \,\right\} \) be the potential function, where S is the arm set. Without loss of generality, we index the arms in S by \([\left\| S\right\| ]\).

By choosing \(\eta = \sqrt{\frac{8}{L}}\) and \(F(\textbf{q}) = -2\sum _{i=1}^{\left\| S\right\| } \sqrt{\textbf{q}(i)}\), the conclusion in Proposition 1 can be directly derived from Theorem 11 in [21].
D Details of the MedianElimination Algorithm Corresponding to Proposition 2
For completeness, we present the description of the MedianElimination algorithm we used in Algorithm 1. For more detailed information, please refer to Theorem 10 of [9].

Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chen, H., He, Y., Zhang, C. (2025). On the Problem of Best Arm Retention. In: Li, B., Li, M., Sun, X. (eds) Frontiers of Algorithmics. IJTCS-FAW 2024. Lecture Notes in Computer Science, vol 14752. Springer, Singapore. https://doi.org/10.1007/978-981-97-7752-5_1
Download citation
DOI: https://doi.org/10.1007/978-981-97-7752-5_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-7751-8
Online ISBN: 978-981-97-7752-5
eBook Packages: Computer ScienceComputer Science (R0)