Skip to main content

On the Problem of Best Arm Retention

  • Conference paper
  • First Online:
Frontiers of Algorithmics (IJTCS-FAW 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14752))

Included in the following conference series:

  • 81 Accesses


This paper presents a comprehensive study on the problem of Best Arm Retention (BAR), which requires retaining m arms with the best arm included from n after some trials, in stochastic multi-armed bandit settings. We explore many perspectives of the problem.

  • We begin by revisiting the lower bound for the \((\varepsilon ,\delta )\)-PAC algorithm for Best Arm Identification (BAI), where we remove the previously imposed restriction of \(\delta <0.5\) in the lower bound found in the literature.

  • By refining the technique above, we obtain optimal bounds for \((\varepsilon ,\delta )\)-PAC algorithms for BAR.

  • We further study another variant of the problem, called r-BAR, which has recently found applications in streaming algorithms for multi-armed bandits. The goal of the r-BAR problem is to ensure the expected gap between the best arm and the optimal arm retained is less than r. We prove tight sample complexity for the problem.

  • We explore the regret minimization problem for r-BAR and develop algorithm beyond pure exploration. We also propose a conjecture regarding the optimal regret in this setting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
€32.70 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

EUR 29.95
Price includes VAT (France)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
EUR 56.49
Price includes VAT (France)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
EUR 69.62
Price includes VAT (France)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others


  1. 1.

    T can be a stopping time.

  2. 2.

    We use mean gap to refer to the mean difference between \(i^*\) and the other fixed arm, and expected gap to denote the expected difference in means between \(i^*\) and the optimal arm of an arm subset, where the randomness of the expectation arises from the arm subset.


  1. Assadi, S., Wang, C.: Single-pass streaming lower bounds for multi-armed bandits exploration with instance-sensitive sample complexity. In: Proceedings of the 35th Annual Conference on Neural Information Processing Systems (NeurIPS) (2022)

    Google Scholar 

  2. Audibert, J., Bubeck, S., Munos, R.: Best arm identification in multi-armed bandits. In: Proceedings of the 23th Conference on Learning Theory (COLT) (2010)

    Google Scholar 

  3. Bubeck, S., Munos, R., Stoltz, G.: Pure exploration in multi-armed bandits problems. In: Proceedings of the 20th International Conference on Algorithmic Learning Theory (ALT) (2009)

    Google Scholar 

  4. Chen, C.H., He, D., Fu, M., Lee, L.H.: Efficient simulation budget allocation for selecting an optimal subset. INFORMS J. Comput. 20(4), 579–595 (2008).

    Article  Google Scholar 

  5. Chen, H., He, Y., Zhang, C.: On interpolating experts and multi-armed bandits. arXiv preprint arXiv:2307.07264 (2023).

  6. Chen, L., Li, J., Qiao, M.: Towards instance optimal bounds for best arm identification. In: Proceedings of the 30th Conference on Learning Theory (COLT) (2017)

    Google Scholar 

  7. Chen, S., Lin, T., King, I., Lyu, M.R., Chen, W.: Combinatorial pure exploration of multi-armed bandits. In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NeurIPS) (2014)

    Google Scholar 

  8. Degenne, R., Ménard, P., Shang, X., Valko, M.: Gamification of pure exploration for linear bandits. In: Proceedings of the 37th International Conference on Machine Learning (ICML) (2020)

    Google Scholar 

  9. Even-Dar, E., Mannor, S., Mansour, Y.: Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res. 7, 1079–1105 (2006)

    MathSciNet  Google Scholar 

  10. Fiez, T., Jain, L., Jamieson, K.G., Ratliff, L.: Sequential experimental design for transductive linear bandits. In: Proceedings of the 32th Annual Conference on Neural Information Processing Systems (NeurIPS) (2019)

    Google Scholar 

  11. Garivier, A., Kaufmann, E.: Optimal best arm identification with fixed confidence. In: Proceedings of the 29th Conference on Learning Theory (COLT) (2016)

    Google Scholar 

  12. He, Y., Ye, Z., Zhang, C.: Understanding memory-regret trade-off for streaming stochastic multi-armed bandits. arXiv preprint arXiv:2405.19752 (2024)

  13. Howard, S.R., Ramdas, A.: Sequential estimation of quantiles with applications to A/B testing and best-arm identification. Bernoulli 28(3), 1704–1728 (2022)

    Article  MathSciNet  Google Scholar 

  14. Jamieson, K., Malloy, M., Nowak, R., Bubeck, S.: lil’UCB: an optimal exploration algorithm for multi-armed bandits. In: Proceedings of the 27th Conference on Learning Theory (COLT) (2014)

    Google Scholar 

  15. Jourdan, M., Degenne, R., Kaufmann, E.: An \(\varepsilon \)-best-arm identification algorithm for fixed-confidence and beyond. In: Proceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS) (2023)

    Google Scholar 

  16. Kalyanakrishnan, S., Stone, P.: Efficient selection of multiple bandit arms: theory and practice. In: Proceedings of the 27th International Conference on Machine Learning (ICML) (2010)

    Google Scholar 

  17. Kalyanakrishnan, S., Tewari, A., Auer, P., Stone, P.: Pac subset selection in stochastic multi-armed bandits. In: Proceedings of the 29th International Conference on Machine Learning (ICML) (2012)

    Google Scholar 

  18. Karnin, Z., Koren, T., Somekh, O.: Almost optimal exploration in multi-armed bandits. In: Proceedings of the 30th International Conference on Machine Learning (ICML) (2013)

    Google Scholar 

  19. Kaufmann, E., Cappé, O., Garivier, A.: On the complexity of best arm identification in multi-armed bandit models. J. Mach. Learn. Res. 17, 1–42 (2016)

    MathSciNet  Google Scholar 

  20. Kone, C., Kaufmann, E., Richert, L.: Bandit pareto set identification: the fixed budget setting. arXiv preprint arXiv:2311.03992 (2023).

  21. Lattimore, T., Gyorgy, A.: Mirror descent and the information ratio. In: Proceedings of the 34th Conference on Learning Theory (COLT) (2021)

    Google Scholar 

  22. Lattimore, T., Szepesvári, C.: Bandit Algorithms. Cambridge University Press, Cambridge (2020)

    Google Scholar 

  23. Locatelli, A., Gutzeit, M., Carpentier, A.: An optimal algorithm for the thresholding bandit problem. In: Proceedings of the 33th International Conference on Machine Learning (ICML) (2016)

    Google Scholar 

  24. Mannor, S., Tsitsiklis, J.N.: The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res. 5, 623–648 (2004)

    MathSciNet  Google Scholar 

  25. Mason, B., Jain, L., Tripathy, A., Nowak, R.: Finding all \(\epsilon \)-good arms in stochastic bandits. In: Proceedings of the 33th Annual Conference on Neural Information Processing Systems (NeurIPS) (2020)

    Google Scholar 

  26. Robbins, H.: Some aspects of the sequential design of experiments. Bull. Am. Math. Soc. 58(5), 527–535 (1952)

    Article  MathSciNet  Google Scholar 

  27. Russo, D.: Simple Bayesian algorithms for best arm identification. In: Proceedings of the 29th Conference on Learning Theory (COLT) (2016)

    Google Scholar 

  28. Siegmund, D.: Sequential Analysis: Tests and Confidence Intervals. Springer, Heidelberg (2013)

    Google Scholar 

  29. Simchi-Levi, D., Wang, C., Xu, J.: On experimentation with heterogeneous subgroups: An asymptotic optimal \(\delta \)-weighted-PAC design. Available at SSRN 4721755 (2024)

    Google Scholar 

  30. Simchowitz, M., Jamieson, K., Recht, B.: The simulator: understanding adaptive sampling in the moderate-confidence regime. In: Proceedings of the 30th Conference on Learning Theory (COLT) (2017)

    Google Scholar 

  31. Topsøe, F.: Some bounds for the logarithmic function. Inequality Theory Appl. 4, 137 (2007)

    MathSciNet  Google Scholar 

  32. Wang, P.A., Tzeng, R.C., Proutiere, A.: Fast pure exploration via frank-wolfe. In: Proceedings of the 34th Annual Conference on Neural Information Processing Systems (NeurIPS) (2021)

    Google Scholar 

  33. You, W., Qin, C., Wang, Z., Yang, S.: Information-directed selection for top-two algorithms. In: Proceedings of the 36th Conference on Learning Theory (COLT) (2023)

    Google Scholar 

  34. Zhao, Y., Stephens, C., Szepesvári, C., Jun, K.S.: Revisiting simple regret: fast rates for returning a good arm. In: Proceedings of the 40th International Conference on Machine Learning (ICML) (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Chihao Zhang .

Editor information

Editors and Affiliations


A Proof of Lemma 1

Let \(T_i(s)\) denote the index of the s-th pull of arm i for \(s\le T_i\). Define the log-likelihood , abbreviated as \(L_T\) when the context is clear. By applying the chain rule to \(L_T\), we have

figure ao

where the second equality follows from and that \(r_t\) is independent of \(\mathcal {F}_{t-1}\) conditioned on \(a_t\). With , we apply Wald’s Lemma (see e.g. [28]) to to obtain:


The remaining task is to prove for any event \(\mathcal {E}\in \mathcal {F}_T\), we reformulate the definition of \(L_T\) as

figure at

Summing over all \(\omega \in \mathcal {E}\), we obtain


Continuing to lower bound Eq. (2), we have

figure au

where the inequality follows from the Jensen inequality. Rearranging, we get . Similarly, . Hence, we conclude

figure ax

which completes our proof in conjunction with Eq. (1).

B Bounds of KL Divergence

We will utilize the following inequalities from [31] to bound the KL divergence.

Fact 2

The following inequalities hold.

  1. (a)

    \(\log (1+x)\ge \frac{x}{1+x}, \forall x>-1\);

  2. (b)

    \(\log (1+x)\ge \frac{x}{1+x}(1+\frac{x}{2+x})=\frac{2x}{2+x}, \forall x>0\);

  3. (c)

    \(\log (1+x)\ge \frac{x}{1+x}\frac{2+x}{2}, \text { if } -1<x\le 0\).

Lemma 6

(Restate 2). \(\mathfrak {d}(\frac{1-\delta }{2}+\frac{1}{2n},1-\delta )=\varOmega \left( \frac{1-\delta }{2}-\frac{1}{2n}\right) \) if \(1-\delta =\frac{1+\varOmega (1)}{n}\).


By definition,

figure ay

where the inequality follows from (a) & (b) of Fact 2.

Lemma 7

(Restate Lemma 3). For any \(x_1,x_2\dots ,x_n \in [0,1]\) with average \(a:=\frac{\sum _{i}x_i}{n}< b\in [0,1]\), then \(\sum _{i:x_i<b}\mathfrak {d}(x_i,b)\ge n\cdot \mathfrak {d}(a,b).\)


Recall that \(\mathfrak {d}(\cdot ,y)\) is convex for any fixed y in Fact 1. Let \(S=\left\{ \,i:x_i<b\,\right\} \) and \(k=\left\| S\right\| \). By the convexity of \(\mathfrak {d}(\cdot ,b)\), we have \(\frac{1}{k}\sum _{i\in S}\mathfrak {d}(x_i,b)\ge \mathfrak {d}\left( \frac{\sum _{i\in S}x_i}{k},b\right) .\) Since \(\mathfrak {d}(x,b)>\mathfrak {d}(y,b)\) if \(x<y<b\) in Fact 1,

$$\sum _{i\in S}\mathfrak {d}(x_i,b)\ge k\cdot \mathfrak {d}\left( \frac{\sum _{i\in S}x_i}{k},b\right) \ge k\cdot \mathfrak {d}\left( \frac{an-(n-k)b}{k},b\right) .$$

Using the convexity of \(\mathfrak {d}(\cdot ,b)\) again, we get

$$\frac{k}{n}\cdot \mathfrak {d}\left( \frac{an-(n-k)b}{k},b\right) +\frac{n-k}{n}\cdot \mathfrak {d}(b,b)\ge \mathfrak {d}(a,b),$$

which implies \(k\cdot \mathfrak {d}\left( \frac{an-(n-k)b}{k},b\right) \ge n\cdot \mathfrak {d}(a,b)\) since \(\mathfrak {d}(b,b)=0\).

Lemma 8

(Restate Lemma 4). For any \(0<a<b<1\), if \(\frac{b-a}{a}={\Omega }(1)\), then \(\mathfrak {d}(b,a)={\Omega }\left( b\cdot \log \frac{b}{a}\right) .\)


By definition of the KL divergence, \(\mathfrak {d}(b,a)=b\log \frac{b}{a}+(1-b)\log \frac{1-b}{1-a}.\)

By Fact 2 (b) & (c),

$$b\log \frac{b}{a}=b\log \left( 1+\frac{b-a}{a}\right) \ge (b-a)\left( 1+\frac{(b-a)/a}{2+(b-a)/a}\right) $$


$$(1-b)\log \frac{1-b}{1-a}=(1-b)\log \left( 1+\frac{a-b}{1-a}\right) \ge -(b-a)\left( 1-\frac{b-a}{2(1-a)}\right) . $$

Therefore if \(r:=\frac{b-a}{a}={\Omega }(1)\),

$$\begin{aligned} \mathfrak {d}(b,a) &= \left( 1-\frac{1}{1+r/(2+r)}\right) b\log \frac{b}{a}+\frac{1}{1+r/(2+r)}b\log \frac{b}{a}+(1-b)\log \frac{1-b}{1-a}\\ &\ge \left( 1-\frac{1}{1+r/(2+r)}\right) b\log \frac{b}{a} +(b-a)-(b-a)\left( 1-\frac{b-a}{2(1-a)}\right) \\ &\ge \left( 1-\frac{1}{1+r/(2+r)}\right) b\log \frac{b}{a}. \end{aligned}$$

C Details of the OSMD Algorithm Corresponding to Proposition 1

For completeness, we provide a description of the OSMD algorithm used in Algorithm 1. For more detailed information, please refer to the work of [21].

Let \(\varDelta _{(n-1)}\) denote the probability simplex with \(n-1\) dimensions, defined as \(\varDelta _{(n-1)}=\left\{ \,\textbf{q}\in \mathbb R_{\ge 0}: \sum _{i=1}^n \textbf{q}(i)=1\,\right\} \). Here, \(\textbf{q}(i)\) represents the value at the i-th position of vector \(\textbf{q}\). Consider a function \(F:\mathbb R^n\rightarrow \mathbb R\cup \left\{ \,\infty \,\right\} \). The Bregman divergence with respect to F is defined as \(B_F(\textbf{q},\textbf{p}) = F(\textbf{q})-F(\textbf{p}) - \langle \nabla F(\textbf{p}),\textbf{q}-\textbf{p}\rangle \) for any \(\textbf{q},\textbf{p}\in \mathbb R^n\).

The algorithm proposed in [21] is designed for loss cases, where each pull results in a loss associated with the corresponding arm instead of a reward. To adapt their algorithm to our setting, we can perform a simple reduction by constructing the loss of each arm \(\ell _t(i)\) as \(1-r_t(i)\), where \(r_t(i)\) is the reward of arm \({arm}_i\). It is straightforward to verify that the results in [21] also hold for the reward setting. Let \(\eta \) be the learning rate and \(F:\mathbb R^{\left\| S\right\| }\rightarrow \mathbb R\cup \left\{ \,\infty \,\right\} \) be the potential function, where S is the arm set. Without loss of generality, we index the arms in S by \([\left\| S\right\| ]\).

figure az

By choosing \(\eta = \sqrt{\frac{8}{L}}\) and \(F(\textbf{q}) = -2\sum _{i=1}^{\left\| S\right\| } \sqrt{\textbf{q}(i)}\), the conclusion in Proposition 1 can be directly derived from Theorem 11 in [21].

D Details of the MedianElimination Algorithm Corresponding to Proposition 2

For completeness, we present the description of the MedianElimination algorithm we used in Algorithm 1. For more detailed information, please refer to Theorem 10 of [9].

figure ba

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, H., He, Y., Zhang, C. (2025). On the Problem of Best Arm Retention. In: Li, B., Li, M., Sun, X. (eds) Frontiers of Algorithmics. IJTCS-FAW 2024. Lecture Notes in Computer Science, vol 14752. Springer, Singapore.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-7751-8

  • Online ISBN: 978-981-97-7752-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics