Stochastic Q-Learning For Large Discrete Action Spaces
Stochastic Q-Learning For Large Discrete Action Spaces
Stochastic Q-Learning For Large Discrete Action Spaces
cal in reinforcement learning (RL). Despite the standard RL algorithms suffer in terms of computational
widespread use of value-based RL approaches efficiency (Akkerman et al., 2023). Identifying the optimal
like Q-learning, they come with a computational actions might entail cycling through all of them, in general,
burden, necessitating the maximization of a value multiple times within different states, which is computa-
function over all actions in each iteration. This tionally expensive and may become prohibitive with large
burden becomes particularly challenging when discrete action spaces (Tessler et al., 2019).
addressing large-scale problems and using deep Such challenges apply to various domains, including combi-
neural networks as function approximators. In natorial optimization (Mazyavkina et al., 2021; Fourati et al.,
this paper, we present stochastic value-based RL 2023; 2024b;a), natural language processing (He et al., 2015;
approaches which, in each iteration, as opposed 2016a;b; Tessler et al., 2019), communications and network-
to optimizing over the entire set of n actions, ing (Luong et al., 2019; Fourati & Alouini, 2021), recom-
only consider a variable stochastic set of a sub- mendation systems (Dulac-Arnold et al., 2015), transporta-
linear number of actions, possibly as small as tion (Al-Abbasi et al., 2019; Haliem et al., 2021; Li et al.,
O(log(n)). The presented stochastic value-based 2022), and robotics (Dulac-Arnold et al., 2015; Tavakoli
RL methods include, among others, Stochastic et al., 2018; Tang & Agrawal, 2020; Seyde et al., 2021;
Q-learning, StochDQN, and StochDDQN, all of 2022; Gonzalez et al., 2023; Ireland & Montana, 2024). Al-
which integrate this stochastic approach for both though tailored solutions leveraging action space structures
value-function updates and action selection. The and dimensions may suffice in specific contexts, their ap-
theoretical convergence of Stochastic Q-learning plicability across diverse problems, possibly unstructured,
is established, while an analysis of stochastic max- still needs to be expanded. We complement these works by
imization is provided. Moreover, through empir- proposing a general method that addresses a broad spectrum
ical validation, we illustrate that the various pro- of problems, accommodating structured and unstructured
posed approaches outperform the baseline meth- single and multi-dimensional large discrete action spaces.
ods across diverse environments, including dif-
ferent control problems, achieving near-optimal Value-based and actor-based approaches are both prominent
average returns in significantly reduced time. approaches in RL. Value-based approaches, which entail the
agent implicitly optimizing its policy by maximizing a value
function, demonstrate superior generalization capabilities
but demand significant computational resources, particularly
1. Introduction
in complex settings. Conversely, actor-based approaches,
Reinforcement learning (RL), a continually evolving field which entail the agent directly optimizing its policy, offer
of machine learning, has achieved notable successes, espe- computational efficiency but often encounter challenges in
cially when combined with deep learning (Sutton & Barto, generalizing across multiple and unexplored actions (Dulac-
2018; Wang et al., 2022). While there have been several Arnold et al., 2015). While both hold unique advantages and
1
challenges, they represent distinct avenues for addressing
Department of Computer, Electrical and Mathematical Sci- the complexities of decision-making in large action spaces.
ence and Engineering, King Abdullah University of Science and
Technology (KAUST), Thuwal, KSA. 2 School of Industrial En- However, comparing them falls outside the scope of this
gineering, Purdue University, West Lafayette, IN 47907, USA. work. While some previous methods have focused on the
Correspondence to: Fares Fourati <fares.fourati@kaust.edu.sa>. latter (Dulac-Arnold et al., 2015), our work concentrates
on the former. Specifically, we aim to exploit the natural
Proceedings of the 41 st International Conference on Machine generalization inherent in value-based RL approaches while
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by
the author(s). reducing their per-step computational complexity.
1
Stochastic Q-learning for Large Discrete Action Spaces
Q-learning, as introduced by Watkins & Dayan (1992), for We propose Stochastic Q-learning, Stochastic Double Q-
discrete action and state spaces, stands out as one of the learning, StochDQN, and StochDDQN, which are ob-
most famous examples of value-based RL methods and re- tained by changing max and arg max to stoch max and
mains one of the most widely used ones in the field. As an stoch arg max in the Q-learning (Watkins & Dayan, 1992),
off-policy learning method, it decouples the learning process the Double Q-learning (Hasselt, 2010), the deep Q-network
from the agent’s current policy, allowing it to leverage past (DQN) (Mnih et al., 2013) and the double DQN (DDQN)
experiences from various sources, which becomes advanta- (Van Hasselt et al., 2016), respectively. Furthermore, we
geous in complex environments. In each step of Q-learning, observed that our approach works even for the on-policy
the agent updates its action value estimates based on the Sarsa (Rummery & Niranjan, 1994).
observed reward and the estimated value of the best action
We conduct a theoretical analysis of the proposed method,
in the next state.
proving the convergence of Stochastic Q-learning, which
Some approaches have been proposed to apply Q-learning integrates these techniques for action selection and value
to continuous state spaces, leveraging deep neural networks updates, and establishing a lower bound on the probability
(Mnih et al., 2013; Van Hasselt et al., 2016). Moreover, sev- of sampling an optimal action from a random set of size
eral improvements have also been suggested to address its dlog(n)e and analyze the error of stochastic maximization
inherent estimation bias (Hasselt, 2010; Van Hasselt et al., compared to exact maximization. Furthermore, we evaluate
2016; Zhang et al., 2017; Lan et al., 2020; Wang et al., the proposed RL algorithms on environments from Gymna-
2021). However, despite the different progress and its nu- sium (Brockman et al., 2016). For the stochastic deep RL
merous advantages, a significant challenge still needs to be algorithms, the evaluations were performed on control tasks
solved in Q-learning-like methods when confronted with within the multi-joint dynamics with contact (MuJoCo) en-
large discrete action spaces. The computational complexity vironment (Todorov et al., 2012) with discretized actions
associated with selecting actions and updating Q-functions (Dulac-Arnold et al., 2015; Tavakoli et al., 2018; Tang &
increases proportionally with the increasing number of ac- Agrawal, 2020). These evaluations demonstrate that the
tions, which renders the conventional approach impractical stochastic approaches outperform non-stochastic ones re-
as the number of actions substantially increases. Conse- garding wall time speedup and sometimes rewards. Our key
quently, we confront a crucial question: Is it possible to contributions are summarized as follows:
mitigate the complexity of the different Q-learning methods
while maintaining a good performance? • We introduce novel stochastic maximization techniques
This work proposes a novel, simple, and practical ap- denoted as stoch max and stoch arg max, offering a com-
proach for handling general, possibly unstructured, single- pelling alternative to conventional deterministic maxi-
dimensional or multi-dimensional, large discrete action mization operations, particularly beneficial for handling
spaces. Our approach targets the computational bottleneck large discrete action spaces, ensuring sub-linear complex-
in value-based methods caused by the search for a maximum ity concerning the number of actions.
(max and arg max) in every learning iteration, which scales
• We present a suite of value-based algorithms suit-
as O(n), i.e., linearly with the number of possible actions n.
able for large discrete actions, including Stochastic Q-
Through randomization, we can reduce this linear per-step
learning, Stochastic Sarsa, Stochastic Double Q-learning,
computational complexity to logarithmic.
StochDQN, and StochDDQN, which integrate stochas-
We introduce stoch max and stoch arg max, which, instead tic maximization within Q-learning, Sarsa, Double Q-
of exhaustively searching for the precise maximum across learning, DQN, and DDQN, respectively.
the entire set of actions, rely on at most two random sub-
sets of actions, both of sub-linear sizes, possibly each of • We analyze stochastic maximization and demonstrate the
size dlog(n)e. The first subset is randomly sampled from convergence of Stochastic Q-learning. Furthermore, we
the complete set of actions, and the second from the pre- empirically validate our approach to tasks from the Gym-
viously exploited actions. These stochastic maximization nasium and MuJoCO environments, encompassing vari-
techniques amortize the computational overhead of stan- ous dimensional discretized actions.
dard maximization operations in various Q-learning meth-
ods (Watkins & Dayan, 1992; Hasselt, 2010; Mnih et al., 2. Related Works
2013; Van Hasselt et al., 2016). Stochastic maximization
methods significantly accelerate the agent’s steps, including While RL has shown promise in diverse domains, practical
action selection and value-function updates in value-based applications often grapple with real-world complexities. A
RL methods, making them practical for handling challeng- significant hurdle arises when dealing with large discrete
ing, large-scale, real-world problems. action spaces (Dulac-Arnold et al., 2015; 2021). Previous
research has investigated strategies to address this challenge
2
Stochastic Q-learning for Large Discrete Action Spaces
by leveraging the combinatorial or the dimensional struc- neural network was trained to predict the optimal action in
tures in the action space (He et al., 2016b; Tavakoli et al., combination with a uniform search. This approach involves
2018; Tessler et al., 2019; Delarue et al., 2020; Seyde et al., the use of an expensive autoregressive proposal distribution
2021; 2022; Fourati et al., 2023; 2024b;a; Akkerman et al., to generate n0 actions and samples a large number of ac-
2023; Fourati et al., 2024b;a; Ireland & Montana, 2024). tions (m), thus remaining computationally expensive, with
For example, He et al. (2016b) leveraged the combinatorial O(n0 + m). In (Metz et al., 2017), sequential DQN allows
structure of their language problem through sub-action em- the agent to choose sub-actions one by one, which increases
beddings. Compressed sensing was employed in (Tessler the number of steps needed to solve a problem and requires
et al., 2019) for text-based games with combinatorial ac- d steps with a linear complexity of O(i) for a discretization
tions. Delarue et al. (2020) formulated the combinatorial granularity i. Additionally, Tavakoli et al. (2018) employs a
action decision of a vehicle routing problem as a mixed- branching technique with duelling DQN for combinatorial
integer program. Moreover, Akkerman et al. (2023) in- control problems. Their approach has a complexity of O(di)
troduced dynamic neighbourhood construction specifically for actions with discretization granularity i and d dimen-
for structured combinatorial large discrete action spaces. sions, whereas our method, in a similar setting, achieves
Previous works tailored solutions for multi-dimensional O(d log(i)). Another line of work introduces action elim-
spaces such as those in (Seyde et al., 2021; 2022; Ireland ination techniques, such as the action elimination DQN
& Montana, 2024), among others, while practical in the (Zahavy et al., 2018), which employs an action elimination
multi-dimensional spaces, may not be helpful for single- network guided by an external elimination signal from the
dimensional large action spaces. While relying on the struc- environment. However, it requires this domain-specific sig-
ture of the action space is practical in some settings, not nal and can be computationally expensive (O(n0 ) where
all problems with large action spaces are multi-dimensional n0 n are the number of remaining actions). In contrast,
or structured. We complement these works by making no curriculum learning, as proposed by Farquhar et al. (2020),
assumptions about the structure of the action space. initially limits an agent’s action space, gradually expanding
it during training for efficient exploration. However, its ef-
Some approaches have proposed factorizing the action
fectiveness relies on having an informative restricted action
spaces to reduce their size. For example, these include
space, and as the action space size grows, its complexity
factorizing into binary subspaces (Lagoudakis & Parr, 2003;
scales linearly with its size, eventually reaching O(n).
Sallans & Hinton, 2004; Pazis & Parr, 2011; Dulac-Arnold
et al., 2012), expert demonstration (Tennenholtz & Mannor, In the context of combinatorial bandits with a single state
2019), tensor factorization (Mahajan et al., 2021), and sym- but large discrete action spaces, previous works have ex-
bolic representations (Cui & Khardon, 2016). Additionally, ploited the combinatorial structure of actions, where each
some hierarchical and multi-agent RL approaches employed action is a subset of main arms. For instance, for submodu-
factorization as well (Zhang et al., 2020; Kim et al., 2021; lar reward functions, which imply diminishing returns when
Peng et al., 2021; Enders et al., 2023). While some of these adding arms, in (Fourati et al., 2023) and (Fourati et al.,
methods effectively handle large action spaces for certain 2024b), stochastic greedy algorithms are used to avoid exact
problems, they necessitate the design of a representation search. The former evaluates the marginal gains of adding
for each discrete action. Even then, for some problems, the and removing sub-actions (arms), while the latter assumes
resulting space may still be large. monotonic rewards and considers adding the best arm until
a cardinality constraint is met. For general reward func-
Methods presented in (Van Hasselt & Wiering, 2009; Dulac-
tions, Fourati et al. (2024a) propose using approximation
Arnold et al., 2015; Wang et al., 2020) combine continuous-
algorithms to evaluate and add sub-actions. While these
action policy gradients with nearest neighbour search to gen-
methods are practical for bandits, they exploit the combina-
erate continuous actions and identify the nearest discrete ac-
torial structure of their problems and consider a single-state
tions. These are interesting methods but require continuous-
scenario, which is different from general RL problems.
to-discrete mapping and are mainly policy-based rather than
value-based approaches. In the works of Kalashnikov et al. While some approaches above are practical for handling
(2018) and Quillen et al. (2018), the cross-entropy method specific problems with large discrete action spaces, they
(Rubinstein, 1999) was utilized to approximate action maxi- often exploit the dimensional or combinatorial structures
mization. This approach requires multiple iterations (r) for inherent in their considered problems. In contrast, we com-
a single action selection. During each iteration, it samples n0 plement these approaches by proposing a solution to tackle
values, where n0 < n, fits a Gaussian distribution to m < n0 any general, potentially unstructured, single-dimensional
of these samples, and subsequently draws a new batch of or multi-dimensional, large discrete action space without
n0 samples from this Gaussian distribution. As a result, this relying on structure assumptions. Our proposed solution is
approximation remains costly, with a complexity of O(rn0 ). general, simple, and efficient.
Additionally, in the work of Van de Wiele et al. (2020), a
3
Stochastic Q-learning for Large Discrete Action Spaces
4
Stochastic Q-learning for Large Discrete Action Spaces
5
Stochastic Q-learning for Large Discrete Action Spaces
et al., 2013) and DDQN (Van Hasselt et al., 2016) algo- similar to those obtained from an optimal action. Therefore,
rithms with the stochastic maximization operations. In these the difference between stochastic maximization and exact
modified approaches, we replace the "s -greedy exploration maximization might be a more informative metric than just
strategy with the same exploration policy as in Eq. (6). the probability of finding an exact maximizer. Thus, at time
step t, given state s and the current estimated Q-function
For StochDQN, we employ a deep neural network as a
Qt , we define the estimation error as t (s), as follows:
function approximator to estimate the action-value function,
represented as Q(s, a; ✓) ⇡ Q(s, a), where ✓ denotes the t (s) , max Qt (s, a) stoch max Qt (s, a) . (9)
weights of the Q-network. This network is trained by min- a2A a2A
6
Stochastic Q-learning for Large Discrete Action Spaces
not have to be static and might still vary over the rounds; The following theorem states the convergence of the iterates
the critical characteristic is that its maximizing action re- Qt of Stochastic Q-learning with memoryless stochastic
mains the same even when its values are updated. Although maximization to the Q⇤ , defined in Eq. 11, for any sampling
the stoch max has sub-linear complexity compared to the distribution P, regardless of the cardinality.
max, without any assumption of the value distributions, the Theorem 6.2. For a finite MDP, as described in Section
following corollary shows that, on average, for a stable Q- 3, let C be a randomly independently sampled subset of
function, after a certain number of iterations, the output of actions from A, of any cardinality, following any distribu-
the stoch max matches precisely the output of max. tion P, exclusively sampled for the value updates, for the
Corollary 5.3. For a given state s, assuming a time range Stochastic Q-learning, as described in Algorithm 1, given
where the Q-function becomes stable in that state, t (s) is by the following update rule:
expected to converge to zero after dlog(n)e
n
iterations.
Qt+1 (st , at ) = (1 ↵t (st , at ))Qt (st , at )
Recalling the definition of the similarity ratio !t , it follows
+ ↵t (st , at ) rt + max Qt (st+1 , a) ,
that !t (s) = 1 (s)/ maxa2A Qt (s, a). Therefore, for a b2C⇠P
given state s, where the Q-function becomes stable, given given any initial estimate Q0 , Qt converges
the boundedness of iterates in Q-learning, it is expected Pwith probability
1 to P
Q⇤ , defined in Eq. (11), as long as t ↵t (s, a) = 1
that !t converges to one. This observation was confirmed, and t ↵t2 (s, a) < 1 for all (s, a) 2 S ⇥ A.
even with continuous states and using neural networks as
function approximators, in Section 7.4. The theorem’s result demonstrates that for any cardinality of
actions, Stochastic Q-learning converges to Q⇤ , as defined
6. Stochastic Q-learning Convergence in Eq. (11), which recovers the convergence guarantees of
Q-learning when the sampling distribution is P(A) = 1.
In this section, we analyze the convergence of the Stochastic Remark 6.3. In principle, any size can be used, balancing
Q-learning, described in Algorithm 1. This algorithm em- time complexity and approximation. Our empirical experi-
ploys the policy ⇡QS
(s), as defined in Eq. (6), with "s > 0 to ments focused on log(n) to illustrate the method’s ability to
p
guarantee that P⇡ [at = a | st = s] > 0 for all state-action recover Q-learning, even with a few actions. Using n will
pairs (s, a) 2 S ⇥ A. The value update rule, on the other approach the value function of Q-learning more closely com-
hand, uses the update rule specified in Eq. (7). pared to using log(n), albeit at the cost of higher complexity
In the convergence analysis, we focus on memoryless max- than log(n).
imization. While the stoch arg max operator for action The theorem shows that even with memoryless stochastic
selection can be employed with or without memory, we as- maximization, using randomly sampled O(log(n)) actions,
sume a memoryless stoch max operator for value updates, the convergence is still guaranteed. However, relying on
which means that value updates are performed by maxi- memory-based stochastic maximization helps minimize the
mizing over a randomly sampled subset of actions from A, approximation error in stochastic maximization, as shown
sampled independently from both the next state s0 and the in Corollary 5.3, and outperforms Q-learning as shown in
set used for the stoch arg max. the experiments in Section 7.1.
For a stochastic variable subset of actions C ✓ A, following In the following, we provide a sketch of the proof addressing
some probability distribution P : 2A ! [0, 1], we consider, the extra stochasticity due to stochastic maximization. The
without loss of generality Q(., ;) = 0, and define, according full proof is provided in Appendix A.
to P, a target Q-function, denoted as Q⇤ , as:
We tackle the additional stochasticity depending on the sam-
Q⇤ (s, a) , E r(s, a) + max Q⇤ (s0 , b) | s, a . (11) pling distribution P, by defining an operator function ,
b2C⇠P which for any q : S ⇥ A ! R, is as follows:
X X
Remark 6.1. The Q defined above depends on the sam-
⇤
( q)(s, a) , P(C) P(s0 | s, a)
pling distribution P. Therefore, it does not represent the C22A s0 2S
optimal value function of the original MDP problem; in-
stead, it is optimal under the condition where only a random r(s, a) + max q(s0 , b) . (12)
b2C
subset of actions following the distribution P is available
to the agent at each time step. However, as the sampling We then demonstrate that it is a contraction in the sup-norm,
cardinality increases, it increasingly better approximates the as shown in Lemma 6.4, which we prove in Appendix A.2.
optimal value function of the original MDP and fully recov- Lemma 6.4. The operator , defined in Eq. (12), is a
ers the optimal Q-function of the original problem when the contraction in the sup-norm, with a contraction factor ,
sampling distribution becomes P(A) = 1. i.e., k q1 q2 k1 kq1 q2 k1 .
7
Stochastic Q-learning for Large Discrete Action Spaces
8
Stochastic Q-learning for Large Discrete Action Spaces
9
Stochastic Q-learning for Large Discrete Action Spaces
10
Stochastic Q-learning for Large Discrete Action Spaces
Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Mazyavkina, N., Sviridov, S., Ivanov, S., and Burnaev,
Traore, R., Dhariwal, P., Hesse, C., Klimov, O., Nichol, E. Reinforcement learning for combinatorial optimiza-
A., Plappert, M., Radford, A., Schulman, J., Sidor, S., tion: A survey. Computers & Operations Research, 134:
and Wu, Y. Stable baselines. https://github.com/ 105400, 2021.
hill-a/stable-baselines, 2018.
Metz, L., Ibarz, J., Jaitly, N., and Davidson, J. Discrete
Ireland, D. and Montana, G. Revalued: Regularised ensem- sequential prediction of continuous actions for deep rl.
ble value-decomposition for factorisable markov decision arXiv preprint arXiv:1705.05035, 2017.
processes. arXiv preprint arXiv:2401.08850, 2024. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing
Jaakkola, T., Jordan, M., and Singh, S. Convergence of atari with deep reinforcement learning. arXiv preprint
stochastic iterative dynamic programming algorithms. arXiv:1312.5602, 2013.
Advances in neural information processing systems, 6,
1993. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,
T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-
Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, chronous methods for deep reinforcement learning. In
A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., International conference on machine learning, pp. 1928–
Vanhoucke, V., et al. Qt-opt: Scalable deep reinforcement 1937. PMLR, 2016.
learning for vision-based robotic manipulation. arXiv
Nair, V. and Hinton, G. E. Rectified linear units improve
preprint arXiv:1806.10293, 2018.
restricted boltzmann machines. In Proceedings of the 27th
international conference on machine learning (ICML-10),
Kim, M., Park, J., et al. Learning collaborative policies
pp. 807–814, 2010.
to solve np-hard routing problems. Advances in Neural
Information Processing Systems, 34:10418–10430, 2021. Pazis, J. and Parr, R. Generalized value functions for large
action sets. In Proceedings of the 28th International
Lagoudakis, M. G. and Parr, R. Reinforcement learning as Conference on Machine Learning (ICML-11), pp. 1185–
classification: Leveraging modern classifiers. In Proceed- 1192, 2011.
ings of the 20th International Conference on Machine
Learning (ICML-03), pp. 424–431, 2003. Peng, B., Rashid, T., Schroeder de Witt, C., Kamienny,
P.-A., Torr, P., Böhmer, W., and Whiteson, S. Facmac:
Lan, Q., Pan, Y., Fyshe, A., and White, M. Maxmin q- Factored multi-agent centralised policy gradients. Ad-
learning: Controlling the estimation bias of q-learning. vances in Neural Information Processing Systems, 34:
arXiv preprint arXiv:2002.06487, 2020. 12208–12221, 2021.
Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., and
Li, S., Wei, C., and Wang, Y. Combining decision making
Levine, S. Deep reinforcement learning for vision-based
and trajectory planning for lane changing using deep
robotic grasping: A simulated comparative evaluation of
reinforcement learning. IEEE Transactions on Intelligent
off-policy methods. In 2018 IEEE International Confer-
Transportation Systems, 23(9):16110–16136, 2022.
ence on Robotics and Automation (ICRA), pp. 6284–6291.
IEEE, 2018.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,
T., Tassa, Y., Silver, D., and Wierstra, D. Continuous Rubinstein, R. The cross-entropy method for combinatorial
control with deep reinforcement learning. arXiv preprint and continuous optimization. Methodology and comput-
arXiv:1509.02971, 2015. ing in applied probability, 1:127–190, 1999.
Luong, N. C., Hoang, D. T., Gong, S., Niyato, D., Wang, P., Rummery, G. A. and Niranjan, M. On-line Q-learning
Liang, Y.-C., and Kim, D. I. Applications of deep rein- using connectionist systems, volume 37. University of
forcement learning in communications and networking: Cambridge, Department of Engineering Cambridge, UK,
A survey. IEEE Communications Surveys & Tutorials, 21 1994.
(4):3133–3174, 2019. Sallans, B. and Hinton, G. E. Reinforcement learning with
factored states and actions. The Journal of Machine
Mahajan, A., Samvelyan, M., Mao, L., Makoviychuk, V.,
Learning Research, 5:1063–1088, 2004.
Garg, A., Kossaifi, J., Whiteson, S., Zhu, Y., and Anand-
kumar, A. Reinforcement learning in factored action Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
spaces using tensor decompositions. arXiv preprint Klimov, O. Proximal policy optimization algorithms.
arXiv:2110.14538, 2021. arXiv preprint arXiv:1707.06347, 2017.
11
Stochastic Q-learning for Large Discrete Action Spaces
Seyde, T., Gilitschenski, I., Schwarting, W., Stellato, B., Wang, H., Lin, S., and Zhang, J. Adaptive ensemble q-
Riedmiller, M., Wulfmeier, M., and Rus, D. Is bang- learning: Minimizing estimation bias via error feedback.
bang control all you need? solving continuous control Advances in Neural Information Processing Systems, 34:
with bernoulli policies. Advances in Neural Information 24778–24790, 2021.
Processing Systems, 34:27209–27221, 2021.
Wang, X., Wang, S., Liang, X., Zhao, D., Huang, J., Xu,
Seyde, T., Werner, P., Schwarting, W., Gilitschenski, I., X., Dai, B., and Miao, Q. Deep reinforcement learning:
Riedmiller, M., Rus, D., and Wulfmeier, M. Solv- a survey. IEEE Transactions on Neural Networks and
ing continuous control via q-learning. arXiv preprint Learning Systems, 2022.
arXiv:2210.12566, 2022.
Watkins, C. J. and Dayan, P. Q-learning. Machine learning,
Sutton, R. S. and Barto, A. G. Reinforcement learning: An 8:279–292, 1992.
introduction. MIT press, 2018.
Zahavy, T., Haroush, M., Merlis, N., Mankowitz, D. J., and
Tang, Y. and Agrawal, S. Discretizing continuous action Mannor, S. Learn what not to learn: Action elimination
space for on-policy optimization. In Proceedings of the with deep reinforcement learning. Advances in neural
aaai conference on artificial intelligence, volume 34, pp. information processing systems, 31, 2018.
5981–5988, 2020.
Zhang, T., Guo, S., Tan, T., Hu, X., and Chen, F. Gen-
Tavakoli, A., Pardo, F., and Kormushev, P. Action branch- erating adjacency-constrained subgoals in hierarchical
ing architectures for deep reinforcement learning. In Pro- reinforcement learning. Advances in Neural Information
ceedings of the AAAI conference on Artificial Intelligence, Processing Systems, 33:21579–21590, 2020.
volume 32, 2018.
Zhang, Z., Pan, Z., and Kochenderfer, M. J. Weighted
Tennenholtz, G. and Mannor, S. The natural language of ac- double q-learning. In IJCAI, pp. 3455–3461, 2017.
tions. In International Conference on Machine Learning,
pp. 6196–6205. PMLR, 2019.
Wang, G., Shi, D., Xue, C., Jiang, H., and Wang, Y. Bic-
ddpg: Bidirectionally-coordinated nets for deep multi-
agent reinforcement learning. In International Confer-
ence on Collaborative Computing: Networking, Applica-
tions and Worksharing, pp. 337–354. Springer, 2020.
12
Stochastic Q-learning for Large Discrete Action Spaces
This policy, with "s > 0, ensures that P⇡ [at = a | st = s] > 0 for all (s, a) 2 S ⇥ A.
Furthermore, during the training, to update the Q-function, given any initial estimate Q0 , we consider a Stochastic Q-learning
which uses stoch max operation as in the following stochastic update rule:
Qt+1 (st , at ) = (1 ↵t (st , at )) Qt (st , at ) + ↵t (st , at ) rt + stoch max Qt (st+1 , b) . (16)
b2A
For the function updates, we consider a stoch max without memory, which involves a max over a random subset of action
C sampled from a set probability distribution P defined over the combinatorial space of actions, i.e., P : 2A ! [0, 1], which
can be a uniform distribution over the action sets of size dlog(n)e.
Hence, for a random subset of actions C, the update rule of Stochastic Q-learning can be written as:
Qt+1 (st , at ) = (1 ↵t (st , at )) Qt (st , at ) + ↵t (st , at ) rt + max Qt (st+1 , b) . (17)
b2C
For the transition probability distribution P : S ⇥ A ⇥ S ! [0, 1], the set probability distribution P : 2A ! [0, 1], the
reward function r : S ⇥ A ! R, and the discount factor, 2 [0, 1], we define the following contraction operator , defined
for a function q : S ⇥ A ! R as
X X
( q)(s, a) = P(C) P(s | s, a) r(s, a) + max q(s0 , b) .
0
(23)
b2C
C22A s0 2S
13
Stochastic Q-learning for Large Discrete Action Spaces
It is now immediate from Lemma 6.4, which we prove in Appendix A.2, that
Moreover,
"✓ ◆2 #
0 ⇤ ⇤
var [Ft (s, a) | Ft ] = E r(s, a) + max Qt (s , b) Q (s, a) ( Qt ) (s, a) + Q (s, a) | Ft
b2C
"✓ ◆2 #
0
=E r(s, a) + max Qt (s , b) ( Qt ) (s, a) | Ft
b2C
= var r(s, a) + max Qt (s0 , b) | Ft
b2C
= var [r(s, a) | Ft ] + var max Qt (s0 , b) | Ft + 2 cov(r(s, a), max Qt (s0 , b) | Ft )
2
b2C b2C
= var [r(s, a) | Ft ] + 2 var max Qt (s0 , b) | Ft .
b2C
The last line follows from the fact that the randomness of maxb2C Qt (s0 , b) | Ft only depends on the random set C and the
next state s0 . Moreover, we consider the reward r(s, a) independent of the set C and the next state s0 , by not using the same
set C for both the action selection and the value update.
Given that r is bounded, its variance is bounded by some constant B. Therefore,
var [Ft (s, a) | Ft ] B + 2
var max Qt (s0 , b) | Ft
b2C
2
=B+ E (max Qt (s0 , b))2 | Ft
2 2
E max Qt (s0 , b) | Ft
b2C b2C
B + 2 E (max Qt (s0 , b))2 | Ft
b2C
B + 2 E (max
0
max Qt (s0 , b))2 | Ft
s 2S b2A
B+ 2
(max
0
max Qt (s0 , b))2
s 2S b2A
2
=B+ kQt k21
=B+ 2
k t + Q⇤ k21
B+ 2
kQ⇤ k21 + 2
k 2
t k1
(B + 2
kQ⇤ k21 )(1 + k 2
t k1 ) + 2
(1 + k 2
t k1 )
max{B + 2
kQ⇤ k21 , 2
}(1 + k 2
t k1 )
max{B + 2
kQ⇤ k21 , 2
}(1 + k t k1 )
2
.
Then, by Eq. (24), Eq. (25), and Theorem 1 in (Jaakkola et al., 1993), t converges to zero with probability 1, i.e., Qt
converges to Q⇤ with probability 1.
14
Stochastic Q-learning for Large Discrete Action Spaces
Therefore,
X X
k q1 q2 k1 = max P(C) P(s | s, a) r(s, a) + max q1 (s0 , b)
0
r(s, a) + max q2 (s0 , b)
s,a b2C b2C
C22A s0 2S
X X
= max P(C) P(s | s, a) max q1 (s0 , b)
0
max q2 (s0 , b)
s,a b2C b2C
C22A s0 2S
X X
max P(C) P(s0 | s, a) max q1 (s0 , b) max q2 (s0 , b)
s,a b2C b2C
C22A s0 2S
X X
max P(C) P(s0 | s, a) max |q1 (z, b) q2 (z, b)|
s,a z,b
C22A s0 2S
X X
max P(C) P(s0 | s, a) kq1 q 2 k1
s,a
C22A s0 2S
= kq1 q 2 k1 .
15
Stochastic Q-learning for Large Discrete Action Spaces
B. Stochastic Maximization
We analyze the proposed stochastic maximization method by comparing its error to that of exact maximization. First, we
consider the case without memory, where C = R, and then the case with memory, where M = 6 ;. Finally, we provide a
specialized bound for the case where the action values follow a uniform distribution.
p pa⇤0 .
The probability pa⇤0 of finding a⇤0 is the probability of sampling a⇤0 within the random set R of size dlog(n)e, which is the
fraction of all possible combinations of size dlog(n)e that include a⇤0 .
n 1 n
This fraction can be calculated as dlog(n)e 1 divided by all possible combinations of size dlog(n)e, which is dlog(n)e .
n 1
(dlog(n)e 1)
Therefore, pa⇤0 = n .
(dlog(n)e )
Consequently,
dlog(n)e
p . (27)
n
A straightforward example of a stable Q-function occurs during validation periods when no function updates are performed.
However, in general, a stable Q-function does not have to be static and might still vary over the rounds; the key characteristic
is that its maximizing action remains the same even when its values are updated. Although the stoch max has sub-linear
complexity compared to the max, without any assumption of the value distributions, the following Corollary shows that, on
average, for a stable Q-function, after a certain number of iterations, the output of the stoch max matches exactly the output
of max.
16
Stochastic Q-learning for Large Discrete Action Spaces
For a given discrete state s, M keeps track of the most recent best action found. For C = R [ M,
stoch max Q(s, a) = max Q(s, a) max Q(s, a). (28)
a2A a2C a2M
Therefore, for a given state s, on average, if the Q-function is stable, then within dlog(n)e ,
n
M will contain the optimal action
a⇤ . Therefore, on an average, after dlog(n)e
n
time steps,
We know that, stoch maxa2A Q(s, a) maxa2A Q(s, a). Therefore, for a stable Q-function, on an average, after n
dlog(n)e
time steps, stoch maxa2A Q(s, a) becomes maxa2A Q(s, a).
Proof. For a given state s we assume a uniformly randomly sampled subset R of size dlog(n)e actions, and the values of the
sampled actions are independent and follow a uniform distribution in the interval [Qt (s, a?t ) bt (s), Qt (s, a?t )]. Therefore,
the cumulative distribution function (CDF) for the value of an action a given the state s and the optimal action a⇤t is:
8
< 0 for y < Qt (s, a⇤t ) bt (s)
G(y; s, a) = y for y 2 [Qt (s, a⇤t ) bt , Qt (s, a⇤ )]
:
1 for y > Qt (s, a⇤t ) .
If we select dlog(n)e such actions, the CDF of the maximum of these actions, denoted as Fmax is the following:
✓ ◆
Fmax (x; s, a) = P max Qt (s, a) x
a2R
Y
= P (Qt (s, a) x)
a2R
Y
= F (x; s, a)
a2R
= F (x; s, a)dlog(n)e .
The second line follows from the independence of the values, and the last line follows from the assumption that all actions
follow the same uniform distribution.
The CDF of the maximum is therefore given by:
8
< 0 for x < 0
Fmax (x; s, a) = xdlog(n)e for x 2 [0, 1]
:
1 for x > 1 .
17
Stochastic Q-learning for Large Discrete Action Spaces
R1 R1
We employed the identity 0 x dµ(x) = 0 1 µ(x) dx, which can be demonstrated through integration by parts. To return
to the original scale, we can first multiply by bt and then add Qt (s, a⇤t ) bt (s), resulting in:
bt (s)
E max Qt (s, a) | s, a⇤t = Qt (s, a⇤t ) .
a2R dlog(n)e + 1
As an example of this setting, for Qt (s, a?t ) = 100, bt = 100, for a setting with n = 1000 actions, dlog(n)e+1 = 11. Hence
the E [maxk2R Qt (s, k) | s, a?t ] ⇡ 91. This shows that even with a randomly sampled set of actions R, the stoch max can
be close to the max. We simulate this setting in the experiments in Fig. 6.
Our proposed stochastic maximization does not solely rely on the randomly sampled subset of actions R but also considers
actions from previous experiences through M. Therefore, the expected stoch max should be higher than the above result,
providing an upper bound on the expected t as described in the following corollary of Lemma B.2.
Corollary B.3. For a given discrete state s, if the values of the sampled actions follow independently a uniform distribution
from the interval [Qt (s, a?t ) bt (s), Qt (s, a?t )], then the expected value of t (s) is:
bt (s)
E [ t (s) | s] . (30)
dlog(n)e + 1
Proof. At time step t, given a state s, and the current estimated Q-function Qt , t (s) is defined as follows:
For a given state s and a uniformly randomly sampled subset R of size dlog(n)e actions and a subset of some previous
played actions M ⇢ E, using the law of total expectation,
18
Stochastic Q-learning for Large Discrete Action Spaces
19
Stochastic Q-learning for Large Discrete Action Spaces
C. Pseudocodes
D. Experimental Details
D.1. Environments
We test our proposed algorithms on a standardized set of environments using open-source libraries. We compare stochastic
maximization to exact maximization and evaluate the proposed stochastic RL algorithms on Gymnasium environments
(Brockman et al., 2016). Stochastic Q-learning and Stochastic Double Q-learning are tested on the CliffWalking-v0, the
FrozenLake-v1, and a generated MDP environment, while stochastic deep Q-learning approaches are tested on MuJoCo
control tasks (Todorov et al., 2012).
20
Stochastic Q-learning for Large Discrete Action Spaces
✓ ✓ + ↵r✓ L(✓).
s s0 .
end while
end for
21
Stochastic Q-learning for Large Discrete Action Spaces
D.2. Algorithms
D.2.1. S TOCHASTIC M AXIMIZATION
We have two scenarios, one for discrete and the other for continuous states. For discrete states, E is a dictionary with the keys
as the states in S with corresponding values of the latest played action in every state. In contrast, E comprises the actions in
the replay buffer for continuous states. Indeed, we do not consider the whole set E either. Instead, we only consider a subset
M ⇢ E. For discrete states, for a given state s, M includes the latest two exploited actions in state s. For continuous states,
where it is impossible to retain the last exploited action for each state, we consider randomly sampled subset M ⇢ E, which
includes dlog(n)e actions, even though they were played in different states. In the experiments involving continuous states,
we demonstrate that this was sufficient to achieve good results, see Section 7.3.
22
Stochastic Q-learning for Large Discrete Action Spaces
Figure 5: Comparison results for the stochastic and deterministic methods. The x-axis represents the number of possible
actions, and the y-axis represents the time step duration of the agent in seconds.
Figure 6: stoch max with (w/) and without (w/o) memory M vs. max on uniformly distributed action values as described
in Section B. The x-axis and y-axis represent the steps and the values, respectively.
E. Additional Results
E.1. Wall Time Speed
Stochastic maximization methods exhibit logarithmic complexity regarding the number of actions, as confirmed in Fig.
5a. Therefore, both StochDQN and StochDDQN, which apply these techniques for action selection and updates, have
exponentially faster execution times compared to both DQN and DDQN, which can be seen in Fig 5b which shows the
complete step duration for deep Q-learning methods, which include action selection and network update. The proposed
methods are nearly as fast as a random algorithm, which samples and selects actions randomly and has no updates.
23
Stochastic Q-learning for Large Discrete Action Spaces
Figure 7: Comparison results for the stochastic and non-stochastic methods for the Inverted Pendulum with 512 actions.
Figure 8: Comparison results for the stochastic and non-stochastic methods for the Half Cheetah with 4096 actions.
For Deep Q-Networks, for the InvertedPendulum-v4, both stoch max and max return similar values (Fig. 7a), !t approaches
one rapidly (Fig. 7b) and t remains below 0.5 (Fig. 7c). In the case of HalfCheetah-v4, both stoch max and max return
similar values (Fig. 8a), !t quickly converges to one (Fig. 8b), and t is upper bounded below eight (Fig. 8c).
While the difference t remains bounded, the values of both stoch max and max increase over the rounds as the agent
explores better options. This leads to the ratio !t converging to one as the error becomes negligible over the rounds, as
expected according to Eq. (34).
24
Stochastic Q-learning for Large Discrete Action Spaces
Figure 9: Stochastic vs non-stochastic of deep Q-learning variants on Inverted Pendulum and Half Cheetah, with steps on
the x-axis and average returns, smoothed over a size 100 window on the y-axis.
Figure 10: Comparing stochastic and non-stochastic Q-learning approaches on the Cliff Walking, with steps on the x-axis,
instantaneous rewards smoothed over a size 1000 window on the y-axis for plot (a), and cumulative rewards on the y-axis
for plot (b).
25
Stochastic Q-learning for Large Discrete Action Spaces
Figure 11: Comparing stochastic and non-stochastic Q-learning approaches on the Frozen Lake, with steps on the x-axis,
instantaneous rewards smoothed over a size 1000 window on the y-axis for plot (a), and cumulative rewards on the y-axis
for plot (b).
Figure 12: Comparing stochastic and non-stochastic Q-learning approaches on the generated MDP environment, with steps
on the x-axis, instantaneous rewards smoothed over a size 1000 window on the y-axis for plot (a), and cumulative rewards
on the y-axis for plot (b).
26