Stochastic Q-Learning For Large Discrete Action Spaces

Stochastic Q-learning for Large Discrete Action Spaces
Fares Fourati 1 Vaneet Aggarwal 2 Mohamed-Slim Alouini 1
Abstract advances in the field, a significant challenge lies in nav-

In complex environments with large discrete ac- igating complex environments with large discrete action
tion spaces, effective decision-making is criti- spaces (Dulac-Arnold et al., 2015; 2021). In such scenarios,
arXiv:2405.10310v1 [cs.LG] 16 May 2024
cal in reinforcement learning (RL). Despite the standard RL algorithms suffer in terms of computational
widespread use of value-based RL approaches efficiency (Akkerman et al., 2023). Identifying the optimal
like Q-learning, they come with a computational actions might entail cycling through all of them, in general,
burden, necessitating the maximization of a value multiple times within different states, which is computa-
function over all actions in each iteration. This tionally expensive and may become prohibitive with large
burden becomes particularly challenging when discrete action spaces (Tessler et al., 2019).
addressing large-scale problems and using deep Such challenges apply to various domains, including combi-
neural networks as function approximators. In natorial optimization (Mazyavkina et al., 2021; Fourati et al.,
this paper, we present stochastic value-based RL 2023; 2024b;a), natural language processing (He et al., 2015;
approaches which, in each iteration, as opposed 2016a;b; Tessler et al., 2019), communications and network-
to optimizing over the entire set of n actions, ing (Luong et al., 2019; Fourati & Alouini, 2021), recom-
only consider a variable stochastic set of a sub- mendation systems (Dulac-Arnold et al., 2015), transporta-
linear number of actions, possibly as small as tion (Al-Abbasi et al., 2019; Haliem et al., 2021; Li et al.,
O(log(n)). The presented stochastic value-based 2022), and robotics (Dulac-Arnold et al., 2015; Tavakoli
RL methods include, among others, Stochastic et al., 2018; Tang & Agrawal, 2020; Seyde et al., 2021;
Q-learning, StochDQN, and StochDDQN, all of 2022; Gonzalez et al., 2023; Ireland & Montana, 2024). Al-
which integrate this stochastic approach for both though tailored solutions leveraging action space structures
value-function updates and action selection. The and dimensions may suffice in specific contexts, their ap-
theoretical convergence of Stochastic Q-learning plicability across diverse problems, possibly unstructured,
is established, while an analysis of stochastic max- still needs to be expanded. We complement these works by
imization is provided. Moreover, through empir- proposing a general method that addresses a broad spectrum
ical validation, we illustrate that the various proof problems, accommodating structured and unstructured
posed approaches outperform the baseline meth- single and multi-dimensional large discrete action spaces.
ods across diverse environments, including dif-
ferent control problems, achieving near-optimal Value-based and actor-based approaches are both prominent
average returns in significantly reduced time. approaches in RL. Value-based approaches, which entail the
agent implicitly optimizing its policy by maximizing a value
function, demonstrate superior generalization capabilities
but demand significant computational resources, particularly
1. Introduction
in complex settings. Conversely, actor-based approaches,
Reinforcement learning (RL), a continually evolving field which entail the agent directly optimizing its policy, offer
of machine learning, has achieved notable successes, espe- computational efficiency but often encounter challenges in
cially when combined with deep learning (Sutton & Barto, generalizing across multiple and unexplored actions (Dulac-
2018; Wang et al., 2022). While there have been several Arnold et al., 2015). While both hold unique advantages and
1
challenges, they represent distinct avenues for addressing
Department of Computer, Electrical and Mathematical Sci- the complexities of decision-making in large action spaces.
ence and Engineering, King Abdullah University of Science and
Technology (KAUST), Thuwal, KSA. 2 School of Industrial En- However, comparing them falls outside the scope of this
gineering, Purdue University, West Lafayette, IN 47907, USA. work. While some previous methods have focused on the
Correspondence to: Fares Fourati <fares.fourati@kaust.edu.sa>. latter (Dulac-Arnold et al., 2015), our work concentrates
on the former. Specifically, we aim to exploit the natural
Proceedings of the 41 st International Conference on Machine generalization inherent in value-based RL approaches while
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by
the author(s). reducing their per-step computational complexity.
1
Q-learning, as introduced by Watkins & Dayan (1992), for We propose Stochastic Q-learning, Stochastic Double Q-
discrete action and state spaces, stands out as one of the learning, StochDQN, and StochDDQN, which are ob-
most famous examples of value-based RL methods and re- tained by changing max and arg max to stoch max and
mains one of the most widely used ones in the field. As an stoch arg max in the Q-learning (Watkins & Dayan, 1992),
off-policy learning method, it decouples the learning process the Double Q-learning (Hasselt, 2010), the deep Q-network
from the agent’s current policy, allowing it to leverage past (DQN) (Mnih et al., 2013) and the double DQN (DDQN)
experiences from various sources, which becomes advanta- (Van Hasselt et al., 2016), respectively. Furthermore, we
geous in complex environments. In each step of Q-learning, observed that our approach works even for the on-policy
the agent updates its action value estimates based on the Sarsa (Rummery & Niranjan, 1994).
observed reward and the estimated value of the best action
We conduct a theoretical analysis of the proposed method,
in the next state.
proving the convergence of Stochastic Q-learning, which
Some approaches have been proposed to apply Q-learning integrates these techniques for action selection and value
to continuous state spaces, leveraging deep neural networks updates, and establishing a lower bound on the probability
(Mnih et al., 2013; Van Hasselt et al., 2016). Moreover, sev- of sampling an optimal action from a random set of size
eral improvements have also been suggested to address its dlog(n)e and analyze the error of stochastic maximization
inherent estimation bias (Hasselt, 2010; Van Hasselt et al., compared to exact maximization. Furthermore, we evaluate
2016; Zhang et al., 2017; Lan et al., 2020; Wang et al., the proposed RL algorithms on environments from Gymna-
2021). However, despite the different progress and its nu- sium (Brockman et al., 2016). For the stochastic deep RL
merous advantages, a significant challenge still needs to be algorithms, the evaluations were performed on control tasks
solved in Q-learning-like methods when confronted with within the multi-joint dynamics with contact (MuJoCo) en-
large discrete action spaces. The computational complexity vironment (Todorov et al., 2012) with discretized actions
associated with selecting actions and updating Q-functions (Dulac-Arnold et al., 2015; Tavakoli et al., 2018; Tang &
increases proportionally with the increasing number of ac- Agrawal, 2020). These evaluations demonstrate that the
tions, which renders the conventional approach impractical stochastic approaches outperform non-stochastic ones re-
as the number of actions substantially increases. Conse- garding wall time speedup and sometimes rewards. Our key
quently, we confront a crucial question: Is it possible to contributions are summarized as follows:
mitigate the complexity of the different Q-learning methods
while maintaining a good performance? • We introduce novel stochastic maximization techniques
This work proposes a novel, simple, and practical ap- denoted as stoch max and stoch arg max, offering a com-
proach for handling general, possibly unstructured, single- pelling alternative to conventional deterministic maxi-
dimensional or multi-dimensional, large discrete action mization operations, particularly beneficial for handling
spaces. Our approach targets the computational bottleneck large discrete action spaces, ensuring sub-linear complex-
in value-based methods caused by the search for a maximum ity concerning the number of actions.
(max and arg max) in every learning iteration, which scales
• We present a suite of value-based algorithms suit-
as O(n), i.e., linearly with the number of possible actions n.
able for large discrete actions, including Stochastic Q-
Through randomization, we can reduce this linear per-step
learning, Stochastic Sarsa, Stochastic Double Q-learning,
computational complexity to logarithmic.
StochDQN, and StochDDQN, which integrate stochas-
We introduce stoch max and stoch arg max, which, instead tic maximization within Q-learning, Sarsa, Double Q-
of exhaustively searching for the precise maximum across learning, DQN, and DDQN, respectively.
the entire set of actions, rely on at most two random sub-
sets of actions, both of sub-linear sizes, possibly each of • We analyze stochastic maximization and demonstrate the
size dlog(n)e. The first subset is randomly sampled from convergence of Stochastic Q-learning. Furthermore, we
the complete set of actions, and the second from the pre- empirically validate our approach to tasks from the Gym-
viously exploited actions. These stochastic maximization nasium and MuJoCO environments, encompassing vari-
techniques amortize the computational overhead of stan- ous dimensional discretized actions.
dard maximization operations in various Q-learning meth-
ods (Watkins & Dayan, 1992; Hasselt, 2010; Mnih et al., 2. Related Works
2013; Van Hasselt et al., 2016). Stochastic maximization
methods significantly accelerate the agent’s steps, including While RL has shown promise in diverse domains, practical
action selection and value-function updates in value-based applications often grapple with real-world complexities. A
RL methods, making them practical for handling challeng- significant hurdle arises when dealing with large discrete
ing, large-scale, real-world problems. action spaces (Dulac-Arnold et al., 2015; 2021). Previous
research has investigated strategies to address this challenge
2
by leveraging the combinatorial or the dimensional struc- neural network was trained to predict the optimal action in
tures in the action space (He et al., 2016b; Tavakoli et al., combination with a uniform search. This approach involves
2018; Tessler et al., 2019; Delarue et al., 2020; Seyde et al., the use of an expensive autoregressive proposal distribution
2021; 2022; Fourati et al., 2023; 2024b;a; Akkerman et al., to generate n0 actions and samples a large number of ac-
2023; Fourati et al., 2024b;a; Ireland & Montana, 2024). tions (m), thus remaining computationally expensive, with
For example, He et al. (2016b) leveraged the combinatorial O(n0 + m). In (Metz et al., 2017), sequential DQN allows
structure of their language problem through sub-action em- the agent to choose sub-actions one by one, which increases
beddings. Compressed sensing was employed in (Tessler the number of steps needed to solve a problem and requires
et al., 2019) for text-based games with combinatorial ac- d steps with a linear complexity of O(i) for a discretization
tions. Delarue et al. (2020) formulated the combinatorial granularity i. Additionally, Tavakoli et al. (2018) employs a
action decision of a vehicle routing problem as a mixed- branching technique with duelling DQN for combinatorial
integer program. Moreover, Akkerman et al. (2023) in- control problems. Their approach has a complexity of O(di)
troduced dynamic neighbourhood construction specifically for actions with discretization granularity i and d dimen-
for structured combinatorial large discrete action spaces. sions, whereas our method, in a similar setting, achieves
Previous works tailored solutions for multi-dimensional O(d log(i)). Another line of work introduces action elim-
spaces such as those in (Seyde et al., 2021; 2022; Ireland ination techniques, such as the action elimination DQN
& Montana, 2024), among others, while practical in the (Zahavy et al., 2018), which employs an action elimination
multi-dimensional spaces, may not be helpful for single- network guided by an external elimination signal from the
dimensional large action spaces. While relying on the struc- environment. However, it requires this domain-specific sig-
ture of the action space is practical in some settings, not nal and can be computationally expensive (O(n0 ) where
all problems with large action spaces are multi-dimensional n0  n are the number of remaining actions). In contrast,
or structured. We complement these works by making no curriculum learning, as proposed by Farquhar et al. (2020),
assumptions about the structure of the action space. initially limits an agent’s action space, gradually expanding
it during training for efficient exploration. However, its ef-
Some approaches have proposed factorizing the action
fectiveness relies on having an informative restricted action
spaces to reduce their size. For example, these include
space, and as the action space size grows, its complexity
factorizing into binary subspaces (Lagoudakis & Parr, 2003;
scales linearly with its size, eventually reaching O(n).
Sallans & Hinton, 2004; Pazis & Parr, 2011; Dulac-Arnold
et al., 2012), expert demonstration (Tennenholtz & Mannor, In the context of combinatorial bandits with a single state
2019), tensor factorization (Mahajan et al., 2021), and sym- but large discrete action spaces, previous works have ex-
bolic representations (Cui & Khardon, 2016). Additionally, ploited the combinatorial structure of actions, where each
some hierarchical and multi-agent RL approaches employed action is a subset of main arms. For instance, for submodu-
factorization as well (Zhang et al., 2020; Kim et al., 2021; lar reward functions, which imply diminishing returns when
Peng et al., 2021; Enders et al., 2023). While some of these adding arms, in (Fourati et al., 2023) and (Fourati et al.,
methods effectively handle large action spaces for certain 2024b), stochastic greedy algorithms are used to avoid exact
problems, they necessitate the design of a representation search. The former evaluates the marginal gains of adding
for each discrete action. Even then, for some problems, the and removing sub-actions (arms), while the latter assumes
resulting space may still be large. monotonic rewards and considers adding the best arm until
a cardinality constraint is met. For general reward func-
Methods presented in (Van Hasselt & Wiering, 2009; Dulac-
tions, Fourati et al. (2024a) propose using approximation
Arnold et al., 2015; Wang et al., 2020) combine continuous-
algorithms to evaluate and add sub-actions. While these
action policy gradients with nearest neighbour search to gen-
methods are practical for bandits, they exploit the combina-
erate continuous actions and identify the nearest discrete ac-
torial structure of their problems and consider a single-state
tions. These are interesting methods but require continuous-
scenario, which is different from general RL problems.
to-discrete mapping and are mainly policy-based rather than
value-based approaches. In the works of Kalashnikov et al. While some approaches above are practical for handling
(2018) and Quillen et al. (2018), the cross-entropy method specific problems with large discrete action spaces, they
(Rubinstein, 1999) was utilized to approximate action maxi- often exploit the dimensional or combinatorial structures
mization. This approach requires multiple iterations (r) for inherent in their considered problems. In contrast, we com-
a single action selection. During each iteration, it samples n0 plement these approaches by proposing a solution to tackle
values, where n0 < n, fits a Gaussian distribution to m < n0 any general, potentially unstructured, single-dimensional
of these samples, and subsequently draws a new batch of or multi-dimensional, large discrete action space without
n0 samples from this Gaussian distribution. As a result, this relying on structure assumptions. Our proposed solution is
approximation remains costly, with a complexity of O(rn0 ). general, simple, and efficient.
Additionally, in the work of Van de Wiele et al. (2020), a
3
3. Problem Description When representing the value function as a parameterized

function, such as a neural network, taking only the current
In the context of a Markov decision process (MDP), we state s as input and outputting the values for all actions, as
have specific components: a finite set of actions denoted as proposed in DQN (Mnih et al., 2013), the network must
A, a finite set of states denoted as S, a transition probability accommodate a large number of output nodes, which results
distribution P : S ⇥ A ⇥ S ! [0, 1], a bounded reward in increasing memory overhead and necessitates extensive
function r : S ⇥ A ! R, and a discount factor 2 [0, 1]. predictions and maximization over these final outputs in
Furthermore, for time step t, we denote the chosen action the last layer. A notable point about this approach is that
as at , the current state as st , and the received reward as it does not exploit contextual information (representation)
rt , r(st , at ). Additionally, for time step t, we define a of actions, if available, which leads to lower generalization
learning rate function ↵t : S ⇥ A ! [0, 1]. capability across actions with similar features and fails to
The cumulative reward an agent receives during an episode generalize over new actions.
in an MDP with variable length time T is the return Rt . It Previous works have considered generalization over actions
is calculated as the discounted sum of rewards
PT from time by taking the features of an action a and the current state s
step t until the episode terminates: Rt , i=t i t ri . RL as inputs to the Q-network and predicting its value (Zahavy
aims to learn a policy ⇡ : S ! A mapping states to ac- et al., 2018; Metz et al., 2017; Van de Wiele et al., 2020).
tions that maximize the expected return across all episodes. However, it leads to further complications when the value
The state-action value function, denoted as Q⇡ (s, a), repre- function is modeled as a parameterized function with both
sents the expected return when starting from a given state s, state s and action a as inputs. Although this approach al-
taking action a, and following a policy ⇡ afterwards. The lows for improved generalization across the action space
function Q⇡ can be expressed recursively using the Bellman by leveraging contextual information from each action and
equation: generalizing across similar ones, it requires evaluating the
X function for each action within the action set A. This results
Q⇡ (s, a) = r(s, a) + P(s0 | s, a)Q⇡ (s0 , ⇡(s0 )). (1)
in a linear increase in the number of function calls as the
s0 2S
number of actions grows. This scalability issue becomes
Two main categories of policies are commonly employed in particularly problematic when dealing with computation-
RL systems: value-based and actor-based policies (Sutton ally expensive function approximators, such as deep neural
& Barto, 2018). This study primarily concentrates on the networks (Dulac-Arnold et al., 2015). Addressing these
former type, where the value function directly influences challenges forms the motivation behind this work.
the policy’s decisions. An example of a value-based policy
in a state s involves an "s -greedy algorithm, selecting the 4. Proposed Approach
action with the highest Q-function value with probability
(1 "s ), where "s 0, function of the state s, requiring the To alleviate the computational burden associated with max-
use of arg max operation, as follows: imizing a Q-function at each time step, especially when
( dealing with large action spaces, we introduce stochastic
play randomly with proba. ✏s maximization methods with sub-linear complexity relative
⇡Q (s) = (2) to the size of the action set A. Then, we integrate these
arg maxa2A Q(s, a) otherwise.
methods into different value-based RL algorithms.
Furthermore, during the training, to update the Q-function,
Q-learning (Watkins & Dayan, 1992), for example, uses the 4.1. Stochastic Maximization
following update rule, which requires a max operation:
We introduce stochastic maximization as an alternative to
Qt+1 (st , at ) = (1 ↵t (st , at )) Qt (st , at ) maximization when dealing with large discrete action spaces.
 Instead of conducting an exhaustive search for the precise
+ ↵t (st , at ) rt + max Qt (st+1 , b) . (3) maximum across the entire set of actions A, stochastic maxi-
b2A
mization searches for a maximum within a stochastic subset
Therefore, the computational complexity of both the ac- of actions of sub-linear size relative to the total number
tion selections in Eq. (2) and the Q-function updates in of actions. In principle, any size can be used, trading off
Eq. (3) scales linearly with the cardinality n of the action time complexity and approximation. We mainly focus on
set A, making this approach infeasible as the number of O(log(n)) to illustrate the power of the method in recover-
actions increases significantly. The same complexity is- ing Q-learning, even with such a small number of actions,
sues remain for other Q-learning variants, such as Double with logarithmic complexity.
Q-learning (Hasselt, 2010), DQN (Mnih et al., 2013), and We consider two approaches to stochastic maximization:
DDQN (Van Hasselt et al., 2016), among several others.
4
memoryless and memory-based approaches. The memory- Algorithm 1 Stochastic Q-learning

less one samples a random subset of actions R ✓ A with Initialize Q(s, a) for all s 2 S, a 2 A
a sublinear size and seeks the maximum within this sub- for each episode do
set. On the other hand, the memory-based one expands the Observe state s.
randomly sampled set to include a few actions M with a sub- for each step of episode do
Choose a from s with policy ⇡Q S
(s).
linear size from the latest exploited actions E and uses the
Take action a, observe r, s0 .
combined sets to search for a stochastic maximum. Stochas- b⇤ stoch arg maxb2A Q(s0 , b).
tic maximization, which may miss the exact maximum in r + Q(s0 , b⇤ ) Q(s, a).
both versions, is always upper-bounded by deterministic Q(s, a) Q(s, a) + ↵(s, a) .
maximization, which finds the exact maximum. However, s s0 .
by construction, it has sublinear complexity in the number end for
end for
of actions, making it appealing when maximizing over large
action spaces becomes impractical.
Formally, given a state s, which may be discrete or contin- 2 in Appendix C, that replace the max and arg max opera-
uous, along with a Q-function, a random subset of actions tions in Q-learning and Double Q-learning with stoch max
R ✓ A, and a memory subset M ✓ E (empty in the memo- and stoch arg max, respectively. Furthermore, we introduce
ryless case), each subset being of sublinear size, such as at Stochastic Sarsa, described in Algorithm 3 in Appendix C,
most O(log(n)) each, the stoch max is the maximum value which replaces the maximization in the greedy action selec-
computed from the union set C = R [ M, defined as: tion (arg max) in Sarsa.
Our proposed solution takes a distinct approach from the
stoch max Qt (s, k) , max Qt (s, k). (4) conventional method of selecting the action with the high-
k2A k2C
est Q-function value from the complete set of actions A.
Besides, the stoch arg max is computed as follows: Instead, it uses stochastic maximization, which finds a max-
imum within a stochastic subset C ✓ A, constructed as
stoch arg max Qt (s, k) , arg max Qt (s, k). (5) explained in Section 4.1. Our stochastic policy ⇡Q S
(s), uses
k2A k2C
an "s -greedy algorithm, in a given state s, with a probability
of (1 "s ), for "s > 0, is defined as follows:
In the analysis of stochastic maximization, we explore both (
memory-based and memoryless maximization. In the anal- play randomly with proba. ✏s
ysis and experiments, we consider the random set R to ⇡Q (s) ,
S
stoch arg maxa2A Q(s, a) otherwise.
consist of dlog(n)e actions. When memory-based, in our
(6)
experiments, within a given discrete state, we consider the
Furthermore, during the training, to update the Q-function,
two most recently exploited actions in that state. For con-
our proposed Stochastic Q-learning uses the following rule:
tinuous states, where it is impossible to retain the latest
exploited actions for each state, we consider a randomly Qt+1 (st , at ) = (1 ↵t (st , at )) Qt (st , at )
sampled subset M ✓ E, which includes dlog(n)e actions, 
even though they were played in different states. We demon- + ↵t (st , at ) rt + stoch max Qt (st+1 , b) . (7)
b2A
strate that this approach was sufficient to achieve good re-
sults in the benchmarks considered; see Section 7.3. Our While Stochastic Q-learning, like Q-learning, employs the
Stochastic Q-learning convergence analysis considers mem- same values for action selection and action evaluation,
oryless stochastic maximization with a random set R of any Stochastic Double Q-learning, similar to Double Q-learning,
size. learns two separate Q-functions. For each update, one Q-
Remark 4.1. By setting C equal to A, we essentially revert function determines the policy, while the other determines
to standard approaches. Consequently, our method is an the value of that policy. Both stochastic learning methods
extension of non-stochastic maximization. However, in pur- remove the maximization bottleneck from exploration and
suit of our objective to make RL practical for large discrete training updates, making these proposed algorithms signifi-
action spaces, for a given state s, in our analysis and experi- cantly faster than their deterministic counterparts.
ments, we keep the union set C limited to at most 2dlog(n)e,
ensuring sub-linear (logarithmic) complexity. 4.3. Stochastic Deep Q-network
We introduce Stochastic DQN (StochDQN), described in
4.2. Stochastic Q-learning
Algorithm 4 in Appendix C, and Stochastic DDQN (StochD-
We introduce Stochastic Q-learning, described in Algorithm DQN) as efficient variants of deep Q-networks. These vari-
1, and Stochastic Double Q-learning, described in Algorithm ants substitute the maximization steps in the DQN (Mnih
5
et al., 2013) and DDQN (Van Hasselt et al., 2016) algo- similar to those obtained from an optimal action. Therefore,
rithms with the stochastic maximization operations. In these the difference between stochastic maximization and exact
modified approaches, we replace the "s -greedy exploration maximization might be a more informative metric than just
strategy with the same exploration policy as in Eq. (6). the probability of finding an exact maximizer. Thus, at time
step t, given state s and the current estimated Q-function
For StochDQN, we employ a deep neural network as a
Qt , we define the estimation error as t (s), as follows:
function approximator to estimate the action-value function,
represented as Q(s, a; ✓) ⇡ Q(s, a), where ✓ denotes the t (s) , max Qt (s, a) stoch max Qt (s, a) . (9)
weights of the Q-network. This network is trained by min- a2A a2A
imizing a series of loss functions denoted as Li (✓i ), with

Furthermore, we define the similarity ratio !t (s), as follows:
these loss functions changing at each iteration i as follows:
h i !t (s) , stoch max Qt (s, a) / max Qt (s, a) . (10)
2
Li (✓i ) , Es,a⇠⇢(·) (yi Q (s, a; ✓i )) , (8) a2A a2A
It can be seen from the definitions that t (s) 0 and

where yi , E [r + stoch maxb2A Q(s0 , b; ✓i 1 )| s, a]. In !t (s)  1. While sampling the exact maximizer is not al-
this context, yi represents the target value for an iteration i, ways possible, near-optimal actions may yield near-optimal
and ⇢(.) is a probability distribution that covers states and values, providing good approximations, i.e., t (s) ⇡ 0 and
actions. Like the DQN approach, we keep the parameters !t (s) ⇡ 1. In general, this difference depends on the value
fixed from the previous iteration, denoted as ✓i 1 when distribution over the actions.
optimizing the loss function Li (✓i ).
While we do not make any specific assumptions about the
These target values depend on the network weights, which value distribution in our work, we note that with some sim-
differ from the fixed targets typically used in supervised plifying assumptions on the value distributions over the
learning. We employ stochastic gradient descent for the actions, one can derive more specialized guarantees. For ex-
training. While StochDQN, like DQN, employs the same ample, assuming that the rewards are uniformly distributed
values for action selection and evaluation, StochDDQN, like over the actions, we demonstrate in Appendix B.3 that for a
DDQN, trains two separate value functions. It does this given discrete state s, if the values of the sampled actions
by randomly assigning each experience to update one of independently follow a uniform distribution from the inter-
the two value functions, resulting in two sets of weights, ✓ val [Qt (s, a?t ) bt (s), Qt (s, a?t )], where bt (s) represents
and ✓0 . For each update, one set of weights determines the the range of the Qt (s, .) values over the actions in state s
policy, while the other set determines the values. at time step t, then the expected value of t (s), even with-
bt (s)
out memory, is: E [ t (s) | s]  dlog(n)e+1 . Furthermore,
5. Stochastic Maximization Analysis we empirically demonstrate that for the considered control
problems, the difference t (s) is not large, and the ratio
In the following, we study stochastic maximization with and !t (s) is close to one, as shown in Section 7.4.
without memory compared to exact maximization.
5.2. Stochastic Maximization with Memory
5.1. Memoryless Stochastic Maximization
While memoryless stochastic maximization could approach
Memoryless stochastic maximization, i.e., C = R [ ;, does the maximum value or find it with the probability p, lower-
not always yield an optimal maximizer. To return an optimal bounded in Lemma 5.1, it does not converge to an exact
action, this action needs to be randomly sampled from the maximization, as it keeps sampling purely at random, as can
set of actions. Finding an exact maximizer, without rely- be seen in Fig. 6 in Appendix E.2.1. However, memory-
ing on memory M, is a random event with a probability p, based stochastic maximization, i.e., C = R [ M with M = 6
representing the likelihood of sampling such an exact maxi- ;, can become an exact maximization when the Q-function
mizer. In the following lemma, we provide a lower bound becomes stable, as we state in the Corollary 5.3, which we
on the probability of discovering an optimal action within prove in Appendix B.2.1, and as confirmed in Fig. 6.
a uniformly randomly sampled subset C = R of dlog(n)e
Definition 5.2. A Q-function is considered stable for a given
actions, which we prove in Appendix B.1.1.
time range and state s when its maximizing action in that
Lemma 5.1. For any given state s, the probability p of sam- state remains unchanged for all subsequent steps within that
pling an optimal action from a uniformly randomly chosen time, even if the Q-function’s values themselves change.
subset C of size dlog(n)e actions is at least dlog(n)e
n .
A straightforward example of a stable Q-function occurs
While finding an exact maximizer through sampling may not during validation periods when no function updates are
always occur, the rewards of near-optimal actions can still be performed. However, in general, a stable Q-function does
6
not have to be static and might still vary over the rounds; The following theorem states the convergence of the iterates
the critical characteristic is that its maximizing action re- Qt of Stochastic Q-learning with memoryless stochastic
mains the same even when its values are updated. Although maximization to the Q⇤ , defined in Eq. 11, for any sampling
the stoch max has sub-linear complexity compared to the distribution P, regardless of the cardinality.
max, without any assumption of the value distributions, the Theorem 6.2. For a finite MDP, as described in Section
following corollary shows that, on average, for a stable Q- 3, let C be a randomly independently sampled subset of
function, after a certain number of iterations, the output of actions from A, of any cardinality, following any distribu-
the stoch max matches precisely the output of max. tion P, exclusively sampled for the value updates, for the
Corollary 5.3. For a given state s, assuming a time range Stochastic Q-learning, as described in Algorithm 1, given
where the Q-function becomes stable in that state, t (s) is by the following update rule:
expected to converge to zero after dlog(n)e
n
iterations.
Qt+1 (st , at ) = (1 ↵t (st , at ))Qt (st , at )

Recalling the definition of the similarity ratio !t , it follows
+ ↵t (st , at ) rt + max Qt (st+1 , a) ,
that !t (s) = 1 (s)/ maxa2A Qt (s, a). Therefore, for a b2C⇠P
given state s, where the Q-function becomes stable, given given any initial estimate Q0 , Qt converges
the boundedness of iterates in Q-learning, it is expected Pwith probability
1 to P
Q⇤ , defined in Eq. (11), as long as t ↵t (s, a) = 1
that !t converges to one. This observation was confirmed, and t ↵t2 (s, a) < 1 for all (s, a) 2 S ⇥ A.
even with continuous states and using neural networks as
function approximators, in Section 7.4. The theorem’s result demonstrates that for any cardinality of
actions, Stochastic Q-learning converges to Q⇤ , as defined
6. Stochastic Q-learning Convergence in Eq. (11), which recovers the convergence guarantees of
Q-learning when the sampling distribution is P(A) = 1.
In this section, we analyze the convergence of the Stochastic Remark 6.3. In principle, any size can be used, balancing
Q-learning, described in Algorithm 1. This algorithm em- time complexity and approximation. Our empirical experi-
ploys the policy ⇡QS
(s), as defined in Eq. (6), with "s > 0 to ments focused on log(n) to illustrate the method’s ability to
p
guarantee that P⇡ [at = a | st = s] > 0 for all state-action recover Q-learning, even with a few actions. Using n will
pairs (s, a) 2 S ⇥ A. The value update rule, on the other approach the value function of Q-learning more closely com-
hand, uses the update rule specified in Eq. (7). pared to using log(n), albeit at the cost of higher complexity
In the convergence analysis, we focus on memoryless max- than log(n).
imization. While the stoch arg max operator for action The theorem shows that even with memoryless stochastic
selection can be employed with or without memory, we as- maximization, using randomly sampled O(log(n)) actions,
sume a memoryless stoch max operator for value updates, the convergence is still guaranteed. However, relying on
which means that value updates are performed by maxi- memory-based stochastic maximization helps minimize the
mizing over a randomly sampled subset of actions from A, approximation error in stochastic maximization, as shown
sampled independently from both the next state s0 and the in Corollary 5.3, and outperforms Q-learning as shown in
set used for the stoch arg max. the experiments in Section 7.1.
For a stochastic variable subset of actions C ✓ A, following In the following, we provide a sketch of the proof addressing
some probability distribution P : 2A ! [0, 1], we consider, the extra stochasticity due to stochastic maximization. The
without loss of generality Q(., ;) = 0, and define, according full proof is provided in Appendix A.
to P, a target Q-function, denoted as Q⇤ , as:
 We tackle the additional stochasticity depending on the sam-
Q⇤ (s, a) , E r(s, a) + max Q⇤ (s0 , b) | s, a . (11) pling distribution P, by defining an operator function ,
b2C⇠P which for any q : S ⇥ A ! R, is as follows:
X X
Remark 6.1. The Q defined above depends on the sam-
⇤
( q)(s, a) , P(C) P(s0 | s, a)
pling distribution P. Therefore, it does not represent the C22A s0 2S
optimal value function of the original MDP problem; in- 
stead, it is optimal under the condition where only a random r(s, a) + max q(s0 , b) . (12)
b2C
subset of actions following the distribution P is available
to the agent at each time step. However, as the sampling We then demonstrate that it is a contraction in the sup-norm,
cardinality increases, it increasingly better approximates the as shown in Lemma 6.4, which we prove in Appendix A.2.
optimal value function of the original MDP and fully recov- Lemma 6.4. The operator , defined in Eq. (12), is a
ers the optimal Q-function of the original problem when the contraction in the sup-norm, with a contraction factor ,
sampling distribution becomes P(A) = 1. i.e., k q1 q2 k1  kq1 q2 k1 .
7
We then use the above lemma to establish the conver-

gence of Stochastic Q-learning. Given any initial estimate
Q0 , using the considered update rule for Stochastic Q-
learning, subtracting from both sides Q⇤ (st , at ) and letting
t (s, a) , Qt (s, a) Q⇤ (s, a), yields
t+1 (st , at ) = (1 ↵t (st , at )) t (st , at )
+ ↵t (st , at )Ft (st , at ), with
Ft (s, a) , r(s, a) + max Qt (s0 , b) Q⇤ (s, a) . (13)
b2C
With Ft representing the past at time t, Figure 1: Comparison of stochastic vs. non-stochastic value-
X X based variants on the FrozenLake-v1, with steps on the x-axis and
E [Ft (s, a) | Ft ] = P(C) P(s0 | s, a) [Ft (s, a)] cumulative rewards on the y-axis.
C22A s0 2S
= ( (Qt )) (s, a) Q⇤ (s, a). 7.2. Exponential Wall Time Speedup

Using the fact that Q = ⇤
Q and Lemma 6.4,
⇤
Stochastic maximization
kE [Ft (s, a) | Ft ]k1  kQt ⇤
Q k1 =. k t k1
methods exhibit logarith-
(14) mic complexity regarding
Given that r is bounded, its variance is bounded by some the number of actions.
constant B. Thus, as shown in Appendix A.1, for C = Therefore, StochDQN
max{B + 2 kQ⇤ k21 , 2 }, var [Ft (s, a) | Ft ]  C(1 + and StochDDQN, which
k t k1 )2 . Then, by this inequality, Eq. (14), and Theo- apply these techniques for
rem 1 in (Jaakkola et al., 1993), t converges to zero with action selection and up-
probability 1, i.e., Qt converges to Q⇤ with probability 1. dates, have exponentially
Figure 2: Comparison of wall
faster execution times time in seconds of stochastic and
than DQN and DDQN, as non-stochastic DQN methods on
7. Experiments confirmed in Fig. 2. various action set sizes.
We compare stochastic maximization to exact maximiza- For the time duration of action selection alone, please refer
tion and evaluate the proposed RL algorithms in Gymna- to Appendix E.1. The time analysis results show that the pro-
sium (Brockman et al., 2016) and MuJoCo (Todorov et al., posed methods are nearly as fast as a random algorithm that
2012) environments. The stochastic tabular Q-learning ap- selects actions randomly. Specifically, in the experiments
proaches are tested on CliffWalking-v0, FrozenLake-v1, and with the InvertedPendulum-v4, the stochastic methods took
a generated MDP environment. Additionally, the stochas- around 0.003 seconds per step for a set of 1000 actions,
tic deep Q-network approaches are tested on control tasks while the non-stochastic methods took 0.18 seconds, which
and compared against their deterministic counterparts, as indicates that the stochastic versions are 60 times faster
well as against DDPG (Lillicrap et al., 2015), A2C (Mnih than their deterministic counterparts. Furthermore, for the
et al., 2016), and PPO (Schulman et al., 2017), using Stable- HalfCheetah-v4 experiment, we considered 4096 actions,
Baselines implementations (Hill et al., 2018), which can where one (D)DQN step takes 0.6 seconds, needing around
directly handle continuous action spaces. Further details 17 hours to run for 100,000 steps, while the Stoch(D)DQN
can be found in Appendix D. needs around 2 hours to finish the same 100,000 steps. In
other words, we can easily run for 10x more steps in the
7.1. Stochastic Q-learning Average Return same period (seconds). This makes the stochastic methods
more practical, especially with large action spaces.
We test Stochastic Q-learning, Stochastic Double Q-
learning, and Stochastic Sarsa in environments with discrete
7.3. Stochastic Deep Q-network Average Return
states and actions. Interestingly, as shown in Fig. 1, our
stochastic algorithms outperform their deterministic counter- Fig. 3 shows the performance of various RL algorithms
parts. Furthermore, we observe that Stochastic Q-learning on the InvertedPendulum-v4 task, which has 512 actions.
outperforms all the methods considered regarding the cu- StochDQN achieves the optimal average return in fewer
mulative rewards in the FrozenLake-v1. Moreover, in the steps than DQN, with a lower per-step time advantage (as
CliffWalking-v0 (as shown in Fig. 10), as well as for the shown in Section 7.2). Interestingly, while DDQN strug-
generated MDP environment with 256 actions (as shown gles, StochDDQN nearly reaches the optimal average return,
in Fig. 12), all the stochastic and non-stochastic methods demonstrating the effectiveness of stochasticity. StochDQN
reach the optimal policy in a similar number of steps. and StochDDQN significantly outperform DDQN, A2C,
8
situations with general large discrete action spaces.

We focus mainly on Q-learning-like methods among value-
based approaches due to their off-policy nature and proven
success in various applications. We demonstrate that these
methods can be applied to large discrete action spaces while
achieving exponentially lower complexity and maintaining
good performance. Furthermore, our proposed stochastic
maximization method performs well even when applied
to the on-policy Sarsa algorithm, extending its potential
Figure 3: Comparison of stochastic DQN variants against other beyond off-policy methods. Consequently, the suggested
RL algorithms on the InvertedPendulum-v4, with 1000 actions, stochastic approach offers broader applicability to other
with steps on the x-axis and average returns on the y-axis. value-based approaches, resulting in lower complexity and
improved efficiency with large discrete action spaces.
and PPO by obtaining higher average returns in fewer steps.
Similarly, Fig. 9b in Appendix E.3 shows the results for the While the primary goal of this work is to reduce the com-
HalfCheetah-v4 task, which has 4096 actions. Stochastic plexity and wall time of Q-learning-like algorithms, our ex-
methods, particularly StochDDQN, achieve results com- periments revealed that stochastic methods not only achieve
parable to the non-stochastic methods. Notably, all DQN shorter step times (in seconds) but also, in some cases, yield
methods (stochastic and non-stochastic) outperform PPO higher rewards and exhibit faster convergence in terms of
and A2C, highlighting their efficiency in such scenarios. the number of steps compared to other methods. These im-
Remark 7.1. While comparing them falls outside the scope provements can be attributed to several factors. Firstly, in-
of our work, we note that DDQN was proposed to mitigate troducing more stochasticity into the greedy choice through
the inherent overestimation in DQN. However, exchanging stoch arg max enhances exploration. Secondly, Stochastic
overestimation for underestimation bias is not always ben- Q-learning specifically helps to reduce the inherent over-
eficial, as our results demonstrate and as shown in other estimation in Q-learning-like methods (Hasselt, 2010; Lan
studies such as (Lan et al., 2020). et al., 2020; Wang et al., 2021). This reduction is achieved
using stoch max, a lower bound to the max operation.
7.4. Stochastic Maximization Q-learning methods, focused initially on discrete actions,

can be adapted to tackle continuous problems with dis-
This section analyzes cretization techniques and stochastic maximization. Our
stochastic maximization control experiments show that Q-network methods with
by tracking returned discretization achieve superior performance to algorithms
values of !t (Eq. (10)) with continuous actions, such as PPO, by obtaining higher
across the steps. As rewards in fewer steps, which aligns with observations in
shown in Fig. 4, for previous works that highlight the potential of discretiza-
StochDQN, for the tion for solving continuous control problems (Dulac-Arnold
InvertedPendulum-v4, et al., 2015; Tavakoli et al., 2018; Tang & Agrawal, 2020).
!t approaches one Notably, the logarithmic complexity of the proposed stochas-
rapidly, similarly for Figure 4: The stoch max and tic methods concerning the number of considered actions
the HalfCheetah-v4, as max ratio values tracked over the makes them well-suited for scenarios with finer-grained
shown in Appendix E.2.2. steps on the InvertedPendulum-v4. discretization, leading to more practical implementations.
Furthermore, we track the returned values of the difference
t (Eq. (9)) and show that it is bounded by small values in
both environments, as illustrated in Appendix E.2.2.
9. Conclusion
We propose adapting Q-learning-like methods to mitigate
8. Discussion the computational bottleneck associated with the max and
arg max operations in these methods. By reducing the
In this work, we focus on adapting value-based methods, maximization complexity from linear to sublinear using
which excel in generalization compared to actor-based ap- stoch max and stoch arg max, we pave the way for prac-
proaches (Dulac-Arnold et al., 2015). However, this advan- tical and efficient value-based RL for large discrete action
tage comes at the cost of lower computational efficiency due spaces. We prove the convergence of Stochastic Q-learning,
to the maximization operation required for action selection analyze stochastic maximization, and empirically show that
and value function updates. Therefore, our primary motiva- it performs well with significantly low complexity.
tion is to provide a computationally efficient alternative for
9
Impact Statement Fourati, F. and Alouini, M.-S. Artificial intelligence for

satellite communication: A review. Intelligent and Con-
This paper presents work whose goal is to advance the field verged Networks, 2(3):213–243, 2021.
of Machine Learning. There are many potential societal
consequences of our work, none which we feel must be Fourati, F., Aggarwal, V., Quinn, C., and Alouini, M.-S.
specifically highlighted here. Randomized greedy learning for non-monotone stochas-
tic submodular maximization under full-bandit feedback.
References In International Conference on Artificial Intelligence and
Statistics, pp. 7455–7471. PMLR, 2023.
Akkerman, F., Luy, J., van Heeswijk, W., and Schiffer, M.
Handling large discrete action spaces via dynamic neigh- Fourati, F., Alouini, M.-S., and Aggarwal, V. Federated
borhood construction. arXiv preprint arXiv:2305.19891, combinatorial multi-agent multi-armed bandits. arXiv
2023. preprint arXiv:2405.05950, 2024a.
Al-Abbasi, A. O., Ghosh, A., and Aggarwal, V. Deeppool: Fourati, F., Quinn, C. J., Alouini, M.-S., and Aggarwal,
Distributed model-free algorithm for ride-sharing using V. Combinatorial stochastic-greedy bandit. In Proceed-
deep reinforcement learning. IEEE Transactions on Intel- ings of the AAAI Conference on Artificial Intelligence,
ligent Transportation Systems, 20(12):4714–4727, 2019. volume 38, pp. 12052–12060, 2024b.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,
Schulman, J., Tang, J., and Zaremba, W. Openai gym. Gonzalez, G., Balakuntala, M., Agarwal, M., Low, T.,
arXiv preprint arXiv:1606.01540, 2016. Knoth, B., Kirkpatrick, A. W., McKee, J., Hager, G.,
Aggarwal, V., Xue, Y., et al. Asap: A semi-autonomous
Cui, H. and Khardon, R. Online symbolic gradient-based precise system for telesurgery during communication de-
optimization for factored action mdps. In IJCAI, pp. lays. IEEE Transactions on Medical Robotics and Bion-
3075–3081, 2016. ics, 5(1):66–78, 2023.
Delarue, A., Anderson, R., and Tjandraatmadja, C. Rein-
Haliem, M., Mani, G., Aggarwal, V., and Bhargava, B. A
forcement learning with combinatorial actions: An appli-
distributed model-free ride-sharing approach for joint
cation to vehicle routing. Advances in Neural Information
matching, pricing, and dispatching using deep reinforce-
Processing Systems, 33:609–620, 2020.
ment learning. IEEE Transactions on Intelligent Trans-
Dulac-Arnold, G., Denoyer, L., Preux, P., and Gallinari, portation Systems, 22(12):7931–7942, 2021.
P. Fast reinforcement learning with large action sets
using error-correcting output codes for mdp factoriza- Hasselt, H. Double q-learning. Advances in neural informa-
tion. In Joint European Conference on Machine Learning tion processing systems, 23, 2010.
and Knowledge Discovery in Databases, pp. 180–194.
He, J., Chen, J., He, X., Gao, J., Li, L., Deng, L., and Osten-
Springer, 2012.
dorf, M. Deep reinforcement learning with an unbounded
Dulac-Arnold, G., Evans, R., van Hasselt, H., Sunehag, P., action space. arXiv preprint arXiv:1511.04636, 5, 2015.
Lillicrap, T., Hunt, J., Mann, T., Weber, T., Degris, T., and
Coppin, B. Deep reinforcement learning in large discrete He, J., Chen, J., He, X., Gao, J., Li, L., Deng, L., and
action spaces. arXiv preprint arXiv:1512.07679, 2015. Ostendorf, M. Deep reinforcement learning with a nat-
ural language action space. In Proceedings of the 54th
Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Annual Meeting of the Association for Computational
Paduraru, C., Gowal, S., and Hester, T. Challenges of Linguistics (Volume 1: Long Papers), pp. 1621–1630,
real-world reinforcement learning: definitions, bench- Berlin, Germany, August 2016a. Association for Compu-
marks and analysis. Machine Learning, 110(9):2419– tational Linguistics. doi: 10.18653/v1/P16-1153. URL
2468, 2021. https://aclanthology.org/P16-1153.
Enders, T., Harrison, J., Pavone, M., and Schiffer, M. Hybrid
He, J., Ostendorf, M., He, X., Chen, J., Gao, J., Li, L., and
multi-agent deep reinforcement learning for autonomous
Deng, L. Deep reinforcement learning with a combinato-
mobility on demand systems. In Learning for Dynamics
rial action space for predicting popular Reddit threads. In
and Control Conference, pp. 1284–1296. PMLR, 2023.
Proceedings of the 2016 Conference on Empirical Meth-
Farquhar, G., Gustafson, L., Lin, Z., Whiteson, S., Usunier, ods in Natural Language Processing, pp. 1838–1848,
N., and Synnaeve, G. Growing action spaces. In Interna- Austin, Texas, November 2016b. Association for Compu-
tional Conference on Machine Learning, pp. 3040–3051. tational Linguistics. doi: 10.18653/v1/D16-1189. URL
PMLR, 2020. https://aclanthology.org/D16-1189.
10
Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto, A., Mazyavkina, N., Sviridov, S., Ivanov, S., and Burnaev,
Traore, R., Dhariwal, P., Hesse, C., Klimov, O., Nichol, E. Reinforcement learning for combinatorial optimiza-
A., Plappert, M., Radford, A., Schulman, J., Sidor, S., tion: A survey. Computers & Operations Research, 134:
and Wu, Y. Stable baselines. https://github.com/ 105400, 2021.
hill-a/stable-baselines, 2018.
Metz, L., Ibarz, J., Jaitly, N., and Davidson, J. Discrete
Ireland, D. and Montana, G. Revalued: Regularised ensem- sequential prediction of continuous actions for deep rl.
ble value-decomposition for factorisable markov decision arXiv preprint arXiv:1705.05035, 2017.
processes. arXiv preprint arXiv:2401.08850, 2024. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing
Jaakkola, T., Jordan, M., and Singh, S. Convergence of atari with deep reinforcement learning. arXiv preprint
stochastic iterative dynamic programming algorithms. arXiv:1312.5602, 2013.
Advances in neural information processing systems, 6,
1993. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,
T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn-
Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, chronous methods for deep reinforcement learning. In
A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., International conference on machine learning, pp. 1928–
Vanhoucke, V., et al. Qt-opt: Scalable deep reinforcement 1937. PMLR, 2016.
learning for vision-based robotic manipulation. arXiv
Nair, V. and Hinton, G. E. Rectified linear units improve
preprint arXiv:1806.10293, 2018.
restricted boltzmann machines. In Proceedings of the 27th
international conference on machine learning (ICML-10),
Kim, M., Park, J., et al. Learning collaborative policies
pp. 807–814, 2010.
to solve np-hard routing problems. Advances in Neural
Information Processing Systems, 34:10418–10430, 2021. Pazis, J. and Parr, R. Generalized value functions for large
action sets. In Proceedings of the 28th International
Lagoudakis, M. G. and Parr, R. Reinforcement learning as Conference on Machine Learning (ICML-11), pp. 1185–
classification: Leveraging modern classifiers. In Proceed- 1192, 2011.
ings of the 20th International Conference on Machine
Learning (ICML-03), pp. 424–431, 2003. Peng, B., Rashid, T., Schroeder de Witt, C., Kamienny,
P.-A., Torr, P., Böhmer, W., and Whiteson, S. Facmac:
Lan, Q., Pan, Y., Fyshe, A., and White, M. Maxmin q- Factored multi-agent centralised policy gradients. Ad-
learning: Controlling the estimation bias of q-learning. vances in Neural Information Processing Systems, 34:
arXiv preprint arXiv:2002.06487, 2020. 12208–12221, 2021.
Quillen, D., Jang, E., Nachum, O., Finn, C., Ibarz, J., and
Li, S., Wei, C., and Wang, Y. Combining decision making
Levine, S. Deep reinforcement learning for vision-based
and trajectory planning for lane changing using deep
robotic grasping: A simulated comparative evaluation of
reinforcement learning. IEEE Transactions on Intelligent
off-policy methods. In 2018 IEEE International Confer-
Transportation Systems, 23(9):16110–16136, 2022.
ence on Robotics and Automation (ICRA), pp. 6284–6291.
IEEE, 2018.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,
T., Tassa, Y., Silver, D., and Wierstra, D. Continuous Rubinstein, R. The cross-entropy method for combinatorial
control with deep reinforcement learning. arXiv preprint and continuous optimization. Methodology and comput-
arXiv:1509.02971, 2015. ing in applied probability, 1:127–190, 1999.
Luong, N. C., Hoang, D. T., Gong, S., Niyato, D., Wang, P., Rummery, G. A. and Niranjan, M. On-line Q-learning
Liang, Y.-C., and Kim, D. I. Applications of deep rein- using connectionist systems, volume 37. University of
forcement learning in communications and networking: Cambridge, Department of Engineering Cambridge, UK,
A survey. IEEE Communications Surveys & Tutorials, 21 1994.
(4):3133–3174, 2019. Sallans, B. and Hinton, G. E. Reinforcement learning with
factored states and actions. The Journal of Machine
Mahajan, A., Samvelyan, M., Mao, L., Makoviychuk, V.,
Learning Research, 5:1063–1088, 2004.
Garg, A., Kossaifi, J., Whiteson, S., Zhu, Y., and Anand-
kumar, A. Reinforcement learning in factored action Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
spaces using tensor decompositions. arXiv preprint Klimov, O. Proximal policy optimization algorithms.
arXiv:2110.14538, 2021. arXiv preprint arXiv:1707.06347, 2017.
11
Seyde, T., Gilitschenski, I., Schwarting, W., Stellato, B., Wang, H., Lin, S., and Zhang, J. Adaptive ensemble q-
Riedmiller, M., Wulfmeier, M., and Rus, D. Is bang- learning: Minimizing estimation bias via error feedback.
bang control all you need? solving continuous control Advances in Neural Information Processing Systems, 34:
with bernoulli policies. Advances in Neural Information 24778–24790, 2021.
Processing Systems, 34:27209–27221, 2021.
Wang, X., Wang, S., Liang, X., Zhao, D., Huang, J., Xu,
Seyde, T., Werner, P., Schwarting, W., Gilitschenski, I., X., Dai, B., and Miao, Q. Deep reinforcement learning:
Riedmiller, M., Rus, D., and Wulfmeier, M. Solv- a survey. IEEE Transactions on Neural Networks and
ing continuous control via q-learning. arXiv preprint Learning Systems, 2022.
arXiv:2210.12566, 2022.
Watkins, C. J. and Dayan, P. Q-learning. Machine learning,
Sutton, R. S. and Barto, A. G. Reinforcement learning: An 8:279–292, 1992.
introduction. MIT press, 2018.
Zahavy, T., Haroush, M., Merlis, N., Mankowitz, D. J., and
Tang, Y. and Agrawal, S. Discretizing continuous action Mannor, S. Learn what not to learn: Action elimination
space for on-policy optimization. In Proceedings of the with deep reinforcement learning. Advances in neural
aaai conference on artificial intelligence, volume 34, pp. information processing systems, 31, 2018.
5981–5988, 2020.
Zhang, T., Guo, S., Tan, T., Hu, X., and Chen, F. Gen-
Tavakoli, A., Pardo, F., and Kormushev, P. Action branch- erating adjacency-constrained subgoals in hierarchical
ing architectures for deep reinforcement learning. In Pro- reinforcement learning. Advances in Neural Information
ceedings of the AAAI conference on Artificial Intelligence, Processing Systems, 33:21579–21590, 2020.
volume 32, 2018.
Zhang, Z., Pan, Z., and Kochenderfer, M. J. Weighted
Tennenholtz, G. and Mannor, S. The natural language of ac- double q-learning. In IJCAI, pp. 3455–3461, 2017.
tions. In International Conference on Machine Learning,
pp. 6196–6205. PMLR, 2019.
Tessler, C., Zahavy, T., Cohen, D., Mankowitz, D. J., and

Mannor, S. Action assembly: Sparse imitation learning
for text based games with combinatorial action spaces.
arXiv preprint arXiv:1905.09700, 2019.
Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics

engine for model-based control. In 2012 IEEE/RSJ inter-
national conference on intelligent robots and systems, pp.
5026–5033. IEEE, 2012.
Van de Wiele, T., Warde-Farley, D., Mnih, A., and

Mnih, V. Q-learning in enormous action spaces via
amortized approximate maximization. arXiv preprint
arXiv:2001.08116, 2020.
Van Hasselt, H. and Wiering, M. A. Using continuous action

spaces to solve discrete problems. In 2009 International
Joint Conference on Neural Networks, pp. 1149–1156.
IEEE, 2009.
Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-

ment learning with double q-learning. In Proceedings of
the AAAI conference on artificial intelligence, volume 30,
2016.
Wang, G., Shi, D., Xue, C., Jiang, H., and Wang, Y. Bic-
ddpg: Bidirectionally-coordinated nets for deep multi-
agent reinforcement learning. In International Confer-
ence on Collaborative Computing: Networking, Applica-
tions and Worksharing, pp. 337–354. Springer, 2020.
12
A. Stochastic Q-learning Convergence Proofs

In this section, we prove Theorem 6.2, which states the convergence of Stochastic Q-learning. This algorithm uses a
stochastic policy for action selection, employing a stoch arg max with or without memory, possibly dependent on the
current state s. For value updates, it utilizes a stoch max without memory, independent of the following state s0 .
A.1. Proof of Theorem 6.2

Proof. Stochastic Q-learning employs a stochastic policy in a given state s, which use stoch arg max operation, with or
without memory M, with probability (1 "s ), for "s > 0, which can be summarized by the following equation:
(
S play randomly with probability ✏s
⇡Q (s) = (15)
stoch arg maxa2A Q(s, a) otherwise .
This policy, with "s > 0, ensures that P⇡ [at = a | st = s] > 0 for all (s, a) 2 S ⇥ A.
Furthermore, during the training, to update the Q-function, given any initial estimate Q0 , we consider a Stochastic Q-learning
which uses stoch max operation as in the following stochastic update rule:

Qt+1 (st , at ) = (1 ↵t (st , at )) Qt (st , at ) + ↵t (st , at ) rt + stoch max Qt (st+1 , b) . (16)
b2A
For the function updates, we consider a stoch max without memory, which involves a max over a random subset of action
C sampled from a set probability distribution P defined over the combinatorial space of actions, i.e., P : 2A ! [0, 1], which
can be a uniform distribution over the action sets of size dlog(n)e.
Hence, for a random subset of actions C, the update rule of Stochastic Q-learning can be written as:

Qt+1 (st , at ) = (1 ↵t (st , at )) Qt (st , at ) + ↵t (st , at ) rt + max Qt (st+1 , b) . (17)
b2C
We define an optimal Q-function, denoted as Q⇤ , as follows:


Q⇤ (s, a) = E r(s, a) + stoch max Q⇤ (s0 , b) | s, a (18)
b2A

= E r(s, a) + max Q⇤ (s0 , b) | s, a . (19)
b2C
Subtracting from both sides Q⇤ (st , at ) and letting
t (s, a) = Qt (s, a) Q⇤ (s, a), (20)

yields
t+1 (st , at ) = (1 ↵t (st , at )) t (st , at ) + ↵t (st , at )Ft (st , at ), (21)

with
Ft (s, a) = r(s, a) + max Qt (s0 , b) Q⇤ (s, a) . (22)
b2C
For the transition probability distribution P : S ⇥ A ⇥ S ! [0, 1], the set probability distribution P : 2A ! [0, 1], the
reward function r : S ⇥ A ! R, and the discount factor, 2 [0, 1], we define the following contraction operator , defined
for a function q : S ⇥ A ! R as
X X 
( q)(s, a) = P(C) P(s | s, a) r(s, a) + max q(s0 , b) .
0
(23)
b2C
C22A s0 2S
Therefore, with Ft representing the past at time step t,

X X 
E [Ft (s, a) | Ft ] = P(C) P(s | s, a) r(s, a) + max Qt (s0 , b)
0
Q⇤ (s, a)
b2C
C22A s0 2S
= ( (Qt )) (s, a) Q⇤ (s, a).
13
Using the fact that Q⇤ = Q⇤ ,

E [Ft (s, a) | Ft ] = ( Qt ) (s, a) ( Q⇤ ) (s, a).
It is now immediate from Lemma 6.4, which we prove in Appendix A.2, that
kE [Ft (s, a) | Ft ]k1  kQt Q ⇤ k1 = k t k1 . (24)
Moreover,
"✓ ◆2 #
0 ⇤ ⇤
var [Ft (s, a) | Ft ] = E r(s, a) + max Qt (s , b) Q (s, a) ( Qt ) (s, a) + Q (s, a) | Ft
b2C
"✓ ◆2 #
0
=E r(s, a) + max Qt (s , b) ( Qt ) (s, a) | Ft
b2C

= var r(s, a) + max Qt (s0 , b) | Ft
b2C

= var [r(s, a) | Ft ] + var max Qt (s0 , b) | Ft + 2 cov(r(s, a), max Qt (s0 , b) | Ft )
2
b2C b2C

= var [r(s, a) | Ft ] + 2 var max Qt (s0 , b) | Ft .
b2C
The last line follows from the fact that the randomness of maxb2C Qt (s0 , b) | Ft only depends on the random set C and the
next state s0 . Moreover, we consider the reward r(s, a) independent of the set C and the next state s0 , by not using the same
set C for both the action selection and the value update.
Given that r is bounded, its variance is bounded by some constant B. Therefore,

var [Ft (s, a) | Ft ]  B + 2
var max Qt (s0 , b) | Ft
b2C
  2
=B+ E (max Qt (s0 , b))2 | Ft
2 2
E max Qt (s0 , b) | Ft
b2C b2C

 B + 2 E (max Qt (s0 , b))2 | Ft
b2C

 B + 2 E (max
0
max Qt (s0 , b))2 | Ft
s 2S b2A
B+ 2
(max
0
max Qt (s0 , b))2
s 2S b2A
2
=B+ kQt k21
=B+ 2
k t + Q⇤ k21
B+ 2
kQ⇤ k21 + 2
k 2
t k1
 (B + 2
kQ⇤ k21 )(1 + k 2
t k1 ) + 2
(1 + k 2
t k1 )
 max{B + 2
kQ⇤ k21 , 2
}(1 + k 2
t k1 )
 max{B + 2
kQ⇤ k21 , 2
}(1 + k t k1 )
2
.
Therefore, for constant C = max{B + 2

kQ⇤ k21 , 2
},
2
var [Ft (s, a) | Ft ]  C(1 + k t k1 ) . (25)
Then, by Eq. (24), Eq. (25), and Theorem 1 in (Jaakkola et al., 1993), t converges to zero with probability 1, i.e., Qt
converges to Q⇤ with probability 1.
14
A.2. Proof of Lemma 6.4

Proof. For the transition probability distribution P : S ⇥ A ⇥ S ! [0, 1], the set probability distribution P defined over the
combinatorial space of actions, i.e., P : 2A ! [0, 1], the reward function r : S ⇥ A ! R, and the discount factor 2 [0, 1],
for a function q : S ⇥ A ! R, the operator is defined as follows:
X X 
( q)(s, a) = P(C) P(s0 | s, a) r(s, a) + max q(s0 , b) . (26)
b2C
C22A s0 2S
Therefore,
X X 
k q1 q2 k1 = max P(C) P(s | s, a) r(s, a) + max q1 (s0 , b)
0
r(s, a) + max q2 (s0 , b)
s,a b2C b2C
C22A s0 2S
X X 
= max P(C) P(s | s, a) max q1 (s0 , b)
0
max q2 (s0 , b)
s,a b2C b2C
C22A s0 2S
X X
 max P(C) P(s0 | s, a) max q1 (s0 , b) max q2 (s0 , b)
s,a b2C b2C
C22A s0 2S
X X
 max P(C) P(s0 | s, a) max |q1 (z, b) q2 (z, b)|
s,a z,b
C22A s0 2S
X X
 max P(C) P(s0 | s, a) kq1 q 2 k1
s,a
C22A s0 2S
= kq1 q 2 k1 .
15
B. Stochastic Maximization
We analyze the proposed stochastic maximization method by comparing its error to that of exact maximization. First, we
consider the case without memory, where C = R, and then the case with memory, where M = 6 ;. Finally, we provide a
specialized bound for the case where the action values follow a uniform distribution.
B.1. Memoryless Stochastic Maximization

In the following lemma, we give a lower bound on the probability of finding an optimal action within a uniformly sampled
subset R of dlog(n)e actions. We prove that for a given state s, the probability p of sampling an optimal action within the
uniformly randomly sampled subset R of size dlog(n)e actions is lower bounded with p dlog(n)e n .
B.1.1. P ROOF OF L EMMA 5.1

Proof. In the presence of multiple maximizers, we focus on one of them, denoted as a⇤0 , and then the probability p of
sampling at least one maximizer is lower-bounded by the probability pa⇤0 of finding a⇤0 , i.e.,
p pa⇤0 .
The probability pa⇤0 of finding a⇤0 is the probability of sampling a⇤0 within the random set R of size dlog(n)e, which is the
fraction of all possible combinations of size dlog(n)e that include a⇤0 .
n 1 n
This fraction can be calculated as dlog(n)e 1 divided by all possible combinations of size dlog(n)e, which is dlog(n)e .
n 1
(dlog(n)e 1)
Therefore, pa⇤0 = n .
(dlog(n)e )
Consequently,
dlog(n)e
p . (27)
n
B.2. Stochastic Maximization with Memory

While stochastic maximization without memory could approach the maximum value or find it with the probability p,
lower-bounded in Lemma 5.1, it never converges to an exact maximization, as it keeps sampling purely at random, as can be
seen in Fig. 6. However, stochastic maximization with memory can become an exact maximization when the Q-function
becomes stable, which we prove in the following Corollary. Although the stoch max has sub-linear complexity compared to
the max, the following Corollary shows that, on average, for a stable Q-function, after a certain number of iterations, the
output of the stoch max matches the output of max.
Definition B.1. A Q-function is considered stable for a given state s if its best action in that state remains unchanged for all
subsequent steps, even if the Q-function’s values themselves change.
A straightforward example of a stable Q-function occurs during validation periods when no function updates are performed.
However, in general, a stable Q-function does not have to be static and might still vary over the rounds; the key characteristic
is that its maximizing action remains the same even when its values are updated. Although the stoch max has sub-linear
complexity compared to the max, without any assumption of the value distributions, the following Corollary shows that, on
average, for a stable Q-function, after a certain number of iterations, the output of the stoch max matches exactly the output
of max.
B.2.1. P ROOF OF C OROLLARY 5.3

Proof. We formalize the problem as a geometric distribution where the success event is the event of sampling a subset of
size dlog(n)e that includes at least one maximizer. The geometric distribution gives the probability that the first time to
sample a subset that includes an optimal action requires k independent calls, each with success probability p. From Lemma
5.1, we have p dlog(n)e
n . Therefore, on an average, success requires: p1  dlog(n)e
n
calls.
16
For a given discrete state s, M keeps track of the most recent best action found. For C = R [ M,
stoch max Q(s, a) = max Q(s, a) max Q(s, a). (28)
a2A a2C a2M
Therefore, for a given state s, on average, if the Q-function is stable, then within dlog(n)e ,
n
M will contain the optimal action
a⇤ . Therefore, on an average, after dlog(n)e
n
time steps,
stoch max Q(s, a) max Q(s, a) = max Q(s, a).

a2A a2M a2A
We know that, stoch maxa2A Q(s, a)  maxa2A Q(s, a). Therefore, for a stable Q-function, on an average, after n
dlog(n)e
time steps, stoch maxa2A Q(s, a) becomes maxa2A Q(s, a).
B.3. Stochastic Maximization with Uniformly Distributed Rewards

While the above corollary outlines an upper-bound on the average number of calls needed to determine the exact optimal
action eventually, the following lemma offers insights into the expected maximum value of a randomly sampled subset of
actions, comprising dlog(n)e elements when their values are uniformly distributed.
Lemma B.2. For a given state s and a uniformly randomly sampled subset R of size dlog(n)e actions, if the values of the
sampled actions follow independently a uniform distribution in the interval [Qt (s, a?t ) bt (s), Qt (s, a?t )], then the expected
value of the maximum Q-function within this random subset is:

bt (s)
E max Qt (s, k) | s, a?t = Qt (s, a?t ) . (29)
k2R dlog(n)e + 1
Proof. For a given state s we assume a uniformly randomly sampled subset R of size dlog(n)e actions, and the values of the
sampled actions are independent and follow a uniform distribution in the interval [Qt (s, a?t ) bt (s), Qt (s, a?t )]. Therefore,
the cumulative distribution function (CDF) for the value of an action a given the state s and the optimal action a⇤t is:
8
< 0 for y < Qt (s, a⇤t ) bt (s)
G(y; s, a) = y for y 2 [Qt (s, a⇤t ) bt , Qt (s, a⇤ )]
:
1 for y > Qt (s, a⇤t ) .
We define the variable x = (y (Qt (s, a⇤t ) bt (s)))/bt (s).

8
< 0 for x < 0
F (x; s, a) = x for x 2 [0, 1]
:
1 for x > 1 .
If we select dlog(n)e such actions, the CDF of the maximum of these actions, denoted as Fmax is the following:
✓ ◆
Fmax (x; s, a) = P max Qt (s, a)  x
a2R
Y
= P (Qt (s, a)  x)
a2R
Y
= F (x; s, a)
a2R
= F (x; s, a)dlog(n)e .
The second line follows from the independence of the values, and the last line follows from the assumption that all actions
follow the same uniform distribution.
The CDF of the maximum is therefore given by:
8
< 0 for x < 0
Fmax (x; s, a) = xdlog(n)e for x 2 [0, 1]
:
1 for x > 1 .
17
Now, we can determine the desired expected value as

 Z 1
Qt (s, a) (Qt (s, a⇤t ) bt (s))
E max = x dFmax (x; s, a)
a2R bt (s) 1
Z 1
= x dFmax (x; s, a)
0
Z 1
1
= [xFmax (x; s, a)]0 Fmax (x; s, a) dx
0
Z 1
=1 xdlog(n)e dx
0
1
=1 .
dlog(n)e + 1
R1 R1
We employed the identity 0 x dµ(x) = 0 1 µ(x) dx, which can be demonstrated through integration by parts. To return
to the original scale, we can first multiply by bt and then add Qt (s, a⇤t ) bt (s), resulting in:

bt (s)
E max Qt (s, a) | s, a⇤t = Qt (s, a⇤t ) .
a2R dlog(n)e + 1
As an example of this setting, for Qt (s, a?t ) = 100, bt = 100, for a setting with n = 1000 actions, dlog(n)e+1 = 11. Hence
the E [maxk2R Qt (s, k) | s, a?t ] ⇡ 91. This shows that even with a randomly sampled set of actions R, the stoch max can
be close to the max. We simulate this setting in the experiments in Fig. 6.
Our proposed stochastic maximization does not solely rely on the randomly sampled subset of actions R but also considers
actions from previous experiences through M. Therefore, the expected stoch max should be higher than the above result,
providing an upper bound on the expected t as described in the following corollary of Lemma B.2.
Corollary B.3. For a given discrete state s, if the values of the sampled actions follow independently a uniform distribution
from the interval [Qt (s, a?t ) bt (s), Qt (s, a?t )], then the expected value of t (s) is:
bt (s)
E [ t (s) | s]  . (30)
dlog(n)e + 1
Proof. At time step t, given a state s, and the current estimated Q-function Qt , t (s) is defined as follows:
t (s) = max Qt (s, a) stoch max Qt (s, a) . (31)

a2A a2A
For a given state s and a uniformly randomly sampled subset R of size dlog(n)e actions and a subset of some previous
played actions M ⇢ E, using the law of total expectation,
E [ t (s) | s] = E [E [ t (s) | s, a?t ] | s]

 
= E E max Qt (s, k) stoch max Qt (s, k) | s, a?t | s
k2A k2A
 
= E E max Qt (s, k) max Qt (s, k) | s, a?t | s
k2A k2R[M
 
 E E max Qt (s, k) max Qt (s, k) | s, a?t | s
k2A k2R
 
= E Qt (s, a⇤t ) E max Qt (s, k) | s, a?t | s .
k2R
18
Therefore by Lemma B.2:


bt (s)
E [ t (s) | s]  E Qt (s, a⇤t ) (Qt (s, a⇤t ) )|s
dlog(n)e + 1

bt (s)
=E |s
dlog(n)e + 1
bt (s)
= .
dlog(n)e + 1
19
C. Pseudocodes
Algorithm 2 Stochastic Double Q-learning

Initialize QA (s, a) and QB (s, a) for all s 2 S, a 2 A, n = | A |
for each episode do
Observe state s.
for each step of episode do
Choose a from s via QA + QB with policy ⇡(Q S
A +QB ) (s) in Eq. (6).
Take action a, observe r, s .

0
Choose either UPDATE(A) or UPDATE(B), for example randomly.

if UPDATE(A) then
A
r + QB (s0 , stoch arg maxb2A QA (s0 , b)) QA (s, a).
A
Q (s, a) QA (s, a) + ↵(s, a) A .
else if UPDATE(B) then
B
r + QA (s0 , stoch arg maxb2A QB (s0 , b)) QB (s, a).
B
Q (s, a) QB (s, a) + ↵(s, a) B .
end if
s s0 .
end for
end for
Algorithm 3 Stochastic Sarsa

Initialize Q(s, a) for all s 2 S, a 2 A, n = | A |
for each episode do
Observe state s.
Choose a from s with policy ⇡Q S
(s) in Eq. (6).
for each step of episode do
Take action a, observe r, s0 .
Choose a0 from s0 with policy ⇡Q (s ) in Eq. (6).
S 0
Q(s, a) Q(s, a) + ↵(s, a)[r + Q(s0 , a0 ) Q(s, a)].

s s0 ; a a0 .
end for
end for
D. Experimental Details
D.1. Environments
We test our proposed algorithms on a standardized set of environments using open-source libraries. We compare stochastic
maximization to exact maximization and evaluate the proposed stochastic RL algorithms on Gymnasium environments
(Brockman et al., 2016). Stochastic Q-learning and Stochastic Double Q-learning are tested on the CliffWalking-v0, the
FrozenLake-v1, and a generated MDP environment, while stochastic deep Q-learning approaches are tested on MuJoCo
control tasks (Todorov et al., 2012).
D.1.1. E NVIRONMENTS WITH D ISCRETE S TATES AND ACTIONS

We generate an MDP environment with 256 actions, with rewards following a normal distribution of mean -50 and standard
deviation of 50, with 3 states. Furthermore, while our approach is designed for large discrete action spaces, we tested
it in Gymnasium environments (Brockman et al., 2016) with only four discrete actions, such as CliffWalking-v0 and
FrozenLake-v1. CliffWalking-v0 involves navigating a grid world from the starting point to the destination without falling
off a cliff. FrozenLake-v1 requires moving from the starting point to the goal without stepping into any holes on the frozen
surface, which can be challenging due to the slippery nature of the ice.
20
Algorithm 4 Stochastic Deep Q-Network (StochDQN)

Algorithm parameters: learning rate ↵ 2 (0, 1], replay buffer E, update rate ⌧ .
Initialize: neural network Q(s, a; ✓) with random weights ✓, target network Q̂(s, a; ✓ ) with ✓ = ✓, set of actions A of
size n.
for each episode do
Initialize state s.
while not terminal state is reached do
Choose a from s using a stochastic policy as defined in Eq. (15) using Q(s, .; ✓).
Take action a, observe reward r(s, a) and next state s0 .
Store (s, a, r(s, a), s0 ) in replay buffer E.
Compute target values for the mini-batch:
(
ri if s0i is terminal
yi =
ri + Q̂(si , stoch arg maxa0 2A Q̂(si , a ; ✓ ); ✓ ) otherwise.
0 0 0
Perform a gradient descent step on the loss:

dlog(n)e
1 X
L(✓) = (yi Q(si , ai ; ✓))2 .
dlog(n)e i=1
Update the target network weights:

✓ ⌧ · ✓ + (1 ⌧) · ✓ .
Update the Q-network weights using gradient descent:
✓ ✓ + ↵r✓ L(✓).
s s0 .
end while
end for
21
D.1.2. E NVIRONMENTS WITH C ONTINUOUS S TATES : D ISCRETIZING C ONTROL TASKS

We test the stochastic deep Q-learning approaches on MuJoCo (Todorov et al., 2012) for continuous states discretized
control tasks. We discretize each action dimension into i equally spaced values, creating a discrete action space with n = id
d-dimensional actions. We mainly focused on the inverted pendulum and the half-cheetah. The inverted pendulum involves
a cart that can be moved left or right, intending to balance a pole on top using a 1D force, with i = 512 resulting in 512
actions. The half-cheetah is a robot with nine body parts aiming to maximize forward speed. It can apply torque to 6 joints,
resulting in 6D actions with i = 4, which results in 4096 actions.
D.2. Algorithms
D.2.1. S TOCHASTIC M AXIMIZATION
We have two scenarios, one for discrete and the other for continuous states. For discrete states, E is a dictionary with the keys
as the states in S with corresponding values of the latest played action in every state. In contrast, E comprises the actions in
the replay buffer for continuous states. Indeed, we do not consider the whole set E either. Instead, we only consider a subset
M ⇢ E. For discrete states, for a given state s, M includes the latest two exploited actions in state s. For continuous states,
where it is impossible to retain the last exploited action for each state, we consider randomly sampled subset M ⇢ E, which
includes dlog(n)e actions, even though they were played in different states. In the experiments involving continuous states,
we demonstrate that this was sufficient to achieve good results, see Section 7.3.
D.2.2. TABULAR Q- LEARNING M ETHODS

We set the training parameters the same for all the Q-learning variants. We follow similar hyper-parameters as in (Hasselt,
2010). We set the discount factor to 0.95 and apply a dynamical polynomial learning rate ↵ with ↵t (s, a) = 1/zt (s, a)0.8 ,
where zt (s, a) is the number of times the pair (s, a) has
p been visited, initially set to one for all the pairs. For the exploration
rate, we use use a decaying ", defined as "(s) = 1/ (z(s)) where z(s) is the number of times state s has been visited,
initially set to one for all the states. For Double Q-learning zt (s, a) = ztA (s, a) if QA is updated and zt (s, a) = ztB (s, a) if
QB is updated, where ztA and ztB store the number of updates for each action for the corresponding value function. We
averaged the results over ten repetitions. For Stochastic Q-learning, we track a dictionary D with keys being the states, and
values being the latest exploited action. Thus, for a state s, the memory M = D(s), thus M is the latest exploited action in
the same state s.
D.2.3. D EEP Q- NETWORK M ETHODS

We set the training parameters the same for all the deep Q-learning variants. We set the discount factor to 0.99 and the
learning rate ↵ to 0.001. Our neural network takes input of a size equal to the sum of the dimensions of states and actions
with a single output neuron. The network consists of two hidden linear layers, each with a size of 64, followed by a ReLU
activation function (Nair & Hinton, 2010). We keep the exploration rate " the same for all states, initialize it at 1, and apply
a decay factor of 0.995, with a minimum threshold of 0.01. For n total number of actions, during training, to train the
network, we use stochastic batches of size dlog(n)e uniformly sampled from a buffer of size 2dlog(n)e. We averaged the
results over five repetitions. For the stochastic methods, we consider the actions in the batch of actions as the memory set
M. We choose the batch size in this way to keep the complexity of the Stochastic Q-learning within O(log(n)).
D.3. Compute and Implementation

We implement the different Q-learning methods using Python 3.9, Numpy 1.23.4, and Pytorch 2.0.1. For proximal policy
optimization (PPO) (Schulman et al., 2017), asynchronous actor-critic (A2C) (Mnih et al., 2016), and deep deterministic
policy gradient (DDPG) (Lillicrap et al., 2015), we use the implementations of Stable-Baselines (Hill et al., 2018). We test
the training time using a CPU 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz 1.69 GHz. with 16.0 GB RAM.
22
(a) Action Selection Time (b) Full Step Duration
Figure 5: Comparison results for the stochastic and deterministic methods. The x-axis represents the number of possible
actions, and the y-axis represents the time step duration of the agent in seconds.
Figure 6: stoch max with (w/) and without (w/o) memory M vs. max on uniformly distributed action values as described
in Section B. The x-axis and y-axis represent the steps and the values, respectively.
E. Additional Results
E.1. Wall Time Speed
Stochastic maximization methods exhibit logarithmic complexity regarding the number of actions, as confirmed in Fig.
5a. Therefore, both StochDQN and StochDDQN, which apply these techniques for action selection and updates, have
exponentially faster execution times compared to both DQN and DDQN, which can be seen in Fig 5b which shows the
complete step duration for deep Q-learning methods, which include action selection and network update. The proposed
methods are nearly as fast as a random algorithm, which samples and selects actions randomly and has no updates.
E.2. Stochastic Maxmization

E.2.1. S TOCHASTIC M AXMIZATION VS M AXIMIZATION WITH U NIFORM R EWARDS
In the setting described in Section B.3 with 5000 uniformly independently distributed action values in the range of [0, 100],
as shown in Fig. 6, stoch max without memory, i.e., M = ; reaches around 91 in average return, and keeps fluctuating
around, while stoch max with M quickly achieves the optimal reward.
E.2.2. S TOCHASTIC M AXIMIZATION A NALYSIS

In this section, we analyze stochastic maximization by tracking returned values across rounds, !t (Eq. (10)), and t (Eq. (9)),
which we provide here. At time step t, given a state s, and the current estimated Q-function Qt , we define the non-negative
23
(a) stoch max vs max (b) Ratio !t (c) Difference t
Figure 7: Comparison results for the stochastic and non-stochastic methods for the Inverted Pendulum with 512 actions.
(a) stoch max vs max (b) Ratio !t (c) Difference t
Figure 8: Comparison results for the stochastic and non-stochastic methods for the Half Cheetah with 4096 actions.
underestimation error as t (s), as follows:
t (s) = max Qt (s, a) stoch max Qt (s, a) . (32)

a2A a2A
Furthermore, we define the ratio !t (s), as follows:

stoch maxa2A Qt (s, a)
!t (s) = . (33)
maxa2A Qt (s, a)
It follows that:
t (s)
!t (s) = 1 . (34)
maxa2A Qt (s, a)
For Deep Q-Networks, for the InvertedPendulum-v4, both stoch max and max return similar values (Fig. 7a), !t approaches
one rapidly (Fig. 7b) and t remains below 0.5 (Fig. 7c). In the case of HalfCheetah-v4, both stoch max and max return
similar values (Fig. 8a), !t quickly converges to one (Fig. 8b), and t is upper bounded below eight (Fig. 8c).
While the difference t remains bounded, the values of both stoch max and max increase over the rounds as the agent
explores better options. This leads to the ratio !t converging to one as the error becomes negligible over the rounds, as
expected according to Eq. (34).
E.3. Stochastic Q-network Reward Analysis

As illustrated in Fig. 9a and Fig. 9b for the inverted pendulum and half cheetah experiments, which involve 512 and 4096
actions, respectively, both StochDQN and StochDDQN attain the optimal average return in a comparable number of rounds
to DQN and DDQN. Additionally, StochDQN exhibits the quickest attainment of optimal rewards for the inverted pendulum.
Furthermore, while DDQN did not perform well on the inverted pendulum task, its modification, i.e., StochDDQN, reached
the optimal rewards.
24
(a) Inverted Pendulum (b) Half Cheetah
Figure 9: Stochastic vs non-stochastic of deep Q-learning variants on Inverted Pendulum and Half Cheetah, with steps on
the x-axis and average returns, smoothed over a size 100 window on the y-axis.
(a) Instantaneous Rewards (b) Cumulative Rewards
Figure 10: Comparing stochastic and non-stochastic Q-learning approaches on the Cliff Walking, with steps on the x-axis,
instantaneous rewards smoothed over a size 1000 window on the y-axis for plot (a), and cumulative rewards on the y-axis
for plot (b).
E.4. Stochastic Q-learning Reward Analysis

We tested Stochastic Q-learning, Stochastic Double Q-learning, and Stochastic Sarsa in environments with both discrete
states and actions. Interestingly, as shown in Fig. 11, our stochastic algorithms outperform their deterministic counterparts
in terms of cumulative rewards. Furthermore, we notice that Stochastic Q-learning outperforms all the considered methods
regarding the cumulative rewards. Moreover, in the CliffWalking-v0 (as shown in Fig. 10), as well as for the generated
MDP environment with 256 possible actions (as shown in Fig. 12), all the stochastic and non-stochastic algorithms reach
the optimal policy in a similar number of steps.
25
Figure 11: Comparing stochastic and non-stochastic Q-learning approaches on the Frozen Lake, with steps on the x-axis,
instantaneous rewards smoothed over a size 1000 window on the y-axis for plot (a), and cumulative rewards on the y-axis
for plot (b).
Figure 12: Comparing stochastic and non-stochastic Q-learning approaches on the generated MDP environment, with steps
on the x-axis, instantaneous rewards smoothed over a size 1000 window on the y-axis for plot (a), and cumulative rewards
on the y-axis for plot (b).
26

Stochastic Q-Learning For Large Discrete Action Spaces

Uploaded by

Copyright:

Available Formats

Stochastic Q-Learning For Large Discrete Action Spaces

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stochastic Q-Learning For Large Discrete Action Spaces

Uploaded by

Copyright:

Available Formats

Stochastic Q-learning for Large Discrete Action Spaces

Fares Fourati 1 Vaneet Aggarwal 2 Mohamed-Slim Alouini 1

Abstract advances in the field, a significant challenge lies in nav-

3. Problem Description When representing the value function as a parameterized

memoryless and memory-based approaches. The memory- Algorithm 1 Stochastic Q-learning

imizing a series of loss functions denoted as Li (✓i ), with

It can be seen from the definitions that t (s) 0 and

We then use the above lemma to establish the conver-

= ( (Qt )) (s, a) Q⇤ (s, a). 7.2. Exponential Wall Time Speedup

situations with general large discrete action spaces.

7.4. Stochastic Maximization Q-learning methods, focused initially on discrete actions,

Impact Statement Fourati, F. and Alouini, M.-S. Artificial intelligence for

Tessler, C., Zahavy, T., Cohen, D., Mankowitz, D. J., and

Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physics

Van de Wiele, T., Warde-Farley, D., Mnih, A., and

Van Hasselt, H. and Wiering, M. A. Using continuous action

Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce-

A. Stochastic Q-learning Convergence Proofs

A.1. Proof of Theorem 6.2

We define an optimal Q-function, denoted as Q⇤ , as follows:

Subtracting from both sides Q⇤ (st , at ) and letting

t (s, a) = Qt (s, a) Q⇤ (s, a), (20)

t+1 (st , at ) = (1 ↵t (st , at )) t (st , at ) + ↵t (st , at )Ft (st , at ), (21)

Therefore, with Ft representing the past at time step t,

= ( (Qt )) (s, a) Q⇤ (s, a).

Using the fact that Q⇤ = Q⇤ ,

kE [Ft (s, a) | Ft ]k1  kQt Q ⇤ k1 = k t k1 . (24)

Therefore, for constant C = max{B + 2

A.2. Proof of Lemma 6.4

B.1. Memoryless Stochastic Maximization

B.1.1. P ROOF OF L EMMA 5.1

B.2. Stochastic Maximization with Memory

B.2.1. P ROOF OF C OROLLARY 5.3

stoch max Q(s, a) max Q(s, a) = max Q(s, a).

B.3. Stochastic Maximization with Uniformly Distributed Rewards

We define the variable x = (y (Qt (s, a⇤t ) bt (s)))/bt (s).

Now, we can determine the desired expected value as

t (s) = max Qt (s, a) stoch max Qt (s, a) . (31)

E [ t (s) | s] = E [E [ t (s) | s, a?t ] | s]

Therefore by Lemma B.2:

Algorithm 2 Stochastic Double Q-learning

Take action a, observe r, s .

Choose either UPDATE(A) or UPDATE(B), for example randomly.

Algorithm 3 Stochastic Sarsa

Q(s, a) Q(s, a) + ↵(s, a)[r + Q(s0 , a0 ) Q(s, a)].

D.1.1. E NVIRONMENTS WITH D ISCRETE S TATES AND ACTIONS

Algorithm 4 Stochastic Deep Q-Network (StochDQN)

Perform a gradient descent step on the loss:

Update the target network weights:

D.1.2. E NVIRONMENTS WITH C ONTINUOUS S TATES : D ISCRETIZING C ONTROL TASKS

D.2.2. TABULAR Q- LEARNING M ETHODS

D.2.3. D EEP Q- NETWORK M ETHODS

D.3. Compute and Implementation

(a) Action Selection Time (b) Full Step Duration

E.2. Stochastic Maxmization

E.2.2. S TOCHASTIC M AXIMIZATION A NALYSIS

(a) stoch max vs max (b) Ratio !t (c) Difference t

(a) stoch max vs max (b) Ratio !t (c) Difference t

underestimation error as t (s), as follows: