Opinion Critic
Opinion Critic
Opinion Critic
1. Options evaluation:
300
δ ← r − QU (s, ω, a)
if s0 is non-terminal then
200
δ ← δ + γ(1 − βω,ϑ (s0 ))QΩ (s0 , ω) +
γβω,ϑ (s0 ) max QΩ (s0 , ω̄)
ω̄ 100
end
QU (s, ω, a) ← QU (s, ω, a) + αδ 0
0 500 1000 1500
Episodes
2. Options improvement:
∂ log πω,θ (a | s)
θ ← θ + αθ ∂θ QU (s, ω, a) Figure 2: After a 1000 episodes, the goal location in the four-
∂β (s0 ) 0 0 rooms domain is moved randomly. Option-critic (“OC”) re-
ϑ←ϑ− αϑ ω,ϑ
∂ϑ (QΩ (s , ω) − VΩ (s ))
covers faster than the primitive actor-critic (“AC-PG”) and
SARSA(0). Each line is averaged over 350 runs.
if βω,ϑ terminates in s0 then
choose new ω according to -soft(πΩ (s0 ))
s ← s0 As can be seen in Figure 2, when the goal suddenly
until s0 is terminal changes, the option-critic agent recovers faster. Further-
more, the initial set of options is learned from scratch at a
rate comparable to primitive methods. Despite the simplic-
Learning QU in addition to QΩ is computationally waste- ity of the domain, we are not aware of other methods which
ful both in terms of the number of parameters and samples. could have solved this task without incurring a cost much
A practical solution is to only learn QΩ and derive an esti- larger than when using primitive actions alone (McGovern
mate of QU from it. Because QU is an expectation over next and Barto 2001; Şimşek and Barto 2009).
2011) of order 3. We experimented with 2, 3 or 4 options.
We used Boltzmann policies for the intra-option policies and
linear-sigmoid functions for the termination functions. The
learning rates were set to 0.01 for the critic and 0.001 for
both the intra and termination gradients. We used an epsilon-
greedy policy over options with = 0.01.
8500
7500
Undiscounted Return
5000
Figure 3: Termination probabilities for the option-critic
agent learning with 4 options. The darkest color represents
the walls in the environment while lighter colors encode 0
higher termination probabilities.
-5000 4 options
In the two temporally extended settings, with 4 options 3 options
and 8 options, termination events are more likely to occur 2 options
near the doorways (Figure 3), agreeing with the intuition
0 25 50 100 150 200 250
that they would be good subgoals. As opposed to (Sutton, Episodes
Precup, and Singh 1999), we did not encode this knowledge
ourselves but simply let the agents find options that would Figure 5: Learning curves in the Pinball domain.
maximize the expected discounted return.
In (Konidaris and Barto 2009), an option can only be
Pinball Domain used and updated after a gestation period of 10 episodes. As
learning is fully integrated in option-critic, by 40 episodes a
near optimal set of options had already been learned in all
settings. From a qualitative point of view, the options ex-
hibit temporal extension and specialization (fig. 4). We also
observed that across many successful trajectories the red op-
tion would consistently be used in the vicinity of the goal.
5 0 0 0 95.2 100 0.1 82.6 100 0 0 0 61.7 0.6 14.1 2.1 0 0.2 7.7 0 .7
0 0 0 1.0 0 0 0 0
0 0.1
options. Our approach also relies on the Bellman equations
12 0 0 100 0 0 0.3 13.0 0 0 0
Figure 8: Learning curves in the Arcade Learning Environment. The same set of parameters was used across all four games: 8
options, 0.01 termination regularization, 0.01 entropy regularization, and a baseline for the intra-option policy gradients.
Time
Option 0 Option 1
Figure 9: Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top bar
shows a trajectory in the game, with “white” representing a segment during which option 1 was active and “black” for option 2.
use the DQN framework to implement a gradient-based op- theorem, and as discussed in (Thomas 2014), the gradient
tion learner, which uses intrinsic rewards to learn the internal estimators can be biased in the Qtdiscounted case. By intro-
policies of options, and extrinsic rewards to learn the pol- ducing factors of the form γ t i=1 (1 − βi ) in our updates
icy over options. As opposed to our framework, descriptions (Thomas 2014, eq (3)), it would be possible to obtain un-
of the subgoals are given as inputs to the option learners. biased estimates. However, we do not recommend this ap-
Option-critic is conceptually general and does not require proach since the sample complexity of the unbiased esti-
intrinsic motivation for learning the options. mators is generally too high and the biased estimators per-
formed well in our experiments.
Discussion Perhaps the biggest remaining limitation of our work is
the assumption that all options apply everywhere. In the case
We developed a general gradient-based approach for learn-
of function approximation, a natural extension to initiation
ing simultaneously the intra-option policies and termination
sets is to use a classifier over features, or some other form of
functions, as well as the policy over options, in order to opti-
function approximation. As a result, determining which op-
mize a performance objective for the task at hand. Our ALE
tions are allowed may have similar cost to evaluating a pol-
experiments demonstrate successful end-to-end learning of
icy over options (unlike in the tabular setting, where options
options in the presence of nonlinear function approxima-
with sparse initiation sets lead to faster decisions). This is
tion. As noted, our approach only requires specifying the
akin to eligibility traces, which are more expensive than us-
number of options. However, if one wanted to use additional
ing no trace in the tabular case, but have the same complex-
pseudo-rewards, the option-critic framework would easily
ity with function approximation. If initiation sets are to be
accommodate it. In this case, the internal policies and ter-
learned, the main constraint that needs to be added is that the
mination function gradients would simply need to be taken
options and the policy over them lead to an ergodic chain in
with respect to the pseudo-rewards instead of the task re-
the augmented state-option space. This can be expressed as
ward. A simple instance of this idea, which we used in some
a flow condition that links initiation sets with terminations.
of the experiments, is to use additional rewards to encour-
The precise description of this condition, as well as sparsity
age options that are indeed temporally extended by adding
regularization for initiation sets, is left for future work.
a penalty whenever a switching event occurs. Our approach
can work seamlessly with any other heuristic for biasing the
set of options towards some desirable property (e.g. compo- Acknowledgements
sitionality or sparsity), as long as it can be expressed as an The authors gratefully acknowledge financial support for
additive reward structure. However, as seen in the results, this work by the National Science and Engineering Research
such biasing is not necessary to produce good results. Council of Canada (NSERC) and the Fonds de recherche du
The option-critic architecture relies on the policy gradient Quebec - Nature et Technologies (FRQNT).
Appendix where (7) follows from the assumption that θ only appears in
Augmented Process the intra-option policies. Substituting (7) into (6) yields a re-
cursion which, using the previous remarks about augmented
If ωt has been initiated or is executing at time t, then the process can be transformed into:
discounted probability of transitioning to (st+1 , ωt+1 ) is:
∂QΩ (s, ω) X ∂πω,θ (a | s)
P(1)
X = QU (s, ω, a)+
γ (st+1 , ωt+1 | st , ωt ) = πωt (a| st ) γ P(st+1 | st , a) ∂θ ∂θ
a
a X X X
πω,θ (a | s) γ P (s0 | s, a) βω,ϑ (s0 )πΩ (ω 0 | s0 )
(1 − βωt (st+1 ))1ωt =ωt+1 + βωt (st+1 )πΩ (ωt+1 | st+1 ) .
a s0 ω0
When conditioning the process from (st , ωt−1 ), the dis- ∂Q (s0 , ω 0 )
Ω
counted probability of transitioning to st+1 , ωt is: + (1 − βω,ϑ (s0 ))1ω0 =ω
∂θ
P(1)
γ (st+1 , ωt | st , ωt−1 ) = (1 − βωt−1 (st ))1ωt =ωt−1 + X ∂πω,θ (a | s)
X = QU (s, ω, a)+
βωt−1 (st )πΩ (ωt | st ) πωt (a | st ) γ P (st+1 | st , a) . a
∂θ
a XX ∂QΩ (s0 , ω 0 )
More generally, the k-steps discounted probabilities can be P(1) 0 0
γ (s , ω | s, ω)
0 0
∂θ
expressed recursively as follows: s ω
XX ∞ X
X X ∂πω0 ,θ (a|s0 )
P(k)
γ (st+k , ωt+k | st , ωt ) = = P(k) 0 0
γ (s , ω |s, ω) QU (s0 , ω 0 , a).
st+1 ωt+1
∂θ
k=0 s0 ,ω 0 a
P(1) st , ωt ) P(k−1) The gradient of the expected discounted return with respect
γ (st+1 , ωt+1 | γ (st+k , ωt+k | st+1 , ωt+1 ) ,
XX to θ is then:
P(k)
γ (st+k , ωt+k−1 | st , ωt−1 ) = ∂QΩ (s0 , ω0 )
st+1 ωt =
∂θ
∞
P(1) st , ωt−1 ) P(k−1)
γ (st+1 , ωt | γ (st+k , ωt+k−1 | st+1 , ωt ) . XX X ∂πω,θ (a | s)
P(k)
γ (s, ω | s0 , ω0 ) QU (s, ω, a)
∂θ
Proof of the Intra-Option Policy Gradient Theorem s,ω k=0 a