CHAPTER 21-Final

21 Reinforcement Learning: Monte Carlo and Temporal Difference
Methods
In the last chapter, we talked about Markov decision processes. The implicit assumptions,
in the last chapter, are the state space is known, the transition probability function and the
reward function is available to the decision maker before the start of the decision making.
These assumptions don’t hold in mordern practical decision making systems such as
autonomous vehicles, advanced robotics and wireless communication. For example, in an
autonomous vehicle it is too complicated to completely charachterize the state space and
the associated transition probability function. Hence, the idea in reinforcement learning is
to apply machine learning, i.e. interact with the environment to obtain data which can then
be used for learning optimal policies. Based on how machine learning is applied to these
tasks, we can divide reinforcement learning into two:
1. Indirect learning: In indirect learning or model based learning, a model for the MDP
is learned from samples. Once a model is learned, we can apply the numerical methods
that we discussed in the previous chapter to obtain the optimal policy. Indirect
learning tend to work well when the state space is small.
2. Direct learning: In indirect learning, we first learn the model of the environment and
then obtain the policy. However, in most situations, it is easier to just learn the optimal
policy and bypass the process of learning the MDP model. Direct learning is the only
feasible method for large state space decision making problems.
In this book, we will deal with only direct reinforcement learning methodologies.
21.1 Reinforcement learning: Key problem, and Review of some

fundamental ideas
21.1.1 Reinforcement learning: Key problem
Let us first understand the key problem that we are trying to solve in reinforcement
learning. For this consider the value iteration algorithm for infinite horizon Bellman
dynamic programming equation:
❑
Qk (s , a)=r (s , a)+ γ ∑ Ps , s (a)V k−1(s ) .
'
' (1)
'
s
Let us re-write this equation in a slightly different way:

Qk (s , a)=r (s , a)+ γE [ V k−1 ¿s , a ] . (2)
1
In (2), we have re-written the sum on the right hand side in (1) as an expectation of the
value function, where the expectation is with respect to conditional distribution of the next
state given current state is s and action taken is a .
Since, we don’t know the transition probability function, we cannot evaluate the
expectation in (2). In reinforcement learning, we are interested in approximating this
expectation (and other equivalent formulations) using samples as and when the decision
maker interacts with the environment.
21.1.2 Review of some fundamental ideas

In this section, let us review some fundamental ideas in probability and in later sections,
we will use these ideas to formulate reinforcement learning algorithms.
21.1.2.1 Expectations and averages

Suppose we want to find the average of a set of numbers X 1 , X 2 , ⋯ , X n.Then the average can
be straight forwardly be obtained as follows:
^ X 1+ X 2 +⋯+ X n
X= . (3)
n
The averaging equation in (3) assumes that we can store the n numbers in memory to
obtain the average. However, this may not be possible when n is quite large or the
computations are limited. Is there any way to compute averages recursively (or online),
i.e. as and when data arrives? Let ^ X k denote the average when k values are available,
^
then X k+1 can be written recursively as follows:
( )
X 1 + X 2 + ⋯+ X k+1 X 1+ X 2 +⋯+ X k 1 X + X +⋯+ X k k 1 1
^
X k+1= = + X k +1= 1 2 + X =^
(4) X k 1− +
k +1 k +1 k+1 k k +1 k +1 k+1 k +1 k
1
where ε k = .
k
Equation (4) is of the form:
Updated estimate=Old estimate + step¿ (New data−Old estimate ¿) (5)
In all RL algorithms, some version of (5) would be used to update the estimates.
What is the connection of averages with the problem of obtaining expectations that
mentioned in Section 21.1.2.1. For this we can use the famous law of large numbers, which
roughly translates that for random variables, empirical average converges to the
expectation of the random variable.
Theorem 1 Strong law of large numbers (SLLN)
2
Suppose [ X k ] is a sequence of independent and identically distributed (i.i.d.) sequence of
random variables. Then ^ X k =μ almost surely iff E ¿ and E [ X 1 ]=μ .
While SLLN in Theorem 1 only applies to i.i.d. case, extensions of SLLN exists for Markov
chains (which is what is used in reinforcement learning, however, these are beyond the
scope of the current textbook).
21.1.3 Data and episodes

In reinforcement learning, an agent (or a decision maker) interacts with the environment.
A single interaction, referred to as an episode, typically consists of sequence of the current
state (or observations), the actions that the agent takes (based on a policy), and the reward
obtained by the agent. A collection of such episode, referred to as experience, is available to
the agent for learning, as shown in Figure 1.
Figure 1-Sample episodes available to the agent
21.1.4 A note on notation

In reinforcement learning, we typically have algorithms that learns as a function of
episodes. Each episodes is a time sequence of state, action and reward obtained as the
agent interacts with the environment. For example, in Figure 1, s6 is the state that the agent
observed at time t=6 in Episode 1. Hence, the following notation will be adopted in this
chapter and next: st refer to the state at time t . In contrast, s, and s' correspond to two
states from the agent’s state space. Whenever there is ambiguity, s (script s) would be used
to denote a state from the agent’s state space. For example, in Example 1, the agent’s state
space consists of s0 , and s1. Similar notation apply for both the action ( a vs a ) and the
reward (r vs r ).
21.2 Q-learning
One of the first algorithms to gain popularity is the Q-learning algorithm. Q-learning can be
viewed as a reinforcement learning approach to the value iteration algorithm for
computing the optimal policy. Consider again the value iteration equations in (1), and re-
writing the value function in terms of Q-function and the expectation operator, we have
3
❑
Qk ( s , a )=r ( s , a ) + γ ∑ Ps , s ( a ) V k−1 ( s )
'
'
'
s
❑
¿ r ( s , a ) +γ ∑ P s , s ( a ) max a Qk−1 ( s , a )
'
'
'
(6)
s
❑
¿ ∑ Ps , s ( a ) ¿ ¿
'
'
s
¿ E[r ( s , a)+ γ maxa Q k−1 (⋅, a) ¿s , a ]
As before, since we can’t compute the expectation in (6), we can compute using iterative
averaging, of the form in(5), as follows:
Q ^ ( s t , at ) +ε t ( r t + γ max a Q
^ ( s t , at ) ← Q ^ ( s t , at ) ) .
^ ( st +1 , a ) −Q (7)
Using the learned Q-values, the Q-learning algorithm can be re-written as follows:
^ , a)=0 ∀ s , a.
1 Initialize Q(s
2 for all episodes do.
3 for t = 1⋯ do.
4 Obtain ( st )
5 ^ ( s , a ).
Obtain greedy policy: μ ( s )=max a Q
6 Take action a t : a t=μ ( st ).
7 Obtain current reward rt and next state st+1.
8 ^ ( s ,a )−Q
δ t =r t + γ max a Q ^ (s ,a )
t +1 t t
9 ^ (s , a ) ←Q
Update Q-values:Q ^ ( s , a ) +ε δ
t t t t t t
Algorithm 1: Q-learning algorithm

Example 1. Consider the MDP shown in Figure 2. Assume a discount factor of γ . In state s1,
two actions a 1 and a 2 are possible. The reward depend on both state and action for state s1
and is given by r ( s 1 , a1 )=−1, and r (s 1 , a2 )=+1. In state s2, the reward depends only on state
and is given by r ( s 2)=+100 . Suppose we start at t=1 for the episode shown in Figure 3,
i.e. we start at state s1. Compute the optimal policy and the estimated Q-values for t=1
and t=2, assuming the Q-value are initialized to 0 . What will happen to the optimal policy
at t=∞ ? Assume a step-size of 1 for this problem.
Before, we start solving the problem, let us intuitively understand how the optimal policy
should behave. State s2 has a reward of 100 compared to state s1.
4
Figure 2: State transition diagram for the MDP in Example 1
Figure 3: Episodes in Example 1

Hence, it is straight forward to see that starting from state s1, an optimal policy should
act a 2 incurring a short-term loss of 1 to gain the reward of +100 .
However, here we assume that the decision maker doesn’t know the transition
probability function. So let, us start at t=0 by initializing the Q-values to 0 ,
^ s , a )=Q(
i.e. Q( ^ s , a )=0. At t =1, we start with state s1. The greedy policy (using Step 5 in
1 1 1 2
Algorithm 1) is indifferent between a 1 and a 2. Let us arbitrarily choose a 1, and we obtain a
reward of −1 as show in Figure 3. The updated Q-values at the end of t=1 is
^ , a )=−1 , Q(
Q(s ^ s , a )=0 . Suppose, on acting a 1 at t=1, we end up in state s1 again as show
1 1 1 2
in Figure 3. Now, the greedy algorithm, would choose action a 2, since it’s Q-value is higher.
On acting a 2, we get an instantaneous reward of +1 as show in Figure 3. Hence, the Q-values
at the end of t=2 is Q(s ^ 1 , a1 )=−1 , Q(s
^ 1 , a 2)=1. Taking action a 2, results in the next state
being state s1 with probability 1. Again, action a 2 is picked because of it’s higher Q-value
compared to a 1. At t=∞ , the Q-values converge
^ (s1 , a1 )=−1 , Q(s
^ 1 , a 2)=1+ γ + γ 2 +⋯= 1
to Q .
1−γ
Notice that, even though the optimal policy is to pick a 1 in state s1, the Q-learning algorithm
always picks action a 2. This doesn’t mean that Q-learning is wrong. The reason this happens
is that the initial estimates of Q-function are highly noisy and taking decisions based on
these noisy estimates biases our decision strategy. In the next section, we will see one
strategy to fix the problem.
5
21.2.1 Q-learning algorithm with exploration
The problem with Q-learning in Example 1 is that once we start to greedily select actions
based on the estimated Q-function, then we reinforce past errors and end up selecting only
a (suboptimal) set of actions. Therefore, exploration is required. There are a lot of
exploration strategies available in the literature. One of the simplest, but most effective
strategy, is the ε −greedy strategy. The idea is that we behave greedily most of the time, but
once in a while, i.e. with probability ε , we select randomly from the available actions with
equal probability. The Q-learning algorithm with ε −greedy strategy is provided in
Algorithm 2.
^ , a)=0 ∀ s , a.
1 Initialize Q(s
2 for all episodes do.
3 for t = 1⋯ do.
4 Obtain st
5 Obtain policy:
^ , a) with probability 1−ε uniformrandomly a ∈ A with probability ε
μ( s)={max a Q(s
6 Take action a t :at =μ( st ).

8 ^ ( s ,a )−Q
t +1 t t
9 ^ ( s t , at ) ← Q
Update Q-values: Q ^ ( s t , at ) +ε t δ t
Algorithm 2: Q-learning with ϵ−¿ greedy strategy

The only difference between Algorithm 1 and Algorithm 2 is in Step 5 wherein the greedy
selection is replaced by an ε −greedy strategy.
21.3 Monte Carlo Methods

The previous section on Q-learning can we viewed as a reinforcement learning approach to
value iteration. In this section and next, we focus on reinforcement learning approach to
policy iteration. The first approach, based on Monte Carlo method will be presented in this
section, while Temporal difference (TD) methods will be presented in Section 21.4.
Recall that policy iteration algorithm proceeds in two steps: policy evaluation and policy
improvement. While policy improvement, is relatively straight forward, using the greedy
approach, policy evaluation is difficult as it involves the expectation of the value function,
which cannot be computed without the complete knowledge of the transition probability
function.
6
The idea behind Monte Carlo methods arises from the definition of the value function as the
expected cumulative reward starting from any state, reproduced in Equation (8) for
convenience.
V μ (s)=E μ [ Gt ∨s t =s ] , (8)
where Gt is the cumulative reward (in previous chapter, the notation J t was used instead of
Gt ), as defined in the previous chapter. Hence, given a policy μ, and episodes generated
from the policy μ, an estimate of the value function at any state s can be obtained by
averaging the discounted cumulative reward obtained once state s is visited. Based on how
the rewards are computed two algorithm can be formulated: Algorithm 3 is the first visit
Monte Carlo algorithm, where cumulative rewards are computed starting only from the
first time a state is visited, Algorithm 4 is the very visit Monte Carlo, where the average
involves the cumulative reward every time after a state is visited.
1Initialize empty list: V ( s) ∀ s, CumReturn(s ) ∀ s .
2 for each episode: s0 , a0 , r 0 , s 1 , a1 , r 1 , ⋯ , sT , aT , r T do.
3 G ←0 .
4 for each t = T – 1,⋯ ,0 (starting from end of episode) do.
5 G ← γ G+r t +1.
6 If st doesn ' t appear ∈¿ s0 , s 1 , ⋯ , st −1 do.

7 Append G to CumReturn(st).
8 V ( s)= Average (CumReturn (s)) .
Algorithm 3: First visit Monte Carlo
1Initialize empty list: V ( s) ∀ s, CumReturn(s ) ∀ s .
2 for each episode: s0 , a0 , r 0 , s 1 , a1 , r 1 , ⋯ , sT , aT , r T do
3 G ←0
4 for each t = T – 1,…, 0 (starting from end of episode) do
5 G ← γ G+r t +1
6 Append G to CumReturn(st)
7 V ( s)= Average (CumReturn(s))
Algorithm 4: Every visit Monte Carlo
The difference between the two algorithm is that the extra step of checking if a state is
visited previously (Step 6 in Algorithm 3) is avoided in Algorithm 4.
7
However, notice from policy iteration that policy improvement step requires that we
compute the Q-function estimates. It is straight forward to estimate the Q-function values
instead of value function by keeping track of both the state and action pairs during an
episode. Algorithm 5 gives the pseudo-code for estimating the Q-function values in a first
visit Monte Carlo scenario.
1Initialize empty list: Q( s , a) ∀ s , a, CumReturn(s , a)∀ s , a.
2 for each episode: s0 , a0 , r 0 , s 1 , a1 , r 1 , ⋯ , sT , aT , r T do
3 G ←0
4 for each t = T – 1, ⋯ ; 0 (starting from end of episode) do
5 G ← γ G+r t +1
6 if st; at doesn't appear in s0 , a0 , s 1 , a 1 , ⋯ , st −1 , at −1 then

7 Append G to CumReturn (st; at)
8 Q( s , a)= Average (CumReturn( s ,a))
Algorithm 5: First visit Monte Carlo - Q values
Finally, we need to answer the question of how to introduce exploration into Monte Carlo
algorithms. One option, as discussed in Section 21.2.1, is to use ε −greedy strategy for policy
improvement, instead of greedy strategy.
Example 2. Consider an MDP as shown in Figure 5(a) . For a given policy, one episode was
run and the results are as shown in Figure 5 (b) . For a discount factor of 1, calculate the
first-vist and every-visit Monte Carlo estimate of the value function for the states s1, and s2.
^ (s )=10, and is same regardless of
Monte carlo estimate for value function for state s2 is V 2
first-visit or every-visit Monte Carlo.
For state s1:

^ ( s )=10−1−1−1=7.
First-visit Monte Carlo estimate: V 1
Every-visit Monte Carlo estimate: V^ ( s1 )= 9+8+7 =8.

3
21.4 Temporal Difference

Temporal difference (TD) algorithm is one of the central and influential ideas in
reinforcement learning literature. TD algorithm offer a framework to unify the ideas of
Monte Carlo and dynamic programming methods. Even though,
8
(a) State transition diagram (b) Single episode
Figure 5: Example 2
in this book, we introduced Q-learning, before this section, in Section 21.2, Q-learning can
be viewed in the framework of TD learning. We start with the simple TD (0) learning
algorithm in Section 21.4.1and generalize to TD ( λ) algorithm in Section 21.4.2.
21.4.1 TD (0) algorithm
Consider again the definition of the value function in Equation (8). The basic idea of TD (0)
arises from noticing the following relation:
V μ (s)=E μ [ Gt ∨s t =s ] =E μ [ r t + γ V μ ( s t+1 )∨st =s ] (9)
As in Q-learning, the idea behind TD (0) is given an estimate of value function, the value
function can be updated as follows:
^ μ (s t )+ ε t ( r t + γ V
^μ (s t )← V
V ^ μ ( s t+1 )−V^ μ ( st ) ) , (10)
^ μ ( st +1 ) is the one-step lookahead

where, ε t is the step size at time t . The quantity r t + V
estimate for the value function.
The pseudo-code for the TD (0) algorithm is shown in Algorithm 6.
^ (s) ∀ s.
1Initialize: V
2 for each episode do.
3 for each step t in the episode do.
4 For state st, take action a t=μ ( st ), and observe st +1.
5 ^μ ( st +1 ) −V
δ t =r t + γ V ^μ ( s t )
6 ^μ ( s t ) ← V
Update value function:V ^μ ( s t ) + ε t δ t
Algorithm 6: TD (0) algorithm
9
The quantity δ t in Step 5 in Algorithm 6, the difference between the current estimate and
the one-step lookahead is referred to as the temporal difference, hence the name temporal
difference algorithm.
21.4.2 TD( λ) algorithm
The TD (0) algorithm looks at one-step prediction to update the value function. A natural
extension is to look at a n−¿step prediction algorithm to update the value function.
V μ=E μ [ G t ∨st =s ] =E μ [ r t + γ V μ (s t +1)∨s t=s ] =E μ [ r t +γ r t +1+ ⋯+ γ n r t +n +γ n+1 V μ ( s t+ n+1)∨s t =s ] .

(11)
Notice from (11), that as n → ∞, the n−¿step prediction converges to the Monte Carlo
estimate. Using (11) and replacing the one-step prediction step in(17), we have the
following update equation:
^μ (s t )← V
V ^ μ (s t )+ ε t ( r t + γ r t +1 +⋯+ γ n r t +n + γ n +1 V
^ μ (s t +n+1 )−V
^ μ (s t ) ) , (12)
n
The temporal difference δ t (n in the superscript refers to n−¿step prediction) can be
written as follows:
n n n−1 n−1
(13 k +1 ^
δ nt =∑ γ k r t +k + γ n +1 V^ ( s t+ n+1)−V
^ ( st ),=∑ γ k r t+ k + γ n+1 V^ (s t +n+1 )−V
^ (st )+ ∑ γ k+1 V^ (s t +k+1 )−
) ∑
γ V ( s t+ k+1),=
k=0 k =0 k=0 k=0
where, in the second step, we added and subtracted the last two terms and, the third step is
obtained by re-arranging the terms. The algorithm for n−¿step TD is exactly the same as in
Algorithm 6, except that δ nt replaces δ t in Step 5.
What is the connection of TD ( λ) with the n−¿step prediction ?: The formal connection is
the use of eligibility traces; howevever the topic is outside the scope of the current book.
The interested reader is referred to Chapter 13 in Reinforcement Learning: An Introduction
n
by Sutton and Barto. In short, in TD( λ) algorithm, temporal difference error δ t is weighted
with (1− λ) λ n ∀ n ≥ 0. Hence, the value function update is given by
∞ ∞
^ (s )+ ε ∑ (1− λ) λ k δ k =V
^ (s )← V
V ^ (s )+ ε ∑ γ k λ k δ , (14)
μ t μ t t t μ t t t +k
k=0 k=0
n
where the right-hand side is obtained by substituting for δ t from (13). Therefore, λ → 0,
gives TD (0), and λ → 1, gives Monte Carlo.
21.4.2.1 Policy improvement with TD-learning

TD learning allows for estimating a value function, given a policy. However, notice from the
policy iteration algorithm, the policy improvement algorithm requires that we estimate the
Q-function instead of the value function. Estimating the Q-function, is straight forward (as
10
in the case of Monte Carlo), by keeping track of both the state as well as the action pair. For
example, the Step 5 in the TD (0) algorithm can be replaced as follows:
^ s , μ(s ))−Q(
5 δ t =r t + γ Q( ^ s ,a )
t +1 t +1 t t
^ s ,a )← Q(
6 Update value function: Q( ^ s , a )+ ε δ
t t t t t t
Using the estimated Q-values, the policy iteration algorithm, using greedy policy
improvement step is as follows:
1 Start with some intial policy μ0
2 for k = 0,1, 2⋯ do
3 ^μ
ObtainQ
k
using any TD algorithm
4 ^μ
Policy improvement steps: μk +1=argmax a Q
k
Algorithm 8: Policy iteration using TD learning

As in Q-learning, the greedy step in Step 4 of Algorithm 8 can be replaced by an ε −greedy
strategy.
21.5 Double Learning

The reinforcement learning algorithms we discussed so far (Q-learning, TD learning, Monte
Carlo), use maximization (or ε -greedy) approach to policy improvement. This causes
significant bias to be introduced in the learning process. For example, consider the MDP as
shown in Figure 6. The episodes start from state A , and always terminate in one of the
terminal states ZA , Z B0, Z B1 or Z B2. Taking action R at A , gives a mean reward of 0 .
However, taking action L gives a mean reward of −0.05 . Hence, the optimal action is to
take R at state A . Consider running a Q-learning algorithm, with ε -greedy exploration, for
finding the optimal policy. Whenever, action L is taken, with a probability of 0.25 , the Q-
learning algorithm receives positive reward. This introduces a bias in the learning
procedure and we end up selecting the suboptimal action L with a higher probability.
How do we mitigate the effects of bias during the Q-learning procedure? One methodology
is to use the idea of double learning. In double learning, we learn two independent
estimates of the Q-function, say Q1 and Q2. One of the estimates, say Q1, will be used to
select actions (either using greedy or ε -greedy strategy), and the other estimate, Q2 will be
update based on the action selected using Q 1. The Q-learning using the double learning
strategy is given as follows:
¿ ^ ( s , a)
a =max a Q 1 t +1
11
Figure 6: MDP: Example of bias during Q-learning.
^2 ( s t , at ) ← Q
Q ^2 ( s t , at )+ ε t ( r t + Q
^2 ( st +1 , a¿ ) −Q
^2 ( s t , at ) ) (15)
Of course, the roles of Q1 and Q2 in (15) can be interchanged. However, note that only one
estimate is updated at each time step so as to keep the independence between the two
estimates. One way to decide is to choose to update only one of them randomly with equal
probability. Putting, all this together, the Q-learning algorithm with double learning is
shown in Algorithm 9.
^ 1 (s , a)=0 , Q
1 Initialize Q ^ 2 (s , a)=0 ∀ s , a.
2 for all episodes do

3 2 for t = 0,1, 2⋯ do.
4 Obtain st
5 ^ ( s ,a )+ Q
Obtain greedy policy: μ ( s )=argmax a Q ^ ( s , a)
1 2
6 Take action at:a t=μ ( st )

8 Obtain u ~ Bernoulli (0:5)
9 if u = 1 then
10
¿ ^ ( s ,a )
a =argmax a Q 1 t +1
11 ^2 ( s t +1 , a¿ )−Q
delt a t=r t + Q ^2 ( st , a t )
12 ^ (s , a )← Q
Update Q2-value: Q ^ ( s , a )+ ε δ
2 t t 2 t t t t
12
13 else
14 ^ ( s ,a )
a ¿=argmax a Q 2 t +1
15 ^1 ( s t +1 , a¿ ) −Q
δ t =r t + Q ^1 ( st , at )
16 ^1 ( s t , at ) ← Q
Update Q1-value:Q ^1 ( s t , at ) + ε t δ t
Algorithm 9: Q-learning algorithm with double learning
21.6 Large-scale learning

The reinforcement learning algorithms we have discussed so far are of tabular nature,
i.e. the state space has small number of states and it is easy to keep track using a table.
However, in many of the practical reinforcement learning problems there are either too
many states to keep track off or the state space is continuous, making tabular
reinforcement learning algorithms infeasible. Another problem with large state space or
continuous state space is that it is unlikely that the same state be visited again, and it is
imperative that we learn from “neighbouring” or “similar” states. Learning from similar
states is the idea of generalization that we encountered before in machine learning
literature.
Generalization in RL: Similar to ML algorithms, generalization in reinforcement learning

algorithms is achieved by using function approximators. In theory, any approximators that
we studied in machine learning could be used for reinforcement learning. As an example,
the value function, can be approximated using a linear function approximator or a single
layer neural network as shown below
• Linear approximation:
d
^ w (s)=wT x ( s)=∑ wi x i (s) ,
V
i=1
where w is the weight vector and x (s ) is the d−¿dimensional feature for state s.
• Artificial neural network (ANN): For example, Figure 7 (left figure) shows a single
hidden layer neural network and can be written as
^ W , b ( s)=σ (Wx(s )+ b),
V
where W is the weight matrix, x (s ) is the d−¿dimensional feature for state s, b is the bias
vector and σ (⋅) is the non-linear activation function. We can also use a deep neural
network for function approximation and can be written as
^
V { W } , {b } (s)=σ (W 1 σ (W 2 ⋯ σ (W k x (s)+ bk )⋯+b 2)+b1 ),
i i
where W i , bi is the weight matrix and bias vector for layer i .
13
Here on, we will denote by θ , the hyper-parameters of the function approximator. For
example, for the single layer neural network in Figure 7, the hyper-parameters would be
the weight matrix and the bias vector, i.e. θ={ W , b }, and the value function parameterized
by the hyper-parameters will be denoted by V ^ (s).
θ
Figure 7: Artificial neural network. The figure on the left shows a single layer neural
network and the figure on the right shows a deep neural network.
The next component that we require for a function approximator is the loss function.
Consider again, the TD learning equation in (9), reproduced here for convenience:
V μ (s)=E μ [ r t +γ V μ ( st +1)∨s t =s ] , (16)
which, gives the following update of TD algorithm in (17) (again, reproduced here for
convenience)
^ μ (s t )+ ε t ( r t + γ V
^ μ (s t )← V
V ^ μ ( s t+1 )−V^ μ ( st ) ) , (17)
^ μ (s t ) to
Equations (16) to (17) can be interpreted as updating the current value function V
the “target” r t + γ V^ μ (st +1). Hence, the loss function can be defined as
2
lossθ ( s , a , s )=( V
^μ ( s )−target ( s' , a ) )
TD '
(18)
^μ ( s' ), and θ denotes the hyper-parameters of the function

where, thetarget ( s ' , a ) =r ( s ' ,a ) + V
approximator. Similary, the loss function of the Q-learing algorithm can be defined as
follows:
lossθ ( s , a , s )=( Qθ ( s , a )−(r ( s , a ) + γ max a Qθ ( s , a ) ))

Q ' ' 2
(19)
14
The next question, is how do we optimize the hyper-parameters to minimize the loss
function. We follow ideas from machine learing and choose to optimize the hyper-
parameters using a gradient based algorithm as follows:
θ ←θ−ε ∇θ lossθ ( s , a , s' ) (20)
where, ε is the step-size parameter of the gradient algorithm. However, as in machine

learning, in many applications, we will not be able to compute the exact value of the
gradient of loss function. We can, however, use stochastic gradient descent by replacing the
gradient by an instantaneous estimate computed using st , a t , s t+1 . In addition, we need not
limit ourselves to a simple gradient descent algorithm (or stochastic gradient descent) and
we can use any of the mordern stochastic gradient algorithm, for example with momentum
strategies such as ADAM.
Putting all this together, the pseudo-code for the Q-learning algorithm with function
approximator is given in Algorithm 10.
1 Initialize the hyper-parameter to θ to arbitrarily value.
2 for all episodes do
3 for t = 0,1, 2⋯ do.
4 Obtain st
5 ^ ( s , a)
Obtain greedy policy: μ ( s )=max a Q
6 Take action at: a t=μ ( st )
8 ^ ( s ,a )−Q
t +1 t t
9 Update hyper-parameters:
θt ← θt −ε t ∇θ lossθ ( s t , at , st +1 ) =θt −ε t δ t ∇ Q θ ( st , at )
Algorithm 10: Q-learning algorithm with function approximation
21.6.1 Deep Q-learning (DQN)

A deep Q-learning (DQN) is an algorithm where a deep neural network is used as a function
approximator. The algorithm in Algorithm 10 still holds, however a critical question is how
to implement the hyper-parameter learning step in Step 9 Again, we borrow ideas from
machine learning and use the back propogation algorithm to calcuate the gradient of the
individual hyper-parameters with respect to the loss function.
15
21.6.1.1 Issues with DQN
Most of these issues are common across across schemes that use ‘large’ function
approximator.
1. Exploration: In order to achieve the optimal value of hyper-parameters all the states
and action should be visited ‘often’. However, depending on the transition function and
the reward function, this may not be always the case. Hence, Qθ (s , a) may have varying
level of accuracy for each s and a . A classical methodology to alleviate this problem is
to introduce exploration strategies such as ε -greedy exploration strategies.
2. Stability: When compared to classical supervised learning algorithm, the loss function
in, for example in (19), the ‘true’ label and the target label are functions of the hyper-
parameters, which could make the training process unstable. There are two
procedures to stablize the training process:
– Experience replay: In experience replay, the experiences (the state

transitions, rewards and actions) are stored. Then, these experiences are used
in mini-batches to update the hyper-parameters of the neural networks.
N
Suppose the experience is stored as follows: D = { s t , at , r t , st +1 }t =1. From the
exp
“total” experience, obtain a random sample of M (the mini-batch)

D exp (mini) ⊂ D exp. An estimate of gradient of the loss function can be computed as
below:
❑
1
∇ θ loss θ= ∑
M {s ,a , r , s }
∇ θ loss ( st , a t , r t , s t +1) (21)
t t t t+ 1 ∈ D exp(mini)
This mini-batch gradient reduces correlation between samples used to update the
parameters with decisions made from those updates.
– Lazy update of target network: Consider creating a “target Q-network” with
parameters ϕ , i.e. Qϕ (s , a). Then, the loss function in (19), can be replaced by
Q
lossθ (s , a , s ' )=¿ ¿ . The hyper-parameters of the “online network” (the original
network with hyper-parameters θ ) is updated as in Step9 in Algorithm 10. This
reduces the instability associated with changing the hyper-parameters θ by a
small value and causing the loss function to significantly differ from its previous
value. In order to remain consistent, ϕ ←θ every, say thousands of steps.
21.6.2 Recent advances

In this section, we discuss some recent advances and extensions to deep Q-learning
algorithm.
16
21.6.2.1 Double DQN
In double DQN, the idea is to use double learning along with DQN. It is well know that DQN
overestimates the Q-value (due to the maximization bias inherent in reinforcement
learning algorithms) and hence, results in sub-optimal policy. Using the double learning
idea along with DQN is known to fully solve this issue.
Rather than introduce a new network for double learning, we can re-use the target network
in DQN to achieve double learning, as shown below
Q
lossθ (s , a , s ' )=¿ ¿ (22)
i.e. using the online network to evaluate the greedy policy but using the target network to
estimate its value.
21.6.2.2 Noisy networks

In the classical DQN architecture that we described in Section 21.6.1, exploration was done
using an ε -greedy algorithm. Depending on the size of the state space, this may not be a
practical methodology. The following strategies utilize the (deep) neural network in DQN
to encourage exploration.
1. Independent Gaussian noise: In each iteration, the weights of the neural network are
perturbed by a “small” Gaussian noise. Corresponding to each weight, there is a
Gaussian random variable, which is sampled to create the perturbation. The
parameters of the Gaussian random variable (mean and standard deviation) are
trained, as usual, with back propagation.
2. Factorized Gaussian noise: Instead of doubling the number of parameters to train

(each weight and the associated parameters of the Gaussian random variable) in
Independent Gaussian noise case, in factorized approach, for each layer, there is one
Gaussian random vector the size of the input of the layer and another Gaussian
random vector the size of the output of the layer. The perturbations of the weights is
obtained by taking the outer product between these two vectors.
21.6.2.3 Prioritized experience replay

In experience replay, we sample uniformly random from the total experience to obtain
mini-batches. However, some examples are more informative than others. In prioritized
experience replay, we sample according to the training error. Let ω t be the training error
corresponding to the t th example { st ,a t , r t , s t +1 } in the experience dataset,
ω t=Qθ (s t , at )−( r t + γ max a Qϕ (s t +1 , a) ) . (23)
Then, the priority P(t) of the sample t is set as:
17
α
¿ ωt ¿
P(t)= ❑
(24)
∑ ¿ ωt ¿ α ,
t
where, α is a hyper-parameter that controls the sampling. As α goes to 0 , we get uniform
sampling and as α → 1, more priority is assigned to samples with higher training error (or
“more informative” samples). A modified version of prioritized sampling, given in (25), has
an extra factor of ε , a small positive constant, ensures that each experience has a nonzero
probability of being sampled.
P(t)=¿ ¿ (25)
However, prioritized sampling of the experiences introduces bias in the training process.
Importance sampling can be used to reduce the effects of prioritized sampling; however
those are beyond the scope of the current textbook.
21.6.2.4 Dueling DQN

The dueling DQN is based on the observation that in some states action selection is very
important, but in some other states, action selection doesn’t significantly change the
outcome. To express this dependence, we write the Q-function as follows:
Q( s , a)=V (s)+ A(s ,a) , (26)
where, in (26), the Q-function is written as the sum of the value function and a new
function referred to as the advantage function. The advantage function characterizes the
“advantage” that we get from selecting an action. However, A( s , a) could be either positive
or negative. Figure Error! Reference source not found. shows the architecture of the
dueling DQN.
Figure 8: Architecture of the Dueling DQN

However, this change in architecture is not sufficient. For example, there is nothing
preventing the network from learning V (s)=0 , A( s ,a)=Q(s , a), which would make the
18
value function network redundant. A “trick” to prevent this from happening is to re-
write (8) as follows:
1
Q( s , a)=V (s)+ A(s ,a)− ❑
(27)
¿ A∨¿ ∑ A (s , a) . ¿
a∈ A
This effectively reduces the mean (across action space) of the advantage function to zero.
With these modifications the network obtained faster convergence and performance
comparable to a DQN architecture.
19
21.7 Exercises
Ex 1. Consider an MDP with four states { A , B , C , D } shown in Figure 4. In each state, we can
either perform action a or b . The figure below shows the state transition function and
reinforcement reward obtained by each action. Consider the Q-learning algorithm with
γ=0.9 (discount factor), and α =1. Find the estimated Q values, if the MDP begins in state A
and performs actions { a , a , b , a , b , a }. Assume that the Q-learning algorithm is initialized
with zero values.
Figure 4: MDP for Exercise 1
20

CHAPTER 21-Final

Uploaded by

Copyright:

Available Formats

CHAPTER 21-Final

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CHAPTER 21-Final

Uploaded by

Copyright:

Available Formats

21 Reinforcement Learning: Monte Carlo and Temporal Difference

21.1 Reinforcement learning: Key problem, and Review of some

Let us re-write this equation in a slightly different way:

21.1.2 Review of some fundamental ideas

21.1.2.1 Expectations and averages

21.1.3 Data and episodes

Figure 1-Sample episodes available to the agent

21.1.4 A note on notation

6 Take action a t : a t=μ ( st ).

7 Obtain current reward rt and next state st+1.

Algorithm 1: Q-learning algorithm

Figure 3: Episodes in Example 1

6 Take action a t :at =μ( st ).

Algorithm 2: Q-learning with ϵ−¿ greedy strategy

21.3 Monte Carlo Methods

6 If st doesn ' t appear ∈¿ s0 , s 1 , ⋯ , st −1 do.

6 if st; at doesn't appear in s0 , a0 , s 1 , a 1 , ⋯ , st −1 , at −1 then

For state s1:

Every-visit Monte Carlo estimate: V^ ( s1 )= 9+8+7 =8.

21.4 Temporal Difference

21.4.1 TD (0) algorithm

^ μ ( st +1 ) is the one-step lookahead

4 For state st, take action a t=μ ( st ), and observe st +1.

Algorithm 6: TD (0) algorithm

21.4.2 TD( λ) algorithm

V μ=E μ [ G t ∨st =s ] =E μ [ r t + γ V μ (s t +1)∨s t=s ] =E μ [ r t +γ r t +1+ ⋯+ γ n r t +n +γ n+1 V μ ( s t+ n+1)∨s t =s ] .

21.4.2.1 Policy improvement with TD-learning

using any TD algorithm

Algorithm 8: Policy iteration using TD learning

21.5 Double Learning

2 for all episodes do

6 Take action at:a t=μ ( st )

7 Obtain current reward rt and next state st+1.

Algorithm 9: Q-learning algorithm with double learning

21.6 Large-scale learning

Generalization in RL: Similar to ML algorithms, generalization in reinforcement learning

where W i , bi is the weight matrix and bias vector for layer i .

^μ ( s' ), and θ denotes the hyper-parameters of the function

lossθ ( s , a , s )=( Qθ ( s , a )−(r ( s , a ) + γ max a Qθ ( s , a ) ))

θ ←θ−ε ∇θ lossθ ( s , a , s' ) (20)

where, ε is the step-size parameter of the gradient algorithm. However, as in machine

6 Take action at: a t=μ ( st )

7 Obtain current reward rt and next state st+1.

Algorithm 10: Q-learning algorithm with function approximation

21.6.1 Deep Q-learning (DQN)

– Experience replay: In experience replay, the experiences (the state

“total” experience, obtain a random sample of M (the mini-batch)

21.6.2 Recent advances

21.6.2.2 Noisy networks

2. Factorized Gaussian noise: Instead of doubling the number of parameters to train

21.6.2.3 Prioritized experience replay

ω t=Qθ (s t , at )−( r t + γ max a Qϕ (s t +1 , a) ) . (23)

Then, the priority P(t) of the sample t is set as:

21.6.2.4 Dueling DQN

Figure 8: Architecture of the Dueling DQN

Figure 4: MDP for Exercise 1

You might also like