CHAPTER 21-Final
CHAPTER 21-Final
CHAPTER 21-Final
Methods
In the last chapter, we talked about Markov decision processes. The implicit assumptions,
in the last chapter, are the state space is known, the transition probability function and the
reward function is available to the decision maker before the start of the decision making.
These assumptions don’t hold in mordern practical decision making systems such as
autonomous vehicles, advanced robotics and wireless communication. For example, in an
autonomous vehicle it is too complicated to completely charachterize the state space and
the associated transition probability function. Hence, the idea in reinforcement learning is
to apply machine learning, i.e. interact with the environment to obtain data which can then
be used for learning optimal policies. Based on how machine learning is applied to these
tasks, we can divide reinforcement learning into two:
1. Indirect learning: In indirect learning or model based learning, a model for the MDP
is learned from samples. Once a model is learned, we can apply the numerical methods
that we discussed in the previous chapter to obtain the optimal policy. Indirect
learning tend to work well when the state space is small.
2. Direct learning: In indirect learning, we first learn the model of the environment and
then obtain the policy. However, in most situations, it is easier to just learn the optimal
policy and bypass the process of learning the MDP model. Direct learning is the only
feasible method for large state space decision making problems.
In this book, we will deal with only direct reinforcement learning methodologies.
1
In (2), we have re-written the sum on the right hand side in (1) as an expectation of the
value function, where the expectation is with respect to conditional distribution of the next
state given current state is s and action taken is a .
Since, we don’t know the transition probability function, we cannot evaluate the
expectation in (2). In reinforcement learning, we are interested in approximating this
expectation (and other equivalent formulations) using samples as and when the decision
maker interacts with the environment.
^ X 1+ X 2 +⋯+ X n
X= . (3)
n
The averaging equation in (3) assumes that we can store the n numbers in memory to
obtain the average. However, this may not be possible when n is quite large or the
computations are limited. Is there any way to compute averages recursively (or online),
i.e. as and when data arrives? Let ^ X k denote the average when k values are available,
^
then X k+1 can be written recursively as follows:
( )
X 1 + X 2 + ⋯+ X k+1 X 1+ X 2 +⋯+ X k 1 X + X +⋯+ X k k 1 1
^
X k+1= = + X k +1= 1 2 + X =^
(4) X k 1− +
k +1 k +1 k+1 k k +1 k +1 k+1 k +1 k
1
where ε k = .
k
Equation (4) is of the form:
Updated estimate=Old estimate + step¿ (New data−Old estimate ¿) (5)
In all RL algorithms, some version of (5) would be used to update the estimates.
What is the connection of averages with the problem of obtaining expectations that
mentioned in Section 21.1.2.1. For this we can use the famous law of large numbers, which
roughly translates that for random variables, empirical average converges to the
expectation of the random variable.
Theorem 1 Strong law of large numbers (SLLN)
2
Suppose [ X k ] is a sequence of independent and identically distributed (i.i.d.) sequence of
random variables. Then ^ X k =μ almost surely iff E ¿ and E [ X 1 ]=μ .
While SLLN in Theorem 1 only applies to i.i.d. case, extensions of SLLN exists for Markov
chains (which is what is used in reinforcement learning, however, these are beyond the
scope of the current textbook).
21.2 Q-learning
One of the first algorithms to gain popularity is the Q-learning algorithm. Q-learning can be
viewed as a reinforcement learning approach to the value iteration algorithm for
computing the optimal policy. Consider again the value iteration equations in (1), and re-
writing the value function in terms of Q-function and the expectation operator, we have
3
❑
Qk ( s , a )=r ( s , a ) + γ ∑ Ps , s ( a ) V k−1 ( s )
'
'
'
s
❑
¿ r ( s , a ) +γ ∑ P s , s ( a ) max a Qk−1 ( s , a )
'
'
'
(6)
s
❑
¿ ∑ Ps , s ( a ) ¿ ¿
'
'
s
¿ E[r ( s , a)+ γ maxa Q k−1 (⋅, a) ¿s , a ]
As before, since we can’t compute the expectation in (6), we can compute using iterative
averaging, of the form in(5), as follows:
Q ^ ( s t , at ) +ε t ( r t + γ max a Q
^ ( s t , at ) ← Q ^ ( s t , at ) ) .
^ ( st +1 , a ) −Q (7)
Using the learned Q-values, the Q-learning algorithm can be re-written as follows:
^ , a)=0 ∀ s , a.
1 Initialize Q(s
2 for all episodes do.
3 for t = 1⋯ do.
4 Obtain ( st )
5 ^ ( s , a ).
Obtain greedy policy: μ ( s )=max a Q
8 ^ ( s ,a )−Q
δ t =r t + γ max a Q ^ (s ,a )
t +1 t t
9 ^ (s , a ) ←Q
Update Q-values:Q ^ ( s , a ) +ε δ
t t t t t t
4
Figure 2: State transition diagram for the MDP in Example 1
5
21.2.1 Q-learning algorithm with exploration
The problem with Q-learning in Example 1 is that once we start to greedily select actions
based on the estimated Q-function, then we reinforce past errors and end up selecting only
a (suboptimal) set of actions. Therefore, exploration is required. There are a lot of
exploration strategies available in the literature. One of the simplest, but most effective
strategy, is the ε −greedy strategy. The idea is that we behave greedily most of the time, but
once in a while, i.e. with probability ε , we select randomly from the available actions with
equal probability. The Q-learning algorithm with ε −greedy strategy is provided in
Algorithm 2.
^ , a)=0 ∀ s , a.
1 Initialize Q(s
2 for all episodes do.
3 for t = 1⋯ do.
4 Obtain st
5 Obtain policy:
^ , a) with probability 1−ε uniformrandomly a ∈ A with probability ε
μ( s)={max a Q(s
8 ^ ( s ,a )−Q
δ t =r t + γ max a Q ^ (s ,a )
t +1 t t
9 ^ ( s t , at ) ← Q
Update Q-values: Q ^ ( s t , at ) +ε t δ t
6
The idea behind Monte Carlo methods arises from the definition of the value function as the
expected cumulative reward starting from any state, reproduced in Equation (8) for
convenience.
V μ (s)=E μ [ Gt ∨s t =s ] , (8)
where Gt is the cumulative reward (in previous chapter, the notation J t was used instead of
Gt ), as defined in the previous chapter. Hence, given a policy μ, and episodes generated
from the policy μ, an estimate of the value function at any state s can be obtained by
averaging the discounted cumulative reward obtained once state s is visited. Based on how
the rewards are computed two algorithm can be formulated: Algorithm 3 is the first visit
Monte Carlo algorithm, where cumulative rewards are computed starting only from the
first time a state is visited, Algorithm 4 is the very visit Monte Carlo, where the average
involves the cumulative reward every time after a state is visited.
1Initialize empty list: V ( s) ∀ s, CumReturn(s ) ∀ s .
2 for each episode: s0 , a0 , r 0 , s 1 , a1 , r 1 , ⋯ , sT , aT , r T do.
3 G ←0 .
4 for each t = T – 1,⋯ ,0 (starting from end of episode) do.
5 G ← γ G+r t +1.
6 Append G to CumReturn(st)
7 V ( s)= Average (CumReturn(s))
Algorithm 4: Every visit Monte Carlo
The difference between the two algorithm is that the extra step of checking if a state is
visited previously (Step 6 in Algorithm 3) is avoided in Algorithm 4.
7
However, notice from policy iteration that policy improvement step requires that we
compute the Q-function estimates. It is straight forward to estimate the Q-function values
instead of value function by keeping track of both the state and action pairs during an
episode. Algorithm 5 gives the pseudo-code for estimating the Q-function values in a first
visit Monte Carlo scenario.
1Initialize empty list: Q( s , a) ∀ s , a, CumReturn(s , a)∀ s , a.
2 for each episode: s0 , a0 , r 0 , s 1 , a1 , r 1 , ⋯ , sT , aT , r T do
3 G ←0
4 for each t = T – 1, ⋯ ; 0 (starting from end of episode) do
5 G ← γ G+r t +1
8
(a) State transition diagram (b) Single episode
Figure 5: Example 2
in this book, we introduced Q-learning, before this section, in Section 21.2, Q-learning can
be viewed in the framework of TD learning. We start with the simple TD (0) learning
algorithm in Section 21.4.1and generalize to TD ( λ) algorithm in Section 21.4.2.
Consider again the definition of the value function in Equation (8). The basic idea of TD (0)
arises from noticing the following relation:
V μ (s)=E μ [ Gt ∨s t =s ] =E μ [ r t + γ V μ ( s t+1 )∨st =s ] (9)
As in Q-learning, the idea behind TD (0) is given an estimate of value function, the value
function can be updated as follows:
^ μ (s t )+ ε t ( r t + γ V
^μ (s t )← V
V ^ μ ( s t+1 )−V^ μ ( st ) ) , (10)
5 ^μ ( st +1 ) −V
δ t =r t + γ V ^μ ( s t )
6 ^μ ( s t ) ← V
Update value function:V ^μ ( s t ) + ε t δ t
9
The quantity δ t in Step 5 in Algorithm 6, the difference between the current estimate and
the one-step lookahead is referred to as the temporal difference, hence the name temporal
difference algorithm.
The TD (0) algorithm looks at one-step prediction to update the value function. A natural
extension is to look at a n−¿step prediction algorithm to update the value function.
Notice from (11), that as n → ∞, the n−¿step prediction converges to the Monte Carlo
estimate. Using (11) and replacing the one-step prediction step in(17), we have the
following update equation:
^μ (s t )← V
V ^ μ (s t )+ ε t ( r t + γ r t +1 +⋯+ γ n r t +n + γ n +1 V
^ μ (s t +n+1 )−V
^ μ (s t ) ) , (12)
n
The temporal difference δ t (n in the superscript refers to n−¿step prediction) can be
written as follows:
n n n−1 n−1
(13 k +1 ^
δ nt =∑ γ k r t +k + γ n +1 V^ ( s t+ n+1)−V
^ ( st ),=∑ γ k r t+ k + γ n+1 V^ (s t +n+1 )−V
^ (st )+ ∑ γ k+1 V^ (s t +k+1 )−
) ∑
γ V ( s t+ k+1),=
k=0 k =0 k=0 k=0
where, in the second step, we added and subtracted the last two terms and, the third step is
obtained by re-arranging the terms. The algorithm for n−¿step TD is exactly the same as in
Algorithm 6, except that δ nt replaces δ t in Step 5.
What is the connection of TD ( λ) with the n−¿step prediction ?: The formal connection is
the use of eligibility traces; howevever the topic is outside the scope of the current book.
The interested reader is referred to Chapter 13 in Reinforcement Learning: An Introduction
n
by Sutton and Barto. In short, in TD( λ) algorithm, temporal difference error δ t is weighted
with (1− λ) λ n ∀ n ≥ 0. Hence, the value function update is given by
∞ ∞
^ (s )+ ε ∑ (1− λ) λ k δ k =V
^ (s )← V
V ^ (s )+ ε ∑ γ k λ k δ , (14)
μ t μ t t t μ t t t +k
k=0 k=0
n
where the right-hand side is obtained by substituting for δ t from (13). Therefore, λ → 0,
gives TD (0), and λ → 1, gives Monte Carlo.
10
in the case of Monte Carlo), by keeping track of both the state as well as the action pair. For
example, the Step 5 in the TD (0) algorithm can be replaced as follows:
^ s , μ(s ))−Q(
5 δ t =r t + γ Q( ^ s ,a )
t +1 t +1 t t
^ s ,a )← Q(
6 Update value function: Q( ^ s , a )+ ε δ
t t t t t t
Using the estimated Q-values, the policy iteration algorithm, using greedy policy
improvement step is as follows:
1 Start with some intial policy μ0
2 for k = 0,1, 2⋯ do
3 ^μ
ObtainQ
k
4 ^μ
Policy improvement steps: μk +1=argmax a Q
k
11
Figure 6: MDP: Example of bias during Q-learning.
^2 ( s t , at ) ← Q
Q ^2 ( s t , at )+ ε t ( r t + Q
^2 ( st +1 , a¿ ) −Q
^2 ( s t , at ) ) (15)
Of course, the roles of Q1 and Q2 in (15) can be interchanged. However, note that only one
estimate is updated at each time step so as to keep the independence between the two
estimates. One way to decide is to choose to update only one of them randomly with equal
probability. Putting, all this together, the Q-learning algorithm with double learning is
shown in Algorithm 9.
^ 1 (s , a)=0 , Q
1 Initialize Q ^ 2 (s , a)=0 ∀ s , a.
5 ^ ( s ,a )+ Q
Obtain greedy policy: μ ( s )=argmax a Q ^ ( s , a)
1 2
10
¿ ^ ( s ,a )
a =argmax a Q 1 t +1
11 ^2 ( s t +1 , a¿ )−Q
delt a t=r t + Q ^2 ( st , a t )
12 ^ (s , a )← Q
Update Q2-value: Q ^ ( s , a )+ ε δ
2 t t 2 t t t t
12
13 else
14 ^ ( s ,a )
a ¿=argmax a Q 2 t +1
15 ^1 ( s t +1 , a¿ ) −Q
δ t =r t + Q ^1 ( st , at )
16 ^1 ( s t , at ) ← Q
Update Q1-value:Q ^1 ( s t , at ) + ε t δ t
where w is the weight vector and x (s ) is the d−¿dimensional feature for state s.
• Artificial neural network (ANN): For example, Figure 7 (left figure) shows a single
hidden layer neural network and can be written as
^ W , b ( s)=σ (Wx(s )+ b),
V
where W is the weight matrix, x (s ) is the d−¿dimensional feature for state s, b is the bias
vector and σ (⋅) is the non-linear activation function. We can also use a deep neural
network for function approximation and can be written as
^
V { W } , {b } (s)=σ (W 1 σ (W 2 ⋯ σ (W k x (s)+ bk )⋯+b 2)+b1 ),
i i
13
Here on, we will denote by θ , the hyper-parameters of the function approximator. For
example, for the single layer neural network in Figure 7, the hyper-parameters would be
the weight matrix and the bias vector, i.e. θ={ W , b }, and the value function parameterized
by the hyper-parameters will be denoted by V ^ (s).
θ
Figure 7: Artificial neural network. The figure on the left shows a single layer neural
network and the figure on the right shows a deep neural network.
The next component that we require for a function approximator is the loss function.
Consider again, the TD learning equation in (9), reproduced here for convenience:
V μ (s)=E μ [ r t +γ V μ ( st +1)∨s t =s ] , (16)
which, gives the following update of TD algorithm in (17) (again, reproduced here for
convenience)
^ μ (s t )+ ε t ( r t + γ V
^ μ (s t )← V
V ^ μ ( s t+1 )−V^ μ ( st ) ) , (17)
^ μ (s t ) to
Equations (16) to (17) can be interpreted as updating the current value function V
the “target” r t + γ V^ μ (st +1). Hence, the loss function can be defined as
2
lossθ ( s , a , s )=( V
^μ ( s )−target ( s' , a ) )
TD '
(18)
14
The next question, is how do we optimize the hyper-parameters to minimize the loss
function. We follow ideas from machine learing and choose to optimize the hyper-
parameters using a gradient based algorithm as follows:
5 ^ ( s , a)
Obtain greedy policy: μ ( s )=max a Q
8 ^ ( s ,a )−Q
δ t =r t + γ max a Q ^ (s ,a )
t +1 t t
9 Update hyper-parameters:
θt ← θt −ε t ∇θ lossθ ( s t , at , st +1 ) =θt −ε t δ t ∇ Q θ ( st , at )
15
21.6.1.1 Issues with DQN
Most of these issues are common across across schemes that use ‘large’ function
approximator.
1. Exploration: In order to achieve the optimal value of hyper-parameters all the states
and action should be visited ‘often’. However, depending on the transition function and
the reward function, this may not be always the case. Hence, Qθ (s , a) may have varying
level of accuracy for each s and a . A classical methodology to alleviate this problem is
to introduce exploration strategies such as ε -greedy exploration strategies.
2. Stability: When compared to classical supervised learning algorithm, the loss function
in, for example in (19), the ‘true’ label and the target label are functions of the hyper-
parameters, which could make the training process unstable. There are two
procedures to stablize the training process:
This mini-batch gradient reduces correlation between samples used to update the
parameters with decisions made from those updates.
– Lazy update of target network: Consider creating a “target Q-network” with
parameters ϕ , i.e. Qϕ (s , a). Then, the loss function in (19), can be replaced by
Q
lossθ (s , a , s ' )=¿ ¿ . The hyper-parameters of the “online network” (the original
network with hyper-parameters θ ) is updated as in Step9 in Algorithm 10. This
reduces the instability associated with changing the hyper-parameters θ by a
small value and causing the loss function to significantly differ from its previous
value. In order to remain consistent, ϕ ←θ every, say thousands of steps.
16
21.6.2.1 Double DQN
In double DQN, the idea is to use double learning along with DQN. It is well know that DQN
overestimates the Q-value (due to the maximization bias inherent in reinforcement
learning algorithms) and hence, results in sub-optimal policy. Using the double learning
idea along with DQN is known to fully solve this issue.
Rather than introduce a new network for double learning, we can re-use the target network
in DQN to achieve double learning, as shown below
Q
lossθ (s , a , s ' )=¿ ¿ (22)
i.e. using the online network to evaluate the greedy policy but using the target network to
estimate its value.
17
α
¿ ωt ¿
P(t)= ❑
(24)
∑ ¿ ωt ¿ α ,
t
where, α is a hyper-parameter that controls the sampling. As α goes to 0 , we get uniform
sampling and as α → 1, more priority is assigned to samples with higher training error (or
“more informative” samples). A modified version of prioritized sampling, given in (25), has
an extra factor of ε , a small positive constant, ensures that each experience has a nonzero
probability of being sampled.
P(t)=¿ ¿ (25)
However, prioritized sampling of the experiences introduces bias in the training process.
Importance sampling can be used to reduce the effects of prioritized sampling; however
those are beyond the scope of the current textbook.
where, in (26), the Q-function is written as the sum of the value function and a new
function referred to as the advantage function. The advantage function characterizes the
“advantage” that we get from selecting an action. However, A( s , a) could be either positive
or negative. Figure Error! Reference source not found. shows the architecture of the
dueling DQN.
18
value function network redundant. A “trick” to prevent this from happening is to re-
write (8) as follows:
1
Q( s , a)=V (s)+ A(s ,a)− ❑
(27)
¿ A∨¿ ∑ A (s , a) . ¿
a∈ A
This effectively reduces the mean (across action space) of the advantage function to zero.
With these modifications the network obtained faster convergence and performance
comparable to a DQN architecture.
19
21.7 Exercises
Ex 1. Consider an MDP with four states { A , B , C , D } shown in Figure 4. In each state, we can
either perform action a or b . The figure below shows the state transition function and
reinforcement reward obtained by each action. Consider the Q-learning algorithm with
γ=0.9 (discount factor), and α =1. Find the estimated Q values, if the MDP begins in state A
and performs actions { a , a , b , a , b , a }. Assume that the Q-learning algorithm is initialized
with zero values.
20