Deep Reinforcement Learning Python Distributional 2nd
Deep Reinforcement Learning Python Distributional 2nd
Deep Reinforcement Learning Python Distributional 2nd
with Python
Second Edition
Sudharsan Ravichandiran
BIRMINGHAM - MUMBAI
Deep Reinforcement Learning with Python
Second Edition
Copyright © 2020 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy
of the information presented. However, the information contained in this book is
sold without warranty, either express or implied. Neither the author, nor Packt
Publishing or its dealers and distributors, will be held liable for any damages caused
or alleged to have been caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.
ISBN 978-1-83921-068-6
www.packt.com
To my adorable mom, Kasthuri, and to my beloved dad, Ravichandiran.
- Sudharsan Ravichandiran
packt.com
Subscribe to our online digital library for full access to over 7,000 books and videos,
as well as industry leading tools to help you plan your personal development and
advance your career. For more information, please visit our website.
Why subscribe?
• Spend less time learning and more time coding with practical eBooks and
Videos from over 4,000 industry professionals
• Learn better with Skill Plans built especially for you
• Get a free eBook or video every month
• Fully searchable for easy access to vital information
• Copy and paste, print, and bookmark content
Did you know that Packt offers eBook versions of every book published, with PDF
and ePub files available? You can upgrade to the eBook version at www.Packt.com
and as a print book customer, you are entitled to a discount on the eBook copy. Get
in touch with us at customercare@packtpub.com for more details.
At www.Packt.com, you can also read a collection of free technical articles, sign up for
a range of free newsletters, and receive exclusive discounts and offers on Packt books
and eBooks.
Contributors
Valerii Babushkin is the senior director of data science at X5 Retail Group, where
he leads a team of 100+ people in the area of natural language processing, machine
learning, computer vision, data analysis, and A/B testing. Valerii is a Kaggle
competitions Grand Master, ranking globally in the top 30. He studied cybernetics
at Moscow Polytechnical University and mechatronics at Karlsruhe University
of Applied Sciences and has worked with Packt as an author of the Python Machine
Learning Tips, Tricks, and Techniques course and a technical reviewer for some books
on reinforcement learning.
Table of Contents
Prefacexv
Chapter 1: Fundamentals of Reinforcement Learning 1
Key elements of RL 2
Agent2
Environment2
State and action 2
Reward3
The basic idea of RL 3
The RL algorithm 5
RL agent in the grid world 5
How RL differs from other ML paradigms 9
Markov Decision Processes 10
The Markov property and Markov chain 11
The Markov Reward Process 13
The Markov Decision Process 13
Fundamental concepts of RL 16
Math essentials 16
Expectation16
Action space 18
Policy19
Deterministic policy 20
Stochastic policy 20
Episode23
Episodic and continuous tasks 25
Horizon25
Return and discount factor 26
Small discount factor 27
[i]
Table of Contents
[ ii ]
Table of Contents
Toy text 81
Algorithms81
Environment synopsis 82
Summary82
Questions83
Further reading 83
Chapter 3: The Bellman Equation and Dynamic Programming 85
The Bellman equation 86
The Bellman equation of the value function 86
The Bellman equation of the Q function 90
The Bellman optimality equation 93
The relationship between the value and Q functions 95
Dynamic programming 97
Value iteration 97
The value iteration algorithm 99
Solving the Frozen Lake problem with value iteration 107
Policy iteration 115
Algorithm – policy iteration 118
Solving the Frozen Lake problem with policy iteration 125
Is DP applicable to all environments? 129
Summary 130
Questions131
Chapter 4: Monte Carlo Methods 133
Understanding the Monte Carlo method 134
Prediction and control tasks 135
Prediction task 135
Control task 135
Monte Carlo prediction 136
MC prediction algorithm 140
Types of MC prediction 144
First-visit Monte Carlo 145
Every-visit Monte Carlo 146
Implementing the Monte Carlo prediction method 147
Understanding the blackjack game 147
The blackjack environment in the Gym library 158
Every-visit MC prediction with the blackjack game 160
First-visit MC prediction with the blackjack game 166
Incremental mean updates 167
MC prediction (Q function) 168
Monte Carlo control 170
MC control algorithm 172
On-policy Monte Carlo control 174
[ iii ]
Table of Contents
[ iv ]
Table of Contents
[v]
Table of Contents
[ vi ]
Table of Contents
[ vii ]
Table of Contents
[ viii ]
Table of Contents
[x]
Table of Contents
Summary622
Questions623
Further reading 623
Chapter 16: Deep Reinforcement Learning with
Stable Baselines 625
Installing Stable Baselines 626
Creating our first agent with Stable Baselines 626
Evaluating the trained agent 627
Storing and loading the trained agent 627
Viewing the trained agent 628
Putting it all together 629
Vectorized environments 629
SubprocVecEnv630
DummyVecEnv631
Integrating custom environments 631
Playing Atari games with a DQN and its variants 632
Implementing DQN variants 633
Lunar lander using A2C 634
Creating a custom network 635
Swinging up a pendulum using DDPG 636
Viewing the computational graph in TensorBoard 637
Training an agent to walk using TRPO 639
Installing the MuJoCo environment 640
Implementing TRPO 643
Recording the video 646
Training a cheetah bot to run using PPO 648
Making a GIF of a trained agent 649
Implementing GAIL 651
Summary652
Questions652
Further reading 653
Chapter 17: Reinforcement Learning Frontiers 655
Meta reinforcement learning 656
Model-agnostic meta learning 657
Understanding MAML 660
MAML in a supervised learning setting 663
MAML in a reinforcement learning setting 665
Hierarchical reinforcement learning 668
MAXQ value function Decomposition 668
Imagination augmented agents 672
Summary 676
[ xi ]
Table of Contents
Questions 677
Further reading 677
Appendix 1 – Reinforcement Learning Algorithms 679
Reinforcement learning algorithm 679
Value Iteration 679
Policy Iteration 680
First-Visit MC Prediction 680
Every-Visit MC Prediction 681
MC Prediction – the Q Function 681
MC Control Method 682
On-Policy MC Control – Exploring starts 683
On-Policy MC Control – Epsilon-Greedy 683
Off-Policy MC Control 684
TD Prediction 685
On-Policy TD Control – SARSA 685
Off-Policy TD Control – Q Learning 686
Deep Q Learning 686
Double DQN 687
REINFORCE Policy Gradient 688
Policy Gradient with Reward-To-Go 688
REINFORCE with Baseline 689
Advantage Actor Critic 689
Asynchronous Advantage Actor-Critic 690
Deep Deterministic Policy Gradient 690
Twin Delayed DDPG 691
Soft Actor-Critic 692
Trust Region Policy Optimization 693
PPO-Clipped 694
PPO-Penalty 695
Categorical DQN 695
Distributed Distributional DDPG 697
DAgger 698
Deep Q learning from demonstrations 698
MaxEnt Inverse Reinforcement Learning 699
MAML in Reinforcement Learning 700
Appendix 2 – Assessments 701
Chapter 1 – Fundamentals of Reinforcement Learning 701
Chapter 2 – A Guide to the Gym Toolkit 702
Chapter 3 – The Bellman Equation and Dynamic Programming 702
Chapter 4 – Monte Carlo Methods 703
[ xii ]
Table of Contents
[ xiii ]
Preface
With significant enhancement in the quality and quantity of algorithms in recent
years, this second edition of Hands-On Reinforcement Learning with Python has been
revamped into an example-rich guide to learning state-of-the-art reinforcement
learning (RL) and deep RL algorithms with TensorFlow 2 and the OpenAI Gym
toolkit.
The book has several new chapters dedicated to new RL techniques including
distributional RL, imitation learning, inverse RL, and meta RL. You will learn
to leverage Stable Baselines, an improvement of OpenAI's baseline library, to
implement popular RL algorithms effortlessly. The book concludes with an overview
of promising approaches such as meta-learning and imagination augmented agents
in research.
[ xv ]
Preface
Chapter 2, A Guide to the Gym Toolkit, provides a complete guide to OpenAI's Gym
toolkit. We will understand several interesting environments provided by Gym in
detail by implementing them. We will begin our hands-on RL journey from this
chapter by implementing several fundamental RL concepts using Gym.
Chapter 3, The Bellman Equation and Dynamic Programming, will help us understand
the Bellman equation in detail with extensive math. Next, we will learn two
interesting classic RL algorithms called the value and policy iteration methods,
which we can use to find the optimal policy. We will also see how to implement
value and policy iteration methods for solving the Frozen Lake problem.
Chapter 4, Monte Carlo Methods, explains the model-free method, Monte Carlo.
We will learn what prediction and control tasks are, and then we will look into
Monte Carlo prediction and Monte Carlo control methods in detail. Next, we will
implement the Monte Carlo method to solve the blackjack game using the Gym
toolkit.
Chapter 5, Understanding Temporal Difference Learning, deals with one of the most
popular and widely used model-free methods called Temporal Difference (TD)
learning. First, we will learn how the TD prediction method works in detail, and then
we will explore the on-policy TD control method called SARSA and the off-policy
TD control method called Q learning in detail. We will also implement TD control
methods to solve the Frozen Lake problem using Gym.
Chapter 6, Case Study – The MAB Problem, explains one of the classic problems in
RL called the multi-armed bandit (MAB) problem. We will start the chapter by
understanding what the MAB problem is and then we will learn about several
exploration strategies such as epsilon-greedy, softmax exploration, upper confidence
bound, and Thompson sampling methods for solving the MAB problem in detail.
[ xvi ]
Preface
Chapter 8, A Primer on TensorFlow, deals with one of the most popular deep learning
libraries called TensorFlow. We will understand how to use TensorFlow by
implementing a neural network to recognize handwritten digits. Next, we will learn
to perform several math operations using TensorFlow. Later, we will learn about
TensorFlow 2.0 and see how it differs from the previous TensorFlow versions.
Chapter 9, Deep Q Network and Its Variants, enables us to kick-start our deep RL
journey. We will learn about one of the most popular deep RL algorithms called the
Deep Q Network (DQN). We will understand how DQN works step by step along
with the extensive math. We will also implement a DQN to play Atari games. Next,
we will explore several interesting variants of DQN, called Double DQN, Dueling
DQN, DQN with prioritized experience replay, and DRQN.
Chapter 10, Policy Gradient Method, covers policy gradient methods. We will
understand how the policy gradient method works along with the detailed
derivation. Next, we will learn several variance reduction methods such as policy
gradient with reward-to-go and policy gradient with baseline. We will also
understand how to train an agent for the Cart Pole balancing task using policy
gradient.
Chapter 11, Actor-Critic Methods – A2C and A3C, deals with several interesting actor-
critic methods such as advantage actor-critic and asynchronous advantage actor-
critic. We will learn how these actor-critic methods work in detail, and then we will
implement them for a mountain car climbing task using OpenAI Gym.
Chapter 12, Learning DDPG, TD3, and SAC, covers state-of-the-art deep RL algorithms
such as deep deterministic policy gradient, twin delayed DDPG, and soft actor,
along with step by step derivation. We will also learn how to implement the DDPG
algorithm for performing the inverted pendulum swing-up task using Gym.
Chapter 13, TRPO, PPO, and ACKTR Methods, deals with several popular policy
gradient methods such as TRPO and PPO. We will dive into the math behind TRPO
and PPO step by step and understand how TRPO and PPO helps an agent find the
optimal policy. Next, we will learn to implement PPO for performing the inverted
pendulum swing-up task. At the end, we will learn about the actor-critic method
called actor-critic using Kronecker-Factored trust region in detail.
[ xvii ]
Preface
Chapter 15, Imitation Learning and Inverse RL, explains imitation and inverse RL
algorithms. First, we will understand how supervised imitation learning, DAgger,
and deep Q learning from demonstrations work in detail. Next, we will learn
about maximum entropy inverse RL. At the end of the chapter, we will learn about
generative adversarial imitation learning.
Chapter 16, Deep Reinforcement Learning with Stable Baselines, helps us to understand
how to implement deep RL algorithms using a library called Stable Baselines. We
will learn what Stable Baselines is and how to use it in detail by implementing
several interesting Deep RL algorithms such as DQN, A2C, DDPG TRPO, and PPO.
Chapter 17, Reinforcement Learning Frontiers, covers several interesting avenues in RL,
such as meta RL, hierarchical RL, and imagination augmented agents in detail.
• Anaconda
• Python
• Any web browser
Once the file is downloaded, please make sure that you unzip or extract the folder
using the latest version of:
[ xviii ]
Preface
The code bundle for the book is also hosted on GitHub at https://github.com/
PacktPublishing/Deep-Reinforcement-Learning-with-Python. We also have other
code bundles from our rich catalog of books and videos available at https://github.
com/PacktPublishing/. Check them out!
Conventions used
There are a number of text conventions used throughout this book.
CodeInText: Indicates code words in text, database table names, folder names,
filenames, file extensions, pathnames, dummy URLs, user input, and Twitter
handles. For example: "epsilon_greedy computes the optimal policy."
def epsilon_greedy(epsilon):
if np.random.uniform(0,1) < epsilon:
return env.action_space.sample()
else:
return np.argmax(Q)
When we wish to draw your attention to a particular part of a code block, the
relevant lines or items are highlighted:
def epsilon_greedy(epsilon):
if np.random.uniform(0,1) < epsilon:
return env.action_space.sample()
else:
return np.argmax(Q)
[ xix ]
Preface
Bold: Indicates a new term, an important word, or words that you see on the screen,
for example, in menus or dialog boxes, also appear in the text like this. For example:
"The Markov Reward Process (MRP) is an extension of the Markov chain with the
reward function."
Get in touch
Feedback from our readers is always welcome.
General feedback: Email feedback@packtpub.com, and mention the book's title in the
subject of your message. If you have questions about any aspect of this book, please
email us at questions@packtpub.com.
Errata: Although we have taken every care to ensure the accuracy of our content,
mistakes do happen. If you have found a mistake in this book we would be grateful
if you would report this to us. Please visit, http://www.packtpub.com/submit-errata,
selecting your book, clicking on the Errata Submission Form link, and entering the
details.
Piracy: If you come across any illegal copies of our works in any form on the
Internet, we would be grateful if you would provide us with the location address
or website name. Please contact us at copyright@packtpub.com with a link to the
material.
If you are interested in becoming an author: If there is a topic that you have
expertise in and you are interested in either writing or contributing to a book,
please visit http://authors.packtpub.com.
[ xx ]
Preface
Reviews
Please leave a review. Once you have read and used this book, why not leave a
review on the site that you purchased it from? Potential readers can then see and use
your unbiased opinion to make purchase decisions, we at Packt can understand what
you think about our products, and our authors can see your feedback on their book.
Thank you!
[ xxi ]
Fundamentals of
1
Reinforcement Learning
Reinforcement Learning (RL) is one of the areas of Machine Learning (ML). Unlike
other ML paradigms, such as supervised and unsupervised learning, RL works in a
trial and error fashion by interacting with its environment.
• Key elements of RL
• The basic idea of RL
• The RL algorithm
• How RL differs from other ML paradigms
• The Markov Decision Processes
• Fundamental concepts of RL
[1]
Fundamentals of Reinforcement Learning
• Applications of RL
• RL glossary
We will begin the chapter by defining Key elements of RL. This will help explain The
basic idea of RL.
Key elements of RL
Let's begin by understanding some key elements of RL.
Agent
An agent is a software program that learns to make intelligent decisions. We can
say that an agent is a learner in the RL setting. For instance, a chess player can be
considered an agent since the player learns to make the best moves (decisions) to win
the game. Similarly, Mario in a Super Mario Bros video game can be considered an
agent since Mario explores the game and learns to make the best moves in the game.
Environment
The environment is the world of the agent. The agent stays within the environment.
For instance, coming back to our chess game, a chessboard is called the environment
since the chess player (agent) learns to play the game of chess within the chessboard
(environment). Similarly, in Super Mario Bros, the world of Mario is called the
environment.
The agent interacts with the environment and moves from one state to another
by performing an action. In the chess game environment, the action is the move
performed by the player (agent). The action is usually denoted by a.
[2]
Chapter 1
Reward
We learned that the agent interacts with an environment by performing an action
and moves from one state to another. Based on the action, the agent receives a
reward. A reward is nothing but a numerical value, say, +1 for a good action and -1
for a bad action. How do we decide if an action is good or bad?
In our chess game example, if the agent makes a move in which it takes one of the
opponent's chess pieces, then it is considered a good action and the agent receives
a positive reward. Similarly, if the agent makes a move that leads to the opponent
taking the agent's chess piece, then it is considered a bad action and the agent
receives a negative reward. The reward is denoted by r.
Similarly, in an RL setting, we will not teach the agent what to do or how to do it;
instead, we will give a reward to the agent for every action it does. We will give
a positive reward to the agent when it performs a good action and we will give a
negative reward to the agent when it performs a bad action. The agent begins by
performing a random action and if the action is good, we then give the agent a
positive reward so that the agent understands it has performed a good action and it
will repeat that action. If the action performed by the agent is bad, then we will give
the agent a negative reward so that the agent will understand it has performed a bad
action and it will not repeat that action.
Thus, RL can be viewed as a trial and error learning process where the agent tries out
different actions and learns the good action, which gives a positive reward.
In the dog analogy, the dog represents the agent, and giving a cookie to the dog
upon it catching the ball is a positive reward and not giving a cookie is a negative
reward. So, the dog (agent) explores different actions, which are catching the ball
and not catching the ball, and understands that catching the ball is a good action as it
brings the dog a positive reward (getting a cookie).
[3]
Fundamentals of Reinforcement Learning
Let's further explore the idea of RL with one more simple example. Let's suppose we
want to teach a robot (agent) to walk without hitting a mountain, as Figure 1.1 shows:
We will not teach the robot explicitly to not go in the direction of the mountain.
Instead, if the robot hits the mountain and gets stuck, we give the robot a negative
reward, say -1. So, the robot will understand that hitting the mountain is the wrong
action, and it will not repeat that action:
Similarly, when the robot walks in the right direction without hitting the mountain,
we give the robot a positive reward, say +1. So, the robot will understand that not
hitting the mountain is a good action, and it will repeat that action:
Thus, in the RL setting, the agent explores different actions and learns the best action
based on the reward it gets.
Now that we have a basic idea of how RL works, in the upcoming sections, we will
go into more detail and also learn the important concepts involved in RL.
[4]
Chapter 1
The RL algorithm
The steps involved in a typical RL algorithm are as follows:
RL is basically a trial and error learning process. Now, let's revisit our chess game
example. The agent (software program) is the chess player. So, the agent interacts
with the environment (chessboard) by performing an action (moves). If the agent
gets a positive reward for an action, then it will prefer performing that action; else it
will find a different action that gives a positive reward.
Ultimately, the goal of the agent is to maximize the reward it gets. If the agent
receives a good reward, then it means it has performed a good action. If the agent
performs a good action, then it implies that it can win the game. Thus, the agent
learns to win the game by maximizing the reward.
[5]
Fundamentals of Reinforcement Learning
The positions A to I in the environment are called the states of the environment.
The goal of the agent is to reach state I by starting from state A without visiting
the shaded states (B, C, G, and H). Thus, in order to achieve the goal, whenever
our agent visits a shaded state, we will give a negative reward (say -1) and when it
visits an unshaded state, we will give a positive reward (say +1). The actions in the
environment are moving up, down, right and left. The agent can perform any of these
four actions to reach state I from state A.
The first time the agent interacts with the environment (the first iteration), the agent
is unlikely to perform the correct action in each state, and thus it receives a negative
reward. That is, in the first iteration, the agent performs a random action in each
state, and this may lead the agent to receive a negative reward. But over a series of
iterations, the agent learns to perform the correct action in each state through the
reward it obtains, helping it achieve the goal. Let us explore this in detail.
Iteration 1
As we learned, in the first iteration, the agent performs a random action in each state.
For instance, look at the following figure. In the first iteration, the agent moves right
from state A and reaches the new state B. But since B is the shaded state, the agent
will receive a negative reward and so the agent will understand that moving right is
not a good action in state A. When it visits state A next time, it will try out a different
action instead of moving right:
As Figure 1.5 shows, from state B, the agent moves down and reaches the new state
E. Since E is an unshaded state, the agent will receive a positive reward, so the agent
will understand that moving down from state B is a good action.
[6]
Chapter 1
From state E, the agent moves right and reaches state F. Since F is an unshaded state,
the agent receives a positive reward, and it will understand that moving right from
state E is a good action. From state F, the agent moves down and reaches the goal
state I and receives a positive reward, so the agent will understand that moving
down from state F is a good action.
Iteration 2
In the second iteration, from state A, instead of moving right, the agent tries out a
different action as the agent learned in the previous iteration that moving right is not
a good action in state A.
Thus, as Figure 1.6 shows, in this iteration the agent moves down from state A and
reaches state D. Since D is an unshaded state, the agent receives a positive reward
and now the agent will understand that moving down is a good action in state A:
As shown in the preceding figure, from state D, the agent moves down and reaches
state G. But since G is a shaded state, the agent will receive a negative reward and
so the agent will understand that moving down is not a good action in state D, and
when it visits state D next time, it will try out a different action instead of moving
down.
From G, the agent moves right and reaches state H. Since H is a shaded state, it will
receive a negative reward and understand that moving right is not a good action in
state G.
From H it moves right and reaches the goal state I and receives a positive reward, so
the agent will understand that moving right from state H is a good action.
[7]
Fundamentals of Reinforcement Learning
Iteration 3
In the third iteration, the agent moves down from state A since, in the second
iteration, our agent learned that moving down is a good action in state A. So, the
agent moves down from state A and reaches the next state, D, as Figure 1.7 shows:
Now, from state D, the agent tries a different action instead of moving down since in
the second iteration our agent learned that moving down is not a good action in state
D. So, in this iteration, the agent moves right from state D and reaches state E.
From state E, the agent moves right as the agent already learned in the first iteration
that moving right from state E is a good action and reaches state F.
Now, from state F, the agent moves down since the agent learned in the first iteration
that moving down is a good action in state F, and reaches the goal state I.
Figure 1.8: The agent reaches the goal state without visiting the shaded states
[8]
Chapter 1
As we can see, our agent has successfully learned to reach the goal state I from state
A without visiting the shaded states based on the rewards.
In this way, the agent will try out different actions in each state and understand
whether an action is good or bad based on the reward it obtains. The goal of the
agent is to maximize rewards. So, the agent will always try to perform good actions
that give a positive reward, and when the agent performs good actions in each state,
then it ultimately leads the agent to achieve the goal.
Note that these iterations are called episodes in RL terminology. We will learn more
about episodes later in the chapter.
• Supervised learning
• Unsupervised learning
• RL
In supervised learning, the machine learns from training data. The training data
consists of a labeled pair of inputs and outputs. So, we train the model (agent)
using the training data in such a way that the model can generalize its learning to
new unseen data. It is called supervised learning because the training data acts as a
supervisor, since it has a labeled pair of inputs and outputs, and it guides the model
in learning the given task.
Now, let's understand the difference between supervised and reinforcement learning
with an example. Consider the dog analogy we discussed earlier in the chapter. In
supervised learning, to teach the dog to catch a ball, we will teach it explicitly by
specifying turn left, go right, move forward seven steps, catch the ball, and so on
in the form of training data. But in RL, we just throw a ball, and every time the dog
catches the ball, we give it a cookie (reward). So, the dog will learn to catch the ball
while trying to maximize the cookies (reward) it can get.
Let's consider one more example. Say we want to train the model to play chess using
supervised learning. In this case, we will have training data that includes all the
moves a player can make in each state, along with labels indicating whether it is a
good move or not. Then, we train the model to learn from this training data, whereas
in the case of RL, our agent will not be given any sort of training data; instead, we
just give a reward to the agent for each action it performs. Then, the agent will learn
by interacting with the environment and, based on the reward it gets, it will choose
its actions.
[9]
Fundamentals of Reinforcement Learning
With RL, the agent constantly receives feedback from the user. This feedback
represents rewards (a reward could be ratings the user has given for a movie they
have watched, time spent watching a movie, time spent watching trailers, and so on).
Based on the rewards, an RL agent will understand the movie preference of the user
and then suggest new movies accordingly.
Since the RL agent is learning with the aid of rewards, it can understand if the user's
movie preference changes and suggest new movies according to the user's changed
movie preference dynamically.
Thus, we can say that in both supervised and unsupervised learning the model
(agent) learns based on the given training dataset, whereas in RL the agent learns
by directly interacting with the environment. Thus, RL is essentially an interaction
between the agent and its environment.
To understand an MDP, first, we need to learn about the Markov property and
Markov chain.
[ 10 ]
Chapter 1
For example, if we want to predict the weather and we know that the current state is
cloudy, we can predict that the next state could be rainy. We concluded that the next
state is likely to be rainy only by considering the current state (cloudy) and not the
previous states, which might have been sunny, windy, and so on.
However, the Markov property does not hold for all processes. For instance,
throwing a dice (the next state) has no dependency on the previous number that
showed up on the dice (the current state).
Moving from one state to another is called a transition, and its probability is called
a transition probability. We denote the transition probability by 𝑃𝑃𝑃𝑃𝑃 ′ |𝑠𝑠𝑠. It indicates
the probability of moving from the state s to the next state 𝑠𝑠 ′. Say we have three
states (cloudy, rainy, and windy) in our Markov chain. Then we can represent the
probability of transitioning from one state to another using a table called a Markov
table, as shown in Table 1.1:
• From the state cloudy, we transition to the state rainy with 70% probability
and to the state windy with 30% probability.
• From the state rainy, we transition to the same state rainy with 80%
probability and to the state cloudy with 20% probability.
• From the state windy, we transition to the state rainy with 100% probability.
[ 11 ]
Fundamentals of Reinforcement Learning
We can also represent this transition information of the Markov chain in the form of
a state diagram, as shown in Figure 1.9:
We can also formulate the transition probabilities into a matrix called the transition
matrix, as shown in Figure 1.10:
Thus, to conclude, we can say that the Markov chain or Markov process consists
of a set of states along with their transition probabilities.
[ 12 ]
Chapter 1
A reward function tells us the reward we obtain in each state. For instance, based on
our previous weather example, the reward function tells us the reward we obtain
in the state cloudy, the reward we obtain in the state windy, and so on. The reward
function is usually denoted by R(s).
′
Thus, the MRP consists of states s, a transition probability 𝑃𝑃𝑃𝑃𝑃 |𝑠𝑠𝑠, and a reward
function R(s).
Let's understand this with an example. Given any environment, we can formulate
the environment using an MDP. For instance, let's consider the same grid world
environment we learned earlier. Figure 1.11 shows the grid world environment,
and the goal of the agent is to reach state I from state A without visiting the shaded
states:
[ 13 ]
Fundamentals of Reinforcement Learning
An agent makes a decision (action) in the environment only based on the current
state the agent is in and not based on the past state. So, we can formulate our
environment as an MDP. We learned that the MDP consists of states, actions,
transition probabilities, and a reward function. Now, let's learn how this relates
to our RL environment:
States – A set of states present in the environment. Thus, in the grid world
environment, we have states A to I.
Actions – A set of actions that our agent can perform in each state. An agent
performs an action and moves from one state to another. Thus, in the grid world
environment, the set of actions is up, down, left, and right.
For example, in our grid world environment, say the transition probability of
moving from state A to state B while performing an action right is 100%. This can
be expressed as P(B|A, right) = 1.0. We can also view this in the state diagram, as
shown in Figure 1.12:
[ 14 ]
Chapter 1
Suppose our agent is in state C and the transition probability of moving from state C
to state F while performing the action down is 90%, then it can be expressed as P(F|C,
down) = 0.9. We can also view this in the state diagram, as shown in Figure 1.13:
Say the reward we obtain while transitioning from state A to state B while
performing the action right is -1, then it can be expressed as R(A, right, B) = -1. We
can also view this in the state diagram, as shown in Figure 1.14:
[ 15 ]
Fundamentals of Reinforcement Learning
Suppose our agent is in state C and say the reward we obtain while transitioning
from state C to state F while performing the action down is +1, then it can be
expressed as R(C, down, F) = +1. We can also view this in the state diagram, as
shown in Figure 1.15:
Fundamental concepts of RL
In this section, we will learn about several important fundamental RL concepts.
Math essentials
Before going ahead, let's quickly recap expectation from our high school days, as we
will be dealing with expectation throughout the book.
Expectation
Let's say we have a variable X and it has the values 1, 2, 3, 4, 5, 6. To compute the
average value of X, we can just sum all the values of X divided by the number of
values of X. Thus, the average of X is (1+2+3+4+5+6)/6 = 3.5.
[ 16 ]
Chapter 1
Now, let's suppose X is a random variable. The random variable takes values based
on a random experiment, such as throwing dice, tossing a coin, and so on. The
random variable takes different values with some probabilities. Let's suppose we
are throwing a fair dice, then the possible outcomes (X) are 1, 2, 3, 4, 5, and 6 and the
probability of occurrence of each of these outcomes is 1/6, as shown in Table 1.2:
How can we compute the average value of the random variable X? Since each value
has a probability of an occurrence, we can't just take the average. So, instead, we
compute the weighted average, that is, the sum of values of X multiplied by their
respective probabilities, and this is called expectation. The expectation of a random
variable X can be defined as:
𝑁𝑁
Thus, the expectation of the random variable X is E(X) = 1(1/6) + 2(1/6) + 3(1/6) +
4(1/6) + 5 (1/6) + 6(1/6) = 3.5.
The expectation is also known as the expected value. Thus, the expected value of the
random variable X is 3.5. Thus, when we say expectation or the expected value of a
random variable, it basically means the weighted average.
Now, we will look into the expectation of a function of a random variable. Let
𝑓𝑓(𝑥𝑥) = 𝑥𝑥 2, then we can write:
[ 17 ]
Fundamentals of Reinforcement Learning
Thus, the expected value of f(X) is given as E(f(X)) = 1(1/6) + 4(1/6) + 9(1/6) +
16(1/6) + 25(1/6) + 36(1/6) = 15.1.
Action space
Consider the grid world environment shown in Figure 1.16:
In the preceding grid world environment, the goal of the agent is to reach state I
starting from state A without visiting the shaded states. In each of the states, the
agent can perform any of the four actions—up, down, left, and right—to achieve the
goal. The set of all possible actions in the environment is called the action space.
Thus, for this grid world environment, the action space will be [up, down, left, right].
Discrete action space: When our action space consists of actions that are discrete,
then it is called a discrete action space. For instance, in the grid world environment,
our action space consists of four discrete actions, which are up, down, left, right, and
so it is called a discrete action space.
[ 18 ]
Chapter 1
Continuous action space: When our action space consists of actions that are
continuous, then it is called a continuous action space. For instance, let's suppose
we are training an agent to drive a car, then our action space will consist of several
actions that have continuous values, such as the speed at which we need to drive the
car, the number of degrees we need to rotate the wheel, and so on. In cases where
our action space consists of actions that are continuous, it is called a continuous
action space.
Policy
A policy defines the agent's behavior in an environment. The policy tells the agent
what action to perform in each state. For instance, in the grid world environment, we
have states A to I and four possible actions. The policy may tell the agent to move
down in state A, move right in state D, and so on.
To interact with the environment for the first time, we initialize a random policy, that
is, the random policy tells the agent to perform a random action in each state. Thus,
in an initial iteration, the agent performs a random action in each state and tries to
learn whether the action is good or bad based on the reward it obtains. Over a series
of iterations, an agent will learn to perform good actions in each state, which gives a
positive reward. Thus, we can say that over a series of iterations, the agent will learn
a good policy that gives a positive reward.
This good policy is called the optimal policy. The optimal policy is the policy that
gets the agent a good reward and helps the agent to achieve the goal. For instance, in
our grid world environment, the optimal policy tells the agent to perform an action
in each state such that the agent can reach state I from state A without visiting the
shaded states.
The optimal policy is shown in Figure 1.17. As we can observe, the agent selects the
action in each state based on the optimal policy and reaches the terminal state I from
the starting state A without visiting the shaded states:
[ 19 ]
Fundamentals of Reinforcement Learning
Thus, the optimal policy tells the agent to perform the correct action in each state so
that the agent can receive a good reward.
• A deterministic policy
• A stochastic policy
Deterministic policy
The policy that we just covered is called a deterministic policy. A deterministic policy
tells the agent to perform one particular action in a state. Thus, the deterministic
policy maps the state to one particular action and is often denoted by 𝜇𝜇. Given a state
s at a time t, a deterministic policy tells the agent to perform one particular action a.
It can be expressed as:
𝑎𝑎𝑡𝑡 = 𝜇𝜇𝜇𝜇𝜇𝑡𝑡 )
For instance, consider our grid world example. Given state A, the deterministic
policy 𝜇𝜇 tells the agent to perform the action down. This can be expressed as:
𝜇𝜇(𝐴𝐴) = Down
Thus, according to the deterministic policy, whenever the agent visits state A, it
performs the action down.
Stochastic policy
Unlike a deterministic policy, a stochastic policy does not map a state directly to
one particular action; instead, it maps the state to a probability distribution over an
action space.
That is, we learned that given a state, the deterministic policy will tell the agent to
perform one particular action in the given state, so whenever the agent visits the
state it always performs the same particular action. But with a stochastic policy,
given a state, the stochastic policy will return a probability distribution over an
action space. So instead of performing the same action every time the agent visits
the state, the agent performs different actions each time based on a probability
distribution returned by the stochastic policy.
[ 20 ]
Chapter 1
Let's understand this with an example; we know that our grid world environment's
action space consists of four actions, which are [up, down, left, right]. Given a state
A, the stochastic policy returns the probability distribution over the action space as
[0.10,0.70,0.10,0.10]. Now, whenever the agent visits state A, instead of selecting the
same particular action every time, the agent selects up 10% of the time, down 70% of
the time, left 10% of the time, and right 10% of the time.
The difference between the deterministic policy and stochastic policy is shown
in Figure 1.18. As we can observe, the deterministic policy maps the state to one
particular action, whereas the stochastic policy maps the state to the probability
distribution over an action space:
Thus, the stochastic policy maps the state to a probability distribution over the action
space and is often denoted by 𝜋𝜋 . Say we have a state s and action a at a time t, then
we can express the stochastic policy as:
𝑎𝑎𝑡𝑡 ~𝜋𝜋𝜋𝜋𝜋𝑡𝑡 )
Or it can also be expressed as 𝜋𝜋𝜋𝜋𝜋𝑡𝑡 |𝑠𝑠𝑡𝑡 ).
• Categorical policy
• Gaussian policy
Categorical policy
A stochastic policy is called a categorical policy when the action space is discrete.
That is, the stochastic policy uses a categorical probability distribution over the
action space to select actions when the action space is discrete. For instance, in the
grid world environment from the previous example, we select actions based on a
categorical probability distribution (discrete distribution) as the action space of the
environment is discrete. As Figure 1.19 shows, given state A, we select an action
based on the categorical probability distribution over the action space:
[ 21 ]
Fundamentals of Reinforcement Learning
Figure 1.19: Probability of next move from state A for a discrete action space
Gaussian policy
A stochastic policy is called a Gaussian policy when our action space is continuous.
That is, the stochastic policy uses a Gaussian probability distribution over the action
space to select actions when the action space is continuous. Let's understand this
with a simple example. Suppose we are training an agent to drive a car and say we
have one continuous action in our action space. Let the action be the speed of the car,
and the value of the speed of the car ranges from 0 to 150 kmph. Then, the stochastic
policy uses the Gaussian distribution over the action space to select an action, as
Figure 1.20 shows:
We will learn more about the Gaussian policy in the upcoming chapters.
[ 22 ]
Chapter 1
Episode
The agent interacts with the environment by performing some actions, starting
from the initial state and reaches the final state. This agent-environment interaction
starting from the initial state until the final state is called an episode. For instance, in
a car racing video game, the agent plays the game by starting from the initial state
(the starting point of the race) and reaches the final state (the endpoint of the race).
This is considered an episode. An episode is also often called a trajectory (the path
taken by the agent) and it is denoted by 𝜏𝜏 .
An agent can play the game for any number of episodes, and each episode is
independent of the others. What is the use of playing the game for multiple
episodes? In order to learn the optimal policy, that is, the policy that tells the agent to
perform the correct action in each state, the agent plays the game for many episodes.
For example, let's say we are playing a car racing game; the first time, we may not
win the game, so we play the game several times to understand more about the
game and discover some good strategies for winning the game. Similarly, in the first
episode, the agent may not win the game and it plays the game for several episodes
to understand more about the game environment and good strategies to win the
game.
Say we begin the game from an initial state at a time step t = 0 and reach the final
state at a time step T, then the episode information consists of the agent-environment
interaction, such as state, action, and reward, starting from the initial state until the
final state, that is, (s0, a0, r0, s1, a1, r1,…,sT).
Let's strengthen our understanding of the episode and the optimal policy with the
grid world environment. We learned that in the grid world environment, the goal of
our agent is to reach the final state I starting from the initial state A without visiting
the shaded states. An agent receives a +1 reward when it visits the unshaded states
and a -1 reward when it visits the shaded states.
[ 23 ]
Fundamentals of Reinforcement Learning
When we say generate an episode, it means going from the initial state to the final
state. The agent generates the first episode using a random policy and explores the
environment and over several episodes, it will learn the optimal policy.
Episode 1
As the Figure 1.22 shows, in the first episode, the agent uses a random policy and
selects a random action in each state from the initial state until the final state and
observes the reward:
Episode 2
In the second episode, the agent tries a different policy to avoid the negative rewards
it received in the previous episode. For instance, as we can observe in the previous
episode, the agent selected the action right in state A and received a negative reward,
so in this episode, instead of selecting the action right in state A, it tries a different
action, say down, as shown in Figure 1.23:
Episode n
Thus, over a series of episodes, the agent learns the optimal policy, that is, the policy
that takes the agent to the final state I from state A without visiting the shaded states,
as Figure 1.24 shows:
[ 24 ]
Chapter 1
• An episodic task
• A continuous task
Episodic task: As the name suggests, an episodic task is one that has a terminal/
final state. That is, episodic tasks are tasks made up of episodes and thus they have
a terminal state. For example, in a car racing game, we start from the starting point
(initial state) and reach the destination (terminal state).
Continuous task: Unlike episodic tasks, continuous tasks do not contain any
episodes and so they don't have any terminal state. For example, a personal
assistance robot does not have a terminal state.
Horizon
Horizon is the time step until which the agent interacts with the environment. We
can classify the horizon into two categories:
• Finite horizon
• Infinite horizon
[ 25 ]
Fundamentals of Reinforcement Learning
𝑅𝑅(𝜏𝜏) = ∑ 𝑟𝑟𝑡𝑡
𝑡𝑡𝑡𝑡
The return of the trajectory is the sum of the rewards, that is,
𝑅𝑅(𝜏𝜏) = 2 + 2 + 1 + 2 = 7.
Thus, we can say that the goal of our agent is to maximize the return, that is,
maximize the sum of rewards (cumulative rewards) obtained over the episode. How
can we maximize the return? We can maximize the return if we perform the correct
action in each state. Okay, how can we perform the correct action in each state? We
can perform the correct action in each state by using the optimal policy. Thus, we can
maximize the return using the optimal policy. Thus, the optimal policy is the policy
that gets our agent the maximum return (sum of rewards) by performing the correct
action in each state.
Okay, how can we define the return for continuous tasks? We learned that in
continuous tasks there are no terminal states, so we can define the return as a sum of
rewards up to infinity:
𝑅𝑅(𝜏𝜏) = 𝑟𝑟0 + 𝑟𝑟1 + 𝑟𝑟2 + . . . +𝑟𝑟∞
[ 26 ]
Chapter 1
But how can we maximize the return that just sums to infinity? We introduce a new
term called discount factor 𝛾𝛾 and rewrite our return as:
𝑅𝑅(𝜏𝜏) = 𝛾𝛾 0 𝑟𝑟0 + 𝛾𝛾1 𝑟𝑟1 + 𝛾𝛾 2 𝑟𝑟2 + . . . +𝛾𝛾 𝑛𝑛 𝑟𝑟∞
∞
𝑅𝑅(𝜏𝜏) = ∑ 𝛾𝛾 𝑡𝑡 𝑟𝑟𝑡𝑡
𝑡𝑡𝑡𝑡
Okay, but how is this discount factor 𝛾𝛾 helping us? It helps us in preventing the
return from reaching infinity by deciding how much importance we give to future
rewards and immediate rewards. The value of the discount factor ranges from 0 to 1.
When we set the discount factor to a small value (close to 0), it implies that we give
more importance to immediate rewards than to future rewards. When we set the
discount factor to a high value (close to 1), it implies that we give more importance
to future rewards than to immediate rewards. Let's understand this with an example
with different discount factor values.
As we can observe, the discount factor is heavily decreased for the subsequent time
steps and more importance is given to the immediate reward r0 than the rewards
obtained at the future time steps. Thus, when we set the discount factor to a small
value, we give more importance to the immediate reward than future rewards.
[ 27 ]
Fundamentals of Reinforcement Learning
As we can observe, the discount factor is decreased for subsequent time steps but
unlike the previous case, the discount factor is not decreased heavily. Thus, when we
set the discount factor to a high value, we give more importance to future rewards
than the immediate reward.
[ 28 ]
Chapter 1
Thus, we have learned that when we set the discount factor to 0, the agent will never
learn, considering only the immediate reward, and when we set the discount factor
to 1 the agent will learn forever, looking for the future rewards that lead to infinity.
So, the optimal value of the discount factor lies between 0.2 and 0.8.
But the question is, why should we care about immediate and future rewards? We
give importance to immediate and future rewards depending on the tasks. In some
tasks, future rewards are more desirable than immediate rewards, and vice versa. In
a chess game, the goal is to defeat the opponent's king. If we give more importance
to the immediate reward, which is acquired by actions such as our pawn defeating
any opposing chessman, then the agent will learn to perform this sub-goal instead
of learning the actual goal. So, in this case, we give greater importance to future
rewards than the immediate reward, whereas in some cases, we prefer immediate
rewards over future rewards. Would you prefer chocolates if I gave them to you
today or 13 days later?
In the following two sections, we'll analyze the two fundamental functions of RL.
[ 29 ]
Fundamentals of Reinforcement Learning
Let's understand the value function with an example. Let's suppose we generate the
trajectory 𝜏𝜏 following some policy 𝜋𝜋 in our grid world environment, as shown in
Figure 1.26:
Now, how do we compute the value of all the states in our trajectory? We learned
that the value of a state is the return (sum of rewards) an agent would obtain starting
from that state following policy 𝜋𝜋 . The preceding trajectory is generated using policy
𝜋𝜋 , thus we can say that the value of a state is the return (sum of rewards) of the
trajectory starting from that state:
• The value of state A is the return of the trajectory starting from state A. Thus,
V(A) = 1+1+ -1+1 = 2.
• The value of state D is the return of the trajectory starting from state D. Thus,
V(D) = 1-1+1= 1.
• The value of state E is the return of the trajectory starting from state E. Thus,
V(E) = -1+1 = 0.
• The value of state H is the return of the trajectory starting from state H. Thus,
V(H) = 1.
What about the value of the final state I? We learned the value of a state is the return
(sum of rewards) starting from that state. We know that we obtain a reward when
we transition from one state to another. Since I is the final state, we don't make any
transition from the final state, so there is no reward and thus no value for the final
state I.
In a nutshell, the value of a state is the return of the trajectory starting from that state.
[ 30 ]
Chapter 1
Wait! There is a small change here: instead of taking the return directly as a value of
a state, we will use the expected return. Thus, the value function or the value of state
s can be defined as the expected return that the agent would obtain starting from
state s following policy 𝜋𝜋 . It can be expressed as:
𝑉𝑉 𝜋𝜋 (𝑠𝑠) = 𝔼𝔼 [𝑅𝑅𝑅𝑅𝑅𝑅|𝑠𝑠0 = 𝑠𝑠]
𝜏𝜏𝜏𝜏𝜏
Now, the question is why expected return? Why we can't we just compute the value
of a state as a return directly? Because our return is the random variable and it takes
different values with some probability.
Let's understand this with a simple example. Suppose we have a stochastic policy 𝜋𝜋 . We
learned that unlike the deterministic policy, which maps the state to the action directly,
the stochastic policy maps the state to the probability distribution over the action space.
Thus, the stochastic policy selects actions based on a probability distribution.
Let's suppose we are in state A and the stochastic policy returns the probability
distribution over the action space as [0.0,0.80,0.00,0.20]. It implies that with the
stochastic policy, in state A, we perform the action down 80% of the time, that is,
𝜋𝜋(down|𝐴𝐴) = 0.8, and the action right 20% of the time, that is 𝜋𝜋(right|𝐴𝐴) = 0.20.
Thus, in state A, our stochastic policy 𝜋𝜋 selects the action down 80% of the time and
the action right 20% of the time, and say our stochastic policy selects the action right
in states D and E and the action down in states B and F 100% of the time.
First, we generate an episode 𝜏𝜏1 using our stochastic policy 𝜋𝜋 , as shown in Figure
1.27:
For better understanding, let's focus only on the value of state A. The value of
state A is the return (sum of rewards) of the trajectory starting from state A. Thus,
𝑉𝑉(𝐴𝐴) = 𝑅𝑅(𝜏𝜏1 ) = 1 + 1 + 1 + 1 = 4.
[ 31 ]
Fundamentals of Reinforcement Learning
Say we generate another episode 𝜏𝜏2 using the same given stochastic policy 𝜋𝜋 , as
shown in Figure 1.28:
The value of state A is the return (sum of rewards) of the trajectory from state A.
Thus, 𝑉𝑉(𝐴𝐴) = 𝑅𝑅(𝜏𝜏2 ) = −1 + 1 + 1 + 1 = 2.
As you may observe, although we use the same policy, the values of state A in
trajectories 𝜏𝜏1 and 𝜏𝜏2 are different. This is because our policy is a stochastic policy
and it performs the action down in state A 80% of the time and the action right in state
A 20% of the time. So, when we generate a trajectory using policy 𝜋𝜋 , the trajectory 𝜏𝜏1
will occur 80% of the time and the trajectory 𝜏𝜏2 will occur 20% of the time. Thus, the
return will be 4 for 80% of the time and 2 for 20% of the time.
Thus, instead of taking the value of the state as a return directly, we will take the
expected return, since the return takes different values with some probability. The
expected return is basically the weighted average, that is, the sum of the return
multiplied by their probability. Thus, we can write:
𝑉𝑉 𝜋𝜋 (𝑠𝑠) = 𝔼𝔼 [𝑅𝑅𝑅𝑅𝑅𝑅|𝑠𝑠0 = 𝑠𝑠]
𝜏𝜏𝜏𝜏𝜏
The value of a state A can be obtained as:
𝑉𝑉 𝜋𝜋 (𝐴𝐴) = 𝔼𝔼 [𝑅𝑅(𝜏𝜏)|𝑠𝑠0 = 𝐴𝐴]
𝜏𝜏𝜏𝜏𝜏
[ 32 ]
Chapter 1
Note that the value function depends on the policy, that is, the value of the state
varies based on the policy we choose. There can be many different value functions
according to different policies. The optimal value function V*(s) yields the maximum
value compared to all the other value functions. It can be expressed as:
𝑉𝑉 ∗ (𝑠𝑠) = max 𝑉𝑉 𝜋𝜋 (𝑠𝑠)
𝜋𝜋
For example, let's say we have two policies 𝜋𝜋1 and 𝜋𝜋2. Let the value of state s using
policy 𝜋𝜋1 be 𝑉𝑉 𝜋𝜋1 (𝑠𝑠) = 13 and the value of state s using policy 𝜋𝜋2 be 𝑉𝑉 𝜋𝜋2 (𝑠𝑠) = 11.
∗
Then the optimal value of state s will be 𝑉𝑉 (𝑠𝑠) = 13 as it is the maximum. The policy
that gives the maximum state value is called the optimal policy 𝜋𝜋 ∗. Thus, in this case,
𝜋𝜋1 is the optimal policy as it gives the maximum state value.
We can view the value function in a table called a value table. Let's say we have two
states s0 and s1, then the value function can be represented as:
From the value table, we can tell that it is better to be in state s1 than state s0 as s1 has
a higher value. Thus, we can say that state s1 is the optimal state.
Q function
A Q function, also called the state-action value function, denotes the value of a
state-action pair. The value of a state-action pair is the return the agent would obtain
starting from state s and performing action a following policy 𝜋𝜋 . The value of a state-
action pair or Q function is usually denoted by Q(s,a) and is known as the Q value or
state-action value. It is expressed as:
𝑄𝑄 𝜋𝜋 (𝑠𝑠𝑠 𝑠𝑠) = [𝑅𝑅(𝜏𝜏)|𝑠𝑠0 = 𝑠𝑠𝑠 𝑠𝑠0 = 𝑎𝑎]
Note that the only difference between the value function and Q function is that in
the value function we compute the value of a state, whereas in the Q function we
compute the value of a state-action pair. Let's understand the Q function with an
example. Consider the trajectory in Figure 1.29 generated using policy 𝜋𝜋 :
[ 33 ]
Fundamentals of Reinforcement Learning
We learned that the Q function computes the value of a state-action pair. Say we
need to compute the Q value of state-action pair A-down. That is the Q value of
moving down in state A. Then the Q value will be the return of our trajectory starting
from state A and performing the action down:
𝑄𝑄 𝜋𝜋 (𝐴𝐴𝐴 𝐴𝐴𝐴𝐴) = [𝑅𝑅(𝜏𝜏)|𝑠𝑠0 = 𝐴𝐴𝐴 𝐴𝐴0 = down]
𝑄𝑄(𝐴𝐴𝐴 𝐴𝐴𝐴𝐴) = 1 + 1 + 1 + 1 = 4
Let's suppose we need to compute the Q value of the state-action pair D-right. That
is the Q value of moving right in state D. The Q value will be the return of our
trajectory starting from state D and performing the action right:
𝑄𝑄 𝜋𝜋 (𝐴𝐴𝐴 𝐴𝐴𝐴𝐴𝐴) = [𝑅𝑅(𝜏𝜏)|𝑠𝑠0 = 𝐷𝐷𝐷 𝐷𝐷0 = right]
𝑄𝑄(𝐴𝐴𝐴 𝐴𝐴𝐴𝐴𝐴) = 1 + 1 + 1 = 3
Similarly, we can compute the Q value for all the state-action pairs. Similar to what
we learned about the value function, instead of taking the return directly as the Q
value of a state-action pair, we use the expected return because the return is the
random variable and it takes different values with some probability. So, we can
redefine our Q function as:
𝑄𝑄 𝜋𝜋 (𝑠𝑠𝑠 𝑠𝑠) = 𝔼𝔼 [𝑅𝑅(𝜏𝜏)|𝑠𝑠0 = 𝑠𝑠𝑠 𝑠𝑠0 = 𝑎𝑎]
𝜏𝜏𝜏𝜏𝜏
It implies that the Q value is the expected return the agent would obtain starting
from state s and performing action a following policy 𝜋𝜋 .
Similar to the value function, the Q function depends on the policy, that is, the Q
value varies based on the policy we choose. There can be many different Q functions
according to different policies. The optimal Q function is the one that has the
maximum Q value over other Q functions, and it can be expressed as:
𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠) = max 𝑄𝑄 𝜋𝜋 (𝑠𝑠𝑠 𝑠𝑠𝑠
𝜋𝜋
∗
The optimal policy 𝜋𝜋 is the policy that gives the maximum Q value.
[ 34 ]
Chapter 1
Like the value function, the Q function can be viewed in a table. It is called a Q table.
Let's say we have two states s0 and s1, and two actions 0 and 1; then the Q function
can be represented as follows:
As we can observe, the Q table represents the Q values of all possible state-action
pairs. We learned that the optimal policy is the policy that gets our agent the
maximum return (sum of rewards). We can extract the optimal policy from the Q
table by just selecting the action that has the maximum Q value in each state. Thus,
our optimal policy will select action 1 in state s0 and action 0 in state s1 since they
have a high Q value, as shown in Table 1.6:
[ 35 ]
Fundamentals of Reinforcement Learning
Model-free learning: Model-free learning is when the agent does not know the
model dynamics of its environment. That is, in model-free learning, an agent tries to
find the optimal policy without the model dynamics.
Next, we'll discover the different types of environment an agent works within.
[ 36 ]
Chapter 1
[ 37 ]
Fundamentals of Reinforcement Learning
We have covered a lot of concepts of RL. Now, we'll finish the chapter by looking at
some exciting applications of RL.
Applications of RL
RL has evolved rapidly over the past couple of years with a wide range of
applications ranging from playing games to self-driving cars. One of the major
reasons for this evolution is due to Deep Reinforcement Learning (DRL), which is a
combination of RL and deep learning. We will learn about the various state-of-the-art
deep RL algorithms in the upcoming chapters, so be excited! In this section, we will
look at some real-life applications of RL:
[ 38 ]
Chapter 1
RL glossary
We have learned several important and fundamental concepts of RL. In this section,
we revisit several important terms that are very useful for understanding the
upcoming chapters.
Agent: The agent is the software program that learns to make intelligent decisions,
such as a software program that plays chess intelligently.
Environment: The environment is the world of the agent. If we continue with the
chess example, a chessboard is the environment where the agent plays chess.
State: A state is a position or a moment in the environment that the agent can be in.
For example, all the positions on the chessboard are called states.
Action: The agent interacts with the environment by performing an action and
moves from one state to another, for example, moves made by chessmen are actions.
Reward: A reward is a numerical value that the agent receives based on its action.
Consider a reward as a point. For instance, an agent receives +1 point (reward) for a
good action and -1 point (reward) for a bad action.
[ 39 ]
Fundamentals of Reinforcement Learning
Action space: The set of all possible actions in the environment is called the action
space. The action space is called a discrete action space when our action space
consists of discrete actions, and the action space is called a continuous action space
when our actions space consists of continuous actions.
Policy: The agent makes a decision based on the policy. A policy tells the agent what
action to perform in each state. It can be considered the brain of an agent. A policy is
called a deterministic policy if it exactly maps a state to a particular action. Unlike a
deterministic policy, a stochastic policy maps the state to a probability distribution
over the action space. The optimal policy is the one that gives the maximum reward.
Episode: The agent-environment interaction from the initial state to the terminal
state is called an episode. An episode is often called a trajectory or rollout.
Horizon: The horizon can be considered an agent's lifespan, that is, the time step
until which the agent interacts with the environment. The horizon is called a finite
horizon if the agent-environment interaction stops at a particular time step, and it is
called an infinite horizon when the agent environment interaction continues forever.
Discount factor: The discount factor helps to control whether we want to give
importance to the immediate reward or future rewards. The value of the discount
factor ranges from 0 to 1. A discount factor close to 0 implies that we give more
importance to immediate rewards, while a discount factor close to 1 implies that we
give more importance to future rewards than immediate rewards.
Value function: The value function or the value of the state is the expected return
that an agent would get starting from state s following policy 𝜋𝜋 .
Q function: The Q function or the value of a state-action pair implies the expected
return an agent would obtain starting from state s and performing action a following
policy 𝜋𝜋 .
Model-based and model-free learning: When the agent tries to learn the optimal
policy with the model dynamics, then it is called model-based learning; and when
the agent tries to learn the optimal policy without the model dynamics, then it is
called model-free learning.
[ 40 ]
Chapter 1
Summary
We started the chapter by understanding the basic idea of RL. We learned that RL is
a trial and error learning process and the learning in RL happens based on a reward.
We then explored the difference between RL and the other ML paradigms, such as
supervised and unsupervised learning. Going ahead, we learned about the MDP and
how the RL environment can be modeled as an MDP. Next, we understood several
important fundamental concepts involved in RL, and at the end of the chapter we
looked into some real-life applications of RL.
Questions
Let's evaluate our newly acquired knowledge by answering these questions:
[ 41 ]
Fundamentals of Reinforcement Learning
Further reading
For further information, refer to the following link:
[ 42 ]
A Guide to the Gym Toolkit
2
OpenAI is an artificial intelligence (AI) research organization that aims to build
artificial general intelligence (AGI). OpenAI provides a famous toolkit called Gym
for training a reinforcement learning agent.
Let's suppose we need to train our agent to drive a car. We need an environment to
train the agent. Can we train our agent in the real-world environment to drive a car?
No, because we have learned that reinforcement learning (RL) is a trial-and-error
learning process, so while we train our agent, it will make a lot of mistakes during
learning. For example, let's suppose our agent hits another vehicle, and it receives a
negative reward. It will then learn that hitting other vehicles is not a good action and
will try not to perform this action again. But we cannot train the RL agent in the real-
world environment by hitting other vehicles, right? That is why we use simulators
and train the RL agent in the simulated environments.
There are many toolkits that provide a simulated environment for training an RL
agent. One such popular toolkit is Gym. Gym provides a variety of environments for
training an RL agent ranging from classic control tasks to Atari game environments.
We can train our RL agent to learn in these simulated environments using various
RL algorithms. In this chapter, first, we will install Gym and then we will explore
various Gym environments. We will also get hands-on with the concepts we have
learned in the previous chapter by experimenting with the Gym environment.
Throughout the book, we will use the Gym toolkit for building and evaluating
reinforcement learning algorithms, so in this chapter, we will make ourselves
familiar with the Gym toolkit.
[ 43 ]
A Guide to the Gym Toolkit
Installing Anaconda
Anaconda is an open-source distribution of Python. It is widely used for scientific
computing and processing large volumes of data. It provides an excellent package
management environment, and it supports Windows, Mac, and Linux operating
systems. Anaconda comes with Python installed, along with popular packages used
for scientific computing such as NumPy, SciPy, and so on.
1. Open the Terminal and type the following command to download Anaconda:
wget https://repo.continuum.io/archive/Anaconda3-5.0.1-
Linux-x86_64.sh
[ 44 ]
Chapter 2
Note that we use Python version 3.6. Once the virtual environment is created, we can
activate it using the following command:
That's it! Now that we have learned how to install Anaconda and create a virtual
environment, in the next section, we will learn how to install Gym.
We can install Gym directly using pip. Note that throughout the book, we will use
Gym version 0.15.4. We can install Gym using the following command:
[ 45 ]
A Guide to the Gym Toolkit
cd ~
git clone https://github.com/openai/gym.git
cd gym
pip install -e '.[all]'
• Failed building wheel for pachi-py or failed building wheel for pachi-py
atari-py:
sudo apt-get update
sudo apt-get install xvfb libav-tools xorg-dev libsdl2-dev swig
cmake
Now that we have successfully installed Gym, in the next section, let's kickstart our
hands-on reinforcement learning journey.
[ 46 ]
Chapter 2
Let's introduce one of the simplest environments called the Frozen Lake
environment. Figure 2.1 shows the Frozen Lake environment. As we can observe, in
the Frozen Lake environment, the goal of the agent is to start from the initial state S
and reach the goal state G:
So, the agent has to start from state S and reach the goal state G. But one issue is that
if the agent visits state H, which is the hole state, then the agent will fall into the hole
and die as shown in Figure 2.2:
[ 47 ]
A Guide to the Gym Toolkit
So, we need to make sure that the agent starts from S and reaches G without falling
into the hole state H as shown in Figure 2.3:
Each grid box in the preceding environment is called a state, thus we have 16 states
(S to G) and we have 4 possible actions, which are up, down, left, and right. We
learned that our goal is to reach the state G from S without visiting H. So, we assign
+1 reward for the goal state G and 0 for all other states.
Thus, we have learned how the Frozen Lake environment works. Now, to train our
agent in the Frozen Lake environment, first, we need to create the environment by
coding it from scratch in Python. But luckily we don't have to do that! Since Gym
provides various environments, we can directly import the Gym toolkit and create a
Frozen Lake environment.
[ 48 ]
Chapter 2
Now, we will learn how to create our Frozen Lake environment using Gym. Before
running any code, make sure that you have activated our virtual environment
universe. First, let's import the Gym library:
import gym
Next, we can create a Gym environment using the make function. The make function
requires the environment id as a parameter. In Gym, the id of the Frozen Lake
environment is FrozenLake-v0. So, we can create our Frozen Lake environment as
follows:
env = gym.make("FrozenLake-v0")
After creating the environment, we can see how our environment looks like using
the render function:
env.render()
That's it! Creating the environment using Gym is that simple. In the next section, we
will understand more about the Gym environment by relating all the concepts we
have learned in the previous chapter.
[ 49 ]
A Guide to the Gym Toolkit
Let's now understand how to obtain all the above information from the Frozen Lake
environment we just created using Gym.
States
A state space consists of all of our states. We can obtain the number of states in our
environment by just typing env.observation_space as follows:
print(env.observation_space)
Discrete(16)
It implies that we have 16 discrete states in our state space starting from state S to
G. Note that, in Gym, the states will be encoded as a number, so the state S will be
encoded as 0, state F will be encoded as 1, and so on as Figure 2.5 shows:
[ 50 ]
Chapter 2
Actions
We learned that the action space consists of all the possible actions in the
environment. We can obtain the action space by using env.action_space:
print(env.action_space)
Discrete(4)
It shows that we have 4 discrete actions in our action space, which are left, down,
right, and up. Note that, similar to states, actions also will be encoded into numbers
as shown in Table 2.1:
[ 51 ]
A Guide to the Gym Toolkit
Let's suppose we are in state 2 (F). Now, if we perform action 1 (down) in state 2, we
can reach state 6 as shown in Figure 2.6:
[ 52 ]
Chapter 2
As we can see, in a stochastic environment we reach the next states with some probability.
Now, let's learn how to obtain this transition probability using the Gym environment.
We can obtain the transition probability and the reward function by just typing
env.P[state][action]. So, to obtain the transition probability of moving from state
S to the other states by performing the action right, we can type env.P[S][right].
But we cannot just type state S and action right directly since they are encoded as
numbers. We learned that state S is encoded as 0 and the action right is encoded as
2, so, to obtain the transition probability of state S by performing the action right, we
type env.P[0][2] as the following shows:
print(env.P[0][2])
What does this imply? Our output is in the form of [(transition probability, next
state, reward, Is terminal state?)]. It implies that if we perform an action 2
(right) in state 0 (S) then:
[ 53 ]
A Guide to the Gym Toolkit
Let's understand this with one more example. Let's suppose we are in state 3 (F) as
Figure 2.9 shows:
Say we perform action 1 (down) in state 3 (F). Then the transition probability of state
3 (F) by performing action 1 (down) can be obtained as the following shows:
print(env.P[3][1])
[ 54 ]
Chapter 2
As we can observe, in the second row of our output, we have (0.33333, 7, 0.0,
True), and the last value here is marked as True. It implies that state 7 is a terminal
state. That is, if we perform action 1 (down) in state 3 (F) then we reach state 7 (H)
with 0.33333 probability, and since 7 (H) is a hole, the agent dies if it reaches state 7
(H). Thus 7(H) is a terminal state and so it is marked as True.
Thus, we have learned how to obtain the state space, action space, transition
probability, and the reward function using the Gym environment. In the next section,
we will learn how to generate an episode.
[ 55 ]
A Guide to the Gym Toolkit
Before we begin, we initialize the state by resetting our environment; resetting puts
our agent back to the initial state. We can reset our environment using the reset()
function as shown as follows:
state = env.reset()
Action selection
In order for the agent to interact with the environment, it has to perform some
action in the environment. So, first, let's learn how to perform an action in the Gym
environment. Let's suppose we are in state 3 (F) as Figure 2.11 shows:
Say we need to perform action 1 (down) and move to the new state 7 (H). How can
we do that? We can perform an action using the step function. We just need to input
our action as a parameter to the step function. So, we can perform action 1 (down) in
state 3 (F) using the step function as follows:
env.step(1)
env.render()
[ 56 ]
Chapter 2
As shown in Figure 2.12, the agent performs action 1 (down) in state 3 (F) and reaches
the next state 7 (H):
As you might have guessed, it implies that when we perform action 1 (down) in state
3 (F):
Thus:
[ 57 ]
A Guide to the Gym Toolkit
We can also sample action from our action space and perform a random action to
explore our environment. We can sample an action using the sample function:
random_action = env.action_space.sample()
After we have sampled an action from our action space, then we perform our
sampled action using our step function:
Now that we have learned how to select actions in the environment, let's see how to
generate an episode.
Generating an episode
Now let's learn how to generate an episode. The episode is the agent environment
interaction starting from the initial state to the terminal state. The agent interacts
with the environment by performing some action in each state. An episode ends if
the agent reaches the terminal state. So, in the Frozen Lake environment, the episode
will end if the agent reaches the terminal state, which is either the hole state (H) or
goal state (G).
Let's understand how to generate an episode with the random policy. We learned
that the random policy selects a random action in each state. So, we will generate an
episode by taking random actions in each state. So for each time step in the episode,
we take a random action in each state and our episode will end if the agent reaches
the terminal state.
num_timesteps = 20
for t in range(num_timesteps):
random_action = env.action_space.sample()
[ 58 ]
Chapter 2
If the next state is the terminal state, then break. This implies that our episode ends:
if done:
break
The preceding complete snippet is provided for clarity. The following code denotes
that on every time step, we select an action by randomly sampling from the action
space, and our episode will end if the agent reaches the terminal state:
import gym
env = gym.make("FrozenLake-v0")
state = env.reset()
num_timesteps = 20
for t in range(num_timesteps):
random_action = env.action_space.sample()
env.render()
if done:
break
The preceding code will print something similar to Figure 2.13. Note that you might
get a different result each time you run the preceding code since the agent is taking a
random action in each time step.
As we can observe from the following output, on each time step, the agent takes a
random action in each state and our episode ends once the agent reaches the terminal
state. As Figure 2.13 shows, in time step 4, the agent reaches the terminal state H, and
so the episode ends:
[ 59 ]
A Guide to the Gym Toolkit
Instead of generating one episode, we can also generate a series of episodes by taking
some random action in each state:
import gym
env = gym.make("FrozenLake-v0")
[ 60 ]
Chapter 2
num_episodes = 10
num_timesteps = 20
for i in range(num_episodes):
state = env.reset()
print('Time Step 0 :')
env.render()
for t in range(num_timesteps):
random_action = env.action_space.sample()
env.render()
if done:
break
In the previous chapter, we learned that an agent can find the optimal policy (that is,
the correct action in each state) by generating several episodes. But in the preceding
example, we just took random actions in each state over all the episodes. How can
the agent find the optimal policy? So, in the case of the Frozen Lake environment,
how can the agent find the optimal policy that tells the agent to reach state G from
state S without visiting the hole states H?
So far we have understood how the Gym environment works using the basic Frozen
Lake environment, but Gym has so many other functionalities and also several
interesting environments. In the next section, we will learn about the other Gym
environments along with exploring the functionalities of Gym.
[ 61 ]
A Guide to the Gym Toolkit
Cart-Pole balancing is one of the classical control problems. As shown in Figure 2.14,
the pole is attached to the cart and the goal of our agent is to balance the pole on the
cart, that is, the goal of our agent is to keep the pole standing straight up on the cart
as shown in Figure 2.15:
[ 62 ]
Chapter 2
So the agent tries to push the cart left and right to keep the pole standing straight on
the cart. Thus our agent performs two actions, which are pushing the cart to the left
and pushing the cart to the right, to keep the pole standing straight on the cart. You
can also check out this very interesting video, https://youtu.be/qMlcsc43-lg, which
shows how the RL agent balances the pole on the cart by moving the cart left and
right.
Now, let's learn how to create the Cart-Pole environment using Gym. The
environment id of the Cart-Pole environment in Gym is CartPole-v0, so we can just
use our make function to create the Cart-Pole environment as shown below:
env = gym.make("CartPole-v0")
After creating the environment, we can view our environment using the render
function:
env.render()
We can also close the rendered environment using the close function:
env.close()
[ 63 ]
A Guide to the Gym Toolkit
State space
Now, let's look at the state space of our Cart-Pole environment. Wait! What are the
states here? In the Frozen Lake environment, we had 16 discrete states from S to G.
But how can we describe the states here? Can we describe the state by cart position?
Yes! Note that the cart position is a continuous value. So, in this case, our state space
will be continuous values, unlike the Frozen Lake environment where our state space
had discrete values (S to G).
But with just the cart position alone we cannot describe the state of the environment
completely. So we include cart velocity, pole angle, and pole velocity at the tip. So
we can describe our state space by an array of values as shown as follows:
Thus, our state space contains an array of continuous values. Let's learn how we
can obtain this from Gym. In order to get the state space, we can just type env.
observation_space as shown as follows:
print(env.observation_space)
Box(4,)
Box implies that our state space consists of continuous values and not discrete
values. That is, in the Frozen Lake environment, we obtained the state space as
Discrete(16), which shows that we have 16 discrete states (S to G). But now we
have our state space denoted as Box(4,), which implies that our state space is
continuous and consists of an array of 4 values.
For example, let's reset our environment and see how our initial state space will look
like. We can reset the environment using the reset function:
print(env.reset())
[ 64 ]
Chapter 2
Note that here the state space is randomly initialized and so we will get different
values every time we run the preceding code.
The result of the preceding code implies that our initial state space consists of an
array of 4 values that denote the cart position, cart velocity, pole angle, and pole
velocity at the tip, respectively. That is:
Okay, how can we obtain the maximum and minimum values of our state space? We
can obtain the maximum values of our state space using env.observation_space.
high and the minimum values of our state space using env.observation_space.low.
For example, let's look at the maximum value of our state space:
print(env.observation_space.high)
It implies that:
[ 65 ]
A Guide to the Gym Toolkit
Similarly, we can obtain the minimum value of our state space as:
print(env.observation_space.low)
It states that:
Action space
Now, let's look at the action space. We already learned that in the Cart-Pole
environment we perform two actions, which are pushing the cart to the left and
pushing the cart to the right, and thus the action space is discrete since we have only
two discrete actions.
In order to get the action space, we can just type env.action_space as the following
shows:
print(env.action_space)
Discrete(2)
As we can observe, Discrete(2) implies that our action space is discrete, and we
have two actions in our action space. Note that the actions will be encoded into
numbers as shown in Table 2.4:
[ 66 ]
Chapter 2
import gym
env = gym.make('CartPole-v0')
Set the number of episodes and number of time steps in the episode:
num_episodes = 100
num_timesteps = 50
for i in range(num_episodes):
Return = 0
state = env.reset()
for t in range(num_timesteps):
env.render()
random_action = env.action_space.sample()
[ 67 ]
A Guide to the Gym Toolkit
if done:
break
if i%10==0:
print('Episode: {}, Return: {}'.format(i, Return))
env.close()
The preceding code will output the sum of rewards obtained over every 10 episodes:
Thus, we have learned about one of the interesting and classic control problems
called Cart-Pole balancing and how to create the Cart-Pole balancing environment
using Gym. Gym provides several other classic control environments as shown in
Figure 2.17:
[ 68 ]
Chapter 2
You can also do some experimentation by creating any of the above environments
using Gym. We can check all the classic control environments offered by Gym here:
https://gym.openai.com/envs/#classic_control.
In this section, we will learn how to create the Atari game environment using Gym.
Gym provides about 59 Atari game environments including Pong, Space Invaders,
Air Raid, Asteroids, Centipede, Ms. Pac-Man, and so on. Some of the Atari game
environments provided by Gym are shown in Figure 2.18 to keep you excited:
[ 69 ]
A Guide to the Gym Toolkit
In Gym, every Atari game environment has 12 different variants. Let's understand
this with the Pong game environment. The Pong game environment will have 12
different variants as explained in the following sections.
General environment
• Pong-v0 and Pong-v4: We can create a Pong environment with the
environment id as Pong-v0 or Pong-v4. Okay, what about the state of our
environment? Since we are dealing with the game environment, we can just
take the image of our game screen as our state. But we can't deal with the
raw image directly so we will take the pixel values of our game screen as the
state. We will learn more about this in the upcoming section.
• Pong-ram-v0 and Pong-ram-v4: This is similar to Pong-v0 and Pong-v4,
respectively. However, here, the state of the environment is the RAM of the
Atari machine, which is just the 128 bytes instead of the game screen's pixel
values.
Deterministic environment
• PongDeterministic-v0 and PongDeterministic-v4: In this type, as the name
suggests, the initial position of the game will be the same every time we
initialize the environment, and the state of the environment is the pixel
values of the game screen.
• Pong-ramDeterministic-v0 and Pong-ramDeterministic-v4: This is similar to
PongDeterministic-v0 and PongDeterministic-v4, respectively, but here the
state is the RAM of the Atari machine.
[ 70 ]
Chapter 2
No frame skipping
• PongNoFrameskip-v0 and PongNoFrameskip-v4: In this type, no game
frame is skipped; all game screens are visible to the agent and the state is the
pixel value of the game screen.
• Pong-ramNoFrameskip-v0 and Pong-ramNoFrameskip-v4: This is similar
to PongNoFrameskip-v0 and PongNoFrameskip-v4, but here the state is the
RAM of the Atari machine.
Thus in the Atari environment, the state of our environment will be either the game
screen or the RAM of the Atari machine. Note that similar to the Pong game, all other
Atari games have the id in the same fashion in the Gym environment. For example,
suppose we want to create a deterministic Space Invaders environment; then we can
just create it with the id SpaceInvadersDeterministic-v0. Say we want to create a
Space Invaders environment with no frame skipping; then we can create it with the
id SpaceInvadersNoFrameskip-v0.
We can check out all the Atari game environments offered by Gym here: https://
gym.openai.com/envs/#atari.
State space
In this section, let's understand the state space of the Atari games in the Gym
environment. Let's learn this with the Pong game. We learned that in the Atari
environment, the state of the environment will be either the game screen's pixel
values or the RAM of the Atari machine. First, let's understand the state space where
the state of the environment is the game screen's pixel values.
env = gym.make("Pong-v0")
Here, the game screen is the state of our environment. So, we will just take the
image of the game screen as the state. However, we can't deal with the raw images
directly, so we will take the pixel values of the image (game screen) as our state. The
dimension of the image pixel will be 3 containing the image height, image width,
and the number of the channel.
[ 71 ]
A Guide to the Gym Toolkit
Thus, the state of our environment will be an array containing the pixel values of the
game screen:
Note that the pixel values range from 0 to 255. In order to get the state space, we can
just type env.observation_space as the following shows:
print(env.observation_space)
Box(210, 160, 3)
This indicates that our state space is a 3D array with a shape of [210,160,3]. As we've
learned, 210 denotes the height of the image, 160 denotes the width of the image, and
3 represents the number of channels.
For example, we can reset our environment and see how the initial state space looks
like. We can reset the environment using the reset function:
print(env.reset())
The preceding code will print an array representing the initial game screen's pixel
value.
Now, let's create a Pong environment where the state of our environment is the RAM
of the Atari machine instead of the game screen's pixel value:
env = gym.make("Pong-ram-v0")
print(env.observation_space)
Box(128,)
This implies that our state space is a 1D array containing 128 values. We can reset
our environment and see how the initial state space looks like:
print(env.reset())
[ 72 ]
Chapter 2
Note that this applies to all Atari games in the Gym environment, for example, if we
create a space invaders environment with the state of our environment as the game
screen's pixel value, then our state space will be a 3D array with a shape of Box(210,
160, 3). However, if we create the Space Invaders environment with the state of our
environment as the RAM of Atari machine, then our state space will be an array with
a shape of Box(128,).
Action space
Let's now explore the action space. In general, the Atari game environment has 18
actions in the action space, and the actions are encoded from 0 to 17 as shown in
Table 2.5:
[ 73 ]
A Guide to the Gym Toolkit
Note that all the preceding 18 actions are not applicable to all the Atari game
environments and the action space varies from game to game. For instance, some
games use only the first six of the preceding actions as their action space, and some
games use only the first nine of the preceding actions as their action space, while
others use all of the preceding 18 actions. Let's understand this with an example
using the Pong game:
env = gym.make("Pong-v0")
print(env.action_space)
Discrete(6)
The code shows that we have 6 actions in the Pong action space, and the actions are
encoded from 0 to 5. So the possible actions in the Pong game are noop (no action),
fire, up, right, left, and down.
Let's now look at the action space of the Road Runner game. Just in case you have
not come across this game before, the game screen looks like this:
env = gym.make("RoadRunner-v0")
print(env.action_space)
[ 74 ]
Chapter 2
Discrete(18)
This shows us that the action space in the Road Runner game includes all 18 actions.
import gym
env = gym.make('Tennis-v0')
env.render()
[ 75 ]
A Guide to the Gym Toolkit
Set the number of episodes and the number of time steps in the episode:
num_episodes = 100
num_timesteps = 50
for i in range(num_episodes):
Return = 0
state = env.reset()
for t in range(num_timesteps):
env.render()
random_action = env.action_space.sample()
if done:
break
if i%10==0:
print('Episode: {}, Return: {}'.format(i, Return))
[ 76 ]
Chapter 2
env.close()
The preceding code will output the return (sum of rewards) obtained over every 10
episodes:
To record the game, our system should support FFmpeg. FFmpeg is a framework
used for processing media files. So before moving ahead, make sure that your system
provides FFmpeg support.
We can record our game using the Monitor wrapper as the following code shows. It
takes three parameters: the environment; the directory where we want to save our
recordings; and the force option. If we set force = False, it implies that we need to
create a new directory every time we want to save new recordings, and when we
set force = True, old recordings in the directory will be cleared out and replaced by
new recordings:
We just need to add the preceding line of code after creating our environment.
Let's take a simple example and see how the recordings work. Let's make our agent
randomly play the Tennis game for a single episode and record the agent's gameplay
as a video:
[ 77 ]
A Guide to the Gym Toolkit
import gym
env = gym.make('Tennis-v0')
env.reset()
for _ in range(5000):
env.render()
action = env.action_space.sample()
next_state, reward, done, info = env.step(action)
if done:
break
env.close()
Once the episode ends, we will see a new directory called recording and we can find
the video file in MP4 format in this directory, which has our agent's gameplay as
shown in Figure 2.21:
[ 78 ]
Chapter 2
Other environments
Apart from the classic control and the Atari game environments we've discussed,
Gym also provides several different categories of the environment. Let's find out
more about them.
Box2D
Box2D is the 2D simulator that is majorly used for training our agent to perform
continuous control tasks, such as walking. For example, Gym provides a Box2D
environment called BipedalWalker-v2, which we can use to train our agent to walk.
The BipedalWalker-v2 environment is shown in Figure 2.22:
We can check out several other Box2D environments offered by Gym here: https://
gym.openai.com/envs/#box2d.
[ 79 ]
A Guide to the Gym Toolkit
MuJoCo
Mujoco stands for Multi-Joint dynamics with Contact and is one of the most
popular simulators used for training our agent to perform continuous control
tasks. For example, MuJoCo provides an interesting environment called
HumanoidStandup-v2, which we can use to train our agent to stand up. The
HumanoidStandup-v2 environment is shown in Figure 2.23:
We can check out several other Mujoco environments offered by Gym here: https://
gym.openai.com/envs/#mujoco.
Robotics
Gym provides several environments for performing goal-based tasks for the fetch
and shadow hand robots. For example, Gym provides an environment called
HandManipulateBlock-v0, which we can use to train our agent to orient a box using a
robotic hand. The HandManipulateBlock-v0 environment is shown in Figure 2.24:
[ 80 ]
Chapter 2
We can check out the several robotics environments offered by Gym here: https://
gym.openai.com/envs/#robotics.
Toy text
Toy text is the simplest text-based environment. We already learned about one such
environment at the beginning of this chapter, which is the Frozen Lake environment.
We can check out other interesting toy text environments offered by Gym here:
https://gym.openai.com/envs/#toy_text.
Algorithms
Instead of using our RL agent to play games, can we make use of our agent to
solve some interesting problems? Yes! The algorithmic environment provides
several interesting problems like copying a given sequence, performing addition,
and so on. We can make use of the RL agent to solve these problems by learning
how to perform computation. For instance, Gym provides an environment called
ReversedAddition-v0, which we can use to train our agent to add multiple digit
numbers.
[ 81 ]
A Guide to the Gym Toolkit
Environment synopsis
We have learned about several types of Gym environment. Wouldn't it be nice if we
could have information about all the environments in a single place? Yes! The Gym
wiki provides a description of all the environments with their environment id, state
space, action space, and reward range in a table: https://github.com/openai/gym/
wiki/Table-of-environments.
We can also check all the available environments in Gym using the registry.all()
method:
The preceding code will print all the available environments in Gym.
Thus, in this chapter, we have learned about the Gym toolkit and also several
interesting environments offered by Gym. In the upcoming chapters, we will learn
how to train our RL agent in a Gym environment to find the optimal policy.
Summary
We started the chapter by understanding how to set up our machine by installing
Anaconda and the Gym toolkit. We learned how to create a Gym environment using
the gym.make() function. Later, we also explored how to obtain the state space of the
environment using env.observation_space and the action space of the environment
using env.action_space. We then learned how to obtain the transition probability
and reward function of the environment using env.P. Following this, we also learned
how to generate an episode using the Gym environment. We understood that in each
step of the episode we select an action using the env.step() function.
In the next chapter, we will learn how to find the optimal policy using two
interesting algorithms called value iteration and policy iteration.
[ 82 ]
Chapter 2
Questions
Let's evaluate our newly gained knowledge by answering the following questions:
Further reading
Check out the following resources for more information:
[ 83 ]
The Bellman Equation and
3
Dynamic Programming
In the previous chapter, we learned that in reinforcement learning our goal is to
find the optimal policy. The optimal policy is the policy that selects the correct
action in each state so that the agent can get the maximum return and achieve its
goal. In this chapter, we'll learn about two interesting classic reinforcement learning
algorithms called the value and policy iteration methods, which we can use to find
the optimal policy.
Before diving into the value and policy iteration methods directly, first, we will learn
about the Bellman equation. The Bellman equation is ubiquitous in reinforcement
learning and it is used for finding the optimal value and Q functions. We will
understand what the Bellman equation is and how it finds the optimal value and
Q functions.
After understanding the Bellman equation, we will learn about two interesting
dynamic programming methods called value and policy iterations, which use the
Bellman equation to find the optimal policy. At the end of the chapter, we will learn
how to solve the Frozen Lake problem by finding an optimal policy using the value
and policy iteration methods.
[ 85 ]
The Bellman Equation and Dynamic Programming
In this section, we'll learn what exactly the Bellman equation is and how we can
use it to find the optimal value and Q functions.
[ 86 ]
Chapter 3
Let's understand the Bellman equation with an example. Say we generate a trajectory
𝜏𝜏 using some policy 𝜋𝜋:
Let's suppose we need to compute the value of state s2. According to the Bellman
equation, the value of state s2 is given as:
𝑉𝑉(𝑠𝑠2 ) = 𝑅𝑅(𝑠𝑠2 , 𝑎𝑎2 , 𝑠𝑠3 ) + 𝛾𝛾𝛾𝛾(𝑠𝑠3 )
In the preceding equation, 𝑅𝑅(𝑠𝑠2 , 𝑎𝑎2 , 𝑠𝑠3 ) implies the immediate reward we obtain
while performing an action a2 in state s2 and moving to state s3. From the trajectory,
we can tell that the immediate reward 𝑅𝑅(𝑠𝑠2 , 𝑎𝑎2 , 𝑠𝑠3 ) is r2. And the term 𝛾𝛾𝛾𝛾(𝑠𝑠3 ) is the
discounted value of the next state.
Thus, according to the Bellman equation, the value of state s2 is given as:
[ 87 ]
The Bellman Equation and Dynamic Programming
As we can see, when we perform an action a1 in state s1, with a probability 0.7, we
reach state s2, and with a probability 0.3, we reach state s3:
Thus, when we perform action a1 in state s1, there is a 70% chance the next state will
be s2 and a 30% chance the next state will be s3. We learned that the Bellman equation
is a sum of immediate reward and the discounted value of the next state. But when
our next state is not guaranteed due to the stochasticity present in the environment,
how can we define our Bellman equation?
In this case, we can slightly modify our Bellman equation with the expectations
(the weighted average), that is, a sum of the Bellman backup multiplied by the
corresponding transition probability of the next state:
Let's understand this equation better by considering the same trajectory we just used.
As we notice, when we perform an action a1 in state s1, we go to s2 with a probability
of 0.70 and s3 with a probability of 0.30. Thus, we can write:
𝑉𝑉(𝑠𝑠1 ) = 𝑃𝑃(𝑠𝑠2 |𝑠𝑠1 , 𝑎𝑎1 )[𝑅𝑅(𝑠𝑠1 , 𝑎𝑎1 , 𝑠𝑠2 ) + 𝑉𝑉(𝑠𝑠2 )] + 𝑃𝑃(𝑠𝑠3 |𝑠𝑠1 , 𝑎𝑎1 )[𝑅𝑅(𝑠𝑠1 , 𝑎𝑎1 , 𝑠𝑠3 ) + 𝑉𝑉(𝑠𝑠3 )]
[ 88 ]
Chapter 3
Okay, but what if our policy is a stochastic policy? We learned that with a stochastic
policy, we select actions based on a probability distribution; that is, instead of
performing the same action in a state, we select an action based on the probability
distribution over the action space. Let's understand this with a different trajectory,
shown in Figure 3.3. As we see, in state s1, with a probability of 0.8, we select action a1
and reach state s2, and with a probability of 0.2, we select action a2 and reach state s3:
Thus, when we use a stochastic policy, our next state will not always be the same; it
will be different states with some probability. Now, how can we define the Bellman
equation including the stochastic policy?
[ 89 ]
The Bellman Equation and Dynamic Programming
Thus, our final Bellman equation of the value function can be written as:
The preceding equation is also known as the Bellman expectation equation of the
value function. We can also express the above equation in expectation form. Let's
recollect the definition of expectation:
In equation (1), 𝑓𝑓(𝑥𝑥) = 𝑅𝑅(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) + 𝛾𝛾𝛾𝛾 𝜋𝜋 (𝑠𝑠 ′ ) and 𝑃𝑃𝑃𝑃𝑃𝑃 𝑃 𝑃𝑃𝑃𝑃𝑃 ′ |𝑠𝑠𝑠 𝑠𝑠𝑠 and 𝜋𝜋𝜋𝜋𝜋|𝑠𝑠𝑠
which denote the probability of the stochastic environment and stochastic policy,
respectively.
Thus, we can write the Bellman equation of the value function as:
[ 90 ]
Chapter 3
Let's understand this with an example. Say we generate a trajectory 𝜏𝜏 using some
policy 𝜋𝜋 as shown in Figure 3.4:
Let's suppose we need to compute the Q value of a state-action pair (s2, a2). Then,
according to the Bellman equation, we can write:
Similar to what we learned in the Bellman equation of the value function, the
preceding Bellman equation works only when we have a deterministic environment
because in the stochastic environment our next state will not always be the same
and it will be based on a probability distribution. Suppose we have a stochastic
environment, then when we perform an action a in state s. It is not guaranteed
that our next state will always be 𝑠𝑠 ′; it could be some other states too with some
probability.
So, just like we did in the previous section, we can use the expectation (the weighted
average), that is, a sum of the Bellman backup multiplied by their corresponding
transition probability of the next state, and rewrite our Bellman equation of the Q
function as:
𝑄𝑄 𝜋𝜋 (𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃𝑃𝑃𝑃 ′ |𝑠𝑠𝑠 𝑠𝑠𝑠 [𝑅𝑅(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) + 𝛾𝛾𝛾𝛾 𝜋𝜋 (𝑠𝑠 ′ ,𝑎𝑎′ )]
𝑠𝑠′
[ 91 ]
The Bellman Equation and Dynamic Programming
Similarly, when we use a stochastic policy, our next state will not always be the
same; it will be different states with some probability. So, to include the stochastic
nature of the policy, we can rewrite our Bellman equation with the expectation (the
weighted average), that is, a sum of Bellman backup multiplied by the corresponding
probability of action, just like we did in the Bellman equation of the value function.
Thus, the Bellman equation of the Q function is given as:
𝑄𝑄 𝜋𝜋 (𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝜋𝜋(𝑎𝑎|𝑠𝑠) ∑ 𝑃𝑃𝑃𝑃𝑃 ′ |𝑠𝑠𝑠 𝑠𝑠𝑠 [𝑅𝑅(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) + 𝛾𝛾𝛾𝛾 𝜋𝜋 (𝑠𝑠 ′ ,𝑎𝑎′ )]
𝑎𝑎 𝑠𝑠′
But wait! There is a small change in the above equation. Why do we need to add
the term ∑ 𝜋𝜋(𝑎𝑎|𝑠𝑠) in the case of a Q function? Because in the value function V(s),
𝑎𝑎
we are given only a state s and we choose an action a based on the policy 𝜋𝜋. So,
we added the term ∑ 𝜋𝜋(𝑎𝑎|𝑠𝑠) to include the stochastic nature of the policy. But in
𝑎𝑎
the case of the Q function Q(s, a), we will be given both state s and action a, so we
don't need to add the term ∑ 𝜋𝜋(𝑎𝑎|𝑠𝑠) in our equation since we are not selecting any
𝑎𝑎
action a based on the policy 𝜋𝜋.
However, if you look at the above equation, we need to select action 𝑎𝑎′ based on the
policy 𝜋𝜋 while computing the Q value of the next state-action pair 𝑄𝑄𝑄𝑄𝑄 ′ , 𝑎𝑎′ ) since
𝑎𝑎′ will not be given. So, we can just place the term ∑ 𝜋𝜋(𝑎𝑎′ |𝑠𝑠 ′ ) before the Q value of
𝑎𝑎 ′
the next state-action pair. Thus, our final Bellman equation of the Q function can
be written as:
𝑄𝑄 𝜋𝜋 (𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃𝑃𝑃𝑃 ′ |𝑠𝑠𝑠 𝑠𝑠𝑠 [𝑅𝑅(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) + 𝛾𝛾 𝛾 𝛾𝛾 (𝑎𝑎′ |𝑠𝑠 ′ )𝑄𝑄 𝜋𝜋 (𝑠𝑠 ′ ,𝑎𝑎′ )] (3)
𝑠𝑠′ 𝑎𝑎 ′
Equation (3) is also known as the Bellman expectation equation of the Q function. We
can also express the equation (3) in expectation form as:
𝑄𝑄 𝜋𝜋 (𝑠𝑠𝑠 𝑠𝑠) = 𝔼𝔼𝑠𝑠′ ~𝑃𝑃 [𝑅𝑅(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) + 𝛾𝛾𝛾𝛾𝑎𝑎′ ~𝜋𝜋 𝑄𝑄 𝜋𝜋 (𝑠𝑠 ′ ,𝑎𝑎′ )] (4)
Now that we have understood what the Bellman expectation equation is, in the next
section, we will learn about the Bellman optimality equation and explore how it is
useful for finding the optimal Bellman value and Q functions.
[ 92 ]
Chapter 3
Okay, how can we compute the optimal Bellman value function that has the
maximum value?
We can compute the optimal Bellman value function by selecting the action that
gives the maximum value. But we don't know which action gives the maximum
value, so, we compute the value of state using all possible actions, and then we
select the maximum value as the value of the state.
That is, instead of using some policy 𝜋𝜋 to select the action, we compute the value
of the state using all possible actions, and then we select the maximum value as the
value of the state. Since we are not using any policy, we can remove the expectation
over the policy 𝜋𝜋 and add the max over the action and express our optimal Bellman
value function as:
It's just the same as the Bellman equation, except here we are taking a maximum
over all the possible actions instead of the expectation (weighted average) over the
policy since we are only interested in the maximum value. Let's understand this
with an example. Say we are in a state s and we have two possible actions in the
state. Let the actions be 0 and 1. Then 𝑉𝑉 ∗ (𝑠𝑠𝑠 is given as:
[ 93 ]
The Bellman Equation and Dynamic Programming
As we can observe from the above equation, we compute the state value using all
possible actions (0 and 1) and then select the maximum value as the value of the
state.
Now, let's look at the optimal Bellman Q function. We learned that the Bellman
equation of the Q function is expressed as:
𝑄𝑄 𝜋𝜋 (𝑠𝑠𝑠 𝑠𝑠) = 𝔼𝔼𝑠𝑠′ ~𝑃𝑃 [𝑅𝑅(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) + 𝛾𝛾𝛾𝛾𝑎𝑎′ ~𝜋𝜋 𝑄𝑄 𝜋𝜋 (𝑠𝑠 ′ ,𝑎𝑎′ )]
Just like we learned with the optimal Bellman value function, instead of using the
policy to select action 𝑎𝑎′ in the next state 𝑠𝑠 ′, we choose all possible actions in that
state 𝑠𝑠 ′ and compute the maximum Q value. It can be expressed as:
Let's understand this with an example. Say we are in a state s with an action a. We
perform action a in state s and reach the next state 𝑠𝑠 ′. We need to compute the Q
value for the next state 𝑠𝑠 ′. There can be many actions in state 𝑠𝑠 ′. Let's say we have
two actions 0 and 1 in state 𝑠𝑠 ′. Then we can write the optimal Bellman Q function as:
𝑄𝑄 ∗ (𝑠𝑠 ′ , 0)
𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠) = 𝔼𝔼𝑠𝑠′ ~𝑃𝑃 [𝑅𝑅(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) + 𝛾𝛾𝛾𝛾𝛾 𝛾 ∗ ′ )]
𝑄𝑄 (𝑠𝑠 , 1)
Thus, to summarize, the Bellman optimality equations of the value function and Q
function are:
We can also expand the expectation and rewrite the preceding Bellman optimality
equations as:
[ 94 ]
Chapter 3
And the optimal Q function gives the maximum state-action value (Q value):
Can we derive some relation between the optimal value function and optimal Q
function? We know that the optimal value function has the maximum expected
return when we start from a state s and the optimal Q function has the maximum
expected return when we start from state s performing some action a. So, we can say
that the optimal value function is the maximum of optimal Q value over all possible
actions, and it can be expressed as follows (that is, we can derive V from Q):
Alright, now let's get back to our Bellman equations. Before going ahead, let's just
recap the Bellman equations:
• 𝑄𝑄 𝜋𝜋 (𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃(𝑠𝑠 ′ |𝑠𝑠𝑠 𝑠𝑠) [𝑅𝑅(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) + 𝛾𝛾 𝛾 𝛾𝛾(𝑎𝑎′ |𝑠𝑠 ′ )𝑄𝑄 𝜋𝜋 (𝑠𝑠 ′ ,𝑎𝑎′ )]
𝑠𝑠′ 𝑎𝑎 ′
[ 95 ]
The Bellman Equation and Dynamic Programming
∗ ′ ′
• 𝑄𝑄 (𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃(𝑠𝑠 |𝑠𝑠𝑠 𝑠𝑠)[𝑅𝑅(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ) + 𝛾𝛾 𝛾𝛾𝛾 𝑄𝑄 ∗ (𝑠𝑠 ′ ,𝑎𝑎′ )]
𝑎𝑎 ′
𝑠𝑠′
We learned that the optimal Bellman Q function is expressed as:
If we have an optimal value function 𝑉𝑉 ∗ (𝑠𝑠𝑠, then we can use it to derive the
preceding optimal Bellman Q function, (that is, we can derive Q from V):
The preceding equation is one of the most useful identities in reinforcement learning,
and we will see how it will help us in finding the optimal policy in the upcoming
section.
As we can observe, we just obtained the optimal Bellman value function. Now that
we understand the Bellman equation and the relationship between the value and
the Q function, we can move on to the next section on how to make use of these
equations to find the optimal policy.
[ 96 ]
Chapter 3
Dynamic programming
Dynamic programming (DP) is a technique for solving complex problems. In DP,
instead of solving a complex problem as a whole, we break the problem into simple
sub-problems, then for each sub-problem, we compute and store the solution. If the
same subproblem occurs, we don't recompute; instead, we use the already computed
solution. Thus, DP helps in drastically minimizing the computation time. It has its
applications in a wide variety of fields including computer science, mathematics,
bioinformatics, and so on.
Now, we will learn about two important methods that use DP to find the optimal
policy. The two methods are:
• Value iteration
• Policy iteration
Note that dynamic programming is a model-based method meaning that it will help
us to find the optimal policy only when the model dynamics (transition probability)
of the environment are known. If we don't have the model dynamics, we cannot
apply DP methods.
Value iteration
In the value iteration method, we try to find the optimal policy. We learned that
the optimal policy is the one that tells the agent to perform the correct action in
each state. In order to find the optimal policy, first, we compute the optimal value
function and once we have the optimal value function, we can use it to derive the
optimal policy. Okay, how can we compute the optimal value function? We can use
our optimal Bellman equation of the value function. We learned that, according to
the Bellman optimality equation, the optimal value function can be computed as:
In the The relationship between the value and Q functions section, we learned that given
the value function, we can derive the Q function:
[ 97 ]
The Bellman Equation and Dynamic Programming
Thus, we can compute the optimal value function by just taking the maximum over
the optimal Q function. So, in order to compute the value of a state, we compute the
Q value for all state-action pairs. Then, we select the maximum Q value as the value
of the state.
Let's understand this with an example. Say we have two states, s0 and s1, and we
have two possible actions in these states; let the actions be 0 and 1. First, we compute
the Q value for all possible state-action pairs. Table 3.1 shows the Q values for all
possible state-action pairs:
Then, in each state, we select the maximum Q value as the optimal value of a state.
Thus, the value of state s0 is 3 and the value of state s1 is 4. The optimal value of the
state (value function) is shown Table 3.2:
Once we obtain the optimal value function, we can use it to extract the optimal policy.
Now that we have a basic understanding of how the value iteration method finds
the optimal value function, in the next section, we will go into detail and learn how
exactly the value iteration method works and how it finds the optimal policy from
the optimal value function.
[ 98 ]
Chapter 3
1. Compute the optimal value function by taking the maximum over the Q
function, that is, 𝑉𝑉 ∗ (𝑠𝑠) = max 𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠𝑠
𝑎𝑎
2. Extract the optimal policy from the computed optimal value function
Let's go into detail and learn exactly how the above two steps work. For better
understanding, let's perform the value iteration manually. Consider the small grid
world environment shown in Figure 3.5. Let's say we are in state A and our goal
is to reach state C without visiting the shaded state B, and say we have two actions,
0—left/right, and 1—up/down:
Can you think of what the optimal policy is here? The optimal policy here is the one
that tells us to perform action 1 in state A so that we can reach C without visiting B.
Now we will see how to find this optimal policy using value iteration.
[ 99 ]
The Bellman Equation and Dynamic Programming
That is, we compute the Q value for all state-action pairs and then we select the
maximum Q value as the value of a state.
𝑎𝑎 𝑎𝑎 ′
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃𝑠𝑠𝑠𝑠 ′ [𝑅𝑅𝑠𝑠𝑠𝑠′ + 𝛾𝛾𝛾𝛾(𝑠𝑠 )] (11)
𝑠𝑠′
Thus, using the preceding equation, we can compute the Q function. If you look
𝑎𝑎
at the equation, to compute the Q function, we need the transition probability 𝑃𝑃𝑠𝑠𝑠𝑠 ′,
𝑎𝑎 ′)
the reward function 𝑅𝑅𝑠𝑠𝑠𝑠′, and the value of the next state 𝑉𝑉(𝑠𝑠 . The model dynamics
𝑎𝑎 𝑎𝑎
provide us with the transition probability 𝑃𝑃𝑠𝑠𝑠𝑠′ and the reward function 𝑅𝑅𝑠𝑠𝑠𝑠 ′. But what
about the value of the next state 𝑉𝑉(𝑠𝑠 ′ )? We don't know the value of any states yet.
So, we will initialize the value function (state values) with random values or zeros as
shown in Table 3.4 and compute the Q function.
Iteration 1:
Let's compute the Q value of state A. We have two actions in state A, which are 0
and 1. So, first let's compute the Q value for state A and action 0 (note that we use
the discount factor 𝛾𝛾 𝛾 𝛾 throughout this section):
[ 100 ]
Chapter 3
0 [𝑅𝑅 0 0 0 0 0
𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄 𝑄 𝑄𝑄𝐴𝐴𝐴𝐴 𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐴𝐴𝐴] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐵𝐵𝐵] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐶𝐶𝐶]
=0.1(0 + 0) + 0.8(−1 + 0) + 0.1(1 + 0)
= −0.7
Now, let's compute the Q value for state A and action 1:
1 [𝑅𝑅1 1 1 1 1
𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄 𝑄 𝑄𝑄𝐴𝐴𝐴𝐴 𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐴𝐴𝐴] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐵𝐵𝐵] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐶𝐶𝐶]
= 0.1(0 + 0) + 0.0(−1 + 0) + 0.9(1 + 0)
= 0.9
After computing the Q values for both the actions in state A, we can update the Q
table as shown in Table 3.5:
We learned that the optimal value of a state is just the max of the Q function. That
is, 𝑉𝑉 ∗ (𝑠𝑠) = max 𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠𝑠. By looking at Table 3.5, we can say that the value of state A,
𝑎𝑎
V(A), is Q(A, 1) since Q(A, 1) has a higher value than Q(A, 0). Thus, V(A) = 0.9.
We can update the value of state A in our value table as shown in Table 3.6:
Similarly, in order to compute the value of state B, V(B), we compute the Q value of
Q(B, 0) and Q(B, 1) and select the highest Q value as the value of state B. In the same
way, to compute the values of other states, we compute the Q value for all state-
action pairs and select the maximum Q value as the value of a state.
[ 101 ]
The Bellman Equation and Dynamic Programming
After computing the value of all the states, our updated value table may resemble
Table 3.7. This is the result of the first iteration:
However, the value function (value table) shown in Table 3.7 obtained as a result of
the first iteration is not an optimal one. But why? We learned that the optimal value
function is the maximum of the optimal Q function. That is, 𝑉𝑉 ∗ (𝑠𝑠) = max 𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠𝑠.
𝑎𝑎
Thus to find the optimal value function, we need the optimal Q function. But the
Q function may not be an optimal one in the first iteration as we computed the Q
function based on the randomly initialized state values.
As the following shows, when we started off computing the Q function, we used the
randomly initialized state values.
So, what we can do is, in the next iteration, while computing the Q function, we can
use the updated state values obtained as a result of the first iteration.
That is, in the second iteration, to compute the value function, we compute the
Q value of all state-action pairs and select the maximum Q value as the value
of a state. In order to compute the Q value, we need to know the state values,
in the first iteration, we used the randomly initialized state values. But in the
second iteration, we use the updated state values (value table) obtained from the
first iteration as the following shows:
[ 102 ]
Chapter 3
Iteration 2:
Let's compute the Q value of state A. Remember that while computing the Q value,
we use the updated state values from the previous iteration.
0 [𝑅𝑅 0 0 0 0 0
𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄 𝑄 𝑄𝑄𝐴𝐴𝐴𝐴 𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐴𝐴𝐴] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐵𝐵𝐵] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐶𝐶𝐶]
=0.1(0 + 0.9) + 0.8(−1 − 0.2) + 0.1(1 + 0.5)
= −0.72
Now, let's compute the Q value for state A and action 1:
1 [𝑅𝑅1 1 1 1 1
𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄 𝑄 𝑄𝑄𝐴𝐴𝐴𝐴 𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐴𝐴𝐴] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐵𝐵𝐵] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐶𝐶𝐶]
= 0.1(0 + 0.9) + 0.0(−1 − 0.2) + 0.9(1 + 0.5)
=1.44
As we may observe, since the Q value of action 1 in state A is higher than action 0,
the value of state A becomes 1.44. Similarly, we compute the value for all the states
and update the value table. Table 3.8 shows the updated value table:
[ 103 ]
The Bellman Equation and Dynamic Programming
Iteration 3:
We repeat the same steps we saw in the previous iteration and compute the value
of all the states by selecting the maximum Q value. Remember that while computing
the Q value, we use the updated state values (value table) obtained from the
previous iteration. So, we use the updated state values from iteration 2 to compute
the Q value.
Table 3.9 shows the updated state values obtained as a result of the third iteration:
So, we repeat these steps for many iterations until we find the optimal value
function. But how can we understand whether we have found the optimal value
function or not? When the value function (value table) does not change over
iterations or when it changes by a very small fraction, then we can say that we
have attained convergence, that is, we have found an optimal value function.
Okay, how can we find out whether the value table is changing or not changing
from the previous iteration? We can calculate the difference between the value table
obtained from the previous iteration and the value table obtained from the current
iteration. If the difference is very small—say, the difference is less than a very small
threshold number—then we can say that we have attained convergence as there is
not much change in the value function.
For example, let's suppose Table 3.10 shows the value table obtained as a result of
iteration 4:
[ 104 ]
Chapter 3
As we can notice, the difference between the value table obtained as a result of
iteration 4 and iteration 3 is very small. So, we can say that we have attained
convergence and we take the value table obtained as a result of iteration 4 as
our optimal value function. Please note that the above example is just for better
understanding; in practice, we cannot attain convergence in just four iterations—it
usually takes many iterations.
Now that we have found the optimal value function, in the next step, we will use this
optimal value function to extract an optimal policy.
Now, how can we extract the optimal policy from the obtained optimal value
function?
We generally use the Q function to compute the policy. We know that the Q function
gives the Q value for every state-action pair. Once we have the Q values for all state-
action pairs, we extract the policy by selecting the action that has the maximum Q
value in each state. For example, consider the Q table in Table 3.12. It shows the Q
values for all state-action pairs. Now we can extract the policy from the Q function
(Q table) by selecting action 1 in the state s0 and action 0 in the state s1 as they have
the maximum Q value.
[ 105 ]
The Bellman Equation and Dynamic Programming
Okay, now we compute the Q function using the optimal value function obtained
from Step 1. Once we have the Q function, then we extract the policy by selecting the
action that has the maximum Q value in each state. Since we are computing the Q
function using the optimal value function, the policy extracted from the Q function
will be the optimal policy.
𝑎𝑎 𝑎𝑎 ′
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃𝑆𝑆𝑆𝑆 ′ [𝑅𝑅𝑆𝑆𝑆𝑆 ′ + 𝛾𝛾𝛾𝛾(𝑠𝑠 )]
𝑠𝑠′
Now, while computing Q values, we use the optimal value function we obtained
from step 1. After computing the Q function, we can extract the optimal policy by
selecting the action that has the maximum Q value:
For instance, let's compute the Q value for all actions in state A using the optimal
value function. The Q value for action 0 in state A is computed as:
0 [𝑅𝑅 0 0 0 0 0
𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄 𝑄 𝑄𝑄𝐴𝐴𝐴𝐴 𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐴𝐴𝐴] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐵𝐵𝐵] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐶𝐶𝐶]
=0.1(0 + 1.95) + 0.8(−1 − 0.72) + 0.1(1 + 1.3)
= −0.951
The Q value for action 1 in state A is computed as:
1 [𝑅𝑅1 1 1 1 1
𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄 𝑄 𝑄𝑄𝐴𝐴𝐴𝐴 𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐴𝐴𝐴] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐵𝐵𝐵] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐶𝐶𝐶]
= 0.1(0 + 1.95) + 0.0(−1 − 0.72) + 0.9(1 + 1.3)
= 2.26
Since Q(A, 1) is higher than Q(A, 0), our optimal policy will select action 1 as the
optimal action in state A. Table 3.13 shows the Q table after computing the Q values
for all state-action pairs using the optimal value function:
[ 106 ]
Chapter 3
From this Q table, we pick the action in each state that has the maximum value as an
optimal policy. Thus, our optimal policy would select action 1 in state A, action 1 in
state B, and action 1 in state C.
Thus, according to our optimal policy, if we perform action 1 in state A, we can reach
state C without visiting state B.
In this section, we learned how to compute the optimal policy using the value
iteration method. In the next section, we will learn how to implement the value
iteration method to compute the optimal policy in the Frozen Lake environment
using the Gym toolkit.
Let's recap the Frozen Lake environment a bit. In the Frozen Lake environment
shown in Figure 3.6, the following applies:
[ 107 ]
The Bellman Equation and Dynamic Programming
We learned that in the Frozen Lake environment, our goal is to reach the goal state
G from the starting state S without visiting the hole states H. That is, while trying to
reach the goal state G from the starting state S, if the agent visits the hole states H,
then it will fall into the hole and die as Figure 3.7 shows:
So, we want the agent to avoid the hole states H to reach the goal state G as shown in
the following:
[ 108 ]
Chapter 3
How can we achieve this goal? That is, how can we reach state G from S without
visiting H? We learned that the optimal policy tells the agent to perform the correct
action in each state. So, if we find the optimal policy, then we can reach state G
from S without visiting state H. Okay, how can we find the optimal policy? We
can use the value iteration method we just learned to find the optimal policy.
Remember that all our states (S to G) will be encoded from 0 to 16 and all four
actions—left, down, up, right—will be encoded from 0 to 3 in the Gym toolkit.
In this section, we will learn how to find the optimal policy using the value iteration
method so that the agent can reach state G from S without visiting H.
import gym
import numpy as np
env = gym.make('FrozenLake-v0')
Let's look at the Frozen Lake environment using the render function:
env.render()
As we can notice, our agent is in state S and it has to reach state G without visiting
the H states. So, let's learn how to compute the optimal policy using the value
iteration method.
[ 109 ]
The Bellman Equation and Dynamic Programming
1. Compute the optimal value function by taking the maximum over the Q
function, that is, 𝑉𝑉 ∗ (𝑠𝑠) = max 𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠𝑠
𝑎𝑎
2. Extract the optimal policy from the computed optimal value function
First, let's learn how to compute the optimal value function, and then we will see
how to extract the optimal policy from the computed optimal value function.
def value_iteration(env):
num_iterations = 1000
Set the threshold number for checking the convergence of the value function:
threshold = 1e-20
gamma = 1.0
Now, we will initialize the value table by setting the value of all states to zero:
value_table = np.zeros(env.observation_space.n)
for i in range(num_iterations):
Update the value table, that is, we learned that on every iteration, we use the
updated value table (state values) from the previous iteration:
updated_value_table = np.copy(value_table)
[ 110 ]
Chapter 3
Now, we compute the value function (state value) by taking the maximum of the Q
value:
𝑎𝑎 𝑎𝑎
Where 𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃𝑆𝑆𝑆𝑆 ′ [𝑅𝑅𝑆𝑆𝑆𝑆 ′ + 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ )].
𝑠𝑠′
Thus, for each state, we compute the Q values of all the actions in the state and then
we update the value of the state as the one that has the maximum Q value:
for s in range(env.observation_space.n):
𝑎𝑎 𝑎𝑎 ′
Compute the Q value of all the actions, 𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃𝑆𝑆𝑆𝑆 ′ [𝑅𝑅𝑆𝑆𝑆𝑆 ′ + 𝛾𝛾𝛾𝛾(𝑠𝑠 )]:
𝑠𝑠′
Q_values = [sum([prob*(r + gamma * updated_value_table[s_])
for prob, s_, r, _ in env.P[s][a]])
for a in range(env.action_space.n)]
Update the value of the state as a maximum Q value, 𝑉𝑉 ∗ (𝑠𝑠) = max 𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠𝑠:
𝑎𝑎
value_table[s] = max(Q_values)
After computing the value table, that is, the value of all the states, we check whether
the difference between the value table obtained in the current iteration and the
previous iteration is less than or equal to a threshold value. If the difference is less
than the threshold, then we break the loop and return the value table as our optimal
value function as the following code shows:
return value_table
def value_iteration(env):
num_iterations = 1000
threshold = 1e-20
gamma = 1.0
[ 111 ]
The Bellman Equation and Dynamic Programming
value_table = np.zeros(env.observation_space.n)
for i in range(num_iterations):
updated_value_table = np.copy(value_table)
for s in range(env.observation_space.n):
value_table[s] = max(Q_values)
return value_table
Now that we have computed the optimal value function by taking the maximum
of the Q values, let's see how to extract the optimal policy from the optimal value
function.
def extract_policy(value_table):
gamma = 1.0
[ 112 ]
Chapter 3
First, we initialize the policy with zeros, that is, we set the actions for all the states to
be zero:
policy = np.zeros(env.observation_space.n)
Now, we compute the Q function using the optimal value function obtained from
the previous step. We learned that the Q function can be computed as:
𝑎𝑎 𝑎𝑎 ′
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃𝑆𝑆𝑆𝑆 ′ [𝑅𝑅𝑆𝑆𝑆𝑆 ′ + 𝛾𝛾𝛾𝛾(𝑠𝑠 )]
𝑠𝑠′
After computing the Q function, we can extract the policy by selecting the action that
has the maximum Q value. Since we are computing the Q function using the optimal
value function, the policy extracted from the Q function will be the optimal policy.
As the following code shows, for each state, we compute the Q values for all the
actions in the state and then we extract the policy by selecting the action that has the
maximum Q value.
for s in range(env.observation_space.n):
𝑎𝑎 𝑎𝑎
Compute the Q value of all the actions in the state, 𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃𝑆𝑆𝑆𝑆 ′ [𝑅𝑅𝑆𝑆𝑆𝑆 ′ + 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ )]:
𝑠𝑠′
Q_values = [sum([prob*(r + gamma * value_table[s_])
for prob, s_, r, _ in env.P[s][a]])
for a in range(env.action_space.n)]
Extract the policy by selecting the action that has the maximum Q value,
𝜋𝜋 ∗ = arg max 𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄𝑄:
𝑎𝑎
policy[s] = np.argmax(np.array(Q_values))
return policy
[ 113 ]
The Bellman Equation and Dynamic Programming
The complete snippet of the extract_policy function is shown here to give us more
clarity:
def extract_policy(value_table):
gamma = 1.0
policy = np.zeros(env.observation_space.n)
for s in range(env.observation_space.n):
policy[s] = np.argmax(np.array(Q_values))
return policy
That's it! Now, we will see how to extract the optimal policy in our Frozen Lake
environment.
First, we compute the optimal value function using our value_iteration function by
passing our Frozen Lake environment as the parameter:
optimal_value_function = value_iteration(env)
Next, we extract the optimal policy from the optimal value function using our
extract_policy function:
optimal_policy = extract_policy(optimal_value_function)
print(optimal_policy)
[ 114 ]
Chapter 3
The preceding code will print the following. As we can observe, our optimal policy
tells us to perform the correct action in each state:
[0. 3. 3. 3. 0. 0. 0. 0. 3. 1. 0. 0. 0. 2. 1. 0.]
Now that we have learned what value iteration is and how to perform the value
iteration method to compute the optimal policy in our Frozen Lake environment,
in the next section, we will learn about another interesting method, called policy
iteration.
Policy iteration
In the value iteration method, first, we computed the optimal value function by
taking the maximum over the Q function (Q values) iteratively. Once we found the
optimal value function, we used it to extract the optimal policy. Whereas in policy
iteration we try to compute the optimal value function using the policy iteratively,
once we found the optimal value function, we can use it to extract the optimal policy.
First, let's learn how to compute the value function using a policy. Say we have a
policy 𝜋𝜋, how can we compute the value function using the policy 𝜋𝜋? Here, we can
use our Bellman equation. We learned that according to the Bellman equation, we
can compute the value function using the policy 𝜋𝜋 as the following shows:
Let's suppose our policy is a deterministic policy, so we can remove the term
∑ 𝜋𝜋𝜋𝜋𝜋|𝑠𝑠𝑠 from the preceding equation since there is no stochasticity in the policy
𝑎𝑎
and rewrite our Bellman equation as:
𝑎𝑎 𝑎𝑎
𝑉𝑉 𝜋𝜋 (𝑠𝑠) = ∑ 𝑃𝑃𝑠𝑠𝑠𝑠 ′
′ [𝑅𝑅𝑠𝑠𝑠𝑠′ + 𝛾𝛾𝛾𝛾(𝑠𝑠 )]
𝑠𝑠′
[ 115 ]
The Bellman Equation and Dynamic Programming
Thus using the above equation we can compute the value function using a policy.
Our goal is to find the optimal value function because once we have found the
optimal value function, we can use it to extract the optimal policy.
We will not be given any policy as an input. So, we will initialize the random policy
and compute the value function using the random policy. Then we check if the
computed value function is optimal or not. It will not be optimal since it is computed
based on the random policy.
So, we will extract a new policy from the computed value function, then we will use
the extracted new policy to compute the new value function, and then we will check
if the new value function is optimal. If it's optimal we will stop, else we repeat these
steps for a series of iterations. For a better understanding, look at the following steps:
Iteration 1: Let 𝜋𝜋0 be the random policy. We use this random policy to compute the
value function 𝑉𝑉 𝜋𝜋0. Our value function will not be optimal as it is computed based
on the random policy. So, from 𝑉𝑉 𝜋𝜋0, we extract a new policy 𝜋𝜋1.
Iteration 2: Now, we use the new policy 𝜋𝜋1 derived from the previous iteration to
compute the new value function 𝑉𝑉 𝜋𝜋1, then we check if 𝑉𝑉 𝜋𝜋1 is optimal. If it is optimal,
we stop, else from this value function 𝑉𝑉 𝜋𝜋1, we extract a new policy 𝜋𝜋2.
Iteration 3: Now, we use the new policy 𝜋𝜋2 derived from the previous iteration to
compute the new value function 𝑉𝑉 𝜋𝜋2, then we check if 𝑉𝑉 𝜋𝜋2 is optimal. If it is optimal,
we stop, else from this value function 𝑉𝑉 𝜋𝜋2, we extract a new policy 𝜋𝜋3.
We repeat this process for many iterations until we find the optimal value function
∗
𝑉𝑉 𝜋𝜋 as the following shows:
∗
𝜋𝜋0 → 𝑉𝑉 𝜋𝜋0 → 𝜋𝜋1 → 𝑉𝑉 𝜋𝜋1 → 𝜋𝜋2 → 𝑉𝑉 𝜋𝜋2 → 𝜋𝜋3 → 𝑉𝑉 𝜋𝜋3 → ⋯ → 𝜋𝜋 ∗ → 𝑉𝑉 𝜋𝜋
The preceding step is called policy evaluation and improvement. Policy evaluation
implies that at each step we evaluate the policy by checking if the value function
computed using that policy is optimal. Policy improvement means that at each step
we find the new improved policy to compute the optimal value function.
∗
Once we have found the optimal value function 𝑉𝑉 𝜋𝜋 , then it implies that we have
∗
also found the optimal policy. That is, if 𝑉𝑉 𝜋𝜋 is optimal, then the policy that is used to
∗
compute 𝑉𝑉 𝜋𝜋 will be an optimal policy.
To get a better understanding of how policy iteration works, let's look into the below
steps with pseudocode. In the first iteration, we will initialize a random policy and
use it to compute the value function:
policy = random_policy
value_function = compute_value_function(policy)
[ 116 ]
Chapter 3
Since we computed the value function using the random policy, the computed value
function will not be optimal. So, we need to find a new policy with which we can
compute the optimal value function.
So, we extract a new policy from the value function computed using a random
policy:
new_policy = extract_policy(value_function)
Now, we will use this new policy to compute the new value function:
policy = new_policy
value_function = compute_value_function(policy)
If the new value function is optimal, we stop, else we repeat the preceding steps
for a number of iterations until we find the optimal value function. The following
pseudocode gives us a better understanding:
policy = random_policy
for i in range(num_iterations):
value_function = compute_value_function(policy)
new_policy = extract_policy(value_function)
if value_function = optimal:
break
else:
policy = new_policy
Wait! How do we say our value function is optimal? If the value function is not
changing over iterations, then we can say that our value function is optimal. Okay,
how can we check if the value function is not changing over iterations?
We learned that we compute the value function using a policy. If the policy is not
changing over iterations, then our value function also doesn't change over the
iterations. Thus, when the policy doesn't change over iterations, then we can say
that we have found the optimal value function.
Thus, over a series of iterations when the policy and new policy become the same,
then we can say that we obtained the optimal value function. The following final
pseudocode is given for clarity:
policy = random_policy
for i in range(num_iterations):
value_function = compute_value_function(policy)
new_policy = extract_policy(value_function)
[ 117 ]
The Bellman Equation and Dynamic Programming
if policy == new_policy:
break
else:
policy = new_policy
Thus, when the policy is not changing, that is, when the policy and new policy
become the same, then we can say that we obtained the optimal value function and
the policy that is used to compute the optimal value function will be the optimal
policy.
Remember that in the value iteration method, we compute the optimal value
function using the maximum over Q function (Q value) iteratively and once we
have found the optimal value function, we extract the optimal policy from it. But
in the policy iteration method, we compute the optimal value function using the
policy iteratively and once we have found the optimal value function, then the
policy that is used to compute the optimal value function will be the optimal policy.
Now that we have a basic understanding of how the policy iteration method works,
in the next section, we will get into the details and learn how to compute policy
iteration manually.
Now, let's get into the details and learn how exactly the preceding steps work. For a
clear understanding, let's perform policy iteration manually. Let's take the same grid
world environment we used in the value iteration method. Let's say we are in state A
and our goal is to reach state C without visiting the shaded state B, and say we have
two actions, 0 – left/right and 1 – up/down:
[ 118 ]
Chapter 3
We know that in the above environment, the optimal policy is the one that tells us to
perform action 1 in state A so that we can reach C without visiting B. Now, we will
see how to find this optimal policy using policy iteration.
𝐴𝐴 𝐴 𝐴
𝐵𝐵 𝐵 𝐵
𝐶𝐶 𝐶𝐶
To understand this step better, let's quickly recollect how we compute the value
function in value iteration. In value iteration, we compute the optimal value
function as the maximum over the optimal Q function as the following shows:
[ 119 ]
The Bellman Equation and Dynamic Programming
In policy iteration, we compute the value function using a policy 𝜋𝜋, unlike value
iteration, where we computed the value function using the maximum over the Q
function. The value function using a policy 𝜋𝜋 can be computed as:
𝑎𝑎 𝑎𝑎
𝑉𝑉 𝜋𝜋 (𝑠𝑠) = ∑ 𝑃𝑃𝑆𝑆𝑆𝑆 𝜋𝜋 ′
′ [𝑅𝑅𝑆𝑆𝑆𝑆 ′ + 𝛾𝛾𝛾𝛾 (𝑠𝑠 )]
𝑠𝑠′
If you look at the preceding equation, to compute the value function, we need
𝑎𝑎 𝑎𝑎
the transition probability 𝑃𝑃𝑆𝑆𝑆𝑆 ′, the reward function 𝑅𝑅𝑆𝑆𝑆𝑆 ′ and the value of the next
′ 𝑎𝑎
state 𝑉𝑉𝑉𝑉𝑉 ). The values of the transition probability 𝑃𝑃𝑆𝑆𝑆𝑆 ′ and the reward function
𝑎𝑎
𝑅𝑅𝑆𝑆𝑆𝑆 ′ can be obtained from the model dynamics. But what about the value of the
next state 𝑉𝑉𝑉𝑉𝑉 ′ )? We don't know the value of any states yet. So, we will initialize
the value function (state values) with random values or zeros as Figure 3.15 shows
and compute the value function:
Iteration 1:
Let's compute the value of state A (note that here, we only compute the value for the
action given by the policy, unlike in value iteration, where we computed the Q value
for all the actions in the state and selected the maximum value).
So, the action given by the policy for state A is 1 and we can compute the value
of state A as the following shows (note that we have used a discount factor 𝛾𝛾 𝛾 𝛾
throughout this section):
1 [𝑅𝑅1 1 1 1 1
𝑉𝑉𝑉𝑉𝑉𝑉 𝑉 𝑉𝑉𝐴𝐴𝐴𝐴 𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐴𝐴𝐴] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐵𝐵𝐵] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐶𝐶𝐶]
= 0.1(0 + 0) + 0.0(−1 + 0) + 0.9(1 + 0)
= 0.9
Similarly, we compute the value for all the states using the action given by the policy.
Table 3.16 shows the updated state values obtained as a result of the first iteration:
[ 120 ]
Chapter 3
However, the value function (value table) shown in Table 3.16 obtained as a result of
the first iteration will not be accurate. That is, the state values (value function) will
not be accurate according to the given policy.
Note that unlike the value iteration method, here we are not checking whether
our value function is optimal or not; we just check whether our value function is
accurately computed according to the given policy.
The value function will not be accurate because when we started off computing the
value function using the given policy, we used the randomly initialized state values:
So, in the next iteration, while computing the value function, we will use the updated
state values obtained as a result of the first iteration:
Iteration 2:
Now, in iteration 2, we compute the value function using the policy 𝜋𝜋. Remember
that while computing the value function, we will use the updated state values (value
table) obtained from iteration 1.
[ 121 ]
The Bellman Equation and Dynamic Programming
1 [𝑅𝑅1 1 1 1 1
𝑉𝑉𝑉𝑉𝑉𝑉 𝑉 𝑉𝑉𝐴𝐴𝐴𝐴 𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐴𝐴𝐴] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐵𝐵𝐵] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐶𝐶𝐶]
= 0.1(0 + 0.9) + 0.0(−1 − 0.2) + 0.9(1 + 0.1)
= 1.08
Similarly, we compute the value for all the states using the action given by the policy.
Table 3.17 shows the updated state values obtained as a result of the second iteration:
Iteration 3:
Similarly, in iteration 3, we compute the value function using the policy 𝜋𝜋 and while
computing the value function, we will use the updated state values (value table)
obtained from iteration 2.
Table 3.18 shows the updated state values obtained from the third iteration:
We repeat this for many iterations until the value table does not change or changes
very little over iterations. For example, let's suppose Table 3.19 shows the value table
obtained as a result of iteration 4:
[ 122 ]
Chapter 3
As we can see, the difference between the value tables obtained from iteration 4 and
iteration 3 is very small. So, we can say that the value table is not changing much
over iterations and we stop at this iteration and take this as our final value function.
Okay, how can we extract a new policy from the value function? (Hint: This step is
exactly the same as how we extracted a policy given the value function in step 2 of
the value iteration method.)
In order to extract a new policy, we compute the Q function using the value function
(value table) obtained from the previous step. Once we compute the Q function,
we pick up actions in each state that have the maximum value as a new policy.
We know that the Q function can be computed as:
𝑎𝑎 𝑎𝑎 ′
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃𝑆𝑆𝑆𝑆 ′ [𝑅𝑅𝑆𝑆𝑆𝑆 ′ + 𝛾𝛾𝛾𝛾𝛾𝛾𝛾 )]
𝑠𝑠′
[ 123 ]
The Bellman Equation and Dynamic Programming
Now, while computing Q values, we use the value function we obtained from the
previous step.
For instance, let's compute the Q value for all actions in state A using the value
function obtained from the previous step. The Q value for action 0 in state A is
computed as:
0 [𝑅𝑅 0 0 0 0 0
𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄 𝑄 𝑄𝑄𝐴𝐴𝐴𝐴 𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐴𝐴𝐴] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐵𝐵𝐵] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐶𝐶𝐶]
=0.1(0 + 1.46) + 0.8(−1 − 0.9) + 0.1(1 + 0.61)
= −1.21
The Q value for action 1 in state A is computed as:
1 [𝑅𝑅1 1 1 1 1
𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄 𝑄 𝑄𝑄𝐴𝐴𝐴𝐴 𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐴𝐴𝐴] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐵𝐵𝐵] + 𝑃𝑃𝐴𝐴𝐴𝐴 [𝑅𝑅𝐴𝐴𝐴𝐴 + 𝛾𝛾𝛾𝛾(𝐶𝐶𝐶]
= 0.1(0 + 1.46) + 0.0(−1 − 0.9) + 0.9(1 + 0.61)
=1.59
Table 3.21 shows the Q table after computing the Q values for all state-action pairs:
From this Q table, we pick up actions in each state that have the maximum value as a
new policy.
𝐴𝐴 𝐴 𝐴
𝐵𝐵 𝐵 𝐵
𝐶𝐶 𝐶𝐶
[ 124 ]
Chapter 3
Thus, in this section, we learned how to compute the optimal policy using the policy
iteration method. In the next section, we will learn how to implement the policy
iteration method to compute the optimal policy in the Frozen Lake environment
using the Gym toolkit.
import gym
import numpy as np
env = gym.make('FrozenLake-v0')
We learned that in the policy iteration, we compute the value function using the
policy iteratively. Once we have found the optimal value function, then the policy
that is used to compute the optimal value function will be the optimal policy.
So, first, let's learn how to compute the value function using the policy.
[ 125 ]
The Bellman Equation and Dynamic Programming
def compute_value_function(policy):
num_iterations = 1000
threshold = 1e-20
gamma = 1.0
Now, we will initialize the value table by setting all the state values to zero:
value_table = np.zeros(env.observation_space.n)
for i in range(num_iterations):
Update the value table; that is, we learned that on every iteration, we use the
updated value table (state values) from the previous iteration:
updated_value_table = np.copy(value_table)
Now, we compute the value function using the given policy. We learned that a value
function can be computed according to some policy 𝜋𝜋 as follows:
𝑎𝑎 𝑎𝑎
𝑉𝑉 𝜋𝜋 (𝑠𝑠) = ∑ 𝑃𝑃𝑆𝑆𝑆𝑆 𝜋𝜋 ′
′ [𝑅𝑅𝑆𝑆𝑆𝑆 ′ + 𝛾𝛾𝛾𝛾 (𝑠𝑠 )]
𝑠𝑠′
Thus, for each state, we select the action according to the policy and then we update
the value of the state using the selected action as follows.
for s in range(env.observation_space.n):
a = policy[s]
[ 126 ]
Chapter 3
𝑠𝑠′
value_table[s] = sum(
[prob * (r + gamma * updated_value_table[s_])
for prob, s_, r, _ in env.P[s][a]])
After computing the value table, that is, the value of all the states, we check whether
the difference between the value table obtained in the current iteration and the
previous iteration is less than or equal to a threshold value. If it is less, then we break
the loop and return the value table as an accurate value function of the given policy:
return value_table
Now that we have computed the value function of the policy, let's see how to extract
the policy from the value function.
def extract_policy(value_table):
gamma = 1.0
policy = np.zeros(env.observation_space.n)
for s in range(env.observation_space.n):
policy[s] = np.argmax(np.array(Q_values))
return policy
[ 127 ]
The Bellman Equation and Dynamic Programming
def policy_iteration(env):
num_iterations = 1000
policy = np.zeros(env.observation_space.n)
for i in range(num_iterations):
value_function = compute_value_function(policy)
new_policy = extract_policy(value_function)
If policy and new_policy are the same, then break the loop:
if (np.all(policy == new_policy)):
break
policy = new_policy
return policy
Now, let's learn how to perform policy iteration and find the optimal policy in
the Frozen Lake environment. So, we just feed the Frozen Lake environment to our
policy_iteration function as shown here and get the optimal policy:
optimal_policy = policy_iteration(env)
[ 128 ]
Chapter 3
print(optimal_policy)
array([0., 3., 3., 3., 0., 0., 0., 0., 3., 1., 0., 0., 0., 2., 1., 0.])
As we can observe, our optimal policy tells us to perform the correct action in each
state. Thus, we learned how to perform the policy iteration method to compute the
optimal policy.
Value iteration: In the value iteration method, we compute the optimal value
function by taking the maximum over the Q function (Q values) iteratively:
𝑉𝑉 ∗ (𝑠𝑠) = max 𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠𝑠
𝑎𝑎
𝑎𝑎 𝑎𝑎 ′
Where 𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃𝑆𝑆𝑆𝑆 ′ [𝑅𝑅𝑆𝑆𝑆𝑆 ′ + 𝛾𝛾𝛾𝛾(𝑠𝑠 )]. After finding the optimal value function,
𝑠𝑠′
we extract the optimal policy from it.
Policy iteration: In the policy iteration method, we compute the optimal value
function using the policy iteratively:
𝑎𝑎 𝑎𝑎
𝑉𝑉 𝜋𝜋 (𝑠𝑠) = ∑ 𝑃𝑃𝑆𝑆𝑆𝑆 𝜋𝜋 ′
′ [𝑅𝑅𝑆𝑆𝑆𝑆 ′ + 𝛾𝛾𝛾𝛾 (𝑠𝑠 )]
𝑠𝑠′
We will start off with the random policy and compute the value function. Once
we have found the optimal value function, then the policy that is used to create the
optimal value function will be the optimal policy.
If you look at the preceding two equations, in order to find the optimal policy, we
compute the value function and Q function. But to compute the value and the Q
𝑎𝑎
function, we need to know the transition probability 𝑃𝑃𝑆𝑆𝑆𝑆 ′ of the environment, and
when we don't know the transition probability of the environment, we cannot
compute the value and the Q function in order to find the optimal policy.
[ 129 ]
The Bellman Equation and Dynamic Programming
That is, dynamic programming is a model-based method and to apply this method,
we need to know the model dynamics (transition probability) of the environment,
and when we don't know the model dynamics, we cannot apply the dynamic
programming method.
Okay, how can we find the optimal policy when we don't know the model dynamics
of the environment? In such a case, we can use model-free methods. In the next
chapter, we will learn about one of the interesting model-free methods, called
Monte Carlo, and how it is used to find the optimal policy without requiring the
model dynamics.
Summary
We started off the chapter by understanding the Bellman equation of the value
and Q functions. We learned that, according to the Bellman equation, the value
of a state is the sum of the immediate reward, and the discounted value of the
next state and the value of a state-action pair is the sum of the immediate reward
and the discounted value of the next state-action pair. Then we learned about the
optimal Bellman value function and the Q function, which gives the maximum value.
Moving forward, we learned about the relation between the value and Q functions.
We learned that the value function can be extracted from the Q function as
𝑉𝑉 ∗ (𝑠𝑠) = max 𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠𝑠 and then we learned that the Q function can be extracted
𝑎𝑎
from the value function as 𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃𝑃𝑃𝑃 ′ |𝑠𝑠𝑠 𝑠𝑠𝑠𝑠𝑠𝑠(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) + 𝛾𝛾𝛾𝛾 ∗ (𝑠𝑠 ′ )].
𝑠𝑠′
Later we learned about two interesting methods called value iteration and policy
iteration, which use dynamic programming to find the optimal policy.
In the value iteration method, first, we compute the optimal value function by taking
the maximum over the Q function iteratively. Once we have found the optimal value
function, then we use it to extract the optimal policy. In the policy iteration method,
we try to compute the optimal value function using the policy iteratively. Once we
have found the optimal value function, then the policy that is used to create the
optimal value function will be extracted as the optimal policy.
[ 130 ]
Chapter 3
Questions
Let's try answering the following questions to assess our knowledge of what we
learned in this chapter:
[ 131 ]
Monte Carlo Methods
4
In the previous chapter, we learned how to compute the optimal policy using
two interesting dynamic programming methods called value and policy iteration.
Dynamic programming is a model-based method and it requires the model dynamics
of the environment to compute the value and Q functions in order to find the
optimal policy.
But let's suppose we don't have the model dynamics of the environment. In that
case, how do we compute the value and Q functions? Here is where we use model-
free methods. Model-free methods do not require the model dynamics of the
environment to compute the value and Q functions in order to find the optimal
policy. One such popular model-free method is the Monte Carlo (MC) method.
We will begin the chapter by understanding what the MC method is, then we
will look into two important types of tasks in reinforcement learning called
prediction and control tasks. Later, we will learn how the Monte Carlo method is
used in reinforcement learning and how it is beneficial compared to the dynamic
programming method we learned about in the previous chapter. Moving forward,
we will understand what the MC prediction method is and the different types of
MC prediction methods. We will also learn how to train an agent to play blackjack
with the MC prediction method.
Going ahead, we will learn about the Monte Carlo control method and different
types of Monte Carlo control methods. Following this, we will learn how to train
an agent to play blackjack with the MC control method.
[ 133 ]
Monte Carlo Methods
For instance, the Monte Carlo method approximates the expectation of a random
variable by sampling, and when the sample size is greater, the approximation will
be better. Let's suppose we have a random variable X and say we need to compute
the expected value of X; that is E(X), then we can compute it by taking the sum of the
values of X multiplied by their respective probabilities as follows:
𝑁𝑁
But instead of computing the expectation like this, can we approximate it with the
Monte Carlo method? Yes! We can estimate the expected value of X by just sampling
the values of X for some N times and compute the average value of X as the expected
value of X as follows:
1
𝔼𝔼𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥 [𝑋𝑋] ≈ ∑ 𝑥𝑥𝑖𝑖
𝑁𝑁
𝑖𝑖
When N is larger our approximation will be better. Thus, with the Monte Carlo
method, we can approximate the solution through sampling and our approximation
will be better when the sample size is large.
In the upcoming sections, we will learn how exactly the Monte Carlo method is used
in reinforcement learning.
[ 134 ]
Chapter 4
Prediction task
In the prediction task, a policy 𝜋𝜋 is given as an input and we try to predict the value
function or Q function using the given policy. But what is the use of doing this? Our
goal is to evaluate the given policy. That is, we need to determine whether the given
policy is good or bad. How can we determine that? If the agent obtains a good return
using the given policy then we can say that our policy is good. Thus, to evaluate
the given policy, we need to understand what the return the agent would obtain
if it uses the given policy. To obtain the return, we predict the value function or Q
function using the given policy.
That is, we learned that the value function or value of a state denotes the expected
return an agent would obtain starting from that state following some policy 𝜋𝜋. Thus,
by predicting the value function using the given policy 𝜋𝜋, we can understand what
the expected return the agent would obtain in each state if it uses the given policy 𝜋𝜋.
If the return is good then we can say that the given policy is good.
Similarly, we learned that the Q function or Q value denotes the expected return the
agent would obtain starting from the state s and an action a following the policy 𝜋𝜋.
Thus, predicting the Q function using the given policy 𝜋𝜋, we can understand what the
expected return the agent would obtain in each state-action pair if it uses the given
policy. If the return is good then we can say that the given policy is good.
Thus, we can evaluate the given policy 𝜋𝜋 by computing the value and Q functions.
Note that, in the prediction task, we don't make any change to the given input policy.
We keep the given policy as fixed and predict the value function or Q function using
the given policy and obtain the expected return. Based on the expected return, we
evaluate the given policy.
Control task
Unlike the prediction task, in the control task, we will not be given any policy as
an input. In the control task, our goal is to find the optimal policy. So, we will start
off by initializing a random policy and we try to find the optimal policy iteratively.
That is, we try to find an optimal policy that gives the maximum return.
[ 135 ]
Monte Carlo Methods
Thus, in a nutshell, in the prediction task, we evaluate the given input policy by
predicting the value function or Q function, which helps us to understand the
expected return an agent would get if it uses the given policy, while in the control
task our goal is to find the optimal policy and we will not be given any policy as
input; so we will start off by initializing a random policy and we try to find the
optimal policy iteratively.
Now that we have understood what prediction and control tasks are, in the next
section, we will learn how to use the Monte Carlo method for performing the
prediction and control tasks.
Why do we need the Monte Carlo method for predicting the value function of
the given policy? Why can't we predict the value function using the dynamic
programming methods we learned about in the previous chapter? We learned that
in order to compute the value function using the dynamic programming method,
we need to know the model dynamics (transition probability), and when we don't
know the model dynamics, we use the model-free methods.
The Monte Carlo method is a model-free method, meaning that it doesn't require the
model dynamics to compute the value function.
First, let's recap the definition of the value function. The value function or the value
of the state s can be defined as the expected return the agent would obtain starting
from the state s and following the policy 𝜋𝜋. It can be expressed as:
Okay, how can we estimate the value of the state (value function) using the Monte
Carlo method? At the beginning of the chapter, we learned that the Monte Carlo
method approximates the expected value of a random variable by sampling, and
when the sample size is greater, the approximation will be better. Can we leverage
this concept of the Monte Carlo method to predict the value of a state? Yes!
[ 136 ]
Chapter 4
In order to approximate the value of the state using the Monte Carlo method, we
sample episodes (trajectories) following the given policy 𝜋𝜋 for some N times and then
we compute the value of the state as the average return of a state across the sampled
episodes, and it can be expressed as:
𝑁𝑁
1
𝑉𝑉𝑉𝑉𝑉𝑉 𝑉 ∑ 𝑅𝑅𝑖𝑖 (𝑠𝑠𝑠
𝑁𝑁
𝑖𝑖𝑖𝑖
From the preceding equation, we can understand that the value of a state s can be
approximated by computing the average return of the state s across some N episodes.
Our approximation will be better when N is higher.
Okay, let's get a better understanding of how the Monte Carlo method estimates the
value of a state (value function) with an example. Let's take our favorite grid world
environment we covered in Chapter 1, Fundamentals of Reinforcement Learning, as
shown in Figure 4.1. Our goal is to reach the state I from the state A without visiting
the shaded states, and the agent receives +1 reward when it visits the unshaded
states and -1 reward when it visits the shaded states:
[ 137 ]
Monte Carlo Methods
Let's say we have a stochastic policy 𝜋𝜋. Let's suppose, in state A, our stochastic policy
𝜋𝜋 selects action down 80% of time and action right 20% of the time, and it selects
action right in states D and E and action down in states B and F 100% of the time.
First, we generate an episode 𝜏𝜏1 using our given stochastic policy 𝜋𝜋 as Figure 4.2
shows:
For a better understanding, let's focus only on state A. Let's now compute the return
of state A. The return of a state is the sum of the rewards of the trajectory starting
from that state. Thus, the return of state A is computed as R1(A) = 1+1+1+1 = 4 where
the subscript 1 in R1 indicates the return from episode 1.
Say we generate another episode 𝜏𝜏2 using the same given stochastic policy 𝜋𝜋 as
Figure 4.3 shows:
[ 138 ]
Chapter 4
Say we generate another episode 𝜏𝜏3 using the same given stochastic policy 𝜋𝜋 as Figure
4.4 shows:
Let's now compute the return of state A. The return of state A is R3(A) = 1+1+1+1 = 4.
Thus, we generated three episodes and computed the return of state A in all three
episodes. Now, how can we compute the value of the state A? We learned that in
the Monte Carlo method, the value of a state can be approximated by computing
the average return of the state across some N episodes (trajectories):
𝑁𝑁
1
𝑉𝑉𝑉𝑉𝑉𝑉 𝑉 ∑ 𝑅𝑅𝑖𝑖 (𝑠𝑠𝑠
𝑁𝑁
𝑖𝑖𝑖𝑖
We need to compute the value of state A, so we can compute it by just taking the
average return of the state A across the N episodes as:
𝑁𝑁
1
𝑉𝑉𝑉𝑉𝑉𝑉 𝑉 ∑ 𝑅𝑅𝑖𝑖 (𝐴𝐴𝐴
𝑁𝑁
𝑖𝑖𝑖𝑖
1
𝑉𝑉(𝐴𝐴) = (𝑅𝑅1 (𝐴𝐴) + 𝑅𝑅2 (𝐴𝐴) + 𝑅𝑅3 (𝐴𝐴))
3
4+2+4
𝑉𝑉(𝐴𝐴) = = 3.3
3
[ 139 ]
Monte Carlo Methods
Thus, the value of state A is 3.3. Similarly, we can compute the value of all other
states by just taking the average return of the state across the three episodes.
Thus, in the Monte Carlo prediction method, to predict the value of a state (value
function) using the given input policy 𝜋𝜋, we generate some N episodes using the
given policy and then we compute the value of a state as the average return of the
state across these N episodes.
Note that while computing the return of the state, we can also
include the discount factor and compute the discounted return, but
for simplicity let's not include the discount factor.
Now, that we have a basic understanding of how the Monte Carlo prediction
method predicts the value function of the given policy, let's look into more detail
by understanding the algorithm of the Monte Carlo prediction method in the next
section.
MC prediction algorithm
The Monte Carlo prediction algorithm is given as follows:
[ 140 ]
Chapter 4
3. Compute the value of a state by just taking the average, that is:
total_return(s)
𝑉𝑉(𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁
The preceding algorithm implies that the value of the state is just the average return
of the state across several episodes.
To get a better understanding of how exactly the preceding algorithm works, let's
take a simple example and compute the value of each state manually. Say we need
to compute the value of three states s0, s1, and s2. We know that we obtain a reward
when we transition from one state to another. Thus, the reward for the final state
will be 0 as we don't make any transitions from the final state. Hence, the value of
the final state s2 will be zero. Now, we need to find the value of two states s0 and s1.
Step 1:
Initialize the total_returns(s) and N(s) for all the states to zero as Table 4.1 shows:
Say we are given a stochastic policy 𝜋𝜋; in state s0 our stochastic policy selects the
action 0 for 50% of the time and action 1 for 50% of the time, and it selects action 1 in
state s1 for 100% of the time.
Step 2: Iteration 1:
Generate an episode using the given input policy 𝜋𝜋 , as Figure 4.5 shows:
[ 141 ]
Monte Carlo Methods
Store all rewards obtained in the episode in the list called rewards. Thus,
rewards = [1, 1].
First, we compute the return of the state s0 (sum of rewards from s0):
𝑅𝑅𝑅𝑅𝑅0 ) = sum(rewards[0: ])
= sum([1, 1])
= 2
Update the total return of the state s0 in our table as:
𝑁𝑁𝑁𝑁𝑁0 ) = 𝑁𝑁(𝑠𝑠0 ) + 1
= 0+1 = 1
Now, let's compute the return of the state s1 (sum of rewards from s1):
𝑅𝑅𝑅𝑅𝑅1 ) = sum(rewards[1: ])
= sum([1])
= 1
Update the total return of the state s1 in our table as:
𝑁𝑁𝑁𝑁𝑁1 ) = 𝑁𝑁(𝑠𝑠1 ) + 1
= 0+1 = 1
Our updated table, after iteration 1, is as follows:
[ 142 ]
Chapter 4
Iteration 2:
Say we generate another episode using the same given policy 𝜋𝜋 as Figure 4.6 shows:
Store all rewards obtained in the episode in the list called rewards. Thus,
rewards = [3, 1].
First, we compute the return of the state s0 (sum of rewards from s0):
𝑅𝑅𝑅𝑅𝑅0 ) = sum(rewards[0: ])
= sum([3, 1])
= 4
Update the total return of the state s0 in our table as:
𝑁𝑁𝑁𝑁𝑁0 ) = 𝑁𝑁(𝑠𝑠0 ) + 1
= 1+1 = 2
Now, let's compute the return of the state s1 (sum of rewards from s1):
𝑅𝑅𝑅𝑅𝑅1 ) = sum(rewards[1: ])
= sum([1])
= 1
Update the return of the state s1 in our table as:
[ 143 ]
Monte Carlo Methods
𝑁𝑁𝑁𝑁𝑁1 ) = 𝑁𝑁(𝑠𝑠1 ) + 1
= 1+1 = 2
Our updated table after the second iteration is as follows:
Since we are computing manually, for simplicity, let's stop at two iterations; that is,
we just generate only two episodes.
Step 3:
total_returns(𝑠𝑠𝑠
𝑉𝑉(𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁
Thus:
total_return(𝑠𝑠0 ) 6
𝑉𝑉(𝑠𝑠0 ) = = =3
𝑁𝑁(𝑠𝑠0 ) 2
total_return(𝑠𝑠1 ) 2
𝑉𝑉(𝑠𝑠1 ) = = =1
𝑁𝑁(𝑠𝑠1 ) 2
Thus, we computed the value of the state by just taking the average return across
multiple episodes. Note that in the preceding example, for our manual calculation,
we just generated two episodes, but for a better estimation of the value of the state,
we generate several episodes and then we compute the average return across those
episodes (not just 2).
Types of MC prediction
We just learned how the Monte Carlo prediction algorithm works. We can categorize
the Monte Carlo prediction algorithm into two types:
[ 144 ]
Chapter 4
The following shows the algorithm of first-visit MC; as the point in bold says, we
compute the return for the state st only if it is occurring for the first time in the
episode:
3. Compute the value of a state by just taking the average, that is:
total_return(𝑠𝑠𝑠
𝑉𝑉(𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁
[ 145 ]
Monte Carlo Methods
1. Let total_return(s) be the sum of the return of a state across several episodes
and N(s) be the counter, that is, the number of times a state is visited across
several episodes. Initialize total_return(s) and N(s) as zero for all the states.
The policy 𝜋𝜋 is given as input
2. For M number of iterations:
1. Generate an episode using the policy 𝜋𝜋
2. Store all the rewards obtained in the episode in the list called rewards
3. For each step t in the episode:
3. Compute the value of a state by just taking the average, that is:
total_return(𝑠𝑠𝑠
𝑉𝑉(𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁
Remember that the only difference between the first-visit MC and every-visit MC
methods is that in the first-visit MC method, we compute the return for a state only
for its first time of occurrence in the episode but in the every-visit MC method, the
return of the state is computed every time the state is visited in an episode. We can
choose between first-visit MC and every-visit MC based on the problem that we are
trying to solve.
Now that we have understood how the Monte Carlo prediction method predicts
the value function of the given policy, in the next section, we will learn how to
implement the Monte Carlo prediction method.
[ 146 ]
Chapter 4
The values of the cards Jack (J), King (K), and Queen (Q) will be considered as 10.
The value of the Ace (A) can be 1 or 11, depending on the player's choice. That is, the
player can decide whether the value of an Ace should be 1 or 11 during the game.
The value of the rest of the cards (2 to 10) is just their face value. For instance, the
value of the card 2 will be 2, the value of the card 3 will be 3, and so on.
We learned that the game consists of a player and a dealer. There can be many
players at a time but only one dealer. All the players compete with the dealer and not
with other players. Let's consider a case where there is only one player and a dealer.
Let's understand blackjack by playing the game along with different cases. Let's
suppose we are the player and we are competing with the dealer.
Initially, a player is given two cards. Both of these cards are face up, that is, both
of the player's cards are visible to the dealer. Similarly, the dealer is also given two
cards. But one of the dealer's cards is face up, and the other is face down. That is, the
dealer shows only one of their cards.
[ 147 ]
Monte Carlo Methods
As we can see in Figure 4.7, the player has two cards (both face up) and the dealer
also has two cards (only one face up):
Figure 4.7: The player has 20, and the dealer has 2 with one card face down
Now, the player performs either of the two actions, which are Hit and Stand. If we
(the player) perform the action hit, then we get one more card. If we perform stand,
then it implies that we don't need any more cards and tells the dealer to show all
their cards. Whoever has a sum of cards value equal to 21 or a larger value than the
other player but not exceeding 21 wins the game.
We learned that the value of J, K, and Q is 10. As shown in Figure 4.7, we have cards
J and K, which sums to 20 (10+10). Thus, the total value our cards is already a large
number and it didn't exceed 21. So we stand, and this action tells the dealer to show
their cards. As we can observe in Figure 4.8, the dealer has now shown all their cards
and the total value of the dealer's cards is 12 and the total value of our (the player's)
cards is 20, which is larger and also didn't exceed 21, so we win the game.
[ 148 ]
Chapter 4
Figure 4.9 shows we have two cards and the dealer also has two cards and only one
of the dealer's card is visible to us:
Figure 4.9: The player has 13, and the dealer has 7 with one card face down
[ 149 ]
Monte Carlo Methods
Now, we have to decide whether we should (perform the action) hit or stand. Figure
4.9 shows we have two cards, K and 3, which sums to 13 (10+3). Let's be a little
optimistic and hope that the total value of the dealer's cards will not be greater
than ours. So we stand, and this action tells the dealer to show their cards. As we
can observe in Figure 4.10, the sum of the dealer's card is 17, but ours is only 13, so
we lose the game. That is, the dealer has got a larger value than us, and it did not
exceed 21, so the dealer wins the game, and we lose:
[ 150 ]
Chapter 4
Figure 4.11 shows we have two cards and the dealer also has two cards but only one
of the dealer's cards is visible to us:
Figure 4.11: The player has 8, and the dealer has 10 with one card face down
[ 151 ]
Monte Carlo Methods
Now, we have to decide whether we should (perform the action) hit or stand. We
learned that the goal of the game is to have a sum of cards value of 21, or a larger
value than the dealer while not exceeding 21. Right now, the total value of our cards
is just 3+5 = 8. Thus, we (perform the action) hit so that we can make our sum value
larger. After we hit, we receive a new card as shown in Figure 4.12:
Figure 4.12: The player has 18, and the dealer has 10 with one card face down
[ 152 ]
Chapter 4
As we can observe, we got a new card. Now, the total value of our cards is 3+5+10
= 18. Again, we need to decide whether we should (perform the action) hit or stand.
Let's be a little greedy and (perform the action) hit so that we can make our sum
value a little larger. As shown in Figure 4.13, we hit and received one more card but
now the total value of our cards becomes 3+5+10+10 = 28, which exceeds 21, and this
is called a bust and we lose the game:
[ 153 ]
Monte Carlo Methods
We learned that the value of the Ace can be either 1 or 11, and the player can decide
the value of the ace during the game. Let's learn how this works. As Figure 4.14
shows, we have been given two cards and the dealer also has two cards and only
one of the dealer's cards is face up:
Figure 4.14: The player has 10, and the dealer has 5 with one card face down
[ 154 ]
Chapter 4
As we can see, the total value of our cards is 5+5 = 10. Thus, we hit so that we can
make our sum value larger. As Figure 4.15 shows, after performing the hit action we
received a new card, which is an Ace. Now, we can decide the value of the Ace to be
either 1 or 11. If we consider the value of Ace to be 1, then the total value of our cards
will be 5+5+1 = 11. But if we consider the value of the Ace to be 11, then the total
value of our cards will be 5+5+11 = 21. In this case, we consider the value of our Ace
to be 11 so that our sum value becomes 21.
Thus, we set the value of the Ace to be 11 and win the game, and in this case, the Ace
is called the usable Ace since it helped us to win the game:
Figure 4.15: The player uses the Ace as 11 and wins the game
[ 155 ]
Monte Carlo Methods
Figure 4.16 shows we have two cards and the dealer has two cards with one face up:
Figure 4.16: The player has 13, and the dealer has 10 with one card face down
As we can observe, the total value of our cards is 13 (10+3). We (perform the action)
hit so that we can make our sum value a little larger:
[ 156 ]
Chapter 4
Figure 4.17: The player has to use the Ace as a 1 else they go bust
As Figure 4.17 shows, we hit and received a new card, which is an Ace. Now we can
decide the value of Ace to be 1 or 11. If we choose 11, then our sum value becomes
10+3+11 = 23. As we can observe, when we set our ace to 11, then our sum value
exceeds 21, and we lose the game. Thus, instead of choosing Ace = 11, we set the
Ace value to be 1; so our sum value becomes 10+3+1 = 14.
[ 157 ]
Monte Carlo Methods
Again, we need to decide whether we should (perform the action) hit or stand. Let's
say we stand hoping that the dealer sum value will be lower than ours. As Figure 4.18
shows, after performing the stand action, both of the dealer's cards are shown, and
the sum of the dealer's card is 20, but ours is just 14, and so we lose the game, and in
this case, the Ace is called a non-usable Ace since it did not help us to win the game.
Figure 4.18: The player has 14, and the dealer has 20 and wins
If both the player and the dealer's sum of cards value is the same, say 20, then the
game is called a draw.
Now that we have understood how to play blackjack, let's implement the Monte
Carlo prediction method in the blackjack game. But before going ahead, first, let's
learn how the blackjack environment is designed in Gym.
import gym
env = gym.make('Blackjack-v0')
[ 158 ]
Chapter 4
Now, let's look at the state of the blackjack environment; we can just reset our
environment and look at the initial state:
print(env.reset())
Note that every time we run the preceding code, we might get a different result,
as the initial state is randomly initialized. The preceding code will print something
like this:
(15, 9, True)
As we can observe, our state is represented as a tuple, but what does this mean?
We learned that in the blackjack game, we will be given two cards and we also get
to see one of the dealer's cards. Thus, 15 implies that the value of the sum of our
cards, 9 implies the face value of one of the dealer's cards, True implies that we
have a usable ace, and it will be False if we don't have a usable ace.
print(env.action_space)
Discrete(2)
As we can observe, it implies that we have two actions in our action space, which
are 0 and 1:
Okay, what about the reward? The reward will be assigned as follows:
[ 159 ]
Monte Carlo Methods
Now that we have understood how the blackjack environment is designed in Gym,
let's start implementing the MC prediction method in the blackjack game. First, we
will look at every-visit MC and then we will learn how to implement first-visit MC
prediction.
import gym
import pandas as pd
from collections import defaultdict
env = gym.make('Blackjack-v0')
Defining a policy
We learned that in the prediction method, we will be given an input policy and we
predict the value function of the given input policy. So, now, we first define a policy
function that acts as an input policy. That is, we define the input policy whose value
function will be predicted in the upcoming steps.
As shown in the following code, our policy function takes the state as an input and if
the state[0], the sum of our cards, value, is greater than 19, then it will return action 0
(stand), else it will return action 1 (hit):
def policy(state):
return 0 if state[0] > 19 else 1
For example, let's generate an initial state by resetting the environment as shown as
follows:
state = env.reset()
print(state)
[ 160 ]
Chapter 4
(20, 5, False)
As we can notice, state[0] = 20; that is, the value of the sum of our cards is 20, so in
this case, our policy will return the action 0 (stand) as the following shows:
print(policy(state))
Now that we have defined the policy, in the next sections, we will predict the value
function (state values) of this policy.
Generating an episode
Next, we generate an episode using the given policy, so we define a function called
generate_episode, which takes the policy as an input and generates the episode
using the given policy.
num_timesteps = 100
For a clear understanding, let's look into the function line by line:
def generate_episode(policy):
episode = []
state = env.reset()
for t in range(num_timesteps):
action = policy(state)
[ 161 ]
Monte Carlo Methods
Store the state, action, and reward into our episode list:
If the next state is a final state then break the loop, else update the next state to the
current state:
if done:
break
state = next_state
return episode
Let's take a look at what the output of our generate_episode function looks like.
Note that we generate an episode using the policy we defined earlier:
print(generate_episode(policy))
As we can observe our output is in the form of [(state, action, reward)]. As shown
previously, we have two states in our episode. We performed action 1 (hit) in the
state (10, 2, False) and received a 0 reward, and we performed action 0 (stand) in
the state (20, 2, False) and received a reward of 1.0.
Now that we have learned how to generate an episode using the given policy, next,
we will look at how to compute the value of the state (value function) using the
every-visit MC method.
[ 162 ]
Chapter 4
First, we define the total_return and N as a dictionary for storing the total return
and the number of times the state is visited across episodes respectively:
total_return = defaultdict(float)
N = defaultdict(int)
Set the number of iterations, that is, the number of episodes, we want to generate:
num_iterations = 500000
for i in range(num_iterations):
Generate the episode using the given policy; that is, generate an episode using the
policy function we defined earlier:
episode = generate_episode(policy)
Store all the states, actions, and rewards obtained from the episode:
Compute the return R of the state as the sum of rewards, R(st) = sum(rewards[t:]):
R = (sum(rewards[t:]))
total_return[state] = total_return[state] + R
Update the number of times the state is visited in the episode as N(st) = N(st) + 1:
N[state] = N[state] + 1
After computing the total_return and N we can just convert them into a pandas
data frame for a better understanding. Note that this is just to give a clear
understanding of the algorithm; we don't necessarily have to convert to the pandas
data frame, we can also implement this efficiently just by using the dictionary.
[ 163 ]
Monte Carlo Methods
total_return = pd.DataFrame(total_return.items(),columns=['state',
'total_return'])
N = pd.DataFrame(N.items(),columns=['state', 'N'])
df = pd.merge(total_return, N, on="state")
df.head(10)
The preceding code will display the following. As we can observe, we have the total
return and the number of times the state is visited:
Figure 4.19: The total return and the number of times a state has been visited
[ 164 ]
Chapter 4
Next, we can compute the value of the state as the average return:
total_return(𝑠𝑠𝑠
𝑉𝑉(𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁
Thus, we can write:
df['value'] = df['total_return']/df['N']
df.head(10)
Figure 4.20: The value is calculated as the average of the return of each state
As we can observe, we now have the value of the state, which is just the average of a
return of the state across several episodes. Thus, we have successfully predicted the
value function of the given policy using the every-visit MC method.
Okay, let's check the value of some states and understand how accurately our value
function is estimated according to the given policy. Recall that when we started off,
to generate episodes, we used the optimal policy, which selects the action 0 (stand)
when the sum value is greater than 19 and the action 1 (hit) when the sum value is
lower than 19.
[ 165 ]
Monte Carlo Methods
Let's evaluate the value of the state (21,9,False), as we can observe, the value of
the sum of our cards is already 21 and so this is a good state and should have a high
value. Let's see what our estimated value of the state is:
df[df['state']==(21,9,False)]['value'].values
array([1.0])
Now, let's check the value of the state (5,8,False). As we can notice, the value of the
sum of our cards is just 5 and even the one dealer's single card has a high value, 8; in
this case, the value of the state should be lower. Let's see what our estimated value of
the state is:
df[df['state']==(5,8,False)]['value'].values
array([-1.0])
Thus, we learned how to predict the value function of the given policy using the
every-visit MC prediction method. In the next section, we will look at how to
compute the value of the state using the first-visit MC method.
for i in range(num_iterations):
episode = generate_episode(env,policy)
states, actions, rewards = zip(*episode)
[ 166 ]
Chapter 4
R = (sum(rewards[t:]))
total_return[state] = total_return[state] + R
N[state] = N[state] + 1
You can obtain the complete code from the GitHub repo of the book and you will get
results similar to what we saw in the every-visit MC section.
Thus, we learned how to predict the value function of the given policy using the
first-visit and every-visit MC methods.
total_return(𝑠𝑠𝑠
𝑉𝑉(𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁
Instead of using the arithmetic mean to approximate the value of the state, we can
also use the incremental mean, and it is expressed as:
𝑁𝑁(𝑠𝑠𝑡𝑡 ) = 𝑁𝑁(𝑠𝑠𝑡𝑡 ) + 1
1
𝑉𝑉(𝑠𝑠𝑡𝑡 ) = 𝑉𝑉(𝑠𝑠𝑡𝑡 ) + (𝑅𝑅 − 𝑉𝑉(𝑠𝑠𝑡𝑡 ))
𝑁𝑁(𝑠𝑠𝑡𝑡 ) 𝑡𝑡
But why do we need incremental mean? Consider our environment as non-
stationary. In that case, we don't have to take the return of the state from all the
episodes and compute the average. As the environment is non-stationary we can
ignore returns from earlier episodes and use only the returns from the latest episodes
for computing the average. Thus, we can compute the value of the state using the
incremental mean as shown as follows:
[ 167 ]
Monte Carlo Methods
MC prediction (Q function)
So far, we have learned how to predict the value function of the given policy using
the Monte Carlo method. In this section, we will see how to predict the Q function
of the given policy using the Monte Carlo method.
Predicting the Q function of the given policy using the MC method is exactly the
same as how we predicted the value function in the previous section except that here
we use the return of the state-action pair, whereas in the case of the value function
we used the return of the state. That is, just like we approximated the value of a state
(value function) by computing the average return of the state across several episodes,
we can also approximate the value of a state-action pair (Q function) by computing
the average return of the state-action pair across several episodes.
Thus, we generate several episodes using the given policy 𝜋𝜋, then, we calculate
the total_return(s, a), the sum of the return of the state-action pair across several
episodes, and also we calculate N(s, a), the number of times the state-action pair is
visited across several episodes. Then we compute the Q function or Q value as the
average return of the state-action pair as shown as follows:
total_return(𝑠𝑠𝑠 𝑠𝑠𝑠
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
For instance, let consider a small example. Say we have two states s0 and s1 and we
have two possible actions 0 and 1. Now, we compute total_return(s, a) and N(s, a).
Let's say our table after computation looks like Table 4.4:
Once we have this, we can compute the Q value by just taking the average, that is:
total_return(𝑠𝑠𝑠 𝑠𝑠𝑠
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
[ 168 ]
Chapter 4
Thus, we can compute the Q value for all state-action pairs as:
3. Compute the Q function (Q value) by just taking the average, that is:
total_return(𝑠𝑠𝑠 𝑠𝑠𝑠
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
Recall that in the MC prediction of the value function, we learned two types of MC—
first-visit MC and every-visit MC. In first-visit MC, we compute the return of the
state only for the first time the state is visited in the episode and in every-visit MC
we compute the return of the state every time the state is visited in the episode.
[ 169 ]
Monte Carlo Methods
As mentioned in the previous section, instead of using the arithmetic mean, we can
also use the incremental mean. We learned that the value of a state can be computed
using the incremental mean as:
Okay, we learned that in the control task our goal is to find the optimal policy. First,
how can we compute a policy? We learned that the policy can be extracted from the
Q function. That is, if we have a Q function, then we can extract policy by selecting
an action in each state that has the maximum Q value as the following shows:
So, to compute a policy, we need to compute the Q function. But how can we
compute the Q function? We can compute the Q function similarly to what we
learned in the MC prediction method. That is, in the MC prediction method, we
learned that when given a policy, we can generate several episodes using that policy
and compute the Q function (Q value) as the average return of the state-action pair
across several episodes.
[ 170 ]
Chapter 4
We can perform the same step here to compute the Q function. But in the control
method, we are not given any policy as input. So, we will initialize a random policy,
and then we compute the Q function using the random policy. That is, just like we
learned in the prediction method, we generate several episodes using our random
policy. Then we compute the Q function (Q value) as the average return of a state-
action pair across several episodes as the following shows:
total_return(𝑠𝑠𝑠 𝑠𝑠𝑠
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
Let's suppose after computing the Q function as the average return of the state-action
pair, our Q function looks like Table 4.5:
From the preceding Q function, we can extract a new policy by selecting an action
in each state that has the maximum Q value. That is, 𝜋𝜋 𝜋 𝜋𝜋𝜋𝜋𝜋𝜋𝜋 𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄𝑄. Thus, our
𝑎𝑎
new policy selects action 0 in state s0 and action 1 in state s1 as it has the maximum
Q value.
However, this new policy will not be an optimal policy because this new policy is
extracted from the Q function, which is computed using the random policy. That is,
we initialized a random policy and generated several episodes using the random
policy, then we computed the Q function by taking the average return of the state-
action pair across several episodes. Thus, we are using the random policy to compute
the Q function and so the new policy extracted from the Q function will not be an
optimal policy.
But now that we have extracted a new policy from the Q function, we can use
this new policy to generate episodes in the next iteration and compute the new Q
function. Then, from this new Q function, we extract a new policy. We repeat these
steps iteratively until we find the optimal policy. This is explained clearly in the
following steps:
[ 171 ]
Monte Carlo Methods
Iteration 1—Let 𝜋𝜋0 be the random policy. We use this random policy to generate an
episode, and then we compute the Q function 𝑄𝑄 𝜋𝜋0 by taking the average return of the
state-action pair. Then, from this Q function 𝑄𝑄 𝜋𝜋0, we extract a new policy 𝜋𝜋1. This new
policy 𝜋𝜋1 will not be an optimal policy since it is extracted from the Q function, which
is computed using the random policy.
Iteration 2—So, we use the new policy 𝜋𝜋1 derived from the previous iteration to
generate an episode and compute the new Q function 𝑄𝑄 𝜋𝜋1 as average return of a
state-action pair. Then, from this Q function 𝑄𝑄 𝜋𝜋1, we extract a new policy 𝜋𝜋2. If the
policy 𝜋𝜋2 is optimal we stop, else we go to iteration 3.
Iteration 3—Now, we use the new policy 𝜋𝜋2 derived from the previous iteration
to generate an episode and compute the new Q function 𝑄𝑄 𝜋𝜋2. Then, from this Q
function 𝑄𝑄 𝜋𝜋2, we extract a new policy 𝜋𝜋3. If 𝜋𝜋3 is optimal we stop, else we go to the
next iteration.
We repeat this process for several iterations until we find the optimal policy 𝜋𝜋 ∗ as
shown in Figure 4.21:
∗
𝜋𝜋0 → 𝑄𝑄 𝜋𝜋0 → 𝜋𝜋1 → 𝑄𝑄 𝜋𝜋1 → 𝜋𝜋2 → 𝑄𝑄 𝜋𝜋2 → 𝜋𝜋3 → 𝑄𝑄 𝜋𝜋3 → ⋯ → 𝜋𝜋 ∗ → 𝑄𝑄 𝜋𝜋
Figure 4.21: The path to finding the optimal policy
This step is called policy evaluation and improvement and is similar to the policy
iteration method we covered in Chapter 3, The Bellman Equation and Dynamic
Programming. Policy evaluation implies that at each step we evaluate the policy.
Policy improvement implies that at each step we are improving the policy by taking
the maximum Q value. Note that here, we select the policy in a greedy manner
meaning that we are selecting policy 𝜋𝜋 by just taking the maximum Q value, and so
we can call our policy a greedy policy.
Now that we have a basic understanding of how the MC control method works, in
the next section, we will look into the algorithm of the MC control method and learn
about it in more detail.
MC control algorithm
The following steps show the Monte Carlo control algorithm. As we can observe,
unlike the MC prediction method, here, we will not be given any policy. So, we
start off by initializing the random policy and use the random policy to generate
an episode in the first iteration. Then, we will compute the Q function (Q value)
as the average return of the state-action pair.
[ 172 ]
Chapter 4
Once we have the Q function, we extract a new policy by selecting an action in each
state that has the maximum Q value. In the next iteration, we use the extracted new
policy to generate an episode and compute the new Q function (Q value) as the
average return of the state-action pair. We repeat these steps for many iterations
to find the optimal policy.
One more thing, we need to observe that just as we learned in the first-visit MC
prediction method, here, we compute the return of the state-action pair only for
the first time a state-action pair is visited in the episode.
For a better understanding, we can compare the MC control algorithm with the
MC prediction of the Q function. One difference we can observe is that, here, we
compute the Q function in each iteration. But if you notice, in the MC prediction
of the Q function, we compute the Q function after all the iterations. The reason
for computing the Q function in every iteration here is that we need the Q function
to extract the new policy so that we can use the extracted new policy in the next
iteration to generate an episode:
1. Let total_return(s, a) be the sum of the return of a state-action pair across
several episodes and N(s, a) be the number of times a state-action pair is
visited across several episodes. Initialize total_return(s, a) and N(s, a) for
all state-action pairs to zero and initialize a random policy 𝜋𝜋
2. For M number of iterations:
1. Generate an episode using policy 𝜋𝜋
2. Store all rewards obtained in the episode in the list called rewards
3. For each step t in the episode:
[ 173 ]
Monte Carlo Methods
From the preceding algorithm, we can observe that we generate an episode using
the policy 𝜋𝜋. Then for each step in the episode, we compute the return of state-
action pair and compute the Q function Q(st, at) as an average return, then from this
Q function, we extract a new policy 𝜋𝜋 . We repeat this step iteratively to find the
optimal policy 𝜋𝜋. Thus, we learned how to perform the control task using the Monte
Carlo method.
We can classify the control methods into two types:
• On-policy control
• Off-policy control
On-policy control—In the on-policy control method, the agent behaves using one
policy and also tries to improve the same policy. That is, in the on-policy method,
we generate episodes using one policy and also improve the same policy iteratively
to find the optimal policy. For instance, the MC control method, which we just
learned above, can be called on-policy MC control as we are generating episodes
using a policy 𝜋𝜋, and we also try to improve the same policy 𝜋𝜋 on every iteration
to compute the optimal policy.
Off-policy control—In the off-policy control method, the agent behaves using one
policy b and tries to improve a different policy 𝜋𝜋. That is, in the off-policy method,
we generate episodes using one policy and we try to improve the different policy
iteratively to find the optimal policy.
We will learn how exactly the preceding two control methods work in detail in the
upcoming sections.
total_return(𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 )
𝑄𝑄(𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) =
𝑁𝑁𝑁𝑁𝑁𝑡𝑡 , 𝑎𝑎𝑡𝑡 )
5. Compute the updated policy 𝜋𝜋 using the Q function:
[ 175 ]
Monte Carlo Methods
One of the major drawbacks of the exploring starts method is that it is not applicable
to every environment. That is, we can't just randomly choose any state-action pair as
an initial state-action pair because in some environments there can be only one state-
action pair that can act as an initial state-action pair. So we can't randomly select the
state-action pair as the initial state-action pair.
For example, suppose we are training an agent to play a car racing game; we can't
start the episode in a random position as the initial state and a random action as the
initial action because we have a fixed single starting state and action as the initial
state and action.
Thus, to overcome the problem in exploring starts, in the next section, we will learn
about the Monte Carlo control method with a new type of policy called the epsilon-
greedy policy.
First, let's learn what a greedy policy is. A greedy policy is one that selects the best
action available at the moment. For instance, let's say we are in some state A and we
have four possible actions in the state. Let the actions be up, down, left, and right. But
let's suppose our agent has explored only two actions, up and right, in the state A; the
Q value of actions up and right in the state A are shown in Table 4.6:
Table 4.6: The agent has only explored two actions in state A
We learned that the greedy policy selects the best action available at the moment.
So the greedy policy checks the Q table and selects the action that has the maximum
Q value in state A. As we can see, the action up has the maximum Q value. So our
greedy policy selects the action up in state A.
[ 176 ]
Chapter 4
But one problem with the greedy policy is that it never explores the other possible
actions; instead, it always picks the best action available at the moment. In the
preceding example, the greedy policy always selects the action up. But there could
be other actions in state A that might be more optimal than the action up that the
agent has not explored yet. That is, we still have two more actions—down and left—in
state A that the agent has not explored yet, and they might be more optimal than the
action up.
So, now the question is whether the agent should explore all the other actions in
the state and select the best action as the one that has the maximum Q value or
exploit the best action out of already-explored actions. This is called an exploration-
exploitation dilemma.
Say there are many routes from our work to home and we have explored only two
routes so far. Thus, to reach home, we can select the route that takes us home most
quickly out of the two routes we have explored. However, there are still many other
routes that we have not explored yet that might be even better than our current
optimal route. The question is whether we should explore new routes (exploration)
or whether we should always use our current optimal route (exploitation).
To avoid this dilemma, we introduce a new policy called the epsilon-greedy policy.
Here, all actions are tried with a non-zero probability (epsilon). With a probability
epsilon, we explore different actions randomly and with a probability 1-epsilon, we
choose an action that has the maximum Q value. That is, with a probability epsilon,
we select a random action (exploration) and with a probability 1-epsilon we select
the best action (exploitation).
Say we set epsilon = 0.5; then we will generate a random number from the uniform
distribution and if the random number is less than epsilon (0.5), then we select a
random action (exploration), but if the random number is greater than or equal to
epsilon then we select the best action, that is, the action that has the maximum Q
value (exploitation).
[ 177 ]
Monte Carlo Methods
So, in this way, we explore actions that we haven't seen before with the probability
epsilon and select the best actions out of the explored actions with the probability
1-epsilon. As Figure 4.22 shows, if the random number we generated from the
uniform distribution is less than epsilon, then we choose a random action. If the
random number is greater than or equal to epsilon, then we choose the best action:
The following snippet shows the Python code for the epsilon-greedy policy:
Now that we have understood what an epsilon-greedy policy is, and how it is used
to solve the exploration-exploitation dilemma, in the next section, we will look at
how to use the epsilon-greedy policy in the Monte Carlo control method.
[ 178 ]
Chapter 4
total_return(𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 )
𝑄𝑄(𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) =
𝑁𝑁𝑁𝑁𝑁𝑡𝑡 , 𝑎𝑎𝑡𝑡 )
4. Compute the updated policy 𝜋𝜋 using the Q function. Let
𝑎𝑎∗ = arg max 𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄𝑄. The policy 𝜋𝜋 selects the best action 𝑎𝑎∗ with
𝑎𝑎
probability 1 − 𝜖𝜖 and random action with probability 𝜖𝜖
As we can observe, in every iteration, we generate the episode using the policy
𝜋𝜋 and also we try to improve the same policy 𝜋𝜋 in every iteration to compute the
optimal policy.
import gym
import pandas as pd
import random
from collections import defaultdict
env = gym.make('Blackjack-v0')
[ 179 ]
Monte Carlo Methods
Q = defaultdict(float)
Initialize the dictionary for storing the total return of the state-action pair:
total_return = defaultdict(float)
Initialize the dictionary for storing the count of the number of times a state-action
pair is visited:
N = defaultdict(int)
def epsilon_greedy_policy(state,Q):
epsilon = 0.5
Sample a random value from the uniform distribution; if the sampled value is less
than epsilon then we select a random action, else we select the best action that has
the maximum Q value as shown:
Generating an episode
Now, let's generate an episode using the epsilon-greedy policy. We define a function
called generate_episode, which takes the Q value as an input and returns the
episode.
num_timesteps = 100
[ 180 ]
Chapter 4
def generate_episode(Q):
episode = []
state = env.reset()
for t in range(num_timesteps):
action = epsilon_greedy_policy(state,Q)
Perform the selected action and store the next state information:
If the next state is the final state then break the loop, else update the next state to the
current state:
if done:
break
state = next_state
return episode
num_iterations = 500000
[ 181 ]
Monte Carlo Methods
for i in range(num_iterations):
We learned that in the on-policy control method, we will not be given any policy
as an input. So, we initialize a random policy in the first iteration and improve the
policy iteratively by computing the Q value. Since we extract the policy from the Q
function, we don't have to explicitly define the policy. As the Q value improves, the
policy also improves implicitly. That is, in the first iteration, we generate the episode
by extracting the policy (epsilon-greedy) from the initialized Q function. Over a
series of iterations, we will find the optimal Q function, and hence we also find the
optimal policy.
episode = generate_episode(Q)
Store all the rewards obtained in the episode in the rewards list:
If the state-action pair is occurring for the first time in the episode:
Compute the return R of the state-action pair as the sum of rewards, R(st, at) =
sum(rewards[t:]):
R = sum(rewards[t:])
Update the total return of the state-action pair as total_return(st, at) = total_return(st,
at) + R(st, at):
total_return[(state,action)] = total_return[(state,action)]
+ R
[ 182 ]
Chapter 4
Update the number of times the state-action pair is visited as N(st, at) = N(st, at) + 1:
N[(state, action)] += 1
Thus on every iteration, the Q value improves and so does the policy.
After all the iterations, we can have a look at the Q value of each state-action pair in
the pandas data frame for more clarity. First, let's convert the Q value dictionary into
a pandas data frame:
df = pd.DataFrame(Q.items(),columns=['state_action pair','value'])
df.head(11)
[ 183 ]
Monte Carlo Methods
As we can observe, we have the Q values for all the state-action pairs. Now we can
extract the policy by selecting the action that has the maximum Q value in each state.
For instance, say we are in the state (21,8, True). Now, should we perform action 0
(stand) or action 1 (hit)? It makes more sense to perform action 0 (stand) here, since
the value of the sum of our cards is already 21, and if we perform action 1 (hit) our
game will lead to a bust.
Note that due to stochasticity, you might get different results than those shown here.
Let's look at the Q values of all the actions in this state, (21,8, True):
df[124:126]
In the next section, we will learn about an off-policy control method that uses two
different policies.
In the on-policy method, we generate an episode using the policy 𝜋𝜋 and we improve
the same policy 𝜋𝜋 iteratively to find the optimal policy. But in the off-policy method,
we generate an episode using a policy called the behavior policy b and we try to
iteratively improve a different policy called the target policy 𝜋𝜋.
[ 184 ]
Chapter 4
That is, in the on-policy method, we learned that the agent generates an episode
using the policy 𝜋𝜋. Then for each step in the episode, we compute the return of the
state-action pair and compute the Q function Q(st, at) as an average return, then from
this Q function, we extract a new policy 𝜋𝜋. We repeat this step iteratively to find the
optimal policy 𝜋𝜋.
But in the off-policy method, the agent generates an episode using a policy called
the behavior policy b. Then for each step in the episode, we compute the return of
the state-action pair and compute the Q function Q(st, at) as an average return, then
from this Q function, we extract a new policy called the target policy 𝜋𝜋 . We repeat
this step iteratively to find the optimal target policy 𝜋𝜋.
The behavior policy will usually be set to the epsilon-greedy policy and thus the
agent explores the environment with the epsilon-greedy policy and generates an
episode. Unlike the behavior policy, the target policy is set to be the greedy policy
and so the target policy will always select the best action in each state.
Let's now understand how the off-policy Monte Carlo method works exactly.
First, we will initialize the Q function with random values. Then we generate an
episode using the behavior policy, which is the epsilon-greedy policy. That is, from
the Q function we select the best action (the action that has the max Q value) with
probability 1-epsilon and we select the random action with probability epsilon.
Then for each step in the episode, we compute the return of the state-action pair and
compute the Q function Q(st, at) as an average return. Instead of using the arithmetic
mean to compute the Q function, we can use the incremental mean. We can compute
the Q function using the incremental mean as shown as follows:
1. Initialize the Q function Q(s, a) with random values, set the behavior policy
b to be epsilon-greedy, and also set the target policy 𝜋𝜋 to be greedy policy.
2. For M number of episodes:
1. Generate an episode using the behavior policy b
2. Initialize return R to 0
[ 185 ]
Monte Carlo Methods
As we can observe from the preceding algorithm, first we set the Q values of all
the state-action pairs to random values and then we generate an episode using
the behavior policy. Then on each step of the episode, we compute the updated Q
function (Q values) using the incremental mean and then we extract the target policy
from the updated Q function. As we can notice, on every iteration, the Q function
is constantly improving and since we are extracting the target policy from the Q
function, our target policy will also be improving on every iteration.
Also, note that since it is an off-policy method, the episode is generated using the
behavior policy and we try to improve the target policy.
But wait! There is a small issue here. Since we are finding the target policy 𝜋𝜋 from
the Q function, which is computed based on the episodes generated by a different
policy called the behavior policy, our target policy will be inaccurate. This is because
the distribution of the behavior policy and the target policy will be different. So, to
correct this, we introduce a new technique called importance sampling. This is a
technique for estimating the values of one distribution when given samples from
another.
Let us say we want to compute the expectation of a function f(x) where the value of x
is sampled from the distribution p(x) that is, 𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥; then we can write:
𝑝𝑝(𝑥𝑥)
𝔼𝔼[𝑓𝑓𝑓𝑓𝑓𝑓] ≈ ∫ 𝑓𝑓(𝑥𝑥) 𝑞𝑞(𝑥𝑥)𝑑𝑑𝑑𝑑
𝑥𝑥 𝑞𝑞(𝑥𝑥)
1 𝑝𝑝(𝑥𝑥𝑖𝑖 )
𝔼𝔼[𝑓𝑓𝑓𝑓𝑓𝑓] ≈ ∑ 𝑓𝑓(𝑥𝑥𝑖𝑖 )
𝑁𝑁 𝑞𝑞(𝑥𝑥𝑖𝑖 )
𝑖𝑖
[ 186 ]
Chapter 4
𝑝𝑝(𝑥𝑥)
The ratio is called the importance sampling ratio or importance correction.
𝑞𝑞(𝑥𝑥)
Okay, how does importance sampling help us? We learned that with importance
sampling, we can estimate the value of one distribution by sampling from another
using the importance sampling ratio. In off-policy control, we can estimate the target
policy with the samples (episodes) from the behavior policy using the importance
sampling ratio.
In ordinary importance sampling, the importance sampling ratio will be the ratio
𝜋𝜋𝜋𝜋𝜋|𝑠𝑠𝑠
of the target policy to the behavior policy and in weighted importance
𝑏𝑏𝑏𝑏𝑏|𝑠𝑠𝑠
sampling, the importance sampling ratio will be the weighted ratio of the target
𝜋𝜋𝜋𝜋𝜋|𝑠𝑠𝑠
policy to the behavior policy 𝑊𝑊 .
𝑏𝑏𝑏𝑏𝑏|𝑠𝑠𝑠
Let's now understand how we use weighted importance sampling in the off-policy
Monte Carlo method. Let W be the weight and C(st, at) denote the cumulative sum
of weights across all the episodes. We learned that we compute the Q function (Q
values) using the incremental mean as:
𝑊𝑊
𝑄𝑄(𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) = 𝑄𝑄(𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) + (𝑅𝑅 − 𝑄𝑄(𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ))
𝐶𝐶𝐶𝐶𝐶𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) 𝑡𝑡
The algorithm of the off-policy Monte Carlo method is shown next. First, we generate
an episode using the behavior policy and then we initialize return R to 0 and the
weight W to 1. Then on every step of the episode, we compute the return and update
the cumulative weight as C(st, at) = C(st, at) + W. After updating the cumulative
𝑊𝑊
weights, we update the Q value as 𝑄𝑄(𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) = 𝑄𝑄(𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) + (𝑅𝑅 − 𝑄𝑄(𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 )).
𝐶𝐶𝐶𝐶𝐶𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) 𝑡𝑡
[ 187 ]
Monte Carlo Methods
From the Q value, we extract the target policy as 𝜋𝜋(𝑠𝑠𝑡𝑡 ) = arg max 𝑄𝑄𝑄𝑄𝑄𝑡𝑡 , 𝑎𝑎𝑎. When
𝑎𝑎
the action at given by the behavior policy and the target policy is not the same then
we break the loop and generate the next episode; else we update the weight as
1
𝑊𝑊 𝑊 𝑊𝑊 .
𝑏𝑏𝑏𝑏𝑏𝑡𝑡 |𝑠𝑠𝑡𝑡 )
The complete algorithm of the off-policy Monte Carlo method is explained in the
following steps:
1. Initialize the Q function Q(s, a) with random values, set the behavior policy b
to be epsilon-greedy, and target policy 𝜋𝜋 to be greedy policy and initialize the
cumulative weights as C(s, a) = 0
2. For M number of episodes:
1. Generate an episode using the behavior policy b
2. Initialize return R to 0 and weight W to 1
3. For each step t in the episode, t = T-1, T-2,…, 0:
[ 188 ]
Chapter 4
But one issue with the Monte Carlo method is that it is applicable only to episodic
tasks. We learned that in the Monte Carlo method, we compute the value of the state
by taking the average return of the state and the return is the sum of rewards of the
episode. But when there is no episode, that is, if our task is a continuous task (non-
episodic task), then we cannot apply the Monte Carlo method.
Okay, how do we compute the value of the state where we have a continuous task
and also where we don't know the model dynamics of the environment? Here is
where we use another interesting model-free method called temporal difference
learning. In the next chapter, we will learn exactly how temporal difference
learning works.
Summary
We started the chapter by understanding what the Monte Carlo method is.
We learned that in the Monte Carlo method, we approximate the expectation
of a random variable by sampling, and when the sample size is greater, the
approximation will be better. Then we learned about the prediction and control
tasks. In the prediction task, we evaluate the given policy by predicting the value
function or Q function, which helps us to understand the expected return an
agent would get if it uses the given policy. In the control task, our goal is to find
the optimal policy, and we will not be given any policy as input, so we start by
initializing a random policy and we try to find the optimal policy iteratively.
Moving forward, we learned how to use the Monte Carlo method to perform the
prediction task. We learned that the value of a state and the value of a state-action
pair can be computed by just taking the average return of the state and an average
return of state-action pair across several episodes, respectively.
Following this, we explored how to perform a control task using the Monte Carlo
method. We learned about two different types of control methods—on-policy and
off-policy control.
[ 189 ]
Monte Carlo Methods
In the on-policy method, we generate episodes using one policy and also improve
the same policy iteratively to find the optimal policy. We first learned about the
Monte Carlo control exploring starts method where we set all the state-action pairs
to a non-zero probability to ensure exploration. Later, we learned about Monte Carlo
control with an epsilon-greedy policy where we select a random action (exploration)
with probability epsilon, and with probability 1-epsilon we select the best action
(exploitation).
At the end of the chapter, we discussed the off-policy Monte Carlo control method
where we use two different policies called the behavior policy, for generating the
episode, and the target policy, for finding the optimal policy.
Questions
Let's assess our knowledge of the Monte Carlo methods by answering the following
questions:
[ 190 ]
Understanding Temporal
5
Difference Learning
Temporal difference (TD) learning is one of the most popular and widely used
model-free methods. The reason for this is that TD learning combines the advantages
of both the dynamic programming (DP) method and the Monte Carlo (MC) method
we covered in the previous chapters.
We will also learn how to find the optimal policy in the Frozen Lake environment
using SARSA and the Q learning method. At the end of the chapter, we will compare
the DP, MC, and TD methods.
• TD learning
• TD prediction method
• TD control method
• On-policy TD control – SARSA
• Off-policy TD control – Q learning
• Implementing SARSA and Q learning to find the optimal policy
[ 191 ]
Understanding Temporal Difference Learning
TD learning
The TD learning algorithm was introduced by Richard S. Sutton in 1988. In the
introduction of the chapter, we learned that the reason the TD method became
popular is that it combines the advantages of DP and the MC method. But what
are those advantages?
First, let's recap quickly the advantages and disadvantages of DP and the MC method.
Remember how we estimated the value function in DP methods (value and policy
iteration)? We estimated the value function (the value of a state) as:
𝑎𝑎 𝑎𝑎 ′
𝑉𝑉(𝑠𝑠) = ∑ 𝑃𝑃𝑠𝑠𝑠𝑠 ′ [𝑅𝑅𝑠𝑠𝑠𝑠′ + 𝛾𝛾𝛾𝛾(𝑠𝑠 )]
𝑠𝑠′
As you may recollect, we learned that in order to find the value of a state, we didn't
have to wait till the end of the episode. Instead, we bootstrap, that is, we estimate
the value of the current state V(s) by estimating the value of the next state 𝑉𝑉𝑉𝑉𝑉 ′ ).
However, the disadvantage of DP is that we can apply the DP method only when
we know the model dynamics of the environment. That is, DP is a model-based
method and we should know the transition probability in order to use it. When
we don't know the model dynamics of the environment, we cannot apply the
DP method.
However, the disadvantage of the MC method is that in order to estimate the state
value or Q value we need to wait until the end of the episode, and if the episode
is long then it will cost us a lot of time. Also, we cannot apply MC methods to
continuous tasks (non-episodic tasks).
[ 192 ]
Chapter 5
Now, let's get back to TD learning. The TD learning algorithm takes the benefits
of the DP and the MC methods into account. So, just like in DP, we perform
bootstrapping so that we don't have to wait until the end of an episode to compute
the state value or Q value, and just like the MC method, it is a model-free method
and so it does not require the model dynamics of the environment to compute the
state value or Q value. Now that we have the basic idea behind the TD learning
algorithm, let's get into the details and learn exactly how it works.
Similar to what we learned in Chapter 4, Monte Carlo Methods, we can use the
TD learning algorithm for both the prediction and control tasks, and so we can
categorize TD learning into:
• TD prediction
• TD control
We learned what the prediction and control methods mean in the previous chapter.
Let's recap that a bit before going forward.
In the prediction method, a policy is given as an input and we try to predict the
value function or Q function using the given policy. If we predict the value function
using the given policy, then we can say how good it is for the agent to be in each
state if it uses the given policy. That is, we can say what the expected return an agent
can get in each state if it acts according to the given policy.
In the control method, we are not given a policy as input, and the goal in the control
method is to find the optimal policy. So, we initialize a random policy and then we
try to find the optimal policy iteratively. That is, we try to find an optimal policy
that gives us the maximum return.
First, let's see how to use TD learning to perform prediction task, and then we will
learn how to use TD learning for the control task.
TD prediction
In the TD prediction method, the policy is given as input and we try to estimate the
value function using the given policy. TD learning bootstraps like DP, so it does not
have to wait till the end of the episode, and like the MC method, it does not require
the model dynamics of the environment to compute the value function or the Q
function. Now, let's see how the update rule of TD learning is designed, taking the
preceding advantages into account.
[ 193 ]
Understanding Temporal Difference Learning
𝑉𝑉𝑉𝑉𝑉𝑉 𝑉 𝑉𝑉𝑉𝑉𝑉𝑉
However, a single return value cannot approximate the value of a state perfectly. So,
we generate N episodes and compute the value of a state as the average return of a
state across N episodes:
𝑁𝑁
1
𝑉𝑉𝑉𝑉𝑉𝑉 𝑉 ∑ 𝑅𝑅𝑖𝑖 (𝑠𝑠𝑠
𝑁𝑁
𝑖𝑖𝑖𝑖
But with the MC method, we need to wait until the end of the episode to compute
the value of a state and when the episode is long, it takes a lot of time. One more
problem with the MC method is that we cannot apply it to non-episodic tasks
(continuous tasks).
So, in TD learning, we make use of bootstrapping and estimate the value of a state as:
𝑉𝑉(𝑠𝑠) ≈ 𝑟𝑟 𝑟 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ )
The preceding equation tells us that we can estimate the value of the state by only
taking the immediate reward r and the discounted value of the next state 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ ).
As you may observe from the preceding equation, similar to what we learned in DP
methods (value and policy iteration), we perform bootstrapping but here we don't
need to know the model dynamics.
𝑉𝑉(𝑠𝑠) ≈ 𝑟𝑟 𝑟 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ )
However, a single value of 𝑟𝑟 𝑟 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ ) cannot approximate the value of a state
perfectly. So, we can take a mean value and instead of taking an arithmetic mean,
we can use the incremental mean.
In the MC method, we learned how to use the incremental mean to estimate the
value of the state and it given as follows:
[ 194 ]
Chapter 5
Similarly, here in TD learning, we can use the incremental mean and estimate the
value of the state, as shown here:
Value of a state = value of a state + learning rate (reward + discount factor(value of next
state) - value of a state)
[ 195 ]
Understanding Temporal Difference Learning
Now that we have seen the TD learning update rule and how TD learning is used to
estimate the value of a state, in the next section, we will look into the TD prediction
algorithm and get a clearer understanding of the TD learning method.
TD prediction algorithm
We learned that, in the prediction task, given a policy, we estimate the value function
using the given policy. So, we can say what the expected return an agent can obtain
in each state if it acts according to the given policy.
Before looking into the algorithm directly, for better understanding, first, let's
manually calculate and see how exactly the value of a state is estimated using the TD
learning update rule.
Let's explore TD prediction with the Frozen Lake environment. We have learned
that in the Frozen Lake environment, the goal of the agent is to reach the goal state
G from the starting state S without visiting the hole states H. If the agent visits state
G, we assign a reward of 1 and if it visits any other states, we assign a reward of 0.
Figure 5.2 shows the Frozen Lake environment:
[ 196 ]
Chapter 5
We have four actions in our action space, which are up, down, left, and right, and we
have 16 states from S to G. Instead of encoding the states and actions into numbers,
for easier understanding, let's just keep them as they are. That is, let's just denote
each action by the strings up, down, left, and right, and let's denote each state by their
position in the grid. That is, the first state S is denoted by (1,1) and the second state F
is denoted by (1,2) and so on to the last state G, which is denoted by (4,4).
Now, let's learn how to perform TD prediction in the Frozen Lake environment.
We know that in the TD prediction method, we will be given a policy and we
predict the value function (state value) using a given policy. Let's suppose we are
given the following policy. It basically tells us what action to perform in each state:
Now, we will see how to estimate the value function of the preceding policy using
the TD learning method. Before going ahead, first, we initialize the values of all the
states with random values, as shown here:
[ 197 ]
Understanding Temporal Difference Learning
Say we are in state (1,1) and as per the given policy we take the right action and move
to the next state (1,2), and we receive a reward r of 0. Let's keep the learning rate 𝛼𝛼 as
0.1 and the discount factor 𝛾𝛾 as 1 throughout this section. Now, how can we update
the value of the state?
𝑉𝑉(1, 1) = 0.87
[ 198 ]
Chapter 5
So, we update the value of state (1,1) as 0.87 in the value table, as Figure 5.4 shows:
Now we are in state (1,2). We select the right action according to the given policy
in state (1,2) and move to the next state (1,3) and receive a reward r of 0. We can
compute the value of the state as:
[ 199 ]
Understanding Temporal Difference Learning
Substituting the reward r = 0, the learning rate 𝛼𝛼 𝛼 𝛼𝛼𝛼, and the discount factor
𝛾𝛾 𝛾 𝛾, we can write:
𝑉𝑉(1, 2) = 0.62
So, we update the value of state (1,2) to 0.62 in the value table, as Figure 5.5 shows:
Now we are in state (1,3). We select the left action according to our policy and move
to the next state (1,2) and receive a reward r of 0. We can compute the value of the
state as:
[ 200 ]
Chapter 5
Substituting the value of state V(s) with V(1,3) and the next state 𝑉𝑉𝑉𝑉𝑉𝑉𝑉 with V(1,2),
we have:
𝑉𝑉(1, 3) = 0.782
So, we update the value of state (1,3) to 0.782 in the value table, as Figure 5.6 shows:
[ 201 ]
Understanding Temporal Difference Learning
Thus, in this way, we compute the value of every state using the given policy.
However, computing the value of the state just for one episode will not be accurate.
So, we repeat these steps for several episodes and compute the accurate estimates of
the state value (the value function).
1. Initialize state s
2. For each step in the episode:
Now that we have learned how the TD prediction method predicts the value
function of the given policy, in the next section, let's learn how to implement the
TD prediction method to predict the value of states in the Frozen Lake environment.
import gym
import pandas as pd
env = gym.make('FrozenLake-v0')
[ 202 ]
Chapter 5
Define the random policy, which returns the random action by sampling from the
action space:
def random_policy():
return env.action_space.sample()
Let's define the dictionary for storing the value of states, and we initialize the value
of all the states to 0.0:
V = {}
for s in range(env.observation_space.n):
V[s]=0.0
alpha = 0.85
gamma = 0.90
Set the number of episodes and the number of time steps in each episode:
num_episodes = 50000
num_timesteps = 1000
for i in range(num_episodes):
s = env.reset()
for t in range(num_timesteps):
a = random_policy()
Perform the selected action and store the next state information:
[ 203 ]
Understanding Temporal Difference Learning
Compute the value of the state as 𝑉𝑉(𝑠𝑠) = 𝑉𝑉(𝑠𝑠) + 𝛼𝛼(𝑟𝑟 𝑟 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ ) − 𝑉𝑉𝑉𝑉𝑉𝑉):
s = s_
if done:
break
After all the iterations, we will have values of all the states according to the given
random policy.
Before checking the values of the states, let's recollect that in Gym, all the states in the
Frozen Lake environment will be encoded into numbers. Since we have 16 states, all
the states will be encoded into numbers from 0 to 15 as Figure 5.7 shows:
df
[ 204 ]
Chapter 5
As we can observe, we now have the values of all the states. The value of state 14 is
high since we can reach goal state 15 from state 14 easily, and also, as we can see,
the values of all the terminal states (hole states and the goal state) are zero.
Note that since we have initialized a random policy, you might get varying results
every time you run the previous code.
Now that we have understood how TD learning can be used for prediction tasks,
in the next section, we will learn how to use TD learning for control tasks.
[ 205 ]
Understanding Temporal Difference Learning
TD control
In the control method, our goal is to find the optimal policy, so we will start off with
an initial random policy and then we will try to find the optimal policy iteratively.
In the previous chapter, we learned that the control method can be classified into
two categories:
• On-policy control
• Off-policy control
We learned what on-policy and off-policy control means in the previous chapter.
Let's recap that a bit before going ahead. In the on-policy control, the agent behaves
using one policy and tries to improve the same policy. That is, in the on-policy
method, we generate episodes using one policy and improve the same policy
iteratively to find the optimal policy. In the off-policy control method, the agent
behaves using one policy and tries to improve a different policy. That is, in the
off-policy method, we generate episodes using one policy and we try to improve
a different policy iteratively to find the optimal policy.
Now, we will learn how to perform control tasks using TD learning. First, we will
learn how to perform on-policy TD control and then we will learn about off-policy
TD control.
Okay, how can we compute the Q function in TD learning? First, let's recall how we
compute the value function. In TD learning, the value function is computed as:
[ 206 ]
Chapter 5
Now, we compute the Q function using the preceding TD learning update rule, and
then we extract a policy from them. We can also call the preceding update rule as the
SARSA update rule.
But wait! In the prediction method, we were given a policy as input, so we acted in
the environment using that policy and computed the value function. But here, we
don't have a policy as input. So how can we act in the environment?
So, first we initialize the Q function with random values or with zeros. Then we
extract a policy from this randomly initialized Q function and act in the environment.
Our initial policy will definitely not be optimal as it is extracted from the randomly
initialized Q function, but on every episode, we will update the Q function (Q
values). So, on every episode, we can use the updated Q function to extract a new
policy. Thus, we will obtain the optimal policy after a series of episodes.
One important point we need to note is that in the SARSA method, instead of
making our policy act greedily, we use the epsilon-greedy policy. That is, in a
greedy policy, we always select the action that has the maximum Q value. But, with
the epsilon-greedy policy we select a random action with probability epsilon, and
we select the best action (the action with the maximum Q value) with probability
1-epsilon.
Before looking into the algorithm directly, for a better understanding, first, let's
manually calculate and see how exactly the Q function (Q value) is estimated using
the SARSA update rule and how we can find the optimal policy.
Let us consider the same Frozen Lake environment. Before going ahead, we initialize
our Q table (Q function) with random values. Figure 5.9 shows the Frozen Lake
environment along with the Q table containing random values:
Figure 5.9: The Frozen Lake environment and Q table with random values
[ 207 ]
Understanding Temporal Difference Learning
Suppose we are in state (4,2). Now we need to select an action in this state. How
can we select an action? We learned that in the SARSA method, we select an action
based on the epsilon-greedy policy. With probability epsilon, we select a random
action and with probability 1-epsilon we select the best action (the action that has
the maximum Q value). Suppose we use a probability 1-epsilon and select the best
action. So, in state (4,2), we move right as it has the highest Q value compared to the
other actions, as shown here:
Okay, so, we perform the right action in state (4,2) and move to the next state (4,3) as
Figure 5.11 shows:
Figure 5.11: We perform the action with the maximum Q value in state (4,2)
[ 208 ]
Chapter 5
Thus, we moved right in state (4,2) to the next state (4,3) and received a reward r of 0.
Let's keep the learning rate 𝛼𝛼 at 0.1, and the discount factor 𝛾𝛾 at 1. Now, how can we
update the Q value?
𝑄𝑄((4, 2), right) = 𝑄𝑄((4, 2), right) + 𝛼𝛼(𝑟𝑟 𝑟 𝛾𝛾𝛾𝛾((4, 3), 𝑎𝑎′ ) − 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄)
Substituting the reward r = 0, the learning rate 𝛼𝛼 𝛼 𝛼𝛼𝛼, and the discount factor 𝛾𝛾 𝛾 𝛾,
we can write:
𝑄𝑄((4, 2), right) = 𝑄𝑄((4, 2), right) + 0.1(0 + 1 × 𝑄𝑄((4, 3), 𝑎𝑎′ ) − 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄)
From the previous Q table, we can observe that the Q value of Q((4,2), right) is 0.8.
Thus, substituting Q((4,2), right) with 0.8, we can rewrite the preceding equation as:
Because we have moved to the next state (4,3), we need to select an action in this
state in order to compute the Q value of the next state-action pair. So, we use our
same epsilon-greedy policy to select the action. That is, we select a random action
with a probability of epsilon, or we select the best action that has the maximum Q
value with a probability of 1-epsilon.
[ 209 ]
Understanding Temporal Difference Learning
Suppose we use probability epsilon and select the random action. In state (4,3), we
select the right action randomly, as Figure 5.12 shows. As you can see, although the
right action does not have the maximum Q value, we selected it randomly with
probability epsilon:
[ 210 ]
Chapter 5
1. Initialize state s
2. Extract a policy from Q(s, a) and select an action a to perform in state s
3. For each step in the episode:
1. Perform the action a and move to the next state 𝑠𝑠 ′ and observe the
reward r
2. In state 𝑠𝑠 ′, select the action 𝑎𝑎′ using the epsilon-greedy policy
3. Update the Q value to
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = 𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) + 𝛼𝛼(𝑟𝑟 𝑟 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ ,𝑎𝑎′ ) − 𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄𝑄)
4. Update 𝑠𝑠 𝑠 𝑠𝑠 ′ and 𝑎𝑎 𝑎 𝑎𝑎′ (update the next state 𝑠𝑠 ′-action 𝑎𝑎′ pair
to the current state s-action a pair)
5. If s is not a terminal state, repeat steps 1 to 5
Now that we have learned how the SARSA algorithm works, in the next section,
let's implement the SARSA algorithm to find the optimal policy.
import gym
import random
env = gym.make('FrozenLake-v0')
Let's define the dictionary for storing the Q value of the state-action pair and
initialize the Q value of all the state-action pairs to 0.0:
Q = {}
for s in range(env.observation_space.n):
for a in range(env.action_space.n):
Q[(s,a)] = 0.0
[ 211 ]
Understanding Temporal Difference Learning
Now, let's define the epsilon-greedy policy. We generate a random number from
the uniform distribution and if the random number is less than epsilon, we select the
random action, else we select the best action that has the maximum Q value:
Initialize the discount factor 𝛾𝛾, the learning rate 𝛼𝛼, and the epsilon value:
alpha = 0.85
gamma = 0.90
epsilon = 0.8
Set the number of episodes and number of time steps in the episode:
num_episodes = 50000
num_timesteps = 1000
for i in range(num_episodes):
s = env.reset()
a = epsilon_greedy(s,epsilon)
for t in range(num_timesteps):
Perform the selected action and store the next state information:
[ 212 ]
Chapter 5
Select the action 𝑎𝑎′ in the next state 𝑠𝑠 ′ using the epsilon-greedy policy:
a_ = epsilon_greedy(s_,epsilon)
Update 𝑠𝑠 𝑠 𝑠𝑠 ′ and 𝑎𝑎 𝑎 𝑎𝑎′ (update the next state 𝑠𝑠 ′-action 𝑎𝑎′ pair to the current state
s-action a pair):
s = s_
a = a_
if done:
break
Note that on every iteration we update the Q function. After all the iterations, we
will have the optimal Q function. Once we have the optimal Q function then we can
extract the optimal policy by selecting the action that has the maximum Q value in
each state.
We learned that in the SARSA method, we select action a in state s using the epsilon-
greedy policy, move to the next state 𝑠𝑠 ′, and update the Q value using the update
rule shown here:
[ 213 ]
Understanding Temporal Difference Learning
But unlike SARSA, in Q learning, we use two different policies. One is the epsilon-
greedy policy and the other is a greedy policy. To select an action in the environment
we use an epsilon-greedy policy, but while updating the Q value of the next state-
action pair we use a greedy policy.
That is, we select action a in state s using the epsilon-greedy policy and move to the
next state 𝑠𝑠 ′ and update the Q value using the update rule shown below:
As we can observe from the preceding equation, the max operator implies that in
state 𝑠𝑠 ′, we select the action 𝑎𝑎′ that has the maximum Q value.
Thus, to sum up, in the Q learning method we select an action in the environment
using the epsilon-greedy policy, but while computing the Q value of the next state-
action pair we use the greedy policy. Thus, update rule of Q learning is given as:
Let's understand this better by manually calculating the Q value using our Q
learning update rule. Let's use the same Frozen Lake example. We initialize our Q
table with random values. Figure 5.13 shows the Frozen Lake environment, along
with the Q table containing random values:
[ 214 ]
Chapter 5
Figure 5.13: The Frozen Lake environment with a randomly initialized Q table
Suppose we are in state (3,2). Now, we need to select some action in this state. How
can we select an action? We select an action using the epsilon-greedy policy. So, with
probability epsilon, we select a random action and with probability 1-epsilon we
select the best action that has the maximum Q value.
Say we use probability 1-epsilon and select the best action. So, in state (3,2), we select
the down action as it has the highest Q value compared to other actions in that state,
as Figure 5.14 shows:
Figure 5.14: We perform the action with the maximum Q value in state (3,2)
[ 215 ]
Understanding Temporal Difference Learning
Okay, so, we perform the down action in state (3,2) and move to the next state (4,2), as
Figure 5.15 shows:
Thus, we move down in state (3,2) to the next state (4,2) and receive a reward r of 0.
Let's keep the learning rate 𝛼𝛼 as 0.1, and the discount factor 𝛾𝛾 as 1. Now, how can we
update the Q value?
Substituting the state-action pair Q(s,a) with Q((3,2), down) and the next state 𝑠𝑠 ′ with
(4,2) in the preceding equation, we can write:
Substituting the reward, r = 0, the learning rate 𝛼𝛼 𝛼 𝛼𝛼𝛼, and the discount factor
𝛾𝛾 𝛾 𝛾, we can write:
[ 216 ]
Chapter 5
From the previous Q table, we can observe that the Q value of Q((3,2), down) is 0.8.
Thus, substituting Q((3,2), down) with 0.8, we can rewrite the preceding equation as:
As we can observe, in the preceding equation we have the term max 𝑄𝑄((4, 2), 𝑎𝑎′ ),
𝑎𝑎
which represents the Q value of the next state-action pair as we moved to the new state
(4,2). In order to compute the Q value for the next state, first we need to select an action.
Here, we select an action using the greedy policy, that is, the action that has maximum
Q value.
As Figure 5.16 shows, the right action has the maximum Q value in state (4,2). So, we
select the right action and update the Q value of the next state-action pair:
Figure 5.16: We perform the action with the maximum Q value in state (4,2)
[ 217 ]
Understanding Temporal Difference Learning
Thus, in this way, we update the Q function by updating the Q value of the state-
action pair in each step of the episode. We will extract a new policy from the updated
Q function on every step of the episode and uses this new policy. (Remember that we
select an action in the environment using epsilon-greedy policy but while updating
Q value of the next state-action pair we use the greedy policy). After several
episodes, we will have the optimal Q function. The Q learning algorithm given in the
following will help us to understand this better.
1. Initialize state s
2. For each step in the episode:
Now that we have learned how the Q learning algorithm works, in the next section,
let's implement Q learning to find the optimal policy.
[ 218 ]
Chapter 5
import gym
import numpy as np
import random
env = gym.make('FrozenLake-v0')
Let's define the dictionary for storing the Q values of the state-action pairs, and
initialize the Q values of all the state-action pairs to 0.0:
Q = {}
for s in range(env.observation_space.n):
for a in range(env.action_space.n):
Q[(s,a)] = 0.0
Now, let's define the epsilon-greedy policy. We generate a random number from the
uniform distribution, and if the random number is less than epsilon we select the
random action, else we select the best action that has the maximum Q value:
Initialize the discount factor 𝛾𝛾, the learning rate 𝛼𝛼, and the epsilon value:
alpha = 0.85
gamma = 0.90
epsilon = 0.8
Set the number of episodes and the number of time steps in the episode:
num_episodes = 50000
num_timesteps = 1000
for i in range(num_episodes):
[ 219 ]
Understanding Temporal Difference Learning
s = env.reset()
for t in range(num_timesteps):
a = epsilon_greedy(s,epsilon)
Perform the selected action and store the next state information:
s = s_
if done:
break
After all the iterations, we will have the optimal Q function. Then we can extract the
optimal policy by selecting the action that has the maximum Q value in each state.
[ 220 ]
Chapter 5
Dynamic programming (DP), that is, the value and policy iteration methods, is a
model-based method, meaning that we compute the optimal policy using the model
dynamics of the environment. We cannot apply the DP method when we don't have
the model dynamics of the environment.
We also learned about the Monte Carlo (MC) method. MC is a model-free method,
meaning that we compute the optimal policy without using the model dynamics
of the environment. But one problem we face with the MC method is that it is
applicable only to episodic tasks and not to continuous tasks.
[ 221 ]
Understanding Temporal Difference Learning
Summary
We started off the chapter by understanding what TD learning is and how it takes
advantage of both DP and the MC method. We learned that, just like DP, TD learning
bootstraps, and just like the MC method, TD learning is a model-free method.
Later, we learned how to perform a prediction task using TD learning, and then we
looked into the algorithm of the TD prediction method.
Going forward, we learned how to use TD learning for a control task. First, we
learned about the on-policy TD control method called SARSA, and then we learned
about the off-policy TD control method called Q learning. We also learned how to
find the optimal policy in the Frozen Lake environment using the SARSA and Q
learning methods.
In the next chapter, we will look into an interesting problem called the multi-armed
bandit problem.
Questions
Let's evaluate our newly acquired knowledge by answering the following questions:
[ 222 ]
Chapter 5
Further reading
For further information, refer to the following link:
[ 223 ]
Case Study – The
6
MAB Problem
So far in the previous chapters, we have learned the fundamental concepts
of reinforcement learning and also several interesting reinforcement learning
algorithms. We learned about a model-based method called dynamic programming
and a model-free method called the Monte Carlo method, and then we learned
about the temporal difference method, which combines the advantages of dynamic
programming and the Monte Carlo method.
In this chapter, we will learn about one of the classic problems in reinforcement
learning called the multi-armed bandit (MAB) problem. We start the chapter by
understanding the MAB problem, and then we will learn about several exploration
strategies, called epsilon-greedy, softmax exploration, upper confidence bound, and
Thompson sampling, for solving the MAB problem. Following this, we will learn
how a MAB is useful in real-world use cases.
Moving forward, we will understand how to find the best advertisement banner
that is clicked on most frequently by users by framing it as a MAB problem. At
the end of the chapter, we will learn about contextual bandits and how they are
used in different use cases.
[ 225 ]
Case Study – The MAB Problem
Slot machines are one of the most popular games in the casino, where we pull the
arm and get a reward. If we get 0 reward then we lose the game, and if we get +1
reward then we win the game. There can be several slot machines, and each slot
machine is referred to as an arm. For instance, slot machine 1 is referred to as arm
1, slot machine 2 is referred to as arm 2, and so on. Thus, whenever we say arm n,
it actually means that we are referring to slot machine n.
Each arm has its own probability distribution indicating the probability of
winning and losing the game. For example, let's suppose we have two arms. Let
the probability of winning if we pull arm 1 (slot machine 1) be 0.7 and the probability
of winning if we pull arm 2 (slot machine 2) be 0.5.
[ 226 ]
Chapter 6
Then, if we pull arm 1, 70% of the time we win the game and get the +1 reward, and
if we pull arm 2, then 50% of the time we win the game and get the +1 reward.
Thus, we can say that pulling arm 1 is desirable as it makes us win the game 70% of
the time. However, this probability distribution of the arm (slot machine) will not
be given to us. We need to find out which arm helps us to win the game most of the
time and gives us a good reward.
Say we pulled arm 1 once and received a +1 reward, and we pulled arm 2 once
and received a 0 reward. Since arm 1 gives a +1 reward, we cannot come to the
conclusion that arm 1 is the best arm immediately after pulling it only once. We need
to pull both of the arms many times and compute the average reward we obtain from
each of the arms, and then we can select the arm that gives the maximum average
reward as the best arm.
Let's denote the arm by a and define the average reward by pulling the arm a as:
The optimal arm a* is the one that gives us the maximum average reward, that is:
Okay, we have learned that the arm that gives the maximum average reward is the
optimal arm. But how can we find this?
We play the game for several rounds and we can pull only one arm in each round.
Say in the first round we pull arm 1 and observe the reward, and in the second round
we pull arm 2 and observe the reward. Similarly, in every round, we keep pulling
arm 1 or arm 2 and observe the reward. After completing several rounds of the
game, we compute the average reward of each of the arms, and then we select the
arm that has the maximum average reward as the best arm.
But this is not a good approach to find the best arm. Say we have 20 arms; if we keep
pulling a different arm in each round, then in most of the rounds we will lose the
game and get a 0 reward. Along with finding the best arm, our goal should be to
minimize the cost of identifying the best arm, and this is usually referred to as regret.
[ 227 ]
Case Study – The MAB Problem
Thus, we need to find the best arm while minimizing regret. That is, we need to find
the best arm, but we don't want to end up selecting the arms that make us lose the
game in most of the rounds.
So, should we explore a different arm in each round, or should we select only the
arm that got us a good reward in the previous rounds? This leads to a situation
called the exploration-exploitation dilemma, which we learned about in Chapter
4, Monte Carlo Methods. So, to resolve this, we use the epsilon-greedy method and
select the arm that got us a good reward in the previous rounds with probability
1-epsilon and select the random arm with probability epsilon. After completing
several rounds, we select the best arm as the one that has the maximum average
reward.
cd gym-bandits
pip install -e .
import gym_bandits
import gym
[ 228 ]
Chapter 6
env = gym.make("BanditTwoArmedHighLowFixed-v0")
Since we created a 2-armed bandit, our action space will be 2 (as there are two arms),
as shown here:
print(env.action_space.n)
print(env.p_dist)
[0.8, 0.2]
It indicates that, with arm 1, we win the game 80% of the time and with arm 2, we
win the game 20% of the time. Our goal is to find out whether pulling arm 1 or arm 2
makes us win the game most of the time.
Now that we have learned how to create bandit environments in the Gym, in the
next section, we will explore different exploration strategies to solve the MAB
problem and we will implement them with the Gym.
Exploration strategies
At the beginning of the chapter, we learned about the exploration-exploitation
dilemma in the MAB problem. To overcome this, we use different exploration
strategies and find the best arm. The different exploration strategies are listed here:
• Epsilon-greedy
• Softmax exploration
• Upper confidence bound
• Thomson sampling
Now, we will explore all of these exploration strategies in detail and implement them
to find the best arm.
[ 229 ]
Case Study – The MAB Problem
Epsilon-greedy
We learned about the epsilon-greedy algorithm in the previous chapters. With
epsilon-greedy, we select the best arm with a probability 1-epsilon and we select a
random arm with a probability epsilon. Let's take a simple example and learn how
to find the best arm with the epsilon-greedy method in more detail.
Say we have two arms—arm 1 and arm 2. Suppose with arm 1 we win the game
80% of the time and with arm 2 we win the game 20% of the time. So, we can say
that arm 1 is the best arm as it makes us win the game 80% of the time. Now, let's
learn how to find this with the epsilon-greedy method.
First, we initialize the count (the number of times the arm is pulled), sum_rewards
(the sum of rewards obtained from pulling the arm), and Q (the average reward
obtained by pulling the arm), as Table 6.1 shows:
Round 1:
Say, in round 1 of the game, we select a random arm with a probability epsilon, and
suppose we randomly pull arm 1 and observe the reward. Let the reward obtained
by pulling arm 1 be 1. So, we update our table with count of arm 1 set to 1, and sum_
rewards of arm 1 set to 1, and thus the average reward Q of arm 1 after round 1 is 1 as
Table 6.2 shows:
Round 2:
Say, in round 2, we select the best arm with a probability 1-epsilon. The best arm is
the one that has the maximum average reward. So, we check our table to see which
arm has the maximum average reward. Since arm 1 has the maximum average
reward, we pull arm 1 and observe the reward and let the reward obtained from
pulling arm 1 be 1.
[ 230 ]
Chapter 6
So, we update our table with count of arm 1 to 2 and sum_rewards of arm 1 to 2, and
thus the average reward Q of arm 1 after round 2 is 1 as Table 6.3 shows:
Round 3:
Round 4:
Say, in round 4, we select the best arm with a probability 1-epsilon. So, we pull arm
1 since it has the maximum average reward. Let the reward obtained by pulling arm
1 be 0 this time. Now, we update our table with count of arm 1 to 3 and sum_rewards
of arm 2 to 2, and thus the average reward Q of arm 1 after round 4 will be 0.66
as Table 6.5 shows:
We repeat this process for several rounds; that is, for several rounds of the game,
we pull the best arm with a probability 1-epsilon and we pull a random arm with
probability epsilon.
[ 231 ]
Case Study – The MAB Problem
Table 6.6 shows the updated table after 100 rounds of the game:
From Table 6.6, we can conclude that arm 1 is the best arm since it has the maximum
average reward.
Implementing epsilon-greedy
Now, let's learn to implement the epsilon-greedy method to find the best arm. First,
let's import the necessary libraries:
import gym
import gym_bandits
import numpy as np
For better understanding, let's create the bandit with only two arms:
env = gym.make("BanditTwoArmedHighLowFixed-v0")
print(env.p_dist)
[0.8, 0.2]
We can observe that with arm 1 we win the game with 80% probability and with
arm 2 we win the game with 20% probability. Here, the best arm is arm 1, as with
arm 1 we win the game with 80% probability. Now, let's see how to find this best
arm using the epsilon-greedy method.
Initialize the count for storing the number of times an arm is pulled:
count = np.zeros(2)
[ 232 ]
Chapter 6
sum_rewards = np.zeros(2)
Q = np.zeros(2)
num_rounds = 100
def epsilon_greedy(epsilon):
Now, let's play the game and try to find the best arm using the epsilon-greedy
method.
for i in range(num_rounds):
arm = epsilon_greedy(epsilon=0.5)
Pull the arm and store the reward and next state information:
count[arm] += 1
sum_rewards[arm]+=reward
[ 233 ]
Case Study – The MAB Problem
Q[arm] = sum_rewards[arm]/count[arm]
After all the rounds, we look at the average reward obtained from each of the arms:
print(Q)
[0.83783784 0.34615385]
Now, we can select the optimal arm as the one that has the maximum average
reward:
Since arm 1 has a higher average reward than arm 2, our optimal arm will be arm 1:
Thus, we have found the optimal arm using the epsilon-greedy method.
Softmax exploration
Softmax exploration, also known as Boltzmann exploration, is another useful
exploration strategy for finding the optimal arm.
In the epsilon-greedy policy, we learned that we select the best arm with probability
1-epsilon and a random arm with probability epsilon. As you may have noticed, in
the epsilon-greedy policy, all the non-best arms are explored equally. That is, all the
non-best arms have a uniform probability of being selected. For example, say we
have 4 arms and arm 1 is the best arm. Then we explore the non-best arms – [arm 2,
arm 3, arm 4] – uniformly.
Say arm 3 is never a good arm and it always gives a reward of 0. In this case, instead
of exploring arm 3 again, we can spend more time exploring arm 2 and arm 4. But
the problem with the epsilon-greedy method is that we explore all the non-best arms
equally. So, all the non-best arms – [arm 2, arm 3, arm 4] – will be explored equally.
[ 234 ]
Chapter 6
To avoid this, if we can give priority to arm 2 and arm 4 over arm 3, then we can
explore arm 2 and arm 4 more than arm 3.
Okay, but how can we give priority to the arms? We can give priority to the arms by
assigning a probability to all the arms based on the average reward Q. The arm that
has the maximum average reward will have high probability, and all the non-best
arms have a probability proportional to their average reward.
For instance, as Table 6.7 shows, arm 1 is the best arm as it has a high average reward
Q. So, we assign a high probability to arm 1. Arms 2, 3, and 4 are the non-best arms,
and we need to explore them. As we can observe, arm 3 has an average reward of
0. So, instead of selecting all the non-best arms uniformly, we give more priority to
arms 2 and 4 than arm 3. So, the probability of arm 2 and 4 will be high compared
to arm 3:
exp(𝑄𝑄𝑡𝑡 (𝑎𝑎))
𝑃𝑃𝑡𝑡 (𝑎𝑎) ∝ 𝑛𝑛
(1)
∑𝑖𝑖𝑖𝑖 exp(𝑄𝑄𝑡𝑡 (𝑖𝑖))
So, now the arm will be selected based on the probability. However, in the initial
rounds we will not know the correct average reward of each arm, so selecting the
arm based on the probability of average reward will be inaccurate in the initial
rounds. To avoid this, we introduce a new parameter called T. T is called the
temperature parameter.
[ 235 ]
Case Study – The MAB Problem
We can rewrite the preceding equation with the temperature T, as shown here:
exp(𝑄𝑄𝑡𝑡 (𝑎𝑎𝑎𝑎𝑎𝑎)
𝑃𝑃𝑡𝑡 (𝑎𝑎) ∝ 𝑛𝑛 (2)
∑𝑖𝑖𝑖𝑖 exp(𝑄𝑄𝑡𝑡 (𝑖𝑖)/𝑇𝑇)
Okay, how will this T help us? When T is high, all the arms have an equal probability
of being selected and when T is low, the arm that has the maximum average reward
will have a high probability. So, we set T to a high number in the initial rounds, and
after a series of rounds we reduce the value of T. This means that in the initial round
we explore all the arms equally and after a series of rounds, we select the best arm
that has a high probability.
Let's understand this with a simple example. Say we have four arms, arm 1 to arm
4. Suppose we pull arm 1 and receive a reward of 1. Then the average reward of arm
1 will be 1 and the average reward of all other arms will be 0, as Table 6.8 shows:
Now, if we convert the average reward to probabilities using the softmax function
given in equation (1), then our probabilities look like the following:
As we can observe, we have a 47% probability for arm 1 and a 17% probability for
all other arms. But we cannot assign a high probability to arm 1 by just pulling arm
1 once. So, we set T to a high number, say T = 30, and calculate the probabilities
based on equation (2). Now our probabilities become:
[ 236 ]
Chapter 6
As we can see, now all the arms have equal probabilities of being selected. Now we
explore the arms based on this probability and over a series of rounds, the T value
will be reduced, and we will have a high probability to the best arm. Let's suppose
after some 30 rounds, the average reward of all the arms is:
Table 6.11: Average reward for each arm after 30+ rounds
We learned that the value of T is reduced over several rounds. Suppose the value of
T is reduced and it is now 0.3 (T=0.3); then the probabilities will become:
Table 6.12: Probabilities for each arm with T now set to 0.3
As we can see, arm 1 has a high probability compared to other arms. So, we select
arm 1 as the best arm and explore the non-best arms – [arm 2, arm 3, arm 4] – based
on their probabilities in the next rounds.
[ 237 ]
Case Study – The MAB Problem
Thus, in the initial round, we don't know which arm is the best arm. So instead of
assigning a high probability to the arm based on the average reward, we assign an
equal probability to all the arms in the initial round with a high value of T and over
a series of rounds, we reduce the value of T and assign a high probability to the arm
that has a high average reward.
import gym
import gym_bandits
import numpy as np
Let's take the same two-armed bandit we saw in the epsilon-greedy section:
env = gym.make("BanditTwoArmedHighLowFixed-v0")
count = np.zeros(2)
sum_rewards = np.zeros(2)
Q = np.zeros(2)
num_rounds = 100
exp(𝑄𝑄𝑡𝑡 (𝑎𝑎𝑎𝑎𝑎𝑎)
𝑃𝑃𝑡𝑡 (𝑎𝑎) = 𝑛𝑛
∑𝑖𝑖𝑖𝑖 exp(𝑄𝑄𝑡𝑡 (𝑖𝑖)/𝑇𝑇)
def softmax(T):
[ 238 ]
Chapter 6
return arm
Now, let's play the game and try to find the best arm using the softmax exploration
method.
T = 50
for i in range(num_rounds):
arm = softmax(T)
Pull the arm and store the reward and next state information:
count[arm] += 1
sum_rewards[arm]+=reward
Q[arm] = sum_rewards[arm]/count[arm]
T = T*0.99
[ 239 ]
Case Study – The MAB Problem
After all the rounds, we check the Q value, that is, the average reward of all the arms:
print(Q)
[0.77700348 0.1971831 ]
As we can see, arm 1 has a higher average reward than arm 2, so we select arm 1 as
the optimal arm:
Thus, we have found the optimal arm using the softmax exploration method.
Suppose we have two arms – arm 1 and arm 2. Let's say we played the game for 20
rounds by pulling arm 1 and arm 2 randomly and found that the mean reward of
arm 1 is 0.6 and the mean reward of arm 2 is 0.5. But how can we be sure that this
mean reward is actually accurate? That is, how can we be sure that this mean reward
represents the true mean (population mean)? This is where we use the confidence
interval.
The confidence interval denotes the interval within which the true value lies. So, in
our setting, the confidence interval denotes the interval within which the true mean
reward of the arm lies.
For instance, from Figure 6.2, we can see that the confidence interval of arm 1 is 0.2
to 0.9, which indicates that the mean reward of arm 1 lies in the range of 0.2 to 0.9.
0.2 is the lower confidence bound and 0.9 is the upper confidence bound. Similarly,
we can observe that the confidence interval of arm 2 is 0.5 to 0.7, which indicates
that the mean reward of arm 2 lies in the range of 0.5 to 0.7. where 0.5 is the lower
confidence bound and 0.7 is the upper confidence bound:
[ 240 ]
Chapter 6
Okay, from Figure 6.2, we can see the confidence intervals of arm 1 and arm 2. Now,
how can we make a decision? That is, how can we decide whether to pull arm 1 or
arm 2? If we look closely, we can see that the confidence interval of arm 1 is large
and the confidence interval of arm 2 is small.
When the confidence interval is large, we are uncertain about the mean value. Since
the confidence interval of arm 1 is large (0.2 to 0.9), we are not sure what reward we
would obtain by pulling arm 1 because the average reward varies from as low as 0.2
to as high as 0.9. So, there is a lot of uncertainty in arm 1 and we are not sure whether
arm 1 gives a high reward or a low reward.
When the confidence interval is small, then we are certain about the mean value.
Since the confidence interval of arm 2 is small (0.5 to 0.7), we can be sure that we
will get a good reward by pulling arm 2 as our average reward is in the range of
0.5 to 0.7.
But what is the reason for the confidence interval of arm 2 being small and the
confidence interval of arm 1 being large? At the beginning of the section, we learned
that we played the game for 20 rounds by pulling arm 1 and arm 2 randomly and
computed the mean reward of arm 1 and arm 2. Say arm 2 has been pulled 15 times
and arm 1 has been pulled only 5 times. Since arm 2 has been pulled many times,
the confidence interval of arm 2 is small and it denotes a certain mean reward. Since
arm 1 has been pulled fewer times, the confidence interval of the arm is large and it
denotes an uncertain mean reward. Thus, it indicates that arm 2 has been explored
a lot more than arm 1.
[ 241 ]
Case Study – The MAB Problem
Okay, coming back to our question, should we pull arm 1 or arm 2? In UCB, we
always select the arm that has a high upper confidence bound, so in our example,
we select arm 1 since it has a high upper confidence bound of 0.9. But why do
we have to select the arm that has the highest upper confidence bound? Selecting
the arm with the highest upper bound helps us to select the arm that gives the
maximum reward.
But there is a small catch here. When the confidence interval is large, we will not be
sure about the mean reward. For instance, in our example, we select arm 1 since it
has a high upper confidence bound of 0.9; however, since the confidence interval
of arm 1 is large, our mean reward could be anywhere from 0.2 to 0.9, and so we
can even get a low reward. But that's okay, we still select arm 1 as it promotes
exploration. When the arm is explored well, then the confidence interval gets
smaller.
As we play the game for several rounds by selecting the arm that has a high UCB,
our confidence interval of both arms will get narrower and denote a more accurate
mean value. For instance, as we can see in Figure 6.3, after playing the game for
several rounds, the confidence interval of both the arms becomes small and denotes
a more accurate mean value:
Figure 6.3: Confidence intervals for arms 1 and 2 after several rounds
From Figure 6.3, we can see that the confidence interval of both arms is small and
we have a more accurate mean, and since in UCB we select arm that has the highest
UCB, we select arm 2 as the best arm.
[ 242 ]
Chapter 6
Thus, in UCB, we always select the arm that has the highest upper confidence bound.
In the initial rounds, we may not select the best arm as the confidence interval of the
arms will be large in the initial round. But over a series of rounds, the confidence
interval gets smaller and we select the best arm.
Let N(a) be the number of times arm a was pulled and t be the total number of
rounds, then the upper confidence bound of arm a can be computed as:
2log(𝑡𝑡𝑡
UCB(𝑎𝑎) = 𝑄𝑄(𝑎𝑎) + √ (3)
𝑁𝑁𝑁𝑁𝑁𝑁
We select the arm that has the highest upper confidence bound as the best arm:
Implementing UCB
Now, let's learn how to implement the UCB algorithm to find the best arm.
import gym
import gym_bandits
import numpy as np
Let's create the same two-armed bandit we saw in the previous section:
env = gym.make("BanditTwoArmedHighLowFixed-v0")
count = np.zeros(2)
[ 243 ]
Case Study – The MAB Problem
sum_rewards = np.zeros(2)
Q = np.zeros(2)
num_rounds = 100
Now, we define the UCB function, which returns the best arm as the one that has
the highest UCB:
def UCB(i):
Initialize the numpy array for storing the UCB of all the arms:
ucb = np.zeros(2)
Before computing the UCB, we explore all the arms at least once, so for the first 2
rounds, we directly select the arm corresponding to the round number:
if i < 2:
return i
If the round is greater than 2, then we compute the UCB of all the arms as specified
in equation (3) and return the arm that has the highest UCB:
else:
for arm in range(2):
ucb[arm] = Q[arm] + np.sqrt((2*np.log(sum(count))) /
count[arm])
return (np.argmax(ucb))
Now, let's play the game and try to find the best arm using the UCB method.
for i in range(num_rounds):
arm = UCB(i)
[ 244 ]
Chapter 6
Pull the arm and store the reward and next state information:
count[arm] += 1
sum_rewards[arm]+=reward
Q[arm] = sum_rewards[arm]/count[arm]
After all the rounds, we can select the optimal arm as the one that has the maximum
average reward:
Thompson sampling
Thompson sampling (TS) is another interesting exploration strategy to overcome
the exploration-exploitation dilemma and it is based on a beta distribution. So, before
diving into Thompson sampling, let's first understand the beta distribution. The beta
distribution is a probability distribution function and it is expressed as:
1
𝑓𝑓(𝑥𝑥) = 𝑥𝑥 𝛼𝛼𝛼𝛼 (1 − 𝑥𝑥𝑥𝛽𝛽𝛽𝛽
𝐵𝐵(𝛼𝛼𝛼 𝛼𝛼)
Γ(𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼
Where 𝐵𝐵(𝛼𝛼𝛼 𝛼𝛼) = and Γ(. ) is the gamma function.
Γ(𝛼𝛼 𝛼 𝛼𝛼𝛼
The shape of the distribution is controlled by the two parameters 𝛼𝛼 and 𝛽𝛽. When
the values of 𝛼𝛼 and 𝛽𝛽 are the same, then we will have a symmetric distribution.
[ 245 ]
Case Study – The MAB Problem
For instance, as Figure 6.4 shows, since the value of 𝛼𝛼 and 𝛽𝛽 is equal to two we have a
symmetric distribution:
When the value of 𝛼𝛼 is higher than 𝛽𝛽 then we will have a probability closer to 1 than
0. For instance, as Figure 6.5 shows, since the value of 𝛼𝛼 𝛼 𝛼 and 𝛽𝛽 𝛽 𝛽, we have a
high probability closer to 1 than 0:
When the value of 𝛽𝛽 is higher than 𝛼𝛼 then we will have a high probability closer to
0 than 1. For instance, as shown in the following plot, since the value of 𝛼𝛼 𝛼 𝛼 and
𝛽𝛽 𝛽 𝛽, we have a high probability closer to 0 than 1:
[ 247 ]
Case Study – The MAB Problem
From Figure 6.7, we can see that it is better to pull arm 1 than arm 2 because arm 1
has a high probability close to 1, but arm 2 has a high probability close to 0. So, if
we pull arm 1, we get a reward of 1 and win the game, but if we pull arm 2 we get a
reward of 0 and lose the game. Thus, once we know the true distribution of the arms
then we can understand which arm is the best arm.
But how can we learn the true distribution of arm 1 and arm 2? This is where we use
the Thompson sampling method. Thompson sampling is a probabilistic method and
it is based on a prior distribution.
First, we take n samples from arm 1 and arm 2 and compute their distribution.
However, in the initial iterations, the computed distributions of arm 1 and arm
2 will not be the same as the true distribution, and so we will call this the prior
distribution. As Figure 6.8 shows, we have the prior distribution of arm 1 and arm
2, and it varies from the true distribution:
But over a series of iterations, we learn the true distribution of arm 1 and arm 2 and,
as Figure 6.9 shows, the prior distributions of the arms look the same as the true
distribution after a series of iterations:
Figure 6.9: The prior distributions move closer to the true distributions
[ 248 ]
Chapter 6
Once we have learned the true distributions of all the arms, then we can easily select
the best arm. Okay, but how exactly do we learn the true distribution? Let's explore
this in more detail.
Here, we use the beta distribution as a prior distribution. Say we have two arms,
so we will have two beta distributions (prior distributions), and we initialize both 𝛼𝛼
and 𝛽𝛽 to the same value, say 3, as Figure 6.10 shows:
Figure 6.10: Initialized prior distributions for arms 1 and 2 look the same
As we can see, since we initialized alpha and beta to the same value, the beta
distributions of arm 1 and arm 2 look the same.
In the first round, we just randomly sample a value from these two distributions and
select the arm that has the maximum sampled value. Let's say the sampled value of
arm 1 is high, so in this case, we pull arm 1. Say we win the game by pulling arm
1, then we update the distribution of arm 1 by incrementing the alpha value of the
distribution by 1; that is, we update the alpha value as 𝛼𝛼 𝛼 𝛼𝛼 𝛼 𝛼. As Figure 6.11
shows, the alpha value of the distribution of arm 1 is incremented, and as we can see,
arm 1's beta distribution has slightly high probability closer to 1 compared to arm 2:
[ 249 ]
Case Study – The MAB Problem
In the next round, we again sample a value randomly from these two distributions
and select the arm that has the maximum sampled value. Suppose, in this round as
well, we got the maximum sampled value from arm 1. Then we pull the arm 1 again.
Say we win the game by pulling arm 1, then we update the distribution of arm 1
by updating the alpha value to 𝛼𝛼 𝛼 𝛼𝛼 𝛼 𝛼. As Figure 6.12 shows, the alpha value of
arm 1's distribution is incremented, and arm 1's beta distribution has a slightly high
probability close to 1:
Similarly, in the next round, we again randomly sample a value from these
distributions and pull the arm that has the maximum value. Say this time we got
the maximum value from arm 2, so we pull arm 2 and play the game. Suppose
we lose the game by pulling arm 2. Then we update the distribution of arm 2 by
updating the beta value as 𝛽𝛽 𝛽 𝛽𝛽 𝛽 𝛽. As Figure 6.13 shows, the beta value of arm
2's distribution is incremented and the beta distribution of arm 2 has a slightly high
probability close to 0:
[ 250 ]
Chapter 6
Again, in the next round, we randomly sample a value from the beta distribution of
arm 1 and arm 2. Say the sampled value of arm 2 is high, so we pull arm 2. Say we
lose the game again by pulling arm 2. Then we update the distribution of arm 2 by
updating the beta value as 𝛽𝛽 𝛽 𝛽𝛽 𝛽 𝛽. As Figure 6.14 shows, the beta value of arm
2's distribution is incremented by 1 and also arm 2's beta distribution has a slightly
high probability close to 0:
Okay, so, did you notice what we are doing here? We are essentially increasing the
alpha value of the distribution of the arm if we win the game by pulling that arm,
else we increase the beta value. If we do this repeatedly for several rounds, then we
can learn the true distribution of the arm. Say after several rounds, our distribution
will look like Figure 6.15. As we can see, the distributions of both arms resemble the
true distributions:
Figure 6.15: Prior distributions for arms 1 and 2 after several rounds
[ 251 ]
Case Study – The MAB Problem
Now if we sample a value from each of these distributions, then the sampled value
will always be high from arm 1 and we always pull arm 1 and win the game.
The steps involved in the Thomson sampling method are given here:
1. Initialize the beta distribution with alpha and beta set to equal values for all k
arms
2. Sample a value from the beta distribution of all k arms
3. Pull the arm whose sampled value is high
4. If we win the game, then update the alpha value of the distribution to
𝛼𝛼 𝛼 𝛼𝛼 𝛼 𝛼
5. If we lose the game, then update the beta value of the distribution to
𝛽𝛽 𝛽 𝛽𝛽 𝛽 𝛽
6. Repeat steps 2 to 5 for many rounds
import gym
import gym_bandits
import numpy as np
For better understanding, let's create the same two-armed bandit we saw in the
previous section:
env = gym.make("BanditTwoArmedHighLowFixed-v0")
count = np.zeros(2)
sum_rewards = np.zeros(2)
Q = np.zeros(2)
[ 252 ]
Chapter 6
alpha = np.ones(2)
beta = np.ones(2)
num_rounds = 100
As the following code shows, we randomly sample values from the beta
distributions of both arms and return the arm that has the maximum sampled value:
def thompson_sampling(alpha,beta):
return np.argmax(samples)
Now, let's play the game and try to find the best arm using the Thompson sampling
method.
for i in range(num_rounds):
arm = thompson_sampling(alpha,beta)
Pull the arm and store the reward and next state information:
count[arm] += 1
sum_rewards[arm]+=reward
[ 253 ]
Case Study – The MAB Problem
Q[arm] = sum_rewards[arm]/count[arm]
If we win the game, that is, if the reward is equal to 1, then we update the value of
alpha to 𝛼𝛼 𝛼 𝛼𝛼 𝛼 𝛼, else we update the value of beta to 𝛽𝛽 𝛽 𝛽𝛽 𝛽 𝛽:
if reward==1:
alpha[arm] = alpha[arm] + 1
else:
beta[arm] = beta[arm] + 1
After all the rounds, we can select the optimal arm as the one that has the highest
average reward:
Thus, we found the optimal arm using the Thompson sampling method.
Applications of MAB
So far, we have learned about the MAB problem and how can we solve it using
various exploration strategies. But our goal is not to just use these algorithms for
playing slot machines. We can apply the various exploration strategies to several
different use cases.
[ 254 ]
Chapter 6
Bandits are widely used for website optimization, maximizing conversion rates,
online advertisements, campaigning, and so on.
We can frame this problem as a MAB problem. The five advertisement banners
represent the five arms of the bandit, and we assign +1 reward if the user clicks
the advertisement and 0 reward if the user does not click the advertisement. So, to
find out which advertisement banner is most clicked by the users, that is, which
advertisement banner can give us the maximum reward, we can use various
exploration strategies. In this section, let's just use an epsilon-greedy method to
find the best advertisement banner.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('ggplot')
[ 255 ]
Case Study – The MAB Problem
Creating a dataset
Now, let's create a dataset. We generate a dataset with five columns denoting the five
advertisement banners, and we generate 100,000 rows, where the values in the rows
will be either 0 or 1, indicating whether the advertisement banner has been clicked
(1) or not clicked (0) by the user:
df = pd.DataFrame()
for i in range(5):
df['Banner_type_'+str(i)] = np.random.randint(0,2,100000)
df.head()
The preceding code will print the following. As we can see, we have the five
advertisement banners (0 to 4) and the rows consisting of values of 0 or 1, indicating
whether the banner has been clicked (1) or not clicked (0).
num_iterations = 100000
num_banner = 5
[ 256 ]
Chapter 6
Initialize count for storing the number of times the banner was clicked:
count = np.zeros(num_banner)
Initialize sum_rewards for storing the sum of rewards obtained from each banner:
sum_rewards = np.zeros(num_banner)
Q = np.zeros(num_banner)
banner_selected = []
def epsilon_greedy_policy(epsilon):
for i in range(num_iterations):
banner = epsilon_greedy_policy(0.5)
[ 257 ]
Case Study – The MAB Problem
count[banner] += 1
sum_rewards[banner]+=reward
Q[banner] = sum_rewards[banner]/count[banner]
banner_selected.append(banner)
After all the rounds, we can select the best banner as the one that has the maximum
average reward:
We can also plot and see which banner is selected the most often:
ax = sns.countplot(banner_selected)
ax.set(xlabel='Banner', ylabel='Count')
plt.show()
The preceding code will plot the following. As we can see, banner 2 is selected most
often:
[ 258 ]
Chapter 6
Thus, we have learned how to find the best advertisement banner by framing our
problem as a MAB problem.
Contextual bandits
We just learned how to use bandits to find the best advertisement banner for the
users. But the banner preference varies from user to user. That is, user A likes banner
1, but user B might like banner 3, and so on. Each user has their own preferences. So,
we have to personalize advertisement banners according to each user. How can we
do that? This is where we use contextual bandits.
In the MAB problem, we just perform the action and receive a reward. But with
contextual bandits, we take actions based on the state of the environment and the
state holds the context.
[ 259 ]
Case Study – The MAB Problem
For instance, in the advertisement banner example, the state specifies the user
behavior and we will take action (show the banner) according to the state (user
behavior) that will result in the maximum reward (ad clicks).
Contextual bandits are widely used for personalizing content according to the
user's behavior. They are also used to solve the cold-start problems faced by
recommendation systems. Netflix uses contextual bandits for personalizing the
artwork for TV shows according to user behavior.
Summary
We started off the chapter by understanding what the MAB problem is and how it
can be solved using several exploration strategies. We first learned about the epsilon-
greedy method, where we select a random arm with a probability epsilon and
select the best arm with a probability 1-epsilon. Next, we learned about the softmax
exploration method, where we select the arm based on the probability distribution,
and the probability of each arm is proportional to the average reward.
Following this, we learned about the UCB algorithm, where we select the arm
that has the highest upper confidence bound. Then, we explored the Thomspon
sampling method, where we learned the distributions of the arms based on the beta
distribution.
In the next chapter, we will learn about several interesting deep learning algorithms
that are essential for deep reinforcement learning.
Questions
Let's evaluate the knowledge we gained in this chapter by answering the following
questions:
[ 260 ]
Chapter 6
5. What happens when the value of alpha is higher than the value of beta in the
beta distribution?
6. What are the steps involved in Thompson sampling?
7. What are contextual bandits?
Further reading
For more information, check out these interesting resources:
[ 261 ]
Deep Learning Foundations
7
So far in the previous chapters, we have learned how several reinforcement learning
algorithms work and how they find the optimal policy. In the upcoming chapters,
we will learn about Deep Reinforcement Learning (DRL), which is a combination
of deep learning and reinforcement learning. To understand DRL, we need to have
a strong foundation in deep learning. So, in this chapter, we will learn several
important deep learning algorithms.
Deep learning is a subset of machine learning and it is all about neural networks.
Deep learning has been around for a decade, but the reason it is so popular right
now is because of the computational advancements and availability of huge volumes
of data. With this huge volume of data, deep learning algorithms can outperform
classic machine learning algorithms.
We will start off the chapter by understanding what biological and artificial neurons
are, and then we will learn about Artificial Neural Networks (ANNs) and how to
implement them. Moving forward, we will learn about several interesting deep
learning algorithms such as the Recurrent Neural Network (RNN), Long Short-
Term Memory (LSTM), Convolutional Neural Network (CNN), and Generative
Adversarial Network (GAN).
[ 263 ]
Deep Learning Foundations
• CNNs
• GANs
Let's begin the chapter by understanding how biological and artificial neurons work.
A neuron can be defined as the basic computational unit of the human brain.
Neurons are the fundamental units of our brain and nervous system. Our brain
encompasses approximately 100 billion neurons. Each and every neuron is
connected to one another through a structure called a synapse, which is accountable
for receiving input from the external environment via sensory organs, for sending
motor instructions to our muscles, and for performing other activities.
A neuron can also receive inputs from other neurons through a branchlike structure
called a dendrite. These inputs are strengthened or weakened; that is, they are
weighted according to their importance and then they are summed together in the
cell body called the soma. From the cell body, these summed inputs are processed
and move through the axons and are sent to the other neurons.
[ 264 ]
Chapter 7
Now, let's see how artificial neurons work. Let's suppose we have three inputs x1, x2,
and x3, to predict the output y. These inputs are multiplied by weights w1, w2, and w3
and are summed together as follows:
𝑧𝑧 𝑧 𝑧𝑧𝑧𝑧 𝑧 𝑧𝑧
Here, m is the weights (coefficients), x is the input, and b is the bias (intercept).
Well, yes. Then, what is the difference between neurons and linear regression?
In neurons, we introduce non-linearity to the result, z, by applying a function f(.)
called the activation or transfer function. Thus, our output becomes:
𝑦𝑦 𝑦 𝑦𝑦𝑦𝑦𝑦𝑦
Figure 7.2 shows a single artificial neuron:
[ 265 ]
Deep Learning Foundations
So, a neuron takes the input, x, multiples it by weights, w, and adds bias, b, forms z,
and then we apply the activation function on z and get the output, y.
• Input layer
• Hidden layer
• Output layer
Each layer has a collection of neurons, and the neurons in one layer interact with
all the neurons in the other layers. However, neurons in the same layer will not
interact with one another. This is simply because neurons from the adjacent layers
have connections or edges between them; however, neurons in the same layer do not
have any connections. We use the term nodes or units to represent the neurons in
the ANN.
[ 266 ]
Chapter 7
Input layer
The input layer is where we feed input to the network. The number of neurons in
the input layer is the number of inputs we feed to the network. Each input will have
some influence on predicting the output. However, no computation is performed in
the input layer; it is just used for passing information from the outside world to the
network.
Hidden layer
Any layer between the input layer and the output layer is called a hidden layer.
It processes the input received from the input layer. The hidden layer is responsible
for deriving complex relationships between input and output. That is, the hidden
layer identifies the pattern in the dataset. It is majorly responsible for learning the
data representation and for extracting the features.
There can be any number of hidden layers; however, we have to choose a number
of hidden layers according to our use case. For a very simple problem, we can just
use one hidden layer, but while performing complex tasks such as image recognition,
we use many hidden layers, where each layer is responsible for extracting important
features. The network is called a deep neural network when we have many
hidden layers.
Output layer
After processing the input, the hidden layer sends its result to the output layer.
As the name suggests, the output layer emits the output. The number of neurons
in the output layer is based on the type of problem we want our network to solve.
If it is a binary classification, then the number of neurons in the output layer is one,
and it tells us which class the input belongs to. If it is a multi-class classification say,
with five classes, and if we want to get the probability of each class as an output,
then the number of neurons in the output layer is five, each emitting the probability.
If it is a regression problem, then we have one neuron in the output layer.
[ 267 ]
Deep Learning Foundations
If we do not apply the activation function, then a neuron simply resembles the
linear regression. The aim of the activation function is to introduce a non-linear
transformation to learn the complex underlying patterns in the data.
Now let's look at some of the interesting commonly used activation functions.
1
𝑓𝑓(𝑥𝑥) =
1 + 𝑒𝑒 −𝑥𝑥
It is an S-shaped curve shown in Figure 7.4:
It is differentiable, meaning that we can find the slope of the curve at any two points.
It is monotonic, which implies it is either entirely non-increasing or non-decreasing.
The sigmoid function is also known as a logistic function. As we know that
probability lies between 0 and 1, and since the sigmoid function squashes the value
between 0 and 1, it is used for predicting the probability of output.
[ 268 ]
Chapter 7
1 − 𝑒𝑒 −2𝑥𝑥
𝑓𝑓(𝑥𝑥) =
1 + 𝑒𝑒 −2𝑥𝑥
It also resembles the S-shaped curve. Unlike the sigmoid function, which is centered
on 0.5, the tanh function is 0-centered, as shown in the following diagram:
0 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑓 𝑓
𝑓𝑓(𝑥𝑥) = {
𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥 𝑥 𝑥
[ 269 ]
Deep Learning Foundations
That is, f(x) returns zero when the value of x is less than zero and f(x) returns x when
the value of x is greater than or equal to zero. It can also be expressed as follows:
As we can see in the preceding diagram, when we feed any negative input to the
ReLU function, it converts the negative input to zero.
[ 270 ]
Chapter 7
𝑒𝑒 𝑥𝑥𝑖𝑖
𝑓𝑓(𝑥𝑥𝑖𝑖 ) =
∑𝑗𝑗 𝑒𝑒 𝑥𝑥𝑗𝑗
As shown in the Figure 7.7, the softmax function converts its inputs to probabilities:
Now that we have learned about different activation functions, in the next section,
we will learn about forward propagation in ANNs.
[ 271 ]
Deep Learning Foundations
Let's consider we have two inputs, x1 and x2, and we have to predict the output, 𝑦𝑦𝑦.
Since we have two inputs, the number of neurons in the input layer is two. We set
the number of neurons in the hidden layer to four, and the number of neurons in
the output layer to one. Now, the inputs are multiplied by weights, and then we
add bias and propagate the resultant value to the hidden layer where the activation
function is applied.
Before that, we need to initialize the weight matrix. In the real world, we don't
know which input is more important than the other so that we can weight them and
compute the output. Therefore, we randomly initialize the weights and bias value.
The weight and the bias value between the input to the hidden layer are represented
by Wxh and bh, respectively. What about the dimensions of the weight matrix? The
dimensions of the weight matrix must be the number of neurons in the current layer x
the number of neurons in the next layer. Why is that?
Because it is a basic matrix multiplication rule. To multiply any two matrices, AB,
the number of columns in matrix A must be equal to the number of rows in matrix
B. So, the dimension of the weight matrix, Wxh, should be the number of neurons in
the input layer x the number of neurons in the hidden layer, that is, 2 x 4:
𝑎𝑎1 = 𝜎𝜎𝜎𝜎𝜎1 )
After applying the activation function, we again multiply result a1 by a new weight
matrix and add a new bias value that is flowing between the hidden layer and the
output layer. We can denote this weight matrix and bias as Why and by, respectively.
The dimension of the weight matrix, Why, will be the number of neurons in the hidden
layer x the number of neurons in the output layer. Since we have four neurons in the
hidden layer and one neuron in the output layer, the Why matrix dimension will be
4 x 1. So, we multiply a1 by the weight matrix, Why, and add bias, by, and pass the
result z2 to the next layer, which is the output layer:
𝑦𝑦𝑦 𝑦 𝑦𝑦𝑦𝑦𝑦2 )
[ 272 ]
Chapter 7
This whole process from the input layer to the output layer is known as forward
propagation. Thus, in order to predict the output value, inputs are propagated from
the input layer to the output layer. During this propagation, they are multiplied by
their respective weights on each layer and an activation function is applied on top
of them. The complete forward propagation steps are given as follows:
𝑎𝑎1 = 𝜎𝜎𝜎𝜎𝜎1 )
𝑦𝑦𝑦 𝑦 𝑦𝑦𝑦𝑦𝑦2 )
The preceding forward propagation steps can be implemented in Python as follows:
def forward_prop(X):
z1 = np.dot(X,Wxh) + bh
a1 = sigmoid(z1)
z2 = np.dot(a1,Why) + by
y_hat = sigmoid(z2)
return y_hat
Forward propagation is cool, isn't it? But how do we know whether the output
generated by the neural network is correct? We define a new function called the
cost function (J), also known as the loss function (L), which tells us how well our
neural network is performing. There are many different cost functions. We will use
the mean squared error as a cost function, which can be defined as the mean of the
squared difference between the actual output and the predicted output:
𝑛𝑛
1
𝐽𝐽 𝐽 ∑(𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑦𝑖𝑖 )2
𝑛𝑛
𝑖𝑖𝑖𝑖
Here, n is the number of training samples, y is the actual output, and 𝑦𝑦𝑦 is the
predicted output.
[ 273 ]
Deep Learning Foundations
Okay, so we learned that a cost function is used for assessing our neural network;
that is, it tells us how good our neural network is at predicting the output. But the
question is where is our network actually learning? In forward propagation, the
network is just trying to predict the output. But how does it learn to predict the
correct output? In the next section, we will examine this.
With gradient descent, the neural network learns the optimal values of the randomly
initialized weight matrices. With the optimal values of weights, our network can
predict the correct output and minimize the loss.
Now, we will explore how the optimal values of weights are learned using
gradient descent. Gradient descent is one of the most commonly used optimization
algorithms. It is used for minimizing the cost function, which allows us to minimize
the error and obtain the lowest possible error value. But how does gradient descent
find the optimal weights? Let's begin with an analogy.
Imagine we are on top of a hill, as shown in the following diagram, and we want
to reach the lowest point on the hill. There could be many regions that look like
the lowest points on the hill, but we have to reach the point that is actually the
lowest of all.
That is, we should not be stuck at a point believing it is the lowest point when the
global lowest point exists:
[ 274 ]
Chapter 7
Similarly, we can represent our cost function as follows. It is a plot of cost against
weights. Our objective is to minimize the cost function. That is, we have to reach the
lowest point where the cost is the minimum. The solid dark point in the following
diagram shows the randomly initialized weights. If we move this point downward,
then we can reach the point where the cost is the minimum:
[ 275 ]
Deep Learning Foundations
But how can we move this point (initial weight) downward? How can we descend
and reach the lowest point? Gradients are used for moving from one point to
another. So, we can move this point (initial weight) by calculating a gradient
𝜕𝜕𝜕𝜕
of the cost function with respect to that point (initial weights), which is .
𝜕𝜕𝜕𝜕
Gradients are the derivatives that are actually the slope of a tangent line, as
illustrated in the following diagram. So, by calculating the gradient, we descend
(move downward) and reach the lowest point where the cost is the minimum.
Gradient descent is a first-order optimization algorithm, which means we only
take into account the first derivative when performing the updates:
Thus, with gradient descent, we move our weights to a position where the cost is
minimum. But still, how do we update the weights?
𝜕𝜕𝜕𝜕
𝑊𝑊 𝑊 𝑊𝑊 𝑊 𝑊𝑊
𝜕𝜕𝜕𝜕
This implies weights = weights -α x gradients.
What is 𝛼𝛼? It is called the learning rate. As shown in the following diagram, if
the learning rate is small, then we take a small step downward and our gradient
descent can be slow.
[ 276 ]
Chapter 7
If the learning rate is large, then we take a large step and our gradient descent will
be fast, but we might fail to reach the global minimum and become stuck at a local
minimum. So, the learning rate should be chosen optimally:
This whole process of backpropagating the network from the output layer to
the input layer and updating the weights of the network using gradient descent
to minimize the loss is called backpropagation. Now that we have a basic
understanding of backpropagation, we will strengthen our understanding by
learning about this in detail, step by step. We are going to look at some interesting
math, so put on your calculus hats and follow the steps.
So, we have two weights, one Wxh, which is the input to hidden layer weights, and
the other Why, which is the hidden to output layer weights. We need to find the
optimal values for these two weights that will give us the fewest errors. So, we need
to calculate the derivative of the cost function J with respect to these weights. Since
we are backpropagating, that is, going from the output layer to the input layer, our
first weight will be Why. So, now we need to calculate the derivative of J with respect
to Why. How do we calculate the derivative? First, let's recall our cost function, J:
𝑛𝑛
1
𝐽𝐽 𝐽 ∑(𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑦𝑖𝑖 )2
𝑛𝑛
𝑖𝑖𝑖𝑖
We cannot calculate the derivative directly from the preceding equation since there is
no Why term. So, instead of calculating the derivative directly, we calculate the partial
derivative. Let's recall our forward propagation equation:
𝑦𝑦𝑦 𝑦 𝑦𝑦𝑦𝑦𝑦2 )
[ 277 ]
Deep Learning Foundations
First, we will calculate a partial derivative with respect to 𝑦𝑦𝑦, and then from 𝑦𝑦𝑦 we will
calculate the partial derivative with respect to z2. From z2, we can directly calculate
our derivative Why. It is basically the chain rule. So, the derivative of J with respect to
Why becomes as follows:
𝜕𝜕𝜕𝜕
= (𝑦𝑦 𝑦 𝑦𝑦𝑦𝑦
𝜕𝜕𝜕𝜕𝜕
𝜕𝜕𝜕𝜕𝜕
= 𝜎𝜎 ′ (𝑧𝑧2 )
𝜕𝜕𝜕𝜕2
Here, 𝜎𝜎 ′ is the derivative of our sigmoid activation function. We know that the
1
sigmoid function is 𝜎𝜎(𝑧𝑧) = , so the derivative of the sigmoid function would
1 + 𝑒𝑒 −𝑧𝑧
𝑒𝑒 𝑧𝑧
be 𝜎𝜎 ′ (𝑧𝑧) = .
(1 + 𝑒𝑒 −𝑧𝑧 )2
Next we have:
𝑑𝑑𝑑𝑑2
= 𝑎𝑎1
𝑑𝑑𝑑𝑑ℎ𝑦𝑦
Thus, substituting all the preceding terms in equation (1) we can write:
𝜕𝜕𝜕𝜕
= (𝑦𝑦 𝑦 𝑦𝑦𝑦). 𝜎𝜎 ′ (𝑧𝑧2 ). 𝑎𝑎1 (2)
𝑑𝑑𝑑𝑑ℎ𝑦𝑦
Now we need to compute a derivative of J with respect to our next weight, Wxh.
𝑎𝑎1 = 𝜎𝜎𝜎𝜎𝜎1 )
[ 278 ]
Chapter 7
Now, according to the chain rule, the derivative of J with respect to Wxh is given as:
𝑑𝑑𝑑𝑑2
= 𝑊𝑊ℎ𝑦𝑦
𝜕𝜕𝜕𝜕1
𝜕𝜕𝜕𝜕1
= 𝜎𝜎 ′ (𝑧𝑧1 )
𝜕𝜕𝜕𝜕1
𝑑𝑑𝑑𝑑1
= 𝑋𝑋
𝑑𝑑𝑑𝑑𝑥𝑥𝑥
Thus, substituting all the preceding terms in equation (3), we can write:
𝜕𝜕𝜕𝜕
= (𝑦𝑦 𝑦 𝑦𝑦𝑦). 𝜎𝜎 ′ (𝑧𝑧2 ). 𝑊𝑊ℎ𝑦𝑦 . 𝜎𝜎 ′ (𝑧𝑧1 ) . 𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥
𝜕𝜕𝜕𝜕𝑥𝑥𝑥
After we have computed gradients for both weights, Why and Wxh, we will update our
initial weights according to the weight update rule:
𝜕𝜕𝜕𝜕
𝑊𝑊ℎ𝑦𝑦 = 𝑊𝑊ℎ𝑦𝑦 − 𝛼𝛼 (5)
𝜕𝜕𝜕𝜕ℎ𝑦𝑦
𝜕𝜕𝜕𝜕
𝑊𝑊𝑥𝑥𝑥𝑥 = 𝑊𝑊𝑥𝑥𝑥𝑥 − 𝛼𝛼 (6)
𝜕𝜕𝜕𝜕𝑥𝑥𝑥
That's it! This is how we update the weights of a network and minimize the loss.
Now, let's see how to implement the backpropagation algorithm in Python.
In both the equations (2) and (4), we have the term (𝑦𝑦 𝑦 𝑦𝑦𝑦). 𝜎𝜎 ′ (𝑧𝑧2 ), so instead of
computing them again and again, we just call them delta2:
delta2 = np.multiply(-(y-yHat),sigmoidPrime(z2))
Now, we compute the gradient with respect to Why. Refer to equation (2):
dJ_dWhy = np.dot(a1.T,delta2)
[ 279 ]
Deep Learning Foundations
delta1 = np.dot(delta2,Why.T)*sigmoidPrime(z1)
dJ_dWxh = np.dot(X.T,delta1)
We will update the weights according to our weight update rule equation (5) and (6)
as follows:
delta1 = np.dot(delta2,Why.T)*sigmoid_derivative(z1)
dJ_dWxh = np.dot(X.T, delta1)
return Wxh,Why
That's it. Apart from this, there are different variants of gradient descent methods
such as stochastic gradient descent, mini-batch gradient descent, Adam, RMSprop,
and more.
Before moving on, let's familiarize ourselves with some of the frequently used
terminology in neural networks:
• Forward pass: Forward pass implies forward propagating from the input
layer to the output layer.
• Backward pass: Backward pass implies backpropagating from the output
layer to the input layer.
[ 280 ]
Chapter 7
• Epoch: The epoch specifies the number of times the neural network sees our
whole training data. So, we can say one epoch is equal to one forward pass
and one backward pass for all training samples.
• Batch size: The batch size specifies the number of training samples we use in
one forward pass and one backward pass.
• Number of iterations: The number of iterations implies the number of passes
where one pass = one forward pass + one backward pass.
Say that we have 12,000 training samples and our batch size is 6,000. Then it will
take us two iterations to complete one epoch. That is, in the first iteration, we pass
the first 6,000 samples and perform a forward pass and a backward pass; in the
second iteration, we pass the next 6,000 samples and perform a forward pass and a
backward pass. After two iterations, our neural network will see the whole 12,000
training samples, which makes it one epoch.
[ 281 ]
Deep Learning Foundations
We will understand step-by-step how a neural network learns the XOR logic:
[ 282 ]
Chapter 7
4. Initialize the weights and bias randomly. First, we initialize the input to
hidden layer weights:
Wxh = np.random.randn(num_input,num_hidden)
bh = np.zeros((1,num_hidden))
return z1,a1,z2,y_hat
[ 283 ]
Deep Learning Foundations
return J
11. Set the learning rate and the number of training iterations:
alpha = 0.01
num_iterations = 5000
12. Now, let's start training the network with the following code:
cost =[]
for i in range(num_iterations):
z1,a1,z2,y_hat = forward_prop(X,Wxh,Why)
dJ_dWxh, dJ_dWhy = backword_prop(y_hat, z1, a1, z2)
#update weights
Wxh = Wxh -alpha * dJ_dWxh
Why = Why -alpha * dJ_dWhy
#compute cost
c = cost_function(y, y_hat)
cost.append(c)
plt.title('Cost Function')
plt.xlabel('Training Iterations')
plt.ylabel('Cost')
As you can observe in the following plot, the loss decreases over the training
iterations:
[ 284 ]
Chapter 7
If we were asked to predict the blank term in the preceding sentence, we would
probably say east. Why would we predict that the word east would be the right
word here? Because we read the whole sentence, understood the context, and
predicted that the word east would be an appropriate word to complete the sentence.
If we use a feedforward neural network (the one we learned in the previous section)
to predict the blank, it would not predict the right word. This is due to the fact that
in feedforward networks, each input is independent of other input and they make
predictions based only on the current input, and they don't remember previous input.
Thus, the input to the network will just be the word preceding the blank, which
is the word the. With this word alone as an input, our network cannot predict the
correct word, because it doesn't know the context of the sentence, which means that
it doesn't know the previous set of words to understand the context of the sentence
and to predict an appropriate next word.
[ 285 ]
Deep Learning Foundations
Here is where we use Recurrent Neural Networks (RNNs). They predict output not
only based on the current input, but also on the previous hidden state. Why do they
have to predict the output based on the current input and the previous hidden state?
Why can't they just use the current input and the previous input?
This is because the previous input will only store information about the previous
word, while the previous hidden state will capture the contextual information about
all the words in the sentence that the network has seen so far. Basically, the previous
hidden state acts like memory, and it captures the context of the sentence. With this
context and the current input, we can predict the relevant word.
For instance, let's take the same sentence, The sun rises in the ____. As shown in the
following figure, we first pass the word the as an input, and then we pass the next
word, sun, as input; but along with this, we also pass the previous hidden state, h0.
So, every time we pass the input word, we also pass a previous hidden state as an
input.
In the final step, we pass the word the, and also the previous hidden state h3, which
captures the contextual information about the sequence of words that the network
has seen so far. Thus, h3 acts as the memory and stores information about all the
previous words that the network has seen. With h3 and the current input word (the),
we can predict the relevant next word:
In a nutshell, an RNN uses the previous hidden state as memory, which captures
and stores the contextual information (input) that the network has seen so far.
RNNs are widely applied for use cases that involve sequential data, such as time
series, text, audio, speech, video, weather, and much more. They have been greatly
used in various natural language processing (NLP) tasks, such as language
translation, sentiment analysis, text generation, and so on.
[ 286 ]
Chapter 7
As you can observe in the preceding diagram, the RNN contains a looped connection
in the hidden layer, which implies that we use the previous hidden state along with
the input to predict the output.
Still confused? Let's look at the following unrolled version of an RNN. But wait;
what is the unrolled version of an RNN?
It means that we roll out the network for a complete sequence. Let's suppose that
we have an input sentence with T words; then, we will have 0 to T—1 layers, one
for each word, as shown in Figure 7.17:
[ 287 ]
Deep Learning Foundations
As you can see in Figure 7.17, at the time step t = 1, the output y1 is predicted based
on the current input x1 and the previous hidden state h0. Similarly, at time step t =
2, y2 is predicted using the current input x2 and the previous hidden state h1. This
is how an RNN works; it takes the current input and the previous hidden state to
predict the output.
𝑦𝑦𝑦𝑡𝑡 = softmax(𝑉𝑉𝑉𝑡𝑡 )
[ 288 ]
Chapter 7
That is, the output at a time step, t = softmax (hidden to output layer weight x hidden state
at a time t).
We can also represent RNNs as shown in the following figure. As you can see, the
hidden layer is represented by an RNN block, which implies that our network is an
RNN, and previous hidden states are used in predicting the output:
[ 289 ]
Deep Learning Foundations
We initialize the initial hidden state hinit with random values. As you can see in the
preceding figure, the output, 𝑦𝑦𝑦0, is predicted based on the current input, x0 and
the previous hidden state, which is an initial hidden state, hinit, using the following
formula:
ℎ0 = tanh(𝑈𝑈𝑈𝑈0 + 𝑊𝑊𝑊𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 )
𝑦𝑦𝑦0 = softmax(𝑉𝑉𝑉0 )
Similarly, look at how the output, 𝑦𝑦𝑦1, is computed. It takes the current input, x1, and
the previous hidden state, h0:
ℎ1 = tanh(𝑈𝑈𝑈𝑈1 + 𝑊𝑊𝑊0 )
𝑦𝑦𝑦1 = softmax(𝑉𝑉𝑉1 )
Thus, in forward propagation to predict the output, RNN uses the current input and
the previous hidden state.
The final loss is a sum of the loss at all the time steps. Suppose that we have T - 1
layers; then, the final loss can be given as follows:
𝑇𝑇𝑇𝑇
𝐿𝐿 𝐿 𝐿 𝐿𝐿𝑗𝑗
𝑗𝑗𝑗𝑗
Figure 7.21 shows that the final loss is obtained by the sum of loss at all the time
steps:
[ 290 ]
Chapter 7
We computed the loss, now our goal is to minimize the loss. How can we minimize
the loss? We can minimize the loss by finding the optimal weights of the RNN. As
we learned, we have three weights in RNNs: input to hidden, U, hidden to hidden,
W, and hidden to output, V.
We need to find optimal values for all of these three weights to minimize the loss.
We can use our favorite gradient descent algorithm to find the optimal weights. We
begin by calculating the gradients of the loss function with respect to all the weights;
then, we update the weights according to the weight update rule as follows:
𝜕𝜕𝜕𝜕
𝑉𝑉 𝑉 𝑉𝑉 𝑉 𝑉𝑉
𝜕𝜕𝜕𝜕
𝜕𝜕𝜕𝜕
𝑊𝑊 𝑊 𝑊𝑊 𝑊 𝑊𝑊
𝜕𝜕𝜕𝜕
𝜕𝜕𝜕𝜕
𝑈𝑈 𝑈 𝑈𝑈 𝑈 𝑈𝑈
𝜕𝜕𝜕𝜕
[ 291 ]
Deep Learning Foundations
However, we have a problem with the RNN. The gradient calculation involves
calculating the gradient with respect to the activation function. When we calculate
the gradient with respect to the sigmoid or tanh function, the gradient will become
very small. When we further backpropagate the network over many time steps and
multiply the gradients, the gradients will tend to get smaller and smaller. This is
called a vanishing gradient problem.
Since the gradient vanishes over time, we cannot learn information about long-
term dependencies, that is, RNNs cannot retain information for a long time in the
memory. The vanishing gradient problem occurs not only in RNNs but also in other
deep networks where we have many hidden layers and when we use sigmoid/tanh
functions.
An RNN can easily predict the blank as blue based on the information it has seen,
but it cannot cover the long-term dependencies. What does that mean? Let's
consider the following sentence to understand the problem better:
Archie lived in China for 13 years. He loves listening to good music. He is a fan of comics.
He is fluent in ____.
Now, if we were asked to predict the missing word in the preceding sentence, we
would predict it as Chinese, but how did we predict that? We simply remembered
the previous sentences and understood that Archie lived for 13 years in China. This
led us to the conclusion that Archie might be fluent in Chinese. An RNN, on the
other hand, cannot retain all of this information in its memory to say that Archie is
fluent in Chinese.
[ 292 ]
Chapter 7
LSTM is a variant of an RNN that resolves the vanishing gradient problem and
retains information in the memory as long as it is required. Basically, RNN cells
are replaced with LSTM cells in the hidden units, as shown in Figure 7.22:
In the next section, we will understand how the LSTM cells works.
[ 293 ]
Deep Learning Foundations
This is all achieved by special structures called gates. As shown in the following
diagram, a typical LSTM cell consists of three special gates called the input gate,
output gate, and forget gate:
These three gates are responsible for deciding what information to add, output,
and forget from the memory. With these gates, an LSTM cell effectively keeps
information in the memory only as long as required. Figure 7.24 shows a typical
LSTM cell:
[ 294 ]
Chapter 7
If you look at the LSTM cell, the top horizontal line is called the cell state. It is where
the information flows. Information on the cell state will be constantly updated by
LSTM gates. Now, we will see the function of these gates:
Forget gate: The forget gate is responsible for deciding what information should not
be in the cell state. Look at the following statement:
Harry is a good singer. He lives in New York. Zayn is also a good singer.
As soon as we start talking about Zayn, the network will understand that the subject
has been changed from Harry to Zayn and the information about Harry is no longer
required. Now, the forget gate will remove/forget information about Harry from
the cell state.
Input gate: The input gate is responsible for deciding what information should be
stored in the memory. Let's consider the same example:
Harry is a good singer. He lives in New York. Zayn is also a good singer.
So, after the forget gate removes information from the cell state, the input gate
decides what information has to be in the memory. Here, since the information
about Harry is removed from the cell state by the forget gate, the input gate decides
to update the cell state with the information about Zayn.
Output gate: The output gate is responsible for deciding what information should
be shown from the cell state at a time, t. Now, consider the following sentence:
Here, congrats is an adjective which is used to describe a noun. The output layer
will predict Zayn (noun), to fill in the blank.
Thus, using LSTM, we can overcome the vanishing gradient problem faced in
RNN. In the next section, we will learn another interesting algorithm called the
Convolutional Neural Network (CNN).
[ 295 ]
Deep Learning Foundations
How can we do that? When we feed the image to a computer, it basically converts
it into a matrix of pixel values. The pixel values range from 0 to 255, and the
dimensions of this matrix will be of [image width x image height x number of channels].
A grayscale image has one channel, and colored images have three channels, red,
green, and blue (RGB).
Let's say we have a colored input image with a width of 11 and a height of 11, that
is 11 x 11, then our matrix dimension would be [11 x 11 x 3]. As you can see in [11 x
11 x 3], 11 x 11 represents the image width and height and 3 represents the channel
number, as we have a colored image. So, we will have a 3D matrix.
But it is hard to visualize a 3D matrix, so, for the sake of understanding, let's consider
a grayscale image as our input. Since the grayscale image has only one channel, we
will get a 2D matrix.
As shown in the following diagram, the input grayscale image will be converted into
a matrix of pixel values ranging from 0 to 255, with the pixel values representing the
intensity of pixels at that point:
[ 296 ]
Chapter 7
The values given in the input matrix are just arbitrary values for
our understanding.
Okay, now we have an input matrix of pixel values. What happens next? How does
the CNN come to understand that the image contains a horse? CNNs consists of the
following three important layers:
With the help of these three layers, the CNN recognizes that the image contains a
horse. Now we will explore each of these layers in detail.
Convolutional layers
The convolutional layer is the first and core layer of the CNN. It is one of the
building blocks of a CNN and is used for extracting important features from the
image.
We have an image of a horse. What do you think are the features that will help us
to understand that this is an image of a horse? We can say body structure, face, legs,
tail, and so on. But how does the CNN understand these features? This is where
we use a convolution operation that will extract all the important features from
the image that characterize the horse. So, the convolution operation helps us to
understand what the image is all about.
Okay, what exactly is this convolution operation? How it is performed? How does
it extract the important features? Let's look at this in detail.
[ 297 ]
Deep Learning Foundations
We take the filter matrix, slide it over the input matrix by one pixel, perform
element-wise multiplication, sum the results, and produce a single number. That's
pretty confusing, isn't it? Let's understand this better with the aid of the following
diagram:
As you can see in the previous diagram, we took the filter matrix and placed it on top
of the input matrix, performed element-wise multiplication, summed their results,
and produced the single number. This is demonstrated as follows:
(0 ∗ 0) + (13 ∗ 1) + (7 ∗ 1) + (7 ∗ 0) = 20
[ 298 ]
Chapter 7
Now, we slide the filter over the input matrix by one pixel and perform the same
steps, as shown in Figure 7.29:
(13 ∗ 0) + (13 ∗ 1) + (7 ∗ 1) + (7 ∗ 0) = 20
Again, we slide the filter matrix by one pixel and perform the same operation, as
shown in Figure 7.30:
(7 ∗ 0) + (7 ∗ 1) + (9 ∗ 1) + (11 ∗ 0) = 16
[ 299 ]
Deep Learning Foundations
Now, again, we slide the filter matrix over the input matrix by one pixel and perform
the same operation, as shown in Figure 7.31:
That is:
(7 ∗ 0) + (7 ∗ 1) + (11 ∗ 1) + (11 ∗ 0) = 18
Okay. What are we doing here? We are basically sliding the filter matrix over
the entire input matrix by one pixel, performing element-wise multiplication and
summing their results, which creates a new matrix called a feature map or activation
map. This is called the convolution operation.
As we've learned, the convolution operation is used to extract features, and the new
matrix, that is, the feature maps, represents the extracted features. If we plot the
feature maps, then we can see the features extracted by the convolution operation.
Figure 7.32 shows the actual image (the input image) and the convolved image (the
feature map). We can see that our filter has detected the edges from the actual image
as a feature:
[ 300 ]
Chapter 7
Various filters are used for extracting different features from the image. For instance,
0 −1 0
if we use a sharpen filter, [−1 5 −1], then it will sharpen our image, as shown
0 1 0
in the following figure:
Thus, we have learned that with filters, we can extract important features from the
image using the convolution operation. So, instead of using one filter, we can use
multiple filters to extract different features from the image and produce multiple
feature maps. So, the depth of the feature map will be the number of filters. If we
use seven filters to extract different features from the image, then the depth of our
feature map will be seven:
[ 301 ]
Deep Learning Foundations
Okay, we have learned that different filters extract different features from the image.
But the question is, how can we set the correct values for the filter matrix so that
we can extract the important features from the image? Worry not! We just initialize
the filter matrix randomly, and the optimal values of the filter matrix, with which
we can extract the important features from the images, will be learned through
backpropagation. However, we just need to specify the size of the filter and the
number of filters we want to use.
Strides
We have just learned how a convolution operation works. We slide over the input
matrix with the filter matrix by one pixel and perform the convolution operation. But
we don't have to only slide over the input matrix by one pixel, we can also slide over
the input matrix by any number of pixels.
The number of pixels we slide over the input matrix by the filter matrix is called a
stride.
If we set the stride to 2, then we slide over the input matrix with the filter matrix by
two pixels. Figure 7.35 shows a convolution operation with a stride of 2:
But how do we choose the stride number? We just learned that a stride is the number
of pixels along that we move our filter matrix. So, when the stride is set to a small
number, we can encode a more detailed representation of the image than when the
stride is set to a large number. However, a stride with a high value takes less time
to compute than one with a low value.
[ 302 ]
Chapter 7
Padding
With the convolution operation, we are sliding over the input matrix with a filter
matrix. But in some cases, the filter does not perfectly fit the input matrix. What do
we mean by that? For example, let's say we are performing a convolution operation
with a stride of 2. There exists a situation where, when we move our filter matrix by
two pixels, it reaches the border and the filter matrix does not fit the input matrix.
That is, some part of our filter matrix is outside the input matrix, as shown in the
following diagram:
In this case, we perform padding. We can simply pad the input matrix with zeros so
that the filter can fit the input matrix, as shown in Figure 7.37. Padding with zeros on
the input matrix is called same padding or zero padding:
[ 303 ]
Deep Learning Foundations
Instead of padding them with zeros, we can also simply discard the region of the
input matrix where the filter doesn't fit in. This is called valid padding:
Pooling layers
Okay. Now, we are done with the convolution operation. As a result of the
convolution operation, we have some feature maps. But the feature maps are too
large in dimension. In order to reduce the dimensions of feature maps, we perform
a pooling operation. This reduces the dimensions of the feature maps and keeps
only the necessary details so that the amount of computation can be reduced.
For example, to recognize a horse from the image, we need to extract and keep
only the features of the horse; we can simply discard unwanted features, such
as the background of the image and more. A pooling operation is also called
a downsampling or subsampling operation, and it makes the CNN translation
invariant. Thus, the pooling layer reduces spatial dimensions by keeping only
the important features.
There are different types of pooling operations, including max pooling, average
pooling, and sum pooling.
In max pooling, we slide over the filter on the input matrix and simply take the
maximum value from the filter window, as Figure 7.39 shows:
[ 304 ]
Chapter 7
In average pooling, we take the average value of the input matrix within the filter
window, and in sum pooling, we sum all the values of the input matrix within the
filter window.
Given any image, convolutional layers extract features from the image and produce
a feature map. Now, we need to classify these extracted features. So, we need
an algorithm that can classify these extracted features and tell us whether the
extracted features are the features of a horse, or something else. In order to make
this classification, we use a feedforward neural network. We flatten the feature map
and convert it into a vector, and feed it as an input to the feedforward network.
[ 305 ]
Deep Learning Foundations
The feedforward network takes this flattened feature map as an input, applies an
activation function, such as sigmoid, and returns the output, stating whether the
image contains a horse or not; this is called a fully connected layer and is shown
in the following diagram:
As you will notice, first we feed the input image to the convolutional layer, where
we apply the convolution operation to extract important features from the image and
create the feature maps. We then pass the feature maps to the pooling layer, where
the dimensions of the feature maps will be reduced.
[ 306 ]
Chapter 7
As shown in the previous diagram, we can have multiple convolutional and pooling
layers, and we should also note that the pooling layer does not necessarily have to
be there after every convolutional layer; there can be many convolutional layers
followed by a pooling layer.
So, after the convolutional and pooling layers, we flatten the resultant feature
maps and feed it to a fully connected layer, which is basically a feedforward neural
network that classifies the given input image based on the feature maps.
Now that we have learned how CNNs work, in the next section, we will learn about
another interesting algorithm called the generative adversarial network.
GANs are used extensively for generating new data points. They can be applied to
any type of dataset, but they are popularly used for generating images. Some of the
applications of GANs include generating realistic human faces, converting grayscale
images to colored images, translating text descriptions into realistic images, and
many more.
GANs have evolved so much in recent years that they can generate a very realistic
image. The following figure shows the evolution of GANs in generating images
over the course of five years:
[ 307 ]
Deep Learning Foundations
Excited about GANs already? Now, we will see how exactly they work. Before going
ahead, let's consider a simple analogy. Let's say you are the police and your task is to
find counterfeit money, and the role of the counterfeiter is to create fake money and
cheat the police.
The counterfeiter constantly tries to create fake money in a way that is so realistic
that it cannot be differentiated from the real money. But the police have to identify
whether the money is real or fake. So, the counterfeiter and the police essentially
play a two-player game where one tries to defeat the other. GANs work something
like this. They consist of two important components:
• Generator
• Discriminator
You can perceive the generator as analogous to the counterfeiter, while the
discriminator is analogous to the police. That is, the role of the generator is to create
fake money, and the role of the discriminator is to identify whether the money is fake
or real.
Without going into detail, first, we will get a basic understanding of GANs. Let's
say we want our GAN to generate handwritten digits. How can we do that? First,
we will take a dataset containing a collection of handwritten digits; say, the MNIST
dataset. The generator learns the distribution of images in our dataset. Thus, it
learns the distribution of handwritten digits in our training set. Once, it learns
the distribution of the images in our dataset, and we feed a random noise to the
generator, it will convert the random noise into a new handwritten digit similar
to the one in our training set based on the learned distribution:
[ 308 ]
Chapter 7
As Figure 7.45 shows, we feed random noise to the generator, and it then converts
this random noise into a new image similar to the one we have in our training set,
but not exactly the same as the images in the training set. The image generated by
the generator is called a fake image, and the images in our training set are called
real images. We feed both the real and fake images to the discriminator, which tells
us the probability of them being real. It returns 0 if the image is fake and 1 if the
image is real:
[ 309 ]
Deep Learning Foundations
𝐺𝐺𝐺𝐺𝐺𝐺 𝐺𝐺𝑔𝑔 )
Surprising, isn't it? How does the generator convert random noise to a realistic
image?
Let's say we have a dataset containing a collection of human faces and we want our
generator to generate a new human face. First, the generator learns all the features
of the face by learning the probability distribution of the images in our training set.
Once the generator learns the correct probability distribution, it can generate totally
new human faces.
But how does the generator learn the distribution of the training set? That is, how
does the generator learn the distribution of images of human faces in the training
set?
A generator is nothing but a neural network. So, what happens is that the neural
network learns the distribution of the images in our training set implicitly; let's call
this distribution a generator distribution, Pg. At the first iteration, the generator
generates a really noisy image. But over a series of iterations, it learns the exact
probability distribution of our training set and learns to generate a correct image
by tuning its 𝜃𝜃𝑔𝑔 parameter.
[ 310 ]
Chapter 7
The goal of the discriminator is to discriminate between two classes. That is, given
an image x, it has to identify whether the image is from a real distribution or a
fake distribution (generator distribution). That is, the discriminator has to identify
whether the given input image is from the training set or the fake image generated
by the generator:
𝐷𝐷𝐷𝐷𝐷𝐷 𝐷𝐷𝑑𝑑 )
Let's call the distribution of our training set the real data distribution, which is
represented by Pr. We know that the generator distribution is represented by Pg.
So, the discriminator D essentially tries to discriminate whether the image x is from
Pr or Pg.
We know that the goal of the generator is to generate an image in such a way as
to fool the discriminator into believing that the generated image is from a real
distribution.
In the first iteration, the generator generates a noisy image. When we feed this
image to the discriminator, it can easily detect that the image is from a generator
distribution. The generator takes this as a loss and tries to improve itself, as its goal
is to fool the discriminator. That is, if the generator knows that the discriminator
is easily detecting the generated image as a fake image, then it means that it is not
generating an image similar to those in the training set. This implies that it has not
learned the probability distribution of the training set yet.
So, the generator tunes its parameters in such a way as to learn the correct
probability distribution of the training set. As we know that the generator is a neural
network, we simply update the parameters of the network through backpropagation.
Once it has learned the probability distribution of the real images, then it can
generate images similar to the ones in the training set.
Okay, what about the discriminator? How does it learn? As we know, the role of
the discriminator is to discriminate between real and fake images.
[ 311 ]
Deep Learning Foundations
If the discriminator incorrectly classifies the generated image; that is, if the
discriminator classifies the fake image as a real image, then it implies that the
discriminator has not learned to differentiate between the real and fake image. So,
we update the parameter of the discriminator network through backpropagation
to make the discriminator learn to classify between real and fake images.
So, basically, the generator is trying to fool the discriminator by learning the real
data distribution, Pr, and the discriminator is trying to find out whether the image
is from a real or fake distribution. Now the question is, when do we stop training
the network in light of the fact that the generator and discriminator are competing
against each other?
Basically, the goal of the GAN is to generate images similar to the one in the training
set. Say we want to generate a human face—we learn the distribution of images
in the training set and generate new faces. So, for a generator, we need to find the
optimal discriminator. What do we mean by that?
When Pg = Pr, then the discriminator cannot differentiate between whether the input
image is from a real or a fake distribution, so it will just return 0.5 as a probability, as
the discriminator will become confused between the two distributions when they are
the same.
[ 312 ]
Chapter 7
𝑃𝑃𝑟𝑟 (𝑥𝑥𝑥 1
𝐷𝐷(𝑥𝑥) = =
𝑃𝑃𝑟𝑟 (𝑥𝑥) + 𝑃𝑃𝑔𝑔 (𝑥𝑥𝑥 2
So, when the discriminator just returns the probability of 0.5 for all the generator
images, then we can say that the generator has learned the distribution of images in
our training set and has fooled the discriminator successfully.
Architecture of a GAN
Figure 7.47 shows the architecture of a GAN:
As shown in the preceding diagram, generator G takes the random noise, z, as input
by sampling from a uniform or normal distribution and generates a fake image by
implicitly learning the distribution of the training set.
We sample an image, x, from the real data distribution, 𝑥𝑥𝑥𝑥𝑥𝑟𝑟 (𝑥𝑥𝑥, and fake data
distribution, 𝑥𝑥𝑥𝑥𝑥𝑔𝑔 (𝑥𝑥𝑥, and feed it to the discriminator, D. We feed real and fake
images to the discriminator and the discriminator performs a binary classification
task. That is, it returns 0 when the image is fake and 1 when the image is real.
[ 313 ]
Deep Learning Foundations
When we write 𝑥𝑥𝑥𝑥𝑥𝑟𝑟 (𝑥𝑥𝑥, it implies that image x is sampled from the real distribution,
Pr. Similarly, 𝑥𝑥𝑥𝑥𝑥𝑔𝑔 (𝑥𝑥𝑥 denotes that image x is sampled from the generator
distribution, Pg, and 𝑧𝑧𝑧𝑧𝑧𝑧𝑧 (𝑧𝑧𝑧 implies that the generator input, z, is sampled from the
uniform distribution, Pz.
We've learned that both the generator and discriminator are neural networks and
both of them update their parameters through backpropagation. We now need to
find the optimal generator parameter, 𝜃𝜃𝑔𝑔, and the discriminator parameter, 𝜃𝜃𝑑𝑑 .
Discriminator loss
Now we will look at the loss function of the discriminator. We know that the goal
of the discriminator is to classify whether the image is a real or a fake image. Let's
denote the discriminator by D.
max 𝐿𝐿(𝐷𝐷𝐷 𝐷𝐷) = 𝔼𝔼𝑥𝑥𝑥𝑥𝑥𝑟𝑟 (𝑥𝑥𝑥 [log 𝐷𝐷(𝑥𝑥𝑥 𝑥𝑥𝑑𝑑 )] + 𝔼𝔼𝑧𝑧𝑧𝑧𝑧𝑧𝑧(𝑧𝑧𝑧 [log (1 − 𝐷𝐷(𝐺𝐺(𝑧𝑧𝑧𝑧𝑧𝑔𝑔 );𝜃𝜃𝑑𝑑 ))]
𝑑𝑑
What does this mean, though? Let's understand each of the terms one by one.
First term
Let's look at the first term:
𝔼𝔼𝑥𝑥𝑥𝑥𝑥𝑟𝑟(𝑥𝑥𝑥 log(𝐷𝐷𝐷𝐷𝐷𝐷𝐷
Here, 𝑥𝑥𝑥𝑥𝑥𝑟𝑟 (𝑥𝑥𝑥 implies that we are sampling input x from the real data distribution,
Pr, so x is a real image.
[ 314 ]
Chapter 7
D(x) implies that we are feeding the input image x to the discriminator D, and the
discriminator will return the probability of input image x to be a real image. As x
is sampled from real data distribution Pr, we know that x is a real image. So, we
need to maximize the probability of D(x):
𝑚𝑚𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚𝑚𝑚
But instead of maximizing raw probabilities, we maximize log probabilities, so,
we can write the following:
𝑚𝑚𝑚𝑚𝑚𝑚 log𝐷𝐷𝐷𝐷𝐷𝐷
So, our final equation becomes the following:
Second term
Now, let's look at the second term:
D(G(z)) implies that we are feeding the fake image generated by the generator to the
discriminator D and it will return the probability of the fake input image being a real
image.
If we subtract D(G(z)) from 1, then it will return the probability of the fake input
image being a fake image:
1 − 𝐷𝐷𝐷𝐷𝐷(𝑧𝑧))
Since we know z is not a real image, the discriminator will maximize this probability.
That is, the discriminator maximizes the probability of z being classified as a fake
image, so we write:
𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 𝑚 𝑚𝑚𝑚𝑚𝑚(𝑧𝑧))
[ 315 ]
Deep Learning Foundations
𝑚𝑚𝑚𝑚𝑚𝑚𝑚log (1 − 𝐷𝐷(𝐺𝐺(𝑧𝑧)))
𝔼𝔼𝑧𝑧𝑧𝑧𝑧𝑧𝑧 (𝑧𝑧𝑧 [log (1 − 𝐷𝐷(𝐺𝐺(𝑧𝑧)))] implies the expectations of the log likelihood of the
input images generated by the generator being fake.
Final term
So, combining these two terms, the loss function of the discriminator is given as
follows:
max 𝐿𝐿(𝐷𝐷𝐷 𝐷𝐷) = 𝔼𝔼𝑥𝑥𝑥𝑥𝑥𝑟𝑟 (𝑥𝑥𝑥 [log 𝐷𝐷(𝑥𝑥𝑥 𝑥𝑥𝑑𝑑 )] + 𝔼𝔼𝑧𝑧𝑧𝑧𝑧𝑧𝑧(𝑧𝑧𝑧 [log (1 − 𝐷𝐷(𝐺𝐺(𝑧𝑧𝑧𝑧𝑧𝑔𝑔 );𝜃𝜃𝑑𝑑 ))]
𝑑𝑑
Here, 𝜃𝜃𝑔𝑔 and 𝜃𝜃𝑑𝑑 are the parameters of the generator and discriminator network
respectively. So, the discriminator's goal is to find the right 𝜃𝜃𝑑𝑑 so that it can classify
the image correctly.
Generator loss
The loss function of the generator is given as follows:
We know that the goal of the generator is to fool the discriminator to classify the fake
image as a real image.
In the Discriminator loss section, we saw that 𝔼𝔼𝑧𝑧𝑧𝑧𝑧𝑧𝑧 (𝑧𝑧𝑧 [log (1 − 𝐷𝐷𝐷𝐷𝐷(𝑧𝑧)))]
implies the probability of classifying the fake input image as a fake image, and the
discriminator maximizes the probabilities for correctly classifying the fake image
as fake.
But the generator wants to minimize this probability. As the generator wants to
fool the discriminator, it minimizes this probability of a fake input image being
classified as fake by the discriminator. Thus, the loss function of the generator can
be expressed as follows:
[ 316 ]
Chapter 7
Total loss
We just learned the loss function of the generator and the discriminator combining
these two losses, and we write our final loss function as follows:
min max 𝐿𝐿(𝐷𝐷𝐷 𝐷𝐷) = 𝔼𝔼𝑥𝑥𝑥𝑥𝑥𝑟𝑟 (𝑥𝑥𝑥 [log 𝐷𝐷𝐷𝐷𝐷𝐷] + 𝔼𝔼𝑧𝑧𝑧𝑧𝑧𝑧𝑧(𝑧𝑧𝑧 [log (1 − 𝐷𝐷(𝐺𝐺(𝑧𝑧)))]
𝐺𝐺 𝐷𝐷
So, our objective function is basically a min-max objective function, that is, a
maximization for the discriminator and a minimization for the generator, and we
find the optimal generator parameter, 𝜃𝜃𝑔𝑔, and discriminator parameter, 𝜃𝜃𝑑𝑑, through
backpropagating the respective networks.
𝑚𝑚
1
∇𝜃𝜃𝑑𝑑 ∑ [log 𝐷𝐷(𝑥𝑥 (𝑖𝑖) ) + log (1 − 𝐷𝐷 𝐷𝐷𝐷(𝑧𝑧 (𝑖𝑖) )))]
𝑚𝑚
𝑖𝑖𝑖𝑖
𝑚𝑚
1
∇𝜃𝜃𝑔𝑔 ∑ log (1 − 𝐷𝐷 𝐷𝐷𝐷(𝑧𝑧 (𝑖𝑖) )))
𝑚𝑚
𝑖𝑖𝑖𝑖
Summary
We started off the chapter by understanding biological and artificial neurons. Then
we learned about ANNs and their layers. We learned different types of activation
functions and how they are used to introduce nonlinearity in the network.
Later, we learned about the forward and backward propagation in the neural
network. Next, we learned how to implement an ANN. Moving on, we learned
about RNNs and how they differ from feedforward networks. Next, we learned
about the variant of the RNN called LSTM. Going forward, we learned about CNNs,
how they use different types of layers, and the architecture of CNNs in detail.
At the end of the chapter, we learned about an interesting algorithm called GAN.
We understood the generator and discriminator component of GAN and we also
explored the architecture of GAN in detail. Followed by that, we examined the
loss function of GAN in detail.
In the next chapter, we will learn about one of the most popularly used deep
learning frameworks, called TensorFlow.
[ 317 ]
Deep Learning Foundations
Questions
Let's assess our understanding of deep learning algorithms by answering the
following questions:
Further reading
• To learn more about deep learning algorithms, you can check out my book
Hands-on Deep Learning Algorithms with Python, also published by Packt,
at https://www.packtpub.com/in/big-data-and-business-intelligence/
hands-deep-learning-algorithms-python.
[ 318 ]
A Primer on TensorFlow
8
TensorFlow is one of the most popular deep learning libraries. In upcoming chapters,
we will use TensorFlow to build deep reinforcement models. So, in this chapter, we
will get ourselves familiar with TensorFlow and its functionalities.
We will learn about what computational graphs are and how TensorFlow uses
them. We will also explore TensorBoard, which is a visualization tool provided by
TensorFlow used for visualizing models. Going forward, we will understand how to
build a neural network with TensorFlow to perform handwritten digit classification.
Moving on, we will learn about TensorFlow 2.0, which is the latest version of
TensorFlow. We will understand how TensorFlow 2.0 differs from its previous
versions and how it uses Keras as its high-level API.
• TensorFlow
• Computational graphs and sessions
• Variables, constants, and placeholders
• TensorBoard
• Handwritten digit classification in TensorFlow
• Math operations in TensorFlow
• TensorFlow 2.0 and Keras
[ 319 ]
A Primer on TensorFlow
What is TensorFlow?
TensorFlow is an open source software library from Google, which is extensively
used for numerical computation. It is one of the most used libraries for building
deep learning models. It is highly scalable and runs on multiple platforms, such
as Windows, Linux, macOS, and Android. It was originally developed by the
researchers and engineers of the Google Brain team.
TensorFlow 2.0 is the latest version of TensorFlow. In the upcoming chapters, we will
use TensorFlow 2.0 for building deep reinforcement learning models. However, it is
important to understand how TensorFlow 1.x works. So, first, we will learn to use
TensorFlow 1.x and then we will look into TensorFlow 2.0.
You can install TensorFlow easily through pip by just typing the following
command in your terminal:
import tensorflow as tf
The preceding program should print Hello TensorFlow!. If you get any errors, then
you probably have not installed TensorFlow correctly.
[ 320 ]
Chapter 8
There are two types of dependency in the computational graph, called direct and
indirect dependency. Say we have node b, the input of which is dependent on the
output of node a; this type of dependency is called direct dependency, as shown in
the following code:
a = tf.multiply(8,5)
b = tf.multiply(a,1)
[ 321 ]
A Primer on TensorFlow
When node b doesn't depend on node a for its input, it is called indirect dependency,
as shown in the following code:
a = tf.multiply(8,5)
b = tf.multiply(4,3)
graph = tf.Graph()
with graph.as_default():
z = tf.add(x, y, name='Add')
If we want to clear the default graph (that is, if we want to clear the previously
defined variables and operations in the graph), then we can do this using tf.reset_
default_graph().
Sessions
As mentioned in the previous section, a computational graph with operations on its
nodes and tensors to its edges is created, and in order to execute the graph, we use a
TensorFlow session.
sess = tf.Session()
After creating the session, we can execute our graph, using the sess.run() method.
a = tf.multiply(3,3)
print(a)
[ 322 ]
Chapter 8
In order to execute the graph, we need to initialize and run the TensorFlow session,
as follows:
a = tf.multiply(3,3)
with tf.Session as sess:
print(sess.run(a))
Now that we have learned about sessions, in the next section, we will learn about
variables, constants, and placeholders.
Variables
Variables are containers used to store values. Variables are used as input to several
other operations in a computational graph. A variable can be created using the
tf.Variable() function, as shown in the following code:
x = tf.Variable(13)
As you can see in the preceding code, we create a variable, W, by randomly drawing
values from a normal distribution with a standard deviation of 0.35.
[ 323 ]
A Primer on TensorFlow
It is used to set the name of the variable in the computational graph. So, in the
preceding code, Python saves the variable as W but in the TensorFlow graph, it will
be saved as weights.
Once we create a session, we run the initialization operation, which initializes all of
the defined variables, and only then can we run the other operations, as shown in
the following code:
x = tf.Variable(1212)
init = tf.global_variables_initializer()
Constants
Constants, unlike variables, cannot have their values changed. That is, constants are
immutable. Once they are assigned values, they cannot be changed throughout the
program. We can create constants using tf.constant(), as the following code shows:
x = tf.constant(13)
x = tf.placeholder("float", shape=None)
[ 324 ]
Chapter 8
x = tf.placeholder("float", None)
y = x+3
If we run the preceding code, then it will return an error because we are trying to
compute y, where y= x+3 and x is a placeholder whose value is not assigned. As we
have learned, values for the placeholders will be assigned at runtime. We assign the
values of the placeholder using the feed_dict parameter. The feed_dict parameter
is basically a dictionary where the key represents the name of the placeholder, and
the value represents the value of the placeholder.
As you can see in the following code, we set feed_dict = {x:5}, which implies that
the value for the x placeholder is 5:
Introducing TensorBoard
TensorBoard is TensorFlow's visualization tool, which can be used to visualize a
computational graph. It can also be used to plot various quantitative metrics and
the results of several intermediate calculations. When we are training a really deep
neural network, it becomes confusing when we have to debug the network. So, if we
can visualize the computational graph in TensorBoard, we can easily understand
such complex models, debug them, and optimize them. TensorBoard also supports
sharing.
[ 325 ]
A Primer on TensorFlow
The tabs are pretty self-explanatory. The SCALARS tab shows useful information
about the scalar variables we use in our program. For example, it shows how the
value of a scalar variable called loss changes over several iterations.
The GRAPHS tab shows the computational graph. The DISTRIBUTIONS and
HISTOGRAMS tabs show the distribution of a variable. For example, our model's
weight distribution and histogram can be seen under these tabs. The EMBEDDINGS
tab is used for visualizing high-dimensional vectors, such as word embeddings.
Let's build a basic computational graph and visualize it in TensorBoard. Let's say we
have four constants, shown as follows:
x = tf.constant(1,name='x')
y = tf.constant(1,name='y')
a = tf.constant(3,name='a')
b = tf.constant(3,name='b')
Let's multiply x and y and a and b and save them as prod1 and prod2, as shown in the
following code:
prod1 = tf.multiply(x,y,name='prod1')
prod2 = tf.multiply(a,b,name='prod2')
sum = tf.add(prod1,prod2,name='sum')
As the name suggests, logdir specifies the directory where we want to store the
graph, and graph specifies which graph we want to store:
In the preceding code, ./graphs is the directory where we are storing our event file,
and sess.graph specifies the current graph in our TensorFlow session. So, we are
storing the current graph of the TensorFlow session in the graphs directory.
To start TensorBoard, go to your Terminal, locate the working directory, and type
the following:
tensorboard --logdir=graphs --port=8000
[ 326 ]
Chapter 8
The logdir parameter indicates the directory where the event file is stored and port
is the port number. Once you run the preceding command, open your browser and
type http://localhost:8000/.
In the TensorBoard panel, under the GRAPHS tab, you can see the computational
graph:
As you may notice, all of the operations we have defined are clearly shown in the
graph.
In the previous section, we saw how prod1 and prod2 perform multiplication and
compute the result. We'll define a name scope called Product, and group the prod1
and prod2 operations, as shown in the following code:
with tf.name_scope("Product"):
with tf.name_scope("prod1"):
prod1 = tf.multiply(x,y,name='prod1')
with tf.name_scope("prod2"):
prod2 = tf.multiply(a,b,name='prod2')
[ 327 ]
A Primer on TensorFlow
with tf.name_scope("sum"):
sum = tf.add(prod1,prod2,name='sum')
As you may notice, now, we have only two nodes, sum and Product:
Once we double-click on the nodes, we can see how the computation is happening.
As you can see, the prod1 and prod2 nodes are grouped under the Product scope,
and their results are sent to the sum node, where they will be added. You can see
how the prod1 and prod2 nodes compute their value:
[ 328 ]
Chapter 8
The preceding graph is just a simple example. When we are working on a complex
project with a lot of operations, name scoping helps us to group similar operations
together and enables us to understand the computational graph better.
Now that we have learned about TensorFlow, in the next section, let's see how to
build handwritten digit classification using TensorFlow.
[ 329 ]
A Primer on TensorFlow
In this section, we will see how we can use our neural network to recognize these
handwritten digits, and we will get the hang of TensorFlow and TensorBoard.
import warnings
warnings.filterwarnings('ignore')
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
tf.logging.set_verbosity(tf.logging.ERROR)
In the preceding code, data/mnist implies the location where we store the MNIST
dataset, and one_hot=True implies that we are one-hot encoding the labels (0 to 9).
We will see what we have in our data by executing the following code:
[ 330 ]
Chapter 8
We have 55000 images in the training set, each image is of size 784, and we have 10
labels, which are actually 0 to 9. Similarly, we have 10000 images in the test set.
img1 = mnist.train.images[0].reshape(28,28)
plt.imshow(img1, cmap='Greys')
[ 331 ]
A Primer on TensorFlow
Defining placeholders
As we have learned, we first need to define the placeholders for input and output.
Values for the placeholders will be fed in at runtime through feed_dict:
with tf.name_scope('input'):
X = tf.placeholder("float", [None, num_input])
with tf.name_scope('output'):
Y = tf.placeholder("float", [None, num_output])
Since we have a four-layer network, we have four weights and four biases. We
initialize our weights by drawing values from the truncated normal distribution with
a standard deviation of 0.1. Remember, the dimensions of the weight matrix should
be the number of neurons in the previous layer x the number of neurons in the current layer.
For instance, the dimensions of weight matrix w3 should be the number of neurons in
hidden layer 2 x the number of neurons in hidden layer 3.
with tf.name_scope('weights'):
weights = {
'w1': tf.Variable(tf.truncated_normal([num_input, num_hidden1],
stddev=0.1),name='weight_1'),
'w2': tf.Variable(tf.truncated_normal([num_hidden1, num_hidden2],
[ 332 ]
Chapter 8
stddev=0.1),name='weight_2'),
'w3': tf.Variable(tf.truncated_normal([num_hidden2, num_hidden_3],
stddev=0.1),name='weight_3'),
'out': tf.Variable(tf.truncated_normal([num_hidden_3, num_output],
stddev=0.1),name='weight_4'),
}
The shape of the bias should be the number of neurons in the current layer. For
instance, the dimension of the b2 bias is the number of neurons in hidden layer 2.
We set the bias value as a constant; 0.1 in all of the layers:
with tf.name_scope('biases'):
biases = {
'b1': tf.Variable(tf.constant(0.1, shape=[num_
hidden1]),name='bias_1'),
'b2': tf.Variable(tf.constant(0.1, shape=[num_
hidden2]),name='bias_2'),
'b3': tf.Variable(tf.constant(0.1, shape=[num_
hidden_3]),name='bias_3'),
'out': tf.Variable(tf.constant(0.1, shape=[num_
output]),name='bias_4')
}
Forward propagation
Now we'll define the forward propagation operation. We'll use ReLU activations in
all layers. In the last layers, we'll apply sigmoid activation, as shown in the following
code:
with tf.name_scope('Model'):
with tf.name_scope('layer1'):
layer_1 = tf.nn.relu(tf.add(tf.matmul(X, weights['w1']),
biases['b1']) )
with tf.name_scope('layer2'):
layer_2 = tf.nn.relu(tf.add(tf.matmul(layer_1, weights['w2']),
biases['b2']))
with tf.name_scope('layer3'):
layer_3 = tf.nn.relu(tf.add(tf.matmul(layer_2, weights['w3']),
biases['b3']))
[ 333 ]
A Primer on TensorFlow
with tf.name_scope('output_layer'):
y_hat = tf.nn.sigmoid(tf.matmul(layer_3, weights['out']) +
biases['out'])
• The logits parameter specifies the logits predicted by our network; for
example, y_hat
• The labels parameter specifies the actual labels; for example, true labels, Y
with tf.name_scope('Loss'):
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_
logits(logits=y_hat,labels=Y))
Now, we need to minimize the loss using backpropagation. Don't worry! We don't
have to calculate the derivatives of all the weights manually. Instead, we can use
TensorFlow's optimizer:
learning_rate = 1e-4
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
Computing accuracy
We calculate the accuracy of our model as follows:
• The y_hat parameter denotes the predicted probability for each class of
our model. Since we have 10 classes, we will have 10 probabilities. If the
probability is high at position 7, then it means that our network predicts
the input image as digit 7 with high probability. The tf.argmax() function
returns the index of the largest value. Thus, tf.argmax(y_hat,1) gives the
index where the probability is high. Thus, if the probability is high at index 7,
then it returns 7.
[ 334 ]
Chapter 8
• The Y parameter denotes the actual labels, and they are the one-hot encoded
values. That is, the Y parameter consists of zeros everywhere except at the
position of the actual image, where it has a value of 1. For instance, if the
input image is 7, then Y has a value of 0 at all indices except at index 7, where
it has a value of 1. Thus, tf.argmax(Y,1) returns 7 because that is where we
have a high value, 1.
The tf.equal(x, y) function takes x and y as inputs and returns the truth value of
(x == y) element-wise. Thus, correct_pred = tf.equal(predicted_digit,actual_
digit) consists of True where the actual and predicted digits are the same, and
False where the actual and predicted digits are not the same. We convert the
Boolean values in correct_pred into float values using TensorFlow's cast operation,
tf.cast(correct_pred, tf.float32). After converting them into float values, we
take the average using tf.reduce_mean().
with tf.name_scope('Accuracy'):
predicted_digit = tf.argmax(y_hat, 1)
actual_digit = tf.argmax(Y, 1)
correct_pred = tf.equal(predicted_digit,actual_digit)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
Creating a summary
We can also visualize how the loss and accuracy of our model changes during
several iterations in TensorBoard. So, we use tf.summary() to get the summary of
the variable. Since the loss and accuracy are scalar variables, we use tf.summary.
scalar(), as shown in the following code:
tf.summary.scalar("Accuracy", accuracy)
tf.summary.scalar("Loss", loss)
[ 335 ]
A Primer on TensorFlow
Next, we merge all of the summaries we use in our graph, using tf.summary.
merge_all(). We do this because when we have many summaries, running and
storing them would become inefficient, so we run them once in our session instead
of running them multiple times:
merge_summary = tf.summary.merge_all()
init = tf.global_variables_initializer()
Define the batch size, number of iterations, and learning rate, as follows:
learning_rate = 1e-4
num_iterations = 1000
batch_size = 128
sess.run(init)
for i in range(num_iterations):
[ 336 ]
Chapter 8
if i % 100 == 0:
As you may notice from the following output, the loss decreases and the accuracy
increases over various training iterations:
[ 337 ]
A Primer on TensorFlow
[ 338 ]
Chapter 8
If we double-click and expand Model, we can see that we have three hidden layers
and one output layer:
[ 339 ]
A Primer on TensorFlow
Similarly, we can double-click and see every node. For instance, if we open weights,
we can see how the four weights are initialized using truncated normal distribution,
and how it is updated using the Adam optimizer:
[ 340 ]
Chapter 8
We can see how the accuracy is being calculated by double-clicking on the Accuracy
node:
[ 341 ]
A Primer on TensorFlow
Remember that we also stored a summary of our loss and accuracy variables. We
can find them under the SCALARS tab in TensorBoard. Figure 8.11 shows how the
loss decreases over iterations:
That's it. In the next section, we will learn about another interesting feature in
TensorFlow called eager execution.
[ 342 ]
Chapter 8
x = tf.constant(11)
y = tf.constant(11)
z = x*y
With eager execution, we don't need to create a session; we can simply compute
z, just like we do in Python. In order to enable eager execution, just call the
tf.enable_eager_execution() function:
x = tf.constant(11)
y = tf.constant(11)
z = x*y
print(z)
z.numpy()
121
[ 343 ]
A Primer on TensorFlow
sum = tf.add(x,y)
sum.numpy()
The tf.subtract function is used for finding the difference between two numbers:
difference = tf.subtract(x,y)
difference.numpy()
product = tf.multiply(x,y)
product.numpy()
division = tf.divide(x,y)
division.numpy()
array([0.33333334, 1. , 3. ], dtype=float32)
10.0
[ 344 ]
Chapter 8
Next, let's find the index of the minimum and maximum elements:
tf.argmin(x).numpy()
tf.argmax(x).numpy()
Run the following code to find the squared difference between x and y:
x = tf.Variable([1,3,5,7,11])
y = tf.Variable([1])
tf.math.squared_difference(x,y).numpy()
Let's try typecasting; that is, converting from one data type into another.
print(x.dtype)
tf.int32
We can convert the type of x, which is tf.int32, into tf.float32, using tf.cast, as
shown in the following code:
x = tf.cast(x, dtype=tf.float32)
print(x.dtype)
tf.float32
[ 345 ]
A Primer on TensorFlow
x = [[3,6,9], [7,7,7]]
y = [[4,5,6], [5,5,5]]
array([[3, 6, 9],
[7, 7, 7],
[4, 5, 6],
[5, 5, 5]], dtype=int32)
array([[3, 6, 9, 4, 5, 6],
[7, 7, 7, 5, 5, 5]], dtype=int32)
tf.stack(x, axis=1).numpy()
array([[3, 7],
[6, 7],
[9, 7]], dtype=int32)
x.numpy()
array([[1., 5.],
[2., 3.]]
Compute the mean value of x; that is, (1.0 + 5.0 + 2.0 + 3.0) / 4:
tf.reduce_mean(input_tensor=x).numpy()
2.75
[ 346 ]
Chapter 8
Compute the mean across the row; that is, (1.0+5.0)/2, (2.0+3.0)/2:
tf.reduce_mean(input_tensor=x, axis=0).numpy()
array([1.5, 4. ], dtype=float32)
Compute the mean across the column; that is, (1.0+5.0)/2.0, (2.0+3.0)/2.0:
array([[3. ],
[2.5]], dtype=float32)
tf.nn.softmax(x).numpy()
def square(x):
return tf.multiply(x, x)
The gradients can be computed for the preceding square function using
tf.GradientTape, as follows:
36.0
[ 347 ]
A Primer on TensorFlow
To install TensorFlow 2.0, open your Terminal and type the following command:
Since TensorFlow 2.0 uses Keras as a high-level API, we will look at how Keras
works in the next section.
Bonjour Keras
Keras is another popularly used deep learning library. It was developed by François
Chollet at Google. It is well known for its fast prototyping, and it makes model
building simple. It is a high-level library, meaning that it does not perform any low-
level operations on its own, such as convolution. It uses a backend engine for doing
that, such as TensorFlow. The Keras API is available in tf.keras, and TensorFlow 2.0
uses it as the primary API.
[ 348 ]
Chapter 8
model = Sequential()
In the preceding code, Dense implies a fully connected layer, input_dim implies the
dimension of our input, and activation specifies the activation function that we use.
We can stack up as many layers as we want, one above another.
Define the next layer with the relu activation function, as follows:
model.add(Dense(7, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
The final code block of the sequential model is shown as follows. As you can see, the
Keras code is much simpler than the TensorFlow code:
model = Sequential()
model.add(Dense(13, input_dim=7, activation='relu'))
model.add(Dense(7, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
[ 349 ]
A Primer on TensorFlow
input = Input(shape=(2,))
Now, we'll define our first fully connected layer with 10 neurons and relu activation,
using the Dense class, as shown:
We defined layer1, but where is the input to layer1 coming from? We need to
specify the input to layer1 in a bracket notation at the end, as shown:
We define the next layer, layer2, with 13 neurons and relu activation. The input to
layer2 comes from layer1, so that is added in the bracket at the end, as shown in the
following code:
Now, we can define the output layer with the sigmoid activation function. Input to
the output layer comes from layer2, so that is added in bracket at the end:
After defining all of the layers, we define the model using a Model class, where we
need to specify inputs and outputs, as follows:
input = Input(shape=(2,))
layer1 = Dense(10, activation='relu')(input)
layer2 = Dense(10, activation='relu')(layer1)
output = Dense(1, activation='sigmoid')(layer2)
model = Model(inputs=input, outputs=output)
[ 350 ]
Chapter 8
model.compile(loss='binary_crossentropy', optimizer='sgd',
metrics=['accuracy'])
model.evaluate(x=data_test,y=labels_test)
We can also evaluate the model on the same train set, and that will help us to
understand the training accuracy:
model.evaluate(x=data,y=labels)
That's it. Let's see how to use TensorFlow for the MNIST digit classification task in
the next section.
[ 351 ]
A Primer on TensorFlow
mnist = tf.keras.datasets.mnist
Create a training set and a test set with the following code:
Normalize the train and test sets by dividing the values of x by the maximum value
of x; that is, 255.0:
model = tf.keras.models.Sequential()
Now, let's add layers to the model. We use a three-layer network with the ReLU
function in the hidden layer and softmax in the final layer:
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(256, activation="relu"))
model.add(tf.keras.layers.Dense(128, activation="relu"))
model.add(tf.keras.layers.Dense(10, activation="softmax"))
model.compile(optimizer='sgd', loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.evaluate(x_test, y_test)
That's it! Writing code with the Keras API is that simple.
[ 352 ]
Chapter 8
Summary
We started off this chapter by understanding TensorFlow and how it uses
computational graphs. We learned that computation in TensorFlow is represented by
a computational graph, which consists of several nodes and edges, where nodes are
mathematical operations, such as addition and multiplication, and edges are tensors.
Next, we learned that variables are containers used to store values, and they are used
as input to several other operations in a computational graph. We also learned that
placeholders are like variables, where we only define the type and dimension but do
not assign the values, and the values for the placeholders are fed at runtime. Moving
forward, we learned about TensorBoard, which is TensorFlow's visualization
tool and can be used to visualize a computational graph. We also explored eager
execution, which is more Pythonic and allows rapid prototyping.
We understood that, unlike graph mode, where we need to construct a graph every
time to perform any operation, eager execution follows the imperative programming
paradigm, where any operation can be performed immediately, without having to
create a graph, just like we do in Python.
In the next chapter, we begin our deep reinforcement learning (DRL) journey by
understanding one of the popular DRL algorithms, called the Deep Q Network
(DQN).
Questions
Let's put our knowledge of TensorFlow to the test by answering the following
questions:
Further reading
You can learn more about TensorFlow by checking its official documentation at
https://www.tensorflow.org/tutorials.
[ 353 ]
Deep Q Network and
9
Its Variants
In this chapter, let's get started with one of the most popular Deep Reinforcement
Learning (DRL) algorithms called Deep Q Network (DQN). Understanding DQN
is very important as many of the state-of-the-art DRL algorithms are based on DQN.
The DQN algorithm was first proposed by researchers at Google's DeepMind in
2013 in the paper Playing Atari with Deep Reinforcement Learning. They described
the DQN architecture and explained why it was so effective at playing Atari games
with human-level accuracy. We begin the chapter by learning what exactly a deep
Q network is, and how it is used in reinforcement learning. Next, we will deep
dive into the algorithm of DQN. Then we will learn how to implement DQN to
play Atari games.
After learning about DQN, we will cover several variants of DQN, such as double
DQN, DQN with prioritized experience replay, dueling DQN, and the deep
recurrent Q network in detail.
In this chapter, we will cover the following topics:
• What is DQN?
• The DQN algorithm
• Playing Atari games with DQN
• Double DQN
• DQN with prioritized experience replay
• The dueling DQN
• The deep recurrent Q network
[ 355 ]
Deep Q Network and Its Variants
What is DQN?
The objective of reinforcement learning is to find the optimal policy, that is, the
policy that gives us the maximum return (the sum of rewards of the episode). In
order to compute the policy, first we compute the Q function. Once we have the Q
function, then we extract the policy by selecting an action in each state that has the
maximum Q value. For instance, let's suppose we have two states A and B and our
action space consists of two actions; let the actions be up and down. So, in order to
find which action to perform in state A and B, first we compute the Q value of all
state-action pairs, as Table 9.1 shows:
Once we have the Q value of all state-action pairs, then we select the action in each
state that has the maximum Q value. So, we select the action up in state A and down
in state B as they have the maximum Q value. We improve the Q function on every
iteration and once we have the optimal Q function, then we can extract the optimal
policy from it.
Now, let's revisit our grid world environment, as shown in Figure 9.1:
[ 356 ]
Chapter 9
We learned that in the grid world environment, the goal of our agent is to reach state
I from state A without visiting the shaded states, and in each state, the agent has to
perform one of the four actions—up, down, left, right.
To compute the policy, first we compute the Q values of all state-action pairs. Here,
the number of states is 9 (A to I) and we have 4 actions in our action space, so our Q
table will consist of 9 x 4 = 36 rows containing the Q values of all possible state-action
pairs. Once we obtain the Q values, then we extract the policy by selecting the action
in each state that has the maximum Q value. But is it a good approach to compute
the Q value exhaustively for all state-action pairs? Let's explore this in more detail.
Let's suppose we have an environment where we have 1,000 states and 50 possible
actions in each state. In this case, our Q table will consist of 1,000 x 50 = 50,000 rows
containing the Q values of all possible state-action pairs. In cases like this, where
our environment consists of a large number of states and actions, it will be very
expensive to compute the Q values of all possible state-action pairs in an exhaustive
fashion.
Instead of computing Q values in this way, can we approximate them using any
function approximator, such as a neural network? Yes! We can parameterize our
Q function by a parameter 𝜃𝜃 and compute the Q value where the parameter 𝜃𝜃 is just
the parameter of our neural network. So, we just feed the state of the environment
to a neural network and it will return the Q value of all possible actions in that state.
Once we obtain the Q values, then we can select the best action as the one that has
the maximum Q value.
For example, let's consider our grid world environment. As Figure 9.2 shows, we
just feed state D as an input to the network and it returns the Q value of all actions
in state D, which are up, down, left, and right, as output. Then, we select the action
that has the maximum Q value. Since action right has a maximum Q value, we select
action right in the state D:
[ 357 ]
Deep Q Network and Its Variants
Since we are using a neural network to approximate the Q value, the neural network
is called the Q network, and if we use a deep neural network to approximate the Q
value, then the deep neural network is called a deep Q network (DQN).
We can denote our Q function by 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠𝑠, where the parameter 𝜃𝜃 in subscript
indicates that our Q function is parameterized by 𝜃𝜃, and 𝜃𝜃 is just the parameter of
our neural network.
We initialize the network parameter 𝜃𝜃 with random values and approximate the Q
function (Q values), but since we initialized 𝜃𝜃 with random values, the approximated
Q function will not be optimal. So, we train the network for several iterations by
finding the optimal parameter 𝜃𝜃. Once we find the optimal 𝜃𝜃, we will have the
optimal Q function. Then we can extract the optimal policy from the optimal Q
function.
Okay, but how can we train our network? What about the training data and the
loss function? Is it a classification or regression task? Now that we have a basic
understanding of how DQN works, in the next section, we will get into the details
and address all these questions.
Understanding DQN
In this section, we will understand how exactly DQN works. We learned that we use
DQN to approximate the Q value of all the actions in the given input state. The Q
value is just a continuous number, so we are essentially using our DQN to perform
a regression task.
Okay, what about the training data? We use a buffer called a replay buffer to collect
the agent's experience and, based on this experience, we train our network. Let's
explore the replay buffer in detail.
Replay buffer
We know that the agent makes a transition from a state s to the next state 𝑠𝑠 ′ by
performing some action a, and then receives a reward r. We can save this transition
information (𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) in a buffer called a replay buffer or experience replay. The
replay buffer is usually denoted by 𝒟𝒟. This transition information is basically the
agent's experience. We store the agent's experience over several episodes in the
replay buffer. The key idea of using the replay buffer to store the agent's experience
is that we can train our DQN with experience (transition) sampled from the buffer.
A replay buffer is shown here:
[ 358 ]
Chapter 9
The following steps help us to understand how we store the transition information in
the replay buffer 𝒟𝒟:
Episode 1:
[ 359 ]
Deep Q Network and Its Variants
Episode 2:
Now, this information will be stored in the replay buffer, as Figure 9.6 shows:
Note that the replay buffer is of limited size, that is, a replay buffer will store only
a fixed amount of the agent's experience. So, when the buffer is full we replace the
old experience with new experience. A replay buffer is usually implemented as a
queue structure (first in first out) rather than a list. So, if the buffer is full when new
experience comes in, we remove the old experience and add the new experience
into the buffer.
[ 360 ]
Chapter 9
Loss function
We learned that in DQN, our goal is to predict the Q value, which is just a
continuous value. Thus, in DQN we basically perform a regression task. We
generally use the mean squared error (MSE) as the loss function for the regression
task. MSE can be defined as the average squared difference between the target value
and the predicted value, as shown here:
𝐾𝐾
1
MSE = ∑(𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑦𝑖𝑖 )2
𝐾𝐾
𝑖𝑖𝑖𝑖
Where y is the target value, 𝑦𝑦𝑦 is the predicted value, and K is the number of training
samples.
Now, let's learn how to use MSE in the DQN and train the network. We can train
our network by minimizing the MSE between the target Q value and predicted Q
value. First, how can we obtain the target Q value? Our target Q value should be the
optimal Q value so that we can train our network by minimizing the error between
the optimal Q value and predicted Q value. But how can we compute the optimal
Q value? This is where the Bellman equation helps us. In Chapter 3, The Bellman
Equation and Dynamic Programming, we learned that the optimal Q value can be
obtained using the Bellman optimality equation:
[ 361 ]
Deep Q Network and Its Variants
Thus, according to the Bellman optimality equation, the optimal Q value is just the
sum of the reward and the discounted maximum Q value of the next state-action
pair, that is:
So, we can define our loss as the difference between the target value (the optimal Q
value) and the predicted value (the Q value predicted by the DQN) and express the
loss function L as:
𝐿𝐿(𝜃𝜃) = 𝑟𝑟 𝑟 𝑟𝑟 𝑟𝑟𝑟
′
𝑄𝑄(𝑠𝑠 ′ , 𝑎𝑎′ ) − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
We know that we compute the predicted Q value using the network parameterized
by 𝜃𝜃. How can we compute the target value? That is, we learned that the target
value is the sum of the reward and the discounted maximum Q value of the next
state-action pair. How do we compute the Q value of the next state-action pair?
Similar to the predicted Q value, we can compute the Q value of the next state-action
pair in the target using the same DQN parameterized by 𝜃𝜃. So, we can rewrite our
loss function as:
𝐿𝐿(𝜃𝜃) = 𝑟𝑟 𝑟 𝑟𝑟 𝑟𝑟𝑟
′
𝑄𝑄𝜃𝜃 (𝑠𝑠 ′ , 𝑎𝑎′ ) − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
As shown, both the target value and the predicted Q value are parameterized by 𝜃𝜃.
Instead of computing the loss as just the difference between the target Q value and
the predicted Q value, we use MSE as our loss function. We learned that we store
the agent's experience in a buffer called a replay buffer. So, we randomly sample
a minibatch of K number of transitions (𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) from the replay buffer and train
the network by minimizing the MSE, as shown here:
[ 362 ]
Chapter 9
𝐾𝐾
1
𝐿𝐿(𝜃𝜃) = ∑(𝑟𝑟𝑖𝑖 + 𝛾𝛾 𝛾𝛾𝛾 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖′ , 𝑎𝑎′ ) − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))2
𝐾𝐾 𝑎𝑎 ′
𝑖𝑖𝑖𝑖
For simplicity of notation, we can denote the target value by y and rewrite the
preceding equation as:
𝐾𝐾
1
𝐿𝐿(𝜃𝜃) = ∑(𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))2
𝐾𝐾
𝑖𝑖𝑖𝑖
Where 𝑦𝑦𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾 𝛾𝛾𝛾 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖′ , 𝑎𝑎′ ). We have learned that the target value is just the
′
𝑎𝑎
sum of the reward and the discounted maximum Q value of the next state-action
pair. But what if the next state 𝑠𝑠 ′ is a terminal state? If the next state 𝑠𝑠 ′ is terminal
then we cannot compute the Q value as we don't take any action in the terminal
state, so in that case, the target value will be just the reward, as shown here:
𝑟𝑟𝑖𝑖 if 𝑠𝑠 ′ is terminal
𝑦𝑦𝑖𝑖 = { 𝑟𝑟 + 𝛾𝛾 𝛾𝛾𝛾 𝑄𝑄 (𝑠𝑠 ′ , 𝑎𝑎′ ) if 𝑠𝑠 ′ is not terminal
𝑖𝑖 ′ 𝜃𝜃 𝑖𝑖
𝑎𝑎
𝐾𝐾
1
𝐿𝐿(𝜃𝜃) = ∑(𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))2
𝐾𝐾
𝑖𝑖𝑖𝑖
We train our network by minimizing the loss function. We can minimize the loss
function by finding the optimal parameter 𝜃𝜃. So, we use gradient descent to find
the optimal parameter 𝜃𝜃. We compute the gradient of our loss function ∇𝜃𝜃 𝐿𝐿𝐿𝐿𝐿𝐿 and
update our network parameter 𝜃𝜃 as:
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐿𝐿𝐿𝐿𝐿𝐿
[ 363 ]
Deep Q Network and Its Variants
Target network
In the last section, we learned that we train the network by minimizing the loss
function, which is the MSE between the target value and the predicted value, as
shown here:
𝐾𝐾
1
𝐿𝐿(𝜃𝜃) = ∑(𝑟𝑟𝑖𝑖 + 𝛾𝛾 𝛾𝛾𝛾 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖′ , 𝑎𝑎′ ) − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))2
𝐾𝐾 𝑎𝑎 ′
𝑖𝑖𝑖𝑖
However, there is a small issue with our loss function. We have learned that the
target value is just the sum of the reward and the discounted maximum Q value
of the next state-action pair. We compute this Q value of the next state-action pair
in the target and predicted Q values using the same network parameterized by 𝜃𝜃,
as shown here:
The problem is since the target and predicted value depend on the same parameter 𝜃𝜃,
this will cause instability in the MSE and the network learns poorly. It also causes a
lot of divergence during training.
Let's understand this with a simple example. We will take arbitrary numbers to
make it easier to understand. We know that we try to minimize the difference
between the target value and the predicted value. So, on every iteration, we
compute the gradient of loss and update our network parameter 𝜃𝜃 so that we
can make our predicted value the same as the target value.
Let's suppose in iteration 1, the target value is 13 and the predicted value is 11. So,
we update our parameter 𝜃𝜃 to match the predicted value to the target value, which
is 13. But in the next iteration, the target value changes to 15 and the predicted value
becomes 13 since we updated our network parameter 𝜃𝜃. So, again we update our
parameter 𝜃𝜃 to match the predicted value to the target value, which is now 15. But
in the next iteration, the target value changes to 17 and the predicted value becomes
15 since we updated our network parameter 𝜃𝜃.
[ 364 ]
Chapter 9
As Table 9.2 shows, on every iteration, the predicted value tries to be the same as the
target value, which keeps on changing:
This is because the predicted and target values both depend on the same parameter
𝜃𝜃. If we update 𝜃𝜃, then both the target and predicted values change. Thus, the
predicted value keeps on trying to be the same as the target value, but the target
value keeps on changing due to the update on the network parameter 𝜃𝜃.
How can we avoid this? Can we freeze the target value for a while and compute
only the predicted value so that our predicted value matches the target value?
Yes! To do this, we introduce another neural network called a target network for
computing the Q value of the next state-action pair in the target. The parameter of
the target network is represented by 𝜃𝜃 ′. So, our main deep Q network, which is used
for predicting Q values, learns the optimal parameter 𝜃𝜃 using gradient descent. The
target network is frozen for a while and then the target network parameter 𝜃𝜃 ′ is
updated by just copying the main deep Q network parameter 𝜃𝜃. Freezing the target
network for a while and then updating its parameter 𝜃𝜃 ′ with the main network
parameter 𝜃𝜃 stabilizes the training.
𝐾𝐾
1
𝐿𝐿(𝜃𝜃) = ∑(𝑟𝑟𝑖𝑖 + 𝛾𝛾 𝛾𝛾𝛾 𝑄𝑄𝜃𝜃′ (𝑠𝑠𝑖𝑖′ , 𝑎𝑎′ ) − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))2
𝐾𝐾 𝑎𝑎 ′
𝑖𝑖𝑖𝑖
Thus, the Q value of the next state-action pair in the target is computed by the target
network parameterized by 𝜃𝜃 ′, and the predicted Q value is computed by our main
network parameterized by 𝜃𝜃:
[ 365 ]
Deep Q Network and Its Variants
For notation simplicity, we can represent our target value by y and rewrite the
preceding equation as:
𝐾𝐾
1
𝐿𝐿(𝜃𝜃) = ∑(𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))2
𝐾𝐾
𝑖𝑖𝑖𝑖
Now, for each step in the episode, we feed the state of the environment to our
network and it outputs the Q values of all possible actions in that state. Then, we
select the action that has the maximum Q value:
If we only select the action that has the highest Q value, then we will not explore
any new actions. So, to avoid this, we select actions using the epsilon-greedy policy.
With the epsilon-greedy policy, we select a random action with probability epsilon
and with probability 1-epsilon, we select the best action that has the maximum
Q value.
Note that, since we initialized our network parameter 𝜃𝜃 with random values, the
action we select by taking the maximum Q value will not be the optimal action.
But that's okay, we simply perform the selected action, move to the next state, and
obtain the reward. If the action is good then we will receive a positive reward, and
if it is bad then the reward will be negative. We store all this transition information
(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) in the replay buffer 𝒟𝒟.
[ 366 ]
Chapter 9
Next, we randomly sample a minibatch of K transitions from the replay buffer and
compute the loss. We have learned that our loss function is computed as:
𝐾𝐾
1
𝐿𝐿(𝜃𝜃) = ∑(𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))2
𝐾𝐾
𝑖𝑖𝑖𝑖
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐿𝐿𝐿𝐿𝐿𝐿
We don't update the target network parameter 𝜃𝜃 ′ in every time step. We freeze
the target network parameter 𝜃𝜃 ′ for several time steps and then we copy the main
network parameter 𝜃𝜃 to the target network parameter 𝜃𝜃 ′.
We keep repeating the preceding steps for several episodes to approximate the
optimal Q value. Once we have the optimal Q value, we extract the optimal policy
from them. To give us a more detailed understanding, the algorithm of DQN is given
in the next section.
[ 367 ]
Deep Q Network and Its Variants
Now that we have understood how DQN works, in this next section, we will learn
how to implement it.
Thus, now our DQN is a CNN. We feed the image of the game screen (the game
state) as input to the CNN, and it outputs the Q values of all the actions in the state.
As Figure 9.8 shows, given the image of the game screen, the convolutional layers
extract features from the image and produce a feature map. Next, we flatten the
feature map and feed the flattened feature map as input to the feedforward network.
The feedforward network takes this flattened feature map as input and returns the
Q values of all the actions in the state:
[ 368 ]
Chapter 9
Now that we have understood the architecture of the DQN to play Atari games, in
the next section, we will start implementing it.
import random
import gym
import numpy as np
from collections import deque
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten, Conv2D,
MaxPooling2D
from tensorflow.keras.optimizers import Adam
[ 369 ]
Deep Q Network and Its Variants
env = gym.make("MsPacman-v0")
action_size = env.action_space.n
To avoid this, we preprocess the game screen and then feed the preprocessed game
screen to the DQN. First, we crop and resize the game screen image, convert the
image to grayscale, normalize it, and then reshape the image to 88 x 80 x 1. Next, we
feed this preprocessed game screen image as input to the CNN, which returns the Q
values.
Now, let's define a function called preprocess_state, which takes the game state
(image of the game screen) as an input and returns the preprocessed game state:
def preprocess_state(state):
image = image.mean(axis=2)
image[image==color] = 0
[ 370 ]
Chapter 9
return image
class DQN:
self.state_size = state_size
self.action_size = action_size
self.replay_buffer = deque(maxlen=5000)
self.gamma = 0.9
self.epsilon = 0.8
[ 371 ]
Deep Q Network and Its Variants
Define the update rate at which we want to update the target network:
self.update_rate = 1000
self.main_network = self.build_network()
self.target_network = self.build_network()
self.target_network.set_weights(self.main_network.get_
weights())
def build_network(self):
model = Sequential()
model.add(Conv2D(32, (8, 8), strides=4, padding='same', input_
shape=self.state_size))
model.add(Activation('relu'))
[ 372 ]
Chapter 9
Flatten the feature maps obtained as a result of the third convolutional layer:
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(self.action_size, activation='linear'))
model.compile(loss='mse', optimizer=Adam())
return model
[ 373 ]
Deep Q Network and Its Variants
Compute the predicted value using the main network and store the predicted value
in the Q_values:
Q_values = self.main_network.predict(state)
Q_values[0][action] = target_Q
def update_target_network(self):
self.target_network.set_weights(self.main_network.get_
weights())
[ 374 ]
Chapter 9
num_episodes = 500
num_timesteps = 20000
batch_size = 8
num_screens = 4
done = False
time_step = 0
for i in range(num_episodes):
Set Return to 0:
Return = 0
state = preprocess_state(env.reset())
for t in range(num_timesteps):
[ 375 ]
Deep Q Network and Its Variants
If the number of transitions in the replay buffer is greater than the batch size, then
train the network:
if len(dqn.replay_buffer) > batch_size:
dqn.train(batch_size)
[ 376 ]
Chapter 9
By rendering the environment, we can also observe how the agent learns to play the
game over a series of episodes:
Now that we have learned how DQNs work and how to build a DQN to play Atari
games, in the next section, we will learn an interesting variant of DQN called the
double DQN.
𝑦𝑦 𝑦 𝑦𝑦 𝑦 𝑦𝑦 𝑦𝑦𝑦
′
𝑄𝑄𝜃𝜃′ (𝑠𝑠 ′ , 𝑎𝑎′ )
𝑎𝑎
One of the problems with a DQN is that it tends to overestimate the Q value of the
next state-action pair in the target:
This overestimation is due to the presence of the max operator. Let's see how this
overestimation happens with an example. Suppose we are in a state 𝑠𝑠 ′ and we have
three actions a1, a2, and a3. Assume a3 is the optimal action in the state 𝑠𝑠 ′. When we
estimate the Q values of all the actions in state 𝑠𝑠 ′, the estimated Q value will have
some noise and differ from the actual value. Say, due to the noise, action a2 will get
a higher Q value than the optimal action a3.
[ 377 ]
Deep Q Network and Its Variants
Now, if we select the best action as the one that has the maximum value then we
will end up selecting the action a2 instead of optimal action a3, as shown here:
As we can observe, now we have two Q functions in our target value computation.
One Q function parameterized by the main network parameter 𝜃𝜃 is used for action
selection, and the other Q function parameterized by the target network parameter 𝜃𝜃 ′
is used for Q value computation.
Let's understand the preceding equation by breaking it down into two steps:
• Action selection: First, we compute the Q values of all the next state-action
pairs using the main network parameterized by 𝜃𝜃, and then we select action
𝑎𝑎′, which has the maximum Q value:
• Q value computation: Once we have selected action 𝑎𝑎′, then we compute the
Q value using the target network parameterized by 𝜃𝜃 ′ for the selected action
𝑎𝑎′:
[ 378 ]
Chapter 9
Let's understand this with an example. Let's suppose state 𝑠𝑠 ′ is E, then we can write:
𝑦𝑦 𝑦 𝑦𝑦 𝑦 𝑦𝑦𝑦𝑦𝜃𝜃′ (𝐸𝐸𝐸 𝐸𝐸𝐸 𝐸𝐸𝐸
′
𝑄𝑄𝜃𝜃 (𝐸𝐸𝐸 𝐸𝐸′ ))
𝑎𝑎
First, we compute the Q values of all actions in state E using the main network
parameterized by 𝜃𝜃, and then we select the action that has the maximum Q value.
Let's suppose the action that has the maximum Q value is right:
Now, we can compute the Q value using the target network parameterized by 𝜃𝜃 ′
with the action selected by the main network, which is right. Thus, we can write:
𝑦𝑦 𝑦 𝑦𝑦 𝑦 𝑦𝑦𝑦𝑦𝜃𝜃′ (𝐸𝐸𝐸 𝐸𝐸𝐸𝐸𝐸𝐸
Still not clear? The difference between how we compute the target value in DQN and
double DQN is shown here:
Thus, we learned that in a double DQN, we compute the target value using two
Q functions. One Q function parameterized by the main network parameter 𝜃𝜃 used
for selecting the action that has the maximum Q value, and the other Q function
parameterized by target network parameter 𝜃𝜃 ′ computes the Q value using the
action selected by the main network:
𝑦𝑦 𝑦 𝑦𝑦 𝑦 𝑦𝑦𝑦𝑦𝜃𝜃′ (𝑠𝑠 ′ , arg max
′
𝑄𝑄𝜃𝜃 (𝑠𝑠 ′ , 𝑎𝑎′ ))
𝑎𝑎
Apart from target value computation, double DQN works exactly the same as DQN.
To give us more clarity, the algorithm of double DQN is given in the next section.
[ 379 ]
Deep Q Network and Its Variants
Now that we have learned how the double DQN works, in the next section, we will
learn about an interesting variant of DQN called DQN with prioritized experience
replay.
Yes, but first, why do we need to assign priority for the transition, and how can we
decide which transition should be given more priority than the others? Let's explore
this more in detail.
The TD error 𝛿𝛿 is the difference between the target value and the predicted value, as
shown here:
𝛿𝛿 𝛿 𝛿𝛿 𝛿 𝛿𝛿 𝛿𝛿𝛿
′
𝑄𝑄𝜃𝜃′ (𝑠𝑠 ′ , 𝑎𝑎′ ) − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑠𝑠𝑠𝑠
𝑎𝑎
A transition that has a high TD error implies that the transition is not correct, and so
we need to learn more about that transition to minimize the error. A transition that has
a low TD error implies that the transition is already good. We can always learn more
from our mistakes rather than only focusing on what we are already good at, right?
Similarly, we can learn more from the transitions that have a high TD error than those
that have a low TD error. Thus, we can assign more priority to the transitions that have
a high TD error and less priority to transitions that have a low TD error.
We know that the transition information consists of (𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ), and along with
this information, we also add priority p and store the transition with the priority
in our replay buffer as (𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ , 𝑝𝑝𝑝. The following figure shows the replay buffer
containing transitions along with the priority:
In the next section, we will learn how to prioritize our transitions using the TD error
based on two different types of prioritization methods.
Types of prioritization
We can prioritize our transition using the following two methods:
• Proportional prioritization
• Rank-based prioritization
[ 381 ]
Deep Q Network and Its Variants
Proportional prioritization
We learned that the transition can be prioritized using the TD error, so the priority p
of the transition i will be just its TD error:
𝑝𝑝𝑖𝑖 = |𝛿𝛿𝑖𝑖 |
Note that we take the absolute value of the TD error as a priority to keep the priority
positive. Okay, what about a transition that has a TD error of zero? Say we have a
transition i and its TD error is 0, then the priority of the transition i will just be 0:
𝑝𝑝𝑖𝑖 = 0
But setting the priority of the transition to zero is not desirable, and if we set the
priority of a transition to zero then that particular transition will not be used in our
training at all. So, to avoid this issue, we will add a small value called epsilon to our
TD error. So, even if the TD error is zero, we will still have a small priority due to the
epsilon. To be more precise, adding an epsilon to the TD error guarantees that there
will be no transition with zero priority. Thus, our priority can be modified as:
𝑝𝑝𝑖𝑖 = |𝛿𝛿𝑖𝑖 | + 𝜖𝜖
Instead of having the priority as a raw number, we can convert it into a probability
so that we will have priorities ranging from 0 to 1. We can convert the priority to a
probability as shown here:
𝑝𝑝𝑖𝑖
𝑃𝑃𝑃𝑃𝑃𝑃 𝑃
∑𝑘𝑘 𝑝𝑝𝑘𝑘
The preceding equation calculates the probability P of the transition i.
Can we also control the amount of prioritization? That is, instead of sampling only
the prioritized transition, can we also take a random transition? Yes! To do this, we
introduce a new parameter called 𝛼𝛼 and rewrite our equation as follows. When the
value of 𝛼𝛼 is high, say 1, then we sample only the transitions that have high priority
and when the value of 𝛼𝛼 is low, say 0, then we sample a random transition:
𝑝𝑝𝑖𝑖 𝛼𝛼
𝑃𝑃𝑃𝑃𝑃𝑃 𝑃
∑𝑘𝑘 𝑝𝑝𝑘𝑘 𝛼𝛼
Thus, we have learned how to assign priority to a transition using the proportional
prioritization method. In the next section, we will learn another prioritization
method called rank-based prioritization.
[ 382 ]
Chapter 9
Rank-based prioritization
Rank-based prioritization is the simplest type of prioritization. Here, we assign
priority based on the rank of a transition. What is the rank of a transition? The rank
of a transition i can be defined as the location of the transition in the replay buffer
where the transitions are sorted from high TD error to low TD error.
Thus, we can define the priority of the transition i using rank as:
1
𝑝𝑝𝑖𝑖 =
Rank(𝑖𝑖𝑖
Just as we learned in the previous section, we convert the priority into probability:
𝑝𝑝𝑖𝑖
𝑃𝑃𝑃𝑃𝑃𝑃 𝑃
∑𝑘𝑘 𝑝𝑝𝑘𝑘
Similar to what we learned in the previous section, we can add a parameter to
control the amount of prioritization and express our final equation as:
𝑝𝑝𝑖𝑖 𝛼𝛼
𝑃𝑃𝑃𝑃𝑃𝑃 𝑃
∑𝑘𝑘 𝑝𝑝𝑘𝑘 𝛼𝛼
Okay, what's the issue with this? It will lead to the problem of overfitting, and our
agent will be highly biased to those transitions that have a high TD error. To combat
this, we use importance weights w. The importance weights help us to reduce the
weights of transitions that have occurred many times. The importance weight w of
the transition i can be expressed as:
1 1 𝛽𝛽
𝑤𝑤𝑖𝑖 = ( ∙ )
𝑁𝑁 𝑃𝑃𝑃𝑃𝑃𝑃
[ 383 ]
Deep Q Network and Its Variants
In the preceding expression, N denotes the length of our replay buffer and P(i)
denotes the probability of the transition i. Okay, what's that parameter 𝛽𝛽? It controls
the importance weight. We start off with small values of 𝛽𝛽, from 0.4 and anneal it
toward 1.
Thus, in this section, we have learned how to prioritize transitions in DQN with
prioritized experience replay. In the next section, we will learn about another
interesting variant of DQN called dueling DQN.
• Q function: The Q function gives the expected return an agent would obtain
starting from state s, performing action a, and following the policy 𝜋𝜋.
• Value function: The value function gives the expected return an agent
would obtain starting from state s and following the policy 𝜋𝜋.
Now if we think intuitively, what's the difference between the Q function and
the value function? The Q function gives us the value of a state-action pair, while
the value function gives the value of a state irrespective of the action. Now, the
difference between the Q function and the value function tells us how good the
action a is compared to the average actions in state s.
Thus, the advantage function tells us that in state s, how good the action a is
compared to the average actions. Now that we have understood what the advantage
function is, let's see why and how we can make use of it in the DQN.
[ 384 ]
Chapter 9
Let's suppose we are in some state s and we have 20 possible actions to perform in
this state. Computing the Q values of all these 20 actions in state s is not going to
be useful because most of the actions will not have any effect on the state, and also
most of the actions will have a similar Q value. What do we mean by that? Let's
understand this with the grid world environment shown in Figure 9.12:
As we can see, the agent is in state A. In this case, what is the use of computing the
Q value of the action up in state A? Moving up will have no effect in state A, and it's
not going to take the agent anywhere. Similarly, think of an environment where our
action space is huge, say 100. In this case, most of the actions will not have any effect
in the given state. Also, when the action space is large, most of the actions will have
a similar Q value.
Now, let's talk about the value of a state. Note that not all the states are important
for an agent. There could be a state that always gives a bad reward no matter what
action we perform. In that case, it is not useful to compute the Q value of all possible
actions in the state if we know that the state is always going to give us a bad reward.
Thus, to solve this we can compute the Q function as the sum of the value function
and the advantage function. That is, with the value function, we can understand
whether a state is valuable or not without computing the values of all actions in the
state. And with the advantage function, we can understand whether an action is
really good or it just gives us the same value as all the other actions.
[ 385 ]
Deep Q Network and Its Variants
Now that we have a basic idea of dueling DQN, let's explore the architecture of
dueling DQN in the next section.
We learned that we compute the Q value by adding the state value and the
advantage value together, so we combine the value stream and the advantage stream
using another layer called the aggregate layer, and compute the Q value as Figure
9.14 shows.
[ 386 ]
Chapter 9
Thus, the value stream computes the state value, the advantage stream computes the
advantage value, and the aggregate layer sums these streams and computes the Q
value:
But there is a small issue here. Just summing the state value and advantage
value in the aggregate layer and computing the Q value leads us to a problem of
identifiability.
So, to combat this problem, we make the advantage function to have zero advantage
for the selected action. We can achieve this by subtracting the average advantage
value, that is, the average advantage of all actions in the action space, as shown here:
1
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = 𝑉𝑉(𝑠𝑠) + (𝐴𝐴(𝑠𝑠𝑠 𝑠𝑠) − ∑ 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴′ ))
𝒜𝒜 ′
𝑎𝑎
Thus, we can write our final equation for computing the Q value with parameters as:
1
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠) = 𝑉𝑉(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠) + (𝐴𝐴(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠) − ∑ 𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴′ ;𝜃𝜃𝜃𝜃𝜃𝜃)
𝒜𝒜 ′
𝑎𝑎
[ 387 ]
Deep Q Network and Its Variants
Thus, the only difference between a dueling DQN and DQN is that in a dueling
DQN, instead of computing the Q values directly, we compute them by combining
the state value and the advantage value.
In the next section, we will explore another variant of DQN called deep recurrent
Q network.
But most real-world environments are only partially observable; we cannot see
all the states. For instance, consider an agent learning to walk in a real-world
environment. In this case, the agent will not have complete knowledge of the
environment (the real world); it will have no information outside its view.
Thus, in a POMDP, states provide only partial information, but keeping the
information about past states in the memory will help the agent to understand more
about the nature of the environment and find the optimal policy. Thus, in POMDP,
we need to retain information about past states in order to take the optimal action.
[ 388 ]
Chapter 9
So, can we take advantage of recurrent neural networks to understand and retain
information about the past states as long as it is required? Yes, the Long Short-Term
Memory Recurrent Neural Network (LSTM RNN) is very useful for retaining,
forgetting, and updating the information as required. So, we can use the LSTM
layer in the DQN to retain information about the past states as long as it is required.
Retaining information about the past states helps when we have the problem of
POMDP.
Now that we have a basic understanding of why we need DRQN and how it solves
the problem of POMDP, in the next section, we will look into the architecture of
DRQN.
We pass the game screen as an input to the convolutional layer. The convolutional
layer convolves the image and produces a feature map. The resulting feature map is
then passed to the LSTM layer. The LSTM layer has memory to hold information. So,
it retains information about important previous game states and updates its memory
over time steps as required. Then, we feed the hidden state from the LSTM layer to
the fully connected layer, which outputs the Q value.
Figure 9.16 helps us to understand how exactly DRQN works. Let's suppose we need
to compute the Q value for the state st and the action at. Unlike DQN, we don't just
compute the Q value as Q(st, at) directly. As we can see, along with the current state
st we also use the hidden state ht to compute the Q value. The reason for using the
hidden state is that it holds information about the past game states in memory.
[ 389 ]
Deep Q Network and Its Variants
Since we are using the LSTM cells, the hidden state ht will consist of information
about the past game states in the memory as long as it is required:
Except for this change, DRQN works just like DQN. Wait. What about the replay
buffer? In DQN, we learned that we store the transition information in the replay
buffer and train our network by sampling a minibatch of experience. We also learned
that the transition information is placed sequentially in the replay buffer one after
another, so to avoid the correlated experience, we randomly sample a minibatch of
experience from the replay buffer and train the network.
But in the case of a DRQN, we need sequential information so that our network can
retain information from past game states. Thus we need sequential information but
also we don't want to overfit the network by training with correlated experience.
How can we achieve this?
[ 390 ]
Chapter 9
After sampling the minibatch of episodes randomly, then we can train the DRQN
just like we trained the DQN network by minimizing the MSE loss. To learn more,
you can refer to the DRQN paper given in the Further reading section.
Summary
We started the chapter by learning what deep Q networks are and how they are used
to approximate the Q value. We learned that in a DQN, we use a buffer called the
replay buffer to store the agent's experience. Then, we randomly sample a minibatch
of experience from the replay buffer and train the network by minimizing the MSE.
Moving on, we looked at the algorithm of DQN in more detail, and then we learned
how to implement DQN to play Atari games.
Following this, we learned that the DQN overestimates the target value due to the
max operator. So, we used double DQN, where we have two Q functions in our
target value computation. One Q function parameterized by the main network
parameter 𝜃𝜃 is used for action selection, and the other Q function parameterized
by the target network parameter 𝜃𝜃 ′ is used for Q value computation.
Going ahead, we learned about the DQN with prioritized experience replay, where
the transition is prioritized based on the TD error. We explored two different
types of prioritization methods called proportional prioritization and rank-based
prioritization.
Next, we learned about another interesting variant of DQN called dueling DQN. In
dueling DQN, instead of computing Q values directly, we compute them using two
streams called the value stream and the advantage stream.
At the end of the chapter, we learned about DRQN and how they solve the problem
of partially observable Markov decision processes.
In the next chapter, we will learn about another popular algorithm called policy
gradient.
[ 391 ]
Deep Q Network and Its Variants
Questions
Let's evaluate our understanding of DQN and its variants by answering the
following questions:
Further reading
For more information, we can refer to the following papers:
[ 392 ]
Policy Gradient Method
10
In the previous chapters, we learned how to use value-based reinforcement learning
algorithms to compute the optimal policy. That is, we learned that with value-based
methods, we compute the optimal Q function iteratively and from the optimal Q
function, we extract the optimal policy. In this chapter, we will learn about policy-
based methods, where we can compute the optimal policy without having to
compute the optimal Q function.
We will start the chapter by looking at the disadvantages of computing a policy from
the Q function, and then we will learn how policy-based methods learn the optimal
policy directly without computing the Q function. Next, we will examine one of the
most popular policy-based methods, called the policy gradient. We will first take
a broad overview of the policy gradient algorithm, and then we will learn more
about it in detail.
Going forward, we will also learn how to derive the policy gradient step by step
and examine the algorithm of the policy gradient method in more detail. At the end
of the chapter, we will learn about the variance reduction techniques in the policy
gradient method.
[ 393 ]
Policy Gradient Method
With value-based methods, we extract the optimal policy from the optimal Q
function (Q values), meaning we compute the Q values of all state-action pairs to
find the policy. We extract the policy by selecting an action in each state that has the
maximum Q value. For instance, let's say we have two states s0 and s1 and our action
space has two actions; let the actions be 0 and 1. First, we compute the Q value of all
the state-action pairs, as shown in the following table. Now, we extract policy from
the Q function (Q values) by selecting action 0 in state s0 and action 1 in state s1 as
they have the maximum Q value:
Later, we learned that it is difficult to compute the Q function when our environment
has a large number of states and actions as it would be expensive to compute the
Q values of all possible state-action pairs. So, we resorted to the Deep Q Network
(DQN). In DQN, we used a neural network to approximate the Q function (Q value).
Given a state, the network will return the Q values of all possible actions in that
state. For instance, consider the grid world environment. Given a state, our DQN will
return the Q values of all possible actions in that state. Then we select the action that
has the highest Q value. As we can see in Figure 10.1, given state E, DQN returns the
Q value of all possible actions (up, down, left, right). Then we select the right action in
state E since it has the maximum Q value:
[ 394 ]
Chapter 10
One of the disadvantages of the value-based method is that it is suitable only for
discrete environments (environments with a discrete action space), and we cannot
apply value-based methods in continuous environments (environments with a
continuous action space).
We have learned that a discrete action space has a discrete set of actions; for example,
the grid world environment has discrete actions (up, down, left, and right) and the
continuous action space consists of actions that are continuous values, for example,
controlling the speed of a car.
So far, we have only dealt with a discrete environment where we had a discrete
action space, so we easily computed the Q value of all possible state-action pairs.
But how can we compute the Q value of all possible state-action pairs when our
action space is continuous? Say we are training an agent to drive a car and say we
have one continuous action in our action space. Let the action be the speed of the
car and the value of the speed of the car ranges from 0 to 150 kmph. In this case,
how can we compute the Q value of all possible state-action pairs with the action
being a continuous value?
In this case, we can discretize the continuous actions into speed (0 to 10) as action
1, speed (10 to 20) as action 2, and so on. After discretization, we can compute the
Q value of all possible state-action pairs. However, discretization is not always
desirable. We might lose several important features and we might end up in an
action space with a huge set of actions.
[ 395 ]
Policy Gradient Method
Most real-world problems have continuous action space, say, a self-driving car, or
a robot learning to walk and more. Apart from having a continuous action space they
also have a high dimension. Thus, the DQN and other value-based methods cannot
deal with the continuous action space effectively.
So, we use the policy-based methods. With policy-based methods, we don't need
to compute the Q function (Q values) to find the optimal policy; instead, we can
compute them directly. That is, we don't need the Q function to extract the policy.
Policy-based methods have several advantages over value-based methods, and they
can handle both discrete and continuous action spaces.
Okay, how do policy-based methods work, exactly? How do they find an optimal
policy without computing the Q function? We will learn about this in the next
section. Now that we have a basic understanding of what a policy gradient
method is, and also the disadvantages of value-based methods, in the next section
we will learn about a fundamental and interesting policy-based method called
policy gradient.
The policy gradient method uses a stochastic policy. We have learned that with a
stochastic policy, we select an action based on the probability distribution over the
action space. Say we have a stochastic policy 𝜋𝜋, then it gives the probability of taking
an action a given the state s. It can be denoted by 𝜋𝜋𝜋𝜋𝜋|𝑠𝑠𝑠. In the policy gradient
method, we use a parameterized policy, so we can denote our policy as 𝜋𝜋𝜃𝜃 (𝑎𝑎|𝑠𝑠𝑠,
where 𝜃𝜃 indicates that our policy is parameterized.
[ 396 ]
Chapter 10
Say we have a neural network with a parameter 𝜃𝜃. First, we feed the state of the
environment as an input to the network and it will output the probability of all
the actions that can be performed in the state. That is, it outputs a probability
distribution over an action space. We have learned that with policy gradient, we use
a stochastic policy. So, the stochastic policy selects an action based on the probability
distribution given by the neural network. In this way, we can directly compute the
policy without using the Q function.
Let's understand how the policy gradient method works with an example. Let's take
our favorite grid world environment for better understanding. We know that in the
grid world environment our action space has four possible actions: up, down, left,
and right.
Given any state as an input, the neural network will output the probability
distribution over the action space. That is, as shown in Figure 10.2, when we feed the
state E as an input to the network, it will return the probability distribution over all
actions in our action space. Now, our stochastic policy will select an action based on
the probability distribution given by the neural network. So, it will select action up
10% of the time, down 10% of the time, left 10% of the time, and right 70% of the time:
[ 397 ]
Policy Gradient Method
We should not get confused with the DQN and the policy gradient method. With
DQN, we feed the state as an input to the network, and it returns the Q values of all
possible actions in that state, then we select an action that has a maximum Q value.
But in the policy gradient method, we feed the state as input to the network, and it
returns the probability distribution over an action space, and our stochastic policy
uses the probability distribution returned by the neural network to select an action.
Okay, in the policy gradient method, the network returns the probability distribution
(action probabilities) over the action space, but how accurate are the probabilities?
How does the network learn?
Unlike supervised learning, here we will not have any labeled data to train our
network. So, our network does not know the correct action to perform in the given
state; that is, the network does not know which action gives the maximum reward.
So, the action probabilities given by our neural network will not be accurate in the
initial iterations, and thus we might get a bad reward.
But that is fine. We simply select the action based on the probability distribution
given by the network, store the reward, and move to the next state until the end of
the episode. That is, we play an episode and store the states, actions, and rewards.
Now, this becomes our training data. If we win the episode, that is, if we get a
positive return or high return (the sum of the rewards of the episode), then we
increase the probability of all the actions that we took in each state until the end
of the episode. If we get a negative return or low return, then we decrease the
probability of all the actions that we took in each state until the end of the episode.
Let's understand this with an example. Say we have states s1 to s8 and our goal is
to reach state s8. Say our action space consists of only two actions: left and right. So,
when we feed any state to the network, then it will return the probability distribution
over the two actions.
Consider the following trajectory (episode) 𝜏𝜏1, where we select an action in each
state based on the probability distribution returned by the network using a stochastic
policy:
[ 398 ]
Chapter 10
Let's suppose we generate another trajectory 𝜏𝜏2, where we select an action in each
state based on the probability distribution returned by the network using a stochastic
policy, as shown in Figure 10.4:
Okay, but how exactly do we increase and decrease these probabilities? We learned
that if the return of the trajectory is positive, then we increase the probabilities of
all actions in the episode, else we decrease it. How can we do this exactly? This is
where backpropagation helps us. We know that we train the neural network by
backpropagation.
So, during backpropagation, the network calculates gradients and updates the
parameters of the network 𝜃𝜃. Gradients updates are in such a way that actions
yielding high return will get high probabilities and actions yielding low return
will get low probabilities.
In a nutshell, in the policy gradient method, we use a neural network to find the
optimal policy. We initialize the network parameter 𝜃𝜃 with random values. We feed
the state as an input to the network and it will return the action probabilities. In the
initial iteration, since the network is not trained with any data, it will give random
action probabilities. But we select actions based on the action probability distribution
given by the network and store the state, action, and reward until the end of the
episode. Now, this becomes our training data. If we win the episode, that is, if we
get a high return, then we assign high probabilities to all the actions of the episode,
else we assign low probabilities to all the actions of the episode.
[ 399 ]
Policy Gradient Method
Since we are using a neural network to find the optimal policy, we can call this
neural network a policy network. Now that we have a basic understanding of the
policy gradient method, in the next section, we will learn how exactly the neural
network finds the optimal policy; that is, we will learn how exactly the gradient
computation happens and how we train the network.
The goal of the policy gradient method is to find the optimal parameter 𝜃𝜃 of the
neural network so that the network returns the correct probability distribution over
the action space. Thus, the objective of our network is to assign high probabilities
to actions that maximize the expected return of the trajectory. So, we can write our
objective function J as:
• 𝜏𝜏 is the trajectory.
• 𝜏𝜏𝜏𝜏𝜏𝜃𝜃 (𝜏𝜏𝜏 denotes that we are sampling the trajectory based on the policy 𝜋𝜋
given by network parameterized by 𝜃𝜃.
• 𝑅𝑅𝑅𝑅𝑅𝑅 is the return of the trajectory 𝜏𝜏.
Thus, maximizing our objective function maximizes the return of the trajectory.
How can we maximize the preceding objective function? We generally deal with
minimization problems, where we minimize the loss function (objective function)
by calculating the gradients of our loss function and updating the parameter using
gradient descent. But here, our goal is to maximize the objective function, so we
calculate the gradients of our objective function and perform gradient ascent. That is:
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
Where ∇𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽 implies the gradients of our objective function. Thus, we can find the
optimal parameter 𝜃𝜃 of our network using gradient ascent.
[ 400 ]
Chapter 10
The gradient ∇𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽 is derived as ∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = 𝔼𝔼𝜏𝜏𝜏𝜏𝜏𝜃𝜃 (𝜏𝜏𝜏 [∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )𝑅𝑅𝑅𝑅𝑅𝑅]. We will
learn how exactly we derived this gradient in the next section. In this section, let's
focus only on getting a good fundamental understanding of the policy gradient.
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
Substituting the value of gradient, our parameter update equation becomes:
• log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) represents the log probability of taking an action a given the
state s at a time t.
• 𝑅𝑅𝑅𝑅𝑅𝑅 represents the return of the trajectory.
We learned that we update the gradients in such a way that actions yielding a high
return will get a high probability, and actions yielding a low return will get a low
probability. Let's now see how exactly we are doing that.
Case 1:
Suppose we generate an episode (trajectory) using the policy 𝜋𝜋𝜃𝜃 , where 𝜃𝜃 is the
parameter of the network. After generating the episode, we compute the return of
the episode. If the return of the episode is negative, say -1, that is, 𝑅𝑅(𝜏𝜏) = −1, then
we decrease the probability of all the actions that we took in each state until the
end of the episode.
For each step in the episode, t = 0, . . ., T-1, we update the parameter 𝜃𝜃 as:
[ 401 ]
Policy Gradient Method
It implies that we are decreasing the probability of all the actions that we took in
each state until the end of the episode.
Case 2:
Suppose we generate an episode (trajectory) using the policy 𝜋𝜋𝜃𝜃 , where 𝜃𝜃 is the
parameter of the network. After generating the episode, we compute the return of
the episode. If the return of the episode is positive, say +1, that is, 𝑅𝑅(𝜏𝜏) = +1, then
we increase the probability of all the actions that we took in each state until the end
of the episode.
For each step in the episode, t = 0, . . ., T-1, we update the parameter 𝜃𝜃 as:
We learned that, for each step in the episode, t = 0, . . ., T-1, we update the parameter
𝜃𝜃 as:
𝑇𝑇𝑇𝑇
Thus, if the episode (trajectory) gives a high return, we will increase the probabilities
of all the actions of the episode, else we decrease the probabilities. We learned that
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = 𝔼𝔼𝜏𝜏𝜏𝜏𝜏𝜃𝜃 (𝜏𝜏𝜏 [∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )𝑅𝑅𝑅𝑅𝑅𝑅]. What about that expectation? We have
not included that in our update equation yet. When we looked at the Monte Carlo
method, we learned that we can approximate the expectation using the average.
Thus, using the Monte Carlo approximation method, we change the expectation
term to the sum over N trajectories. So, our update equation becomes:
[ 402 ]
Chapter 10
𝑁𝑁 𝑇𝑇𝑇𝑇
1
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃 ∑ ∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )𝑅𝑅𝑅𝑅𝑅𝑅
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
𝑁𝑁
Thus, first, we collect N number of trajectories {𝜏𝜏 𝑖𝑖 }𝑖𝑖𝑖𝑖 following the policy 𝜋𝜋𝜃𝜃 and
compute the gradient as:
𝑁𝑁 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )𝑅𝑅𝑅𝑅𝑅𝑅]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
But we can't find the optimal parameter 𝜃𝜃 by updating the parameter for just one
iteration. So, we repeat the previous step for many iterations to find the optimal
parameter.
[ 403 ]
Policy Gradient Method
Let's deep dive into the interesting math and see how to calculate the derivative of
our objective function J with respect to the model parameter 𝜃𝜃 in simple steps. Don't
get intimidated by the upcoming equations, it's actually a pretty simple derivation.
Before going ahead, let's revise some math prerequisites in order to understand our
derivation better.
Let X be a discrete random variable whose probability mass function (pmf) is given
as p(x). Let f be a function of a discrete random variable X. Then the expectation of a
function f(X) can be defined as:
∇𝜃𝜃 𝑥𝑥
∇𝜃𝜃 log 𝑥𝑥 𝑥 (3)
𝑥𝑥
We learned that the objective of our network is to maximize the expected return of
the trajectory. Thus, we can write our objective function J as:
• 𝜏𝜏 is the trajectory.
• 𝜏𝜏𝜏𝜏𝜏𝜃𝜃 (𝜏𝜏𝜏 shows that we are sampling the trajectory based on the policy 𝜋𝜋
given by network parameterized by 𝜃𝜃.
• 𝑅𝑅𝑅𝑅𝑅𝑅 is the return of the trajectory.
As we can see, our objective function, equation (4), is in the expectation form.
From the definition of the expectation given in equation (2), we can expand the
expectation and rewrite equation (4) as:
[ 404 ]
Chapter 10
Now, we calculate the derivative of our objective function J with respect to 𝜃𝜃:
𝜋𝜋𝜃𝜃 (𝜏𝜏)
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∫ ∇𝜃𝜃 𝜋𝜋𝜃𝜃 (𝜏𝜏) 𝑅𝑅(𝜏𝜏)𝑑𝑑𝑑𝑑
𝜋𝜋𝜃𝜃 (𝜏𝜏)
Rearranging the preceding equation, we can write:
From the definition of expectation given in equation (2), we can rewrite the
preceding equation in expectation form as:
𝑇𝑇𝑇𝑇
[ 405 ]
Policy Gradient Method
Where p(s0) is the initial state distribution. Taking the log on both sides, we can write:
𝑇𝑇𝑇𝑇
log 𝜋𝜋𝜃𝜃 (𝜏𝜏) = log [𝑝𝑝𝑝𝑝𝑝0 ) ∏[𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) 𝑝𝑝𝑝(𝑠𝑠𝑡𝑡𝑡𝑡 |𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 )]]
𝑡𝑡𝑡𝑡
We know that the log of a product is equal to the sum of the logs, that is,
log(𝑎𝑎𝑎𝑎) = log 𝑎𝑎 𝑎 log 𝑏𝑏. Applying this log rule to the preceding equation, we can
write:
𝑇𝑇𝑇𝑇
log 𝜋𝜋𝜃𝜃 (𝜏𝜏) = log 𝑝𝑝𝑝𝑝𝑝0 ) + log ∏[𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) 𝑝𝑝𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡 |𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 )]
𝑡𝑡𝑡𝑡
Again, we apply the same rule, log of product = sum of logs, and change the log Π to
Σ logs, as shown here:
𝑇𝑇𝑇𝑇
log 𝜋𝜋𝜃𝜃 (𝜏𝜏) = log 𝑝𝑝𝑝𝑝𝑝0 ) + ∑ log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) + log 𝑝𝑝𝑝(𝑠𝑠𝑡𝑡𝑡𝑡 |𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 )
𝑡𝑡𝑡𝑡
𝑇𝑇𝑇𝑇
∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝜏𝜏) = ∇𝜃𝜃 [log 𝑝𝑝𝑝𝑝𝑝0 ) + ∑ log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) + log 𝑝𝑝𝑝(𝑠𝑠𝑡𝑡𝑡𝑡 |𝑠𝑠𝑡𝑡 , 𝑎𝑎𝑡𝑡 )]
𝑡𝑡𝑡𝑡
Note that we are calculating derivative with respect to 𝜃𝜃 and, as we can see in the
preceding equation, the first and last term on the right-hand side (RHS) does not
depend on the 𝜃𝜃, and so they will become zero while calculating derivative. Thus
our equation becomes:
𝑇𝑇𝑇𝑇
[ 406 ]
Chapter 10
Now that we have found the value for ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝜏𝜏), substituting this in equation (5)
we can write:
𝑇𝑇𝑇𝑇
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = 𝔼𝔼𝜏𝜏𝜏𝜏𝜏𝜃𝜃 (𝜏𝜏𝜏 [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) 𝑅𝑅𝑅𝑅𝑅𝑅]
𝑡𝑡𝑡𝑡
That's it. But can we also get rid of that expectation? Yes! We can use a Monte Carlo
approximation method and change the expectation to the sum over N trajectories. So,
our final gradient becomes:
𝑁𝑁 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) ≈ ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) 𝑅𝑅(𝜏𝜏)] (6)
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
Equation (6) shows that instead of updating a parameter based on a single trajectory,
we collect N number of trajectories and update the parameter based on its average
value over N trajectories.
Thus, after computing the gradient, we can update our parameter as:
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
Thus, in this section, we have learned how to derive a policy gradient. In the next
section, we will get into more details and learn about the policy gradient algorithm
step by step.
[ 407 ]
Policy Gradient Method
As we can see from this algorithm, the parameter 𝜃𝜃 is getting updated in every
iteration. Since we are using the parameterized policy 𝜋𝜋𝜃𝜃, our policy is getting
updated in every iteration.
We learned that with the policy gradient method (the REINFORCE method), we
use a policy network that returns the probability distribution over the action space
and then we select an action based on the probability distribution returned by our
network using a stochastic policy. But this applies only to a discrete action space,
and we use categorical policy as our stochastic policy.
What if our action space is continuous? That is, when the action space is continuous,
how can we select actions? Here, our policy network cannot return the probability
distribution over the action space as the action space is continuous. So, in this case,
our policy network will return the mean and variance of the action as output, and
then we generate a Gaussian distribution using this mean and variance and select
an action by sampling from this Gaussian distribution using the Gaussian policy.
We will learn more about this in the upcoming chapters. Thus, we can apply the
policy gradient method to both discrete and continuous action spaces. Next, we
will look at two methods to reduce the variance of policy gradient updates.
[ 408 ]
Chapter 10
Thus, now we will learn the following two important methods to reduce the
variance:
𝑁𝑁 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) 𝑅𝑅(𝜏𝜏)]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
Now, we make a small change in the preceding equation. We know that the return
of the trajectory is the sum of the rewards of that trajectory, that is:
𝑇𝑇𝑇𝑇
𝑅𝑅(𝜏𝜏) = ∑ 𝑟𝑟𝑡𝑡
𝑡𝑡𝑡𝑡
Instead of using the return of trajectory 𝑅𝑅(𝜏𝜏), we use something called reward-to-go
Rt. Reward-to-go is basically the return of the trajectory starting from state st. That is,
instead of multiplying the log probabilities by the return of the full trajectory 𝑅𝑅(𝜏𝜏) in
every step of the episode, we multiply them by the reward-to-go Rt. The reward-to-
go implies the return of the trajectory starting from state st. But why do we have to
do this? Let's understand this in more detail with an example.
We learned that we generate N number of trajectories and compute the gradient as:
𝑁𝑁 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) 𝑅𝑅(𝜏𝜏)]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
For better understanding, let's take only one trajectory by setting N=1, so we can
write:
𝑇𝑇𝑇𝑇
[ 409 ]
Policy Gradient Method
Let's suppose we want to know how good the action right is in the state s2. If we
understand that the action right is a good action in the state s2, then we can increase
the probability of moving right in the state s2, else we decrease it. Okay, how can we
tell whether the action right is good in the state s2? As we learned in the previous
section (when discussing the REINFORCE method), if the return of the trajectory
𝑅𝑅𝑅𝑅𝑅𝑅 is high, then we increase the probability of the action right in the state s2, else
we decrease it.
But we don't have to do that now. Instead, we can compute the return (the sum of
the rewards of the trajectory) only starting from the state s2 because there is no use
in including all the rewards that we obtain from the trajectory before taking the
action right in the state s2. As Figure 10.6 shows, including all the rewards that we
obtain before taking the action right in the state s2 will not help us understand how
good the action right is in the state s2:
Thus, instead of taking the complete return of the trajectory in all the steps of the
episode, we use reward-to-go Rt, which is the return of the trajectory starting from
the state st.
Where R0 indicates the return of the trajectory starting from the state s0, R1 indicates
the return of the trajectory starting from the state s1, and so on. If R0 is high value,
then we increase the probability of the action up in the state s0, else we decrease it.
If R1 is high value, then we increase the probability of the action down in the state s1,
else we decrease it.
𝑇𝑇𝑇𝑇
The preceding equation states that the reward-to-go Rt is the sum of rewards of the
trajectory starting from the state st. Thus, now we can rewrite our gradient with
reward-to-go instead of the return of the trajectory as:
𝑁𝑁 𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) ∑ 𝑟𝑟𝑟𝑟𝑟𝑡𝑡 ′ , 𝑎𝑎𝑡𝑡 ′ )]
𝑁𝑁 ′
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡 𝑡𝑡 =0
𝑁𝑁 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) 𝑅𝑅𝑡𝑡 ]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
𝑇𝑇𝑇𝑇
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
Now that we have understood what policy gradient with reward-to-go is, in the next
section, we will look into the algorithm for more clarity.
[ 411 ]
Policy Gradient Method
From the preceding algorithm, we can observe that we are using reward-to-go
instead of the return of the trajectory. To get a clear understanding of how the
reward-to-go policy gradient works, let's implement it in the next section.
For a clear understanding of how the policy gradient method works, we use
TensorFlow in the non-eager mode by disabling TensorFlow 2 behavior.
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np
import gym
env = gym.make('CartPole-v0')
state_shape = env.observation_space.shape[0]
num_actions = env.action_space.n
[ 412 ]
Chapter 10
gamma = 0.95
def discount_and_normalize_rewards(episode_rewards):
discounted_rewards = np.zeros_like(episode_rewards)
reward_to_go = 0.0
for i in reversed(range(len(episode_rewards))):
reward_to_go = reward_to_go * gamma + episode_rewards[i]
discounted_rewards[i] = reward_to_go
discounted_rewards -= np.mean(discounted_rewards)
discounted_rewards /= np.std(discounted_rewards)
return discounted_rewards
[ 413 ]
Policy Gradient Method
Define layer 1:
Define layer 2. Note that the number of units in layer 2 is set to the number of
actions:
Obtain the probability distribution over the action space as an output of the network
by applying the softmax function to the result of layer 2:
prob_dist = tf.nn.softmax(layer2)
𝑁𝑁 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) 𝑅𝑅𝑡𝑡 ]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
After computing the gradient, we update the parameter of the network using
gradient ascent:
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
However, it is a standard convention to perform minimization rather than
maximization. So, we can convert the preceding maximization objective into the
minimization objective by just adding a negative sign. We can implement this using
tf.nn.softmax_cross_entropy_with_logits_v2. Thus, we can define the negative log
policy as:
neg_log_policy = tf.nn.softmax_cross_entropy_with_logits_v2(logits =
layer2, labels = action_ph)
[ 414 ]
Chapter 10
Define the train operation for minimizing the loss using the Adam optimizer:
train = tf.train.AdamOptimizer(0.01).minimize(loss)
num_iterations = 1000
sess.run(tf.global_variables_initializer())
for i in range(num_iterations):
Initialize an empty list for storing the states, actions, and rewards obtained in the
episode:
done = False
Return = 0
state = env.reset()
[ 415 ]
Policy Gradient Method
state = state.reshape([1,4])
Feed the state to the policy network and the network returns the probability
distribution over the action space as output, which becomes our stochastic policy 𝜋𝜋 :
a = np.random.choice(range(pi.shape[1]), p=pi.ravel())
env.render()
Return += reward
action = np.zeros(num_actions)
action[a] = 1
episode_states.append(state)
episode_actions.append(action)
episode_rewards.append(reward)
state=next_state
[ 416 ]
Chapter 10
discounted_rewards= discount_and_normalize_rewards(episode_
rewards)
if i%10==0:
print("Iteration:{}, Return: {}".format(i,Return))
Now that we have learned how to implement the policy gradient algorithm
with reward-to-go, in the next section, we will learn another interesting variance
reduction technique called policy gradient with baseline.
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
Where the value of the gradient is:
𝑁𝑁 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) ≈ ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) 𝑅𝑅𝑡𝑡 ]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
[ 417 ]
Policy Gradient Method
𝑁𝑁 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) ≈ ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) (𝑅𝑅𝑡𝑡 − 𝑏𝑏𝑏]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
Wait. What is the baseline function? And how does subtracting it from Rt reduce the
variance? The purpose of the baseline is to reduce the variance in the return. Thus, if
the baseline b is a value that can give us the expected return from the state the agent
is in, then subtracting b in every step will reduce the variance in the return.
There are several choices for the baseline functions. We can choose any function
as a baseline function but the baseline function should not depend on our network
parameter. A simple baseline could be the average return of the sampled trajectories:
𝑁𝑁
1
𝑏𝑏 𝑏 ∑ 𝑅𝑅𝑅𝑅𝑅𝑅
𝑁𝑁
𝑖𝑖𝑖𝑖
Thus, subtracting current return Rt and the average return helps us to reduce
variance. As we can see, our baseline function doesn't depend on the network
parameter 𝜃𝜃. So, we can use any function as a baseline function and it should not
affect our network parameter 𝜃𝜃.
One of the most popular functions of the baseline is the value function. We learned
that the value function or the value of a state is the expected return an agent would
obtain starting from that state following the policy 𝜋𝜋. Thus, subtracting the value of
a state (the expected return) and the current return Rt can reduce the variance. So, we
can rewrite our gradient as:
𝑁𝑁 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) ≈ ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) (𝑅𝑅𝑡𝑡 − 𝑉𝑉𝑉𝑉𝑉𝑡𝑡 ))]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
Other than the value function, we can also use different baseline functions such as
the Q function, the advantage function, and more. We will learn more about them in
the next chapter.
But now the question is how can we learn the baseline function? Say we are using
the value function as the baseline function. How can we learn the optimal value
function? Just like we are approximating the policy, we can also approximate the
value function using another neural network parameterized by 𝜙𝜙.
[ 418 ]
Chapter 10
That is, we use another network for approximating the value function (the value of
a state) and we can call this network a value network. Okay, how can we train this
value network?
Since the value of the state is a continuous value, we can train the network by
minimizing the mean squared error (MSE). The MSE can be defined as the mean
squared difference between the actual return Rt and the predicted return 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ).
Thus, the objective function of the value network can be defined as:
𝑁𝑁 𝑇𝑇𝑇𝑇
1 2
𝐽𝐽(𝜙𝜙) = ∑ ∑ (𝑅𝑅𝑡𝑡 − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
We can minimize the error using the gradient descent and update the network
parameter as:
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽
Thus, in the policy gradient with the baseline method, we minimize the variance
in the gradient updates by using the baseline function. A baseline function can be
any function and it should not depend on the network parameter 𝜃𝜃. We use the
value function as a baseline function, then to approximate the value function we
use a different neural network parameterized by 𝜙𝜙, and we find the optimal value
function by minimizing the MSE.
In a nutshell, in the policy gradient with the baseline function, we use two neural
networks:
Policy network parameterized by 𝜽𝜽: This finds the optimal policy by performing
gradient ascent:
𝑁𝑁 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) ≈ ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) (𝑅𝑅𝑡𝑡 − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
Value network parameterized by 𝝓𝝓: This is used to correct the variance in the
gradient update by acting as a baseline, and it finds the optimal value of a state by
performing gradient descent:
𝑁𝑁 𝑇𝑇𝑇𝑇
1 2
𝐽𝐽(𝜙𝜙) = ∑ ∑ (𝑅𝑅𝑡𝑡 − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
[ 419 ]
Policy Gradient Method
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽
Note that the policy gradient with the baseline function is often referred to as the
REINFORCE with baseline method.
Now that we have seen how the policy gradient method with baseline works
by using a policy and a value network, in the next section we will look into the
algorithm to get more clarity.
7. Compute gradients ∇𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽 and update the value network parameter 𝜙𝜙 using
gradient descent as 𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽
8. Repeat steps 2 to 7 for several iterations
[ 420 ]
Chapter 10
Summary
We started off the chapter by learning that with value-based methods, we extract
the optimal policy from the optimal Q function (Q values). Then we learned that it
is difficult to compute the Q function when our action space is continuous. We can
discretize the action space; however, discretization is not always desirable, and it
leads to the loss of several important features and an action space with a huge set of
actions.
We also learned that in the policy gradient method, we select actions based on the
action probability distribution given by the network, and if we win the episode, that
is, if we get a high return, then we assign high probabilities to all the actions in the
episode, else we assign low probabilities to all the actions in the episode. Later, we
learned how to derive the policy gradient step by step, and then we looked into the
algorithm of policy gradient method in more detail.
Moving forward, we learned about the variance reduction methods such as reward-
to-go and the policy gradient method with the baseline function. In the policy
gradient method with the baseline function, we use two networks called the policy
and value network. The role of the policy network is to find the optimal policy, and
the role of the value network is to correct the gradient updates in the policy network
by estimating the value function.
In the next chapter, we will learn about another interesting set of algorithms called
the actor-critic methods.
[ 421 ]
Policy Gradient Method
Questions
Let's evaluate our understanding of the policy gradient method by answering the
following questions:
Further reading
For more information about the policy gradient, we can refer to the following paper:
[ 422 ]
Actor-Critic
11
Methods – A2C and A3C
So far, we have covered two types of methods for learning the optimal policy. One
is the value-based method, and the other is the policy-based method. In the value-
based method, we use the Q function to extract the optimal policy. In the policy-
based method, we compute the optimal policy without using the Q function.
In this chapter, we will learn about another interesting method called the actor-critic
method for finding the optimal policy. The actor-critic method makes use of both the
value-based and policy-based methods. We will begin the chapter by understanding
what the actor-critic method is and how it makes use of value-based and policy-based
methods. We will acquire a basic understanding of actor-critic methods, and then
we will learn about them in detail.
Moving on, we will also learn how actor-critic differs from the policy gradient with
baseline method, and we will learn the algorithm of the actor-critic method in detail.
Next, we will understand what Advantage Actor-Critic (A2C) is, and how it makes
use of the advantage function.
At the end of the chapter, we will learn about one of the most popularly used actor-
critic algorithms, called Asynchronous Advantage Actor-Critic (A3C). We will
understand what A3C is and the details of how it works along with its architecture.
[ 423 ]
Actor-Critic Methods – A2C and A3C
Let's begin the chapter by getting a basic understanding of the actor-critic method.
In this section, without going into further detail, first, let's acquire a basic
understanding of how the actor-critic method works and then, in the next section, we
will get into more detail and understand the math behind the actor-critic method.
Okay, so what actually are the actor and critic networks? How do they work together
and improve the policy? The actor network is basically the policy network, and
it finds the optimal policy using a policy gradient method. The critic network is
basically the value network, and it estimates the state value.
[ 424 ]
Chapter 11
Thus, using its state value, the critic network evaluates the action produced by the
actor network and sends its feedback to the actor. Based on the critic's feedback,
the actor network then updates its parameter.
Thus, in the actor-critic method, we use two networks—the actor network (policy
network), which computes the policy, and the critic network (value network), which
evaluates the policy produced by the actor network by computing the value function
(state values). Isn't this similar to something we just learned in the previous chapter?
Yes! If you recall, it is similar to the policy gradient method with the baseline
(REINFORCE with baseline) we learned in the previous chapter. Similar to
REINFORCE with baseline, here also, we have an actor (policy network) and a critic
(value network) network. However, actor-critic is NOT the same as REINFORCE
with baseline. In the REINFORCE with baseline method, we learned that we use
a value network as the baseline and it helps to reduce the variance in the gradient
updates. In the actor-critic method as well, we use the critic to reduce variance in
the gradient updates of the actor, but it also helps to improve the policy iteratively
in an online fashion. The distinction between these two will be made clear in the
next section.
Now that we have a basic understanding of the actor-critic method, in the next
section, we will learn how the actor-critic method works in detail.
The fundamental difference between the REINFORCE with baseline method and the
actor-critic method is that in the REINFORCE with baseline method, we update the
parameter of the network at the end of an episode. But in the actor-critic method, we
update the parameter of the network at every step of the episode. But why do we
have to do this? What is the use of updating the network parameter at every step of
the episode? Let's explore this in further detail.
We can think of the REINFORCE with baseline method being similar to the Monte
Carlo (MC) method, which we covered in Chapter 4, Monte Carlo Methods, and the
actor-critic method being similar to the TD learning method, which we covered in
Chapter 5, Understanding Temporal Difference Learning. So, first, let's recap these two
methods.
[ 425 ]
Actor-Critic Methods – A2C and A3C
𝑁𝑁 𝑇𝑇𝑇𝑇
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) (𝑅𝑅𝑡𝑡 − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽(𝜃𝜃)
Instead of generating the complete trajectory and then computing the return, can
we make use of bootstrapping, as we learned in TD learning? Yes! In the actor-critic
method, we approximate the return by just taking the immediate reward and the
discounted value of the next state as:
𝑅𝑅 𝑅 𝑅𝑅 𝑅 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ )
[ 426 ]
Chapter 11
Where r is the immediate reward and 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ ) is the discounted value of the next state.
So, we can rewrite the policy gradient by replacing the return R by the bootstrap
estimate, 𝑟𝑟 𝑟 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ ), as shown here:
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )(𝑟𝑟 𝑟 𝑟𝑟𝑟𝑟𝜙𝜙 (𝑠𝑠𝑡𝑡′ ) − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))
Now, we don't have to wait till the end of the episode to compute the return.
Instead, we bootstrap, compute the gradient, and update the network parameter
at every step of the episode.
The difference between how we compute the gradient and update the parameter
of the policy network in REINFORCE with baseline and the actor-critic method
is shown in Figure 11.2. As we can observe in REINFORCE with baseline, first we
generate complete episodes (trajectories), and then we update the parameter of
the network. Whereas, in the actor-critic method, we update the parameter of the
network at every step of the episode:
Figure 11.2: The difference between the REINFORCE with baseline and actor-critic methods
Okay, what about the critic network (value network)? How can we update the
parameter of the critic network? Similar to the actor network, we update the
parameter of the critic network at every step of the episode. The loss of the critic
network is the TD error, which is the difference between the target value of the state
and the value of the state predicted by the network. The target value of the state can
be computed as the sum of reward and the discounted value of the next state value.
Thus, the loss of the critic network is expressed as:
Where 𝑟𝑟 𝑟 𝑟𝑟𝑟𝑟𝜙𝜙 (𝑠𝑠𝑡𝑡′ ) is the target value of the state and 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ) is the predicted value
of the state.
After computing the loss of the critic network, we compute gradients ∇𝜙𝜙 𝐽𝐽(𝜙𝜙) and
update the parameter 𝜙𝜙 of the critic network at every step of the episode using
gradient descent:
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽(𝜙𝜙)
[ 427 ]
Actor-Critic Methods – A2C and A3C
Now that we have learned how the actor (policy network) and critic (value network)
work in the actor-critic method; let's look at the algorithm of the actor-critic method
in the next section for more clarity.
1. Initialize the actor network parameter 𝜃𝜃 and the critic network parameter 𝜙𝜙
2. For N number of episodes, repeat step 3
3. For each step in the episode, that is, for t = 0,. . . ., T-1:
1. Select an action using the policy, 𝑎𝑎𝑡𝑡 ~ 𝜋𝜋𝜃𝜃 (𝑠𝑠𝑡𝑡 )
2. Take the action at in the state st, observe the reward r, and move to
the next state 𝑠𝑠𝑡𝑡′
3. Compute the policy gradients:
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )(𝑟𝑟 𝑟 𝑟𝑟𝑟𝑟𝜙𝜙 (𝑠𝑠𝑡𝑡′ ) − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))
4. Update the actor network parameter 𝜃𝜃 using gradient ascent:
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽(𝜃𝜃)
5. Compute the loss of the critic network:
𝐽𝐽(𝜙𝜙) = 𝑟𝑟 𝑟 𝑟𝑟𝑟𝑟𝜙𝜙 (𝑠𝑠𝑡𝑡′ ) − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 )
6. Compute gradients ∇𝜙𝜙 𝐽𝐽(𝜙𝜙) and update the critic network parameter
𝜙𝜙 using gradient descent:
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽(𝜙𝜙)
As we can observe from the preceding algorithm, the actor network (policy network)
parameter is being updated at every step of the episode. So, in each step of the
episode, we select an action based on the updated policy while the critic network
(value network) parameter is also getting updated at every step, and thus the critic
also improves at evaluating the actor network at every step of the episode. While
with the REINFORCE with baseline method, we only update the parameter of the
network after generating the complete episodes.
One more important difference we should note down between the REINFORCE
with baseline and the actor-critic method is that, in the REINFORCE with baseline
we use the full return of the trajectory whereas in the actor-critic method we use the
bootstrapped return.
[ 428 ]
Chapter 11
In A2C, we compute the policy gradient with the advantage function. So, first, let's
see how to compute the advantage function. We know that the advantage function
is the difference between the Q function and the value function, that is, Q(s, a) – V(s),
so we can use two function approximators (neural networks), one for estimating the
Q function and the other for estimating the value function. Then, we can subtract the
values of these two networks to get the advantage value. But this will definitely not
be an optimal method and, computationally, it will be expensive.
Substituting the preceding Q value in the advantage function, equation (1), we can
write the following:
[ 429 ]
Actor-Critic Methods – A2C and A3C
Thus, now we have the advantage function. We learned that in A2C, we compute the
policy gradient with the advantage function. So, we can write this:
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) (𝑟𝑟 𝑟 𝑟𝑟𝑟𝑟𝜙𝜙 (𝑠𝑠𝑡𝑡′ ) − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))
As we can observe, our policy gradient is now computed using the advantage
function:
Now, check the preceding equation with how we computed the gradient in the
previous section. We can observe that both are essentially the same. Thus, the A2C
method is the same as what we learned in the previous section.
[ 430 ]
Chapter 11
In A3C, we will have two types of networks, one is a global network (global agent),
and the other is the worker network (worker agent). We will have many worker
agents, each worker agent uses a different exploration policy, and they learn in their
own copy of the environment and collect experience. Then, the experience obtained
from these worker agents is aggregated and sent to the global agent. The global
agent aggregates the learning.
Now that we have a very basic idea of how A3C works, let's go into more detail.
The three As
Before diving in, let's first learn what the three A's in A3C signify.
Asynchronous: Asynchronous implies the way A3C works. That is, instead of
having a single agent that tries to learn the optimal policy, here, we have multiple
agents that interact with the environment. Since we have multiple agents interacting
with the environment at the same time, we provide copies of the environment
to every agent so that each agent can then interact with their own copy of the
environment. So, all these multiple agents are called worker agents and we have
a separate agent called the global agent. All the worker agents report to the global
agent asynchronously and the global agent aggregates the learning.
Actor-critic: Each of the worker networks (worker agents) and the global network
(global agent) basically follow an actor-critic architecture. That is, each of the agents
consists of an actor network for estimating the policy and the critic network for
evaluating the policy produced by the actor network.
Now, let us move on to the architecture of A3C and understand how A3C works in
detail.
[ 431 ]
Actor-Critic Methods – A2C and A3C
As we can observe from the preceding figure, we have multiple worker agents and
each worker agent interacts with their own copies of the environment and collects
experience. We can also observe that each worker agent follows an actor-critic
architecture. So, the worker agents compute the actor network loss (policy loss)
and critic network loss (value loss).
In the previous section, we learned that our actor network is updated by computing
the policy gradient:
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) (𝑟𝑟 𝑟 𝑟𝑟𝑟𝑟𝜙𝜙 (𝑠𝑠𝑡𝑡′ ) − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))
[ 432 ]
Chapter 11
𝐽𝐽(𝜃𝜃) = log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) (𝑟𝑟 𝑟 𝑟𝑟𝑟𝑟𝜙𝜙 (𝑠𝑠′𝑡𝑡 ) − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))
As we can observe, actor loss is the product of log probability of the action and the
TD error. Now, we add a new term to our actor loss called the entropy (measure of
randomness) of the policy and redefine the actor loss as:
𝐽𝐽(𝜃𝜃) = log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 ) (𝑟𝑟 𝑟 𝑟𝑟𝑟𝑟𝜙𝜙 (𝑠𝑠′𝑡𝑡 ) − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 )) + 𝛽𝛽𝛽𝛽(𝜋𝜋𝜋𝜋𝜋𝜋𝜋
Where 𝐻𝐻𝐻𝐻𝐻𝐻 denotes the entropy of the policy. Adding the entropy of the policy
promotes sufficient exploration, and the parameter 𝛽𝛽 is used to control the
significance of the entropy.
After computing the losses of the actor and critic networks, worker agents compute
the gradients of the loss and then they send those gradients to the global agent. That
is, the worker agents compute the gradients and their gradients are asynchronously
accumulated to the global agent. The global agent updates their parameters using
the asynchronously received gradients from the worker agents. Then, the global
agent sends the updated parameter periodically to the worker agents, so now the
worker agents will get updated.
In this way, each worker agent computes loss, calculates gradients, and sends those
gradients to the global agent asynchronously. Thus, the global agent parameter is
updated by gradients received from the worker agents. Then, the global agent sends
the updated parameter to the worker agents periodically.
Since we have many worker agents interacting with their own copies of the
environment and aggregating the information to the global network, there will be
low to no correlation between the experiences.
1. The worker agent interacts with their own copies of the environment.
2. Each worker follows a different policy and collects the experience.
3. Next, the worker agents compute the losses of the actor and critic networks.
[ 433 ]
Actor-Critic Methods – A2C and A3C
4. After computing the loss, they calculate gradients of the loss, and send
those gradients to the global agent asynchronously.
5. The global agent updates their parameters with the gradients received from
the worker agents.
6. Now, the updated parameter from the global agent will be sent to the
worker agents periodically.
We repeat the preceding steps for several iterations to find the optimal policy. To
get a clear understanding of how A3C works, in the next section, we will learn
how to implement it.
The code used in this section is adapted from the open source implementation
of A3C (https://github.com/stefanbo92/A3C-Continuous) provided by Stefan
Boschenriedter.
[ 434 ]
Chapter 11
import warnings
warnings.filterwarnings('ignore')
import gym
import multiprocessing
import threading
import numpy as np
import os
import shutil
import matplotlib.pyplot as plt
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
env = gym.make('MountainCarContinuous-v0')
state_shape = env.observation_space.shape[0]
action_shape = env.action_space.shape[0]
Note that we created the continuous mountain car environment, and thus our
action space consists of continuous values. So, we get the bounds of our action space:
num_workers = multiprocessing.cpu_count()
[ 435 ]
Actor-Critic Methods – A2C and A3C
num_episodes = 2000
num_timesteps = 200
global_net_scope = 'Global_Net'
Define the time step at which we want to update the global network:
update_global = 10
gamma = 0.90
beta = 0.01
log_dir = 'logs'
class ActorCritic(object):
[ 436 ]
Chapter 11
self.sess=sess
self.actor_optimizer = tf.train.RMSPropOptimizer(0.0001,
name='RMSPropA')
self.critic_optimizer = tf.train.RMSPropOptimizer(0.001,
name='RMSPropC')
if scope == global_net_scope:
with tf.variable_scope(scope):
Build the global network (global agent) and get the actor and critic parameters:
else:
with tf.variable_scope(scope):
[ 437 ]
Actor-Critic Methods – A2C and A3C
Build the worker network (worker agent) and get the mean and variance of the
action, the value of the state, and the actor and critic network parameters:
Compute the TD error, which is the difference between the target value of the state
and its predicted value:
with tf.name_scope('critic_loss'):
self.critic_loss = tf.reduce_mean(tf.square(td_
error))
Create a normal distribution based on the mean and variance of the action:
Now, let's define the actor network loss. We learned that the loss of the actor
network is defined as:
log_prob = normal_dist.log_prob(self.action_dist)
entropy_pi = normal_dist.entropy()
[ 438 ]
Chapter 11
self.actor_loss = tf.reduce_mean(-self.loss)
with tf.name_scope('select_action'):
self.action = tf.clip_by_value(tf.squeeze(normal_
dist.sample(1), axis=0), action_bound[0], action_bound[1])
Compute the gradients of the actor and critic network losses of the worker agent
(local agent):
with tf.name_scope('local_grad'):
self.actor_grads = tf.gradients(self.actor_loss,
self.actor_params)
self.critic_grads = tf.gradients(self.critic_loss,
self.critic_params)
with tf.name_scope('sync'):
After computing the gradients of the losses of the actor and critic networks, the
worker agents send (push) those gradients to the global agent:
with tf.name_scope('push'):
self.update_actor_params = self.actor_optimizer.
apply_gradients(zip(self.actor_grads, globalAC.actor_params))
self.update_critic_params = self.critic_optimizer.
apply_gradients(zip(self.critic_grads, globalAC.critic_params))
The global agent updates their parameters with the gradients received from the
worker agents (local agents). Then, the worker agents pull the updated parameters
from the global agent:
with tf.name_scope('pull'):
self.pull_actor_params = [l_p.assign(g_p) for l_p,
g_p in zip(self.actor_params, globalAC.actor_params)]
self.pull_critic_params = [l_p.assign(g_p) for l_p,
g_p in zip(self.critic_params, globalAC.critic_params)]
[ 439 ]
Actor-Critic Methods – A2C and A3C
Define the actor network, which returns the mean and variance of the action:
with tf.variable_scope('actor'):
l_a = tf.layers.dense(self.state, 200, tf.nn.relu, kernel_
initializer=w_init, name='la')
mean = tf.layers.dense(l_a, action_shape, tf.nn.
tanh,kernel_initializer=w_init, name='mean')
variance = tf.layers.dense(l_a, action_shape, tf.nn.
softplus, kernel_initializer=w_init, name='variance')
Define the critic network, which returns the value of the state:
with tf.variable_scope('critic'):
l_c = tf.layers.dense(self.state, 100, tf.nn.relu, kernel_
initializer=w_init, name='lc')
value = tf.layers.dense(l_c, 1, kernel_initializer=w_init,
name='value')
actor_params = tf.get_collection(tf.GraphKeys.TRAINABLE_
VARIABLES, scope=scope + '/actor')
critic_params = tf.get_collection(tf.GraphKeys.TRAINABLE_
VARIABLES, scope=scope + '/critic')
Return the mean and variance of the action produced by the actor network, the
value of the state computed by the critic network, and the parameters of the actor
and critic networks:
def pull_from_global(self):
self.sess.run([self.pull_actor_params, self.pull_critic_
params])
state = state[np.newaxis, :]
class Worker(object):
We learned that each worker agent works with their own copies of the environment.
So, let's create a mountain car environment:
self.env = gym.make('MountainCarContinuous-v0').unwrapped
[ 441 ]
Actor-Critic Methods – A2C and A3C
self.name = name
self.sess=sess
def work(self):
global global_rewards, global_episodes
total_step = 1
When the global episodes are less than the number of episodes and the coordinator is
active:
state = self.env.reset()
Return = 0
for t in range(num_timesteps):
if self.name == 'W_0':
self.env.render()
[ 442 ]
Chapter 11
action = self.AC.select_action(state)
Set done to True if we have reached the final step of the episode else set to False:
Return += reward
batch_states.append(state)
batch_actions.append(action)
batch_rewards.append((reward+8)/8)
Now, let's update the global network. If done is True, then set the value of the next
state to 0 else compute the value of the next state:
batch_target_value = []
batch_target_value.reverse()
[ 443 ]
Actor-Critic Methods – A2C and A3C
feed_dict = {
self.AC.state: batch_states,
self.AC. action_dist: batch_actions,
self.AC.target_value: batch_target_
value,
}
self.AC.update_global(feed_dict)
Update the worker network by pulling the parameters from the global network:
self.AC.pull_from_global()
Update the state to the next state and increment the total step:
state = next_state
total_step += 1
if done:
if len(global_rewards) < 5:
global_rewards.append(Return)
else:
global_rewards.append(Return)
global_rewards[-1] =(np.mean(global_
rewards[-5:]))
global_episodes += 1
break
[ 444 ]
Chapter 11
global_rewards = []
global_episodes = 0
sess = tf.Session()
with tf.device("/cpu:0"):
global_agent = ActorCritic(global_net_scope,sess)
worker_agents = []
for i in range(num_workers):
i_name = 'W_%i' % i
worker_agents.append(Worker(i_name, global_agent,sess))
coord = tf.train.Coordinator()
sess.run(tf.global_variables_initializer())
if os.path.exists(log_dir):
shutil.rmtree(log_dir)
tf.summary.FileWriter(log_dir, sess.graph)
worker_threads = []
[ 445 ]
Actor-Critic Methods – A2C and A3C
coord.join(worker_threads)
For a better understanding of the A3C architecture, let's take a look at the
computational graph of A3C in the next section.
Let's take a look at the architecture of the worker agent. As we can observe, our
worker agents follow the actor-critic architecture:
Figure 11.6: A computation graph of A3C with the W_0 node expanded
[ 446 ]
Chapter 11
Now, let's examine the sync node. As Figure 11.7 shows, we have two operations in
the sync node, called push and pull:
Figure 11.7: A computation graph of A3C showing the push and pull operations of the sync node
After computing the gradients of the losses of the actor and critic networks, the
worker agent pushes those gradients to the global agent:
Figure 11.8: A computation graph of A3C—worker agents push their gradients to the global agent
[ 447 ]
Actor-Critic Methods – A2C and A3C
The global agent updates their parameters with the gradients received from the
worker agents. Then, the worker agents pull the updated parameters from the
global agent:
Figure 11.9: A computation graph of A3C – worker agents pull updated parameters from the global agent
Now that we have learned how A3C works, in the next section, let's revisit the A2C
method.
A2C revisited
We can design our A2C algorithm with many worker agents, just like the A3C
algorithm. However, unlike A3C, A2C is a synchronous algorithm, meaning that in
A2C, we can have multiple worker agents, each interacting with their own copies of
the environment, and all the worker agents perform synchronous updates, unlike
A3C, where the worker agents perform asynchronous updates.
That is, in A2C, each worker agent interacts with the environment, computes losses,
and calculates gradients. However, it won't send those gradients to the global
network independently. Instead, it waits for all other worker agents to finish their
work and then updates the weights to the global network in a synchronous fashion.
Performing synchronous weight updates reduces the inconsistency introduced
by A3C.
Summary
We started the chapter by understanding what the actor-critic method is. We
learned that in the actor-critic method, the actor computes the optimal policy,
and the critic evaluates the policy computed by the actor network by estimating
the value function. Next, we learned how the actor-critic method differs from the
policy gradient method with the baseline.
[ 448 ]
Chapter 11
We learned that in the policy gradient method with the baseline, first, we generate
complete episodes (trajectories), and then we update the parameter of the network.
Whereas, in the actor-critic method, we update the parameter of the network at
every step of the episode. Moving forward, we learned what the advantage actor-
critic algorithm is and how it uses the advantage function in the gradient update.
At the end of the chapter, we learned about another interesting actor-critic algorithm,
called asynchronous advantage actor-critic method. We learned that A3C consists
of several worker agents and one global agent. All the worker agents send their
gradients to the global agent asynchronously and then the global agent updates
their parameters with gradients received from the worker agents. After updating
the parameters, the global agent sends the updated parameters to the worker agents
periodically.
Questions
Let's assess our understanding of the actor-critic method by answering the following
questions:
Further reading
To learn more, refer to the following paper:
[ 449 ]
Learning DDPG,
12
TD3, and SAC
In the previous chapter, we learned about interesting actor-critic methods, such
as Advantage Actor-Critic (A2C) and Asynchronous Advantage Actor-Critic
(A3C). In this chapter, we will learn several state-of-the-art actor-critic methods.
We will start off the chapter by understanding one of the popular actor-critic
methods called Deep Deterministic Policy Gradient (DDPG). DDPG is used only
in continuous environments, that is, environments with a continuous action space.
We will understand what DDPG is and how it works in detail. We will also learn the
DDPG algorithm step by step.
Going forward, we will learn about the Twin Delayed Deep Deterministic
Policy Gradient (TD3). TD3 is an improvement over the DDPG algorithm and
includes several interesting features that solve the problems faced in DDPG. We
will understand the key features of TD3 in detail and also look into the algorithm
of TD3 step by step.
Finally, we will learn about another interesting actor-critic algorithm, called Soft
Actor-Critic (SAC). We will learn what SAC is and how it works using the entropy
term in the objective function. We will look into the actor and critic components of
SAC in detail and then learn the algorithm of SAC step by step.
[ 451 ]
Learning DDPG, TD3, and SAC
DDPG uses the policy network as an actor and deep Q network as a critic. One
important difference between the DPPG and actor-critic algorithms we learned in
the previous chapter is that DDPG tries to learn a deterministic policy instead of
a stochastic policy.
First, we will get an intuitive understanding of how DDPG works and then we will
look into the algorithm in detail.
An overview of DDPG
DDPG is an actor-critic method that takes advantage of both the policy-based
method and the value-based method. It uses a deterministic policy 𝜇𝜇 instead of
a stochastic policy 𝜋𝜋.
We learned that a deterministic policy tells the agent to perform one particular
action in a given state, meaning a deterministic policy maps the state to one
particular action:
𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎𝑎𝑎
[ 452 ]
Chapter 12
Whereas a stochastic policy maps the state to the probability distribution over the
action space:
𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎𝑎𝑎
In a deterministic policy, whenever the agent visits the state, it always performs the
same particular action. But with a stochastic policy, instead of performing the same
action every time the agent visits the state, the agent performs a different action
each time based on a probability distribution over the action space.
Now, we will look into an overview of the actor and critic networks in the DDPG
algorithm.
Actor
The actor in DDPG is basically the policy network. The goal of the actor is to learn
the mapping between the state and action. That is, the role of the actor is to learn the
optimal policy that gives the maximum return. So, the actor uses the policy gradient
method to learn the optimal policy.
Critic
The critic is basically the value network. The goal of the critic is to evaluate the action
produced by the actor network. How does the critic network evaluate the action
produced by the actor network? Let's suppose we have a Q function; can we evaluate
an action using the Q function? Yes! First, let's take a little detour and recap the use
of the Q function.
We know that the Q function gives the expected return that an agent would obtain
starting from state s and performing an action a following a particular policy. The
expected return produced by the Q function is often called the Q value. Thus, given a
state and action, we obtain a Q value:
• If the Q value is high, then we can say that the action performed in that state
is a good action. That is, if the Q value is high, meaning the expected return
is high when we perform an action a in state s, we can say that the action a is
a good action.
• If the Q value is low, then we can say that the action performed in that state
is not a good action. That is, if the Q value is low, meaning the expected
return is low when we perform an action a in state s, we can say that the
action a is not a good action.
[ 453 ]
Learning DDPG, TD3, and SAC
Okay, now how can the critic network evaluate an action produced by the actor
network based on the Q function (Q value)? Let's suppose the actor network
performs a down action in state A. So, now, the critic computes the Q value of
moving down in state A. If the Q value is high, then the critic network gives feedback
to the actor network that the action down is a good action in state A. If the Q value is
low, then the critic network gives feedback to the actor network that the down action
is not a good action in state A, and so the actor network tries to perform a different
action in state A.
Thus, with the Q function, the critic network can evaluate the action performed by
the actor network. But wait, how can the critic network learn the Q function? Because
only if it knows the Q function can it evaluate the action performed by the actor. So,
how does the critic network learn the Q function? Here is where we use the deep
Q network (DQN). We learned that with the DQN, we can use the neural network
to approximate the Q function. So, now, we use the DQN as the critic network to
compute the Q function.
DDPG components
Now that we have a basic understanding of how the DDPG algorithm works, let's
go into further detail. We will understand how exactly the actor and critic networks
work by looking at them separately.
Critic network
We learned that the critic network is basically the DQN and it uses the DQN to
estimate the Q value. Now, let's learn how the critic network uses the DQN to
estimate the Q value in more detail, along with a recap of the DQN.
The critic evaluates the action produced by the actor. Thus, the input to the critic
will be the state and also the action produced by the actor in that state, and the critic
returns the Q value of the given state-action pair, as shown Figure 12.1:
[ 454 ]
Chapter 12
To approximate the Q value in the critic, we can use the deep neural network, and
if we use the deep neural network to approximate the Q value, then the network is
called the DQN. Since we are using the neural network to approximate the Q value
in the critic, we can represent the Q function with 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠), where 𝜃𝜃 is the parameter
of the network.
Thus, in the critic network, we approximate the Q value using the DQN and the
parameter of the critic network is represented by 𝜃𝜃, as shown in Figure 12.2:
As we can observe from Figure 12.2, given state s and the action a produced by the
actor, the critic network returns the Q value.
Now, let's look at how to obtain the action a produced by the actor. We learned
that the actor is basically the policy network and it uses a policy gradient to learn
the optimal policy. In DDPG, we learn a deterministic policy instead of a stochastic
policy, so we can denote the policy with 𝜇𝜇 instead of 𝜋𝜋 . The parameter of the actor
network is represented by 𝜙𝜙. So, we can represent our parameterized policy as 𝜇𝜇𝜙𝜙.
Given a state s as the input, the actor network returns the action a to be performed
in that state:
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑠
Thus, the critic network takes state s and action 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑠 produced by the actor
network in that state as input and returns the Q value, as shown in Figure 12.3:
Okay, how can we train the critic network (DQN)? We generally train the network
by minimizing the loss as the difference between the target value and predicted
value. So, we can train the critic network by minimizing the loss as the difference
between the target Q value and the Q value predicted by the network. But how can
we obtain the target Q value? The target Q value is the optimal Q value and we can
obtain the optimal Q value using the Bellman equation.
[ 455 ]
Learning DDPG, TD3, and SAC
We learned that the optimal Q function (Q value) can be obtained by using the
Bellman optimality equation. Thus, the optimal Q function can be obtained using
the Bellman optimality equation as follows:
We know that 𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅 𝑅𝑅𝑅𝑅 represents the immediate reward r we obtain while
performing an action a in state s and moving to the next state 𝑠𝑠 ′, so we can just
denote 𝑅𝑅𝑅𝑅𝑅𝑅 𝑅𝑅𝑅 𝑅𝑅𝑅𝑅 with r:
In the preceding equation, we can remove the expectation. We will approximate the
expectation by sampling K number of transitions from the replay buffer and taking
the average value. We will learn more about this in a while. So, we can express
the target Q value as the sum of the immediate reward and discounted maximum
Q value of the next state-action pair, as shown here:
Thus, we can represent the loss function of the critic network as the difference
between the target value (optimal Bellman Q value) and the predicted value (the Q
value predicted by the critic network):
𝐿𝐿(𝜃𝜃) = 𝑟𝑟 𝑟 𝑟𝑟 𝑟𝑟𝑟
′
𝑄𝑄𝜃𝜃 (𝑠𝑠𝑠𝑠 𝑠𝑠𝑠) − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠𝑠
𝑎𝑎
Here, the action a is the action produced by the actor network, that is, 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑠.
Instead of using the loss as simply the difference between the target value and the
predicted value, we can use the mean squared error as our loss function. We know
that in the DQN, we use the replay buffer and store the transitions as (𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠𝑠. So,
we randomly sample a minibatch of K number of transitions from the replay buffer
and train the network by minimizing the mean squared loss between the target value
(optimal Bellman Q value) and the predicted value (Q value predicted by the critic
network). Thus, our loss function is given as:
𝐾𝐾
1
𝐿𝐿(𝜃𝜃) = ∑(𝑟𝑟𝑖𝑖 + 𝛾𝛾 𝛾𝛾𝛾 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖′ , 𝑎𝑎′ ) − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))2
𝐾𝐾 𝑎𝑎 ′
𝑖𝑖𝑖𝑖
[ 456 ]
Chapter 12
From the preceding equation, we can observe that both the target and predicted Q
functions are parameterized by the same parameter 𝜃𝜃. This will cause instability in
the mean squared error and the network will learn poorly.
So, we introduce another neural network to learn the target value, and it is usually
referred to as the target critic network. The parameter of the target critic network is
represented by 𝜃𝜃𝜃 . Our main critic network, which is used to predict Q values, learns
the correct parameter 𝜃𝜃 using gradient descent. The target critic network parameter
𝜃𝜃𝜃 is updated by just copying the parameter of the main critic network 𝜃𝜃.
Thus, the loss function of the critic network can be written as:
𝐾𝐾 2
1
𝐿𝐿(𝜃𝜃) = ∑ (𝑟𝑟𝑖𝑖 + 𝛾𝛾 𝛾𝛾𝛾 𝑄𝑄𝜃𝜃′ (𝑠𝑠𝑖𝑖′ , 𝑎𝑎′ ) − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 )) (1)
𝐾𝐾 𝑎𝑎 ′
𝑖𝑖𝑖𝑖
Remember that the action ai in the preceding equation is the action produced by the
actor network, that is, 𝑎𝑎𝑖𝑖 = 𝜇𝜇𝜙𝜙 (𝑠𝑠𝑖𝑖 ).
There is a small problem in the target value computation in our loss function due to
the presence of the max term, as shown here:
The max term means that we compute the Q value of all possible actions 𝑎𝑎𝑎 in state
𝑠𝑠 ′ and select the action 𝑎𝑎𝑎 as the one that has the maximum Q value. But when the
action space is continuous, we cannot compute the Q value of all possible actions 𝑎𝑎𝑎
in state 𝑠𝑠 ′. So, we need to get rid of the max term in our loss function. How can we
do that?
Just as we use the target network in the critic, we can use a target actor network, and
the parameter of the target actor network is denoted by 𝜙𝜙𝜙. Now, instead of selecting
the action 𝑎𝑎𝑎 as the one that has the maximum Q value, we can generate an action 𝑎𝑎𝑎
using the target actor network, that is, 𝑎𝑎′ = 𝜇𝜇𝜙𝜙′ (𝑠𝑠 ′ ).
[ 457 ]
Learning DDPG, TD3, and SAC
Thus, as shown in Figure 12.4, to compute the Q value of the next state-action pair
in the target, we feed state 𝑠𝑠 ′ and the action 𝑎𝑎𝑎 produced by the target actor network
parameterized by 𝜙𝜙𝜙 to the target critic network, and it returns the Q value of the
next state-action pair:
Thus, in our loss function, equation (1), we can remove the max term and instead of
′
𝑎𝑎𝑎 , we can write 𝜇𝜇𝜙𝜙′ (𝑠𝑠 ), as shown here:
1 2
𝐿𝐿(𝜃𝜃) = ∑ (𝑟𝑟𝑖𝑖 + 𝛾𝛾𝛾𝛾𝜃𝜃′ (𝑠𝑠𝑖𝑖′ , 𝜇𝜇𝜙𝜙′ (𝑠𝑠𝑖𝑖′ )) − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
1 2
𝐽𝐽(𝜃𝜃) = ∑ (𝑟𝑟𝑖𝑖 + 𝛾𝛾𝛾𝛾𝜃𝜃′ (𝑠𝑠𝑖𝑖′ , 𝜇𝜇𝜙𝜙′ (𝑠𝑠𝑖𝑖′ )) − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
To reduce the clutter, we can denote the target value with y and write:
1 2
𝐽𝐽(𝜃𝜃) = ∑(𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
Where yi is the target value of the critic, that is, 𝑦𝑦𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾𝛾𝛾𝜃𝜃′ (𝑠𝑠𝑖𝑖′ , 𝜇𝜇𝜙𝜙′ (𝑠𝑠𝑖𝑖′ )), and the
action ai is the action produced by the main actor network, that is, 𝑎𝑎𝑖𝑖 = 𝜇𝜇𝜙𝜙 (𝑠𝑠𝑖𝑖 ).
To minimize the loss, we compute the gradients of the objective function ∇𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
and update the main critic network parameter 𝜃𝜃 by performing gradient descent:
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽(𝜃𝜃)
Okay, what about the target critic network parameter 𝜃𝜃𝜃 ? How can we update it? We
can update the parameter of the target critic network by just copying the parameter
of the main critic network parameter 𝜃𝜃 as shown here:
𝜃𝜃 ′ = 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏𝜏𝜏𝜏 ′
This is usually called the soft replacement and the value of 𝜏𝜏 is often set to 0.001.
[ 458 ]
Chapter 12
Thus, we learned how the critic network uses the DQN to compute the Q value to
evaluate the action produced by the actor network. In the next section, we will learn
how the actor network learns the optimal policy.
Actor network
We have already learned that the actor network is the policy network and it uses the
policy gradient to compute the optimal policy. We also learned that we represent
the parameter of the actor network with 𝜙𝜙, and so the parameterized policy is
represented with 𝜇𝜇𝜙𝜙.
The actor network takes state s as an input and returns the action a:
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠)
One important point that we may want to note down here is that we are using
a deterministic policy. Since we are using a deterministic policy, we need to take
care of the exploration-exploitation dilemma, because we know that a deterministic
policy always selects the same action and doesn't explore new actions, unlike a
stochastic policy, which selects different actions based on the probability distribution
over the action space.
Okay, how can we explore new actions while using a deterministic policy? Note that
DDPG is designed for an environment where the action space is continuous. Thus,
we are using a deterministic policy in the continuous action space.
Unlike the discrete action space, in the continuous action space, we have continuous
values. So, to explore new actions, we can just add some noise 𝒩𝒩 to the action
produced by the actor network since the action is a continuous value. We generate
this noise using a process called the Ornstein-Uhlenbeck random process. So, our
modified action can be represented as:
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠) + 𝒩𝒩
For example, say the action 𝜇𝜇𝜙𝜙 (𝑠𝑠) produced by the actor network is 13. Suppose the
noise 𝒩𝒩 is 0.1, then our action becomes a = 13+0.1 = 13.1.
We learned that the critic network is represented by 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠𝑠 and it evaluates the
action produced by the actor using the Q value. If the Q value is high, then the critic
tells the actor that it has produced a good action but when the Q value is low, then
the critic tells the actor that it has produced a bad action.
[ 459 ]
Learning DDPG, TD3, and SAC
But wait! We learned that it is difficult to compute the Q value when the action space
is continuous. That is, when the action space is continuous, it is difficult to compute
the Q value of all possible actions in the state and take the maximum Q value. That
is why we resorted to the policy gradient method. But now, we are computing the
Q value with a continuous action space. How will this work?
Note that, here in DDPG, we are not computing the Q value of all possible state-
action pairs. We simply compute the Q value of state s and action a produced by
the actor network.
The goal of the actor is to make the critic tell that the action it has produced is a
good action. That is, the actor wants to get good feedback from the critic network.
When does the critic give good feedback to the actor? The critic gives good feedback
when the action produced by the actor has a maximum Q value. That is, if the action
produced by the actor has a maximum Q value, then the critic tells the actor that it
has produced a good action. So, the actor tries to generate an action in such a way
that it can maximize the Q value produced by the critic.
Thus, the objective function of the actor is to generate an action that maximizes the
Q value produced by the critic network. So, we can write the objective function of the
actor as:
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽(𝜙𝜙)
Wait. Instead of updating the actor network parameter 𝜙𝜙 just for a single state 𝑠𝑠 , we
sample 𝐾𝐾 number of states from the replay buffer 𝒟𝒟 and update the parameter. So,
now our objective function becomes:
1
𝐽𝐽(𝜙𝜙) = ∑ 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎)
𝐾𝐾
𝑖𝑖
Where the action 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑖𝑖 ). Maximizing the preceding objective function implies
that the actor tries to generate actions in such a way that it maximizes the Q value
over all the sampled states. We can maximize the objective function by performing
gradient ascent and update the actor network parameter as:
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽(𝜙𝜙)
[ 460 ]
Chapter 12
To summarize, the objective of the actor is to generate action in such a way that it
maximizes the Q value produced by the critic. So, we perform gradient ascent and
update the actor network parameter.
Okay, what about the parameter of the target actor network? How can we update
it? We can update the parameter of the target actor network by just copying the
parameter of the main actor network parameter 𝜙𝜙 by soft replacement, as shown
here:
𝜙𝜙 ′ = 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏)𝜙𝜙 ′
Now that we have understood how actor and critic networks work, let's get a good
understanding of what we have learned so far and how DDPG works exactly by
putting all the concepts together.
Note that DDPG is an actor-critic method, and so its parameters will be updated at
every step of the episode, unlike the policy gradient method, where we generate
complete episodes and then update the parameter. Okay, let's get started and
understand how DDPG works.
First, we initialize the main critic network parameter 𝜃𝜃 and the main actor network
parameter 𝜙𝜙 with random values. We learned that the target network parameter is
just a copy of the main network parameter. So, we initialize the target critic network
parameter 𝜃𝜃𝜃 by just copying the main critic network parameter 𝜃𝜃. Similarly, we
initialize the target actor network parameter 𝜙𝜙𝜙 by just copying the main actor
network parameter 𝜙𝜙. We also initialize the replay buffer 𝒟𝒟.
Now, for each step in the episode, first, we select an action, a, using the actor
network:
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠)
[ 461 ]
Learning DDPG, TD3, and SAC
However, instead of using the action a directly, to ensure exploration, we add some
noise 𝒩𝒩 , and so the action becomes:
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠) + 𝒩𝒩
Then, we perform the action a, move to the next state 𝑠𝑠 ′, and get the reward r. We
store this transition information in a replay buffer 𝒟𝒟.
Next, we randomly sample a minibatch of K transitions (s, a, r, s') from the replay
buffer. These K transitions will be used for updating both our critic and actor network.
First, let us compute the loss of the critic network. We learned that the loss function
of the critic network is:
1 2
𝐽𝐽(𝜃𝜃) = ∑(𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
Where yi is the target value of the critic, that is, 𝑦𝑦𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾𝛾𝛾𝜃𝜃′ (𝑠𝑠𝑖𝑖′ , 𝜇𝜇𝜙𝜙′ (𝑠𝑠𝑖𝑖′ )), and the
action ai is the action produced by the actor network, that is, 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑖𝑖 ).
After computing the loss of the critic network, we compute the gradients ∇𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽 and
update the critic network parameter 𝜃𝜃 using gradient descent:
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
Now, let us update the actor network. We learned that the objective function of the
actor network is:
1
𝐽𝐽(𝜙𝜙) = ∑ 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎)
𝐾𝐾
𝑖𝑖
Note that in the above equation, we are only using the state (si) from the sampled K
transitions (s, a, r, s'). The action a is selected by actor network, 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑖𝑖 ). Now, we
need to maximize the preceding objective function. Maximizing the above objective
function helps the actor to generate actions in such a way that it maximizes the Q
value produced by the critic. We can maximize the objective function by computing
the gradients of our objective function ∇𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽 and update the actor network
parameter 𝜙𝜙 using gradient ascent:
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽(𝜙𝜙)
And then, in the final step, we update the parameter of the target critic network 𝜃𝜃𝜃
and the parameter of the target actor network 𝜙𝜙𝜙 by soft replacement:
𝜃𝜃 ′ = 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏)𝜃𝜃 ′
𝜙𝜙𝜙 𝜙 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏𝜏𝜏𝜏𝜏
[ 462 ]
Chapter 12
We repeat these steps for several episodes. Thus, for each step in the episode, we
update the parameter of our networks. Since the parameter gets updated at every
step, our policy will also be improved at every step in the episode.
To have a better understanding of how DDPG works, let's look into the DDPG
algorithm in the next section.
Algorithm – DDPG
The DDPG algorithm is given as follows:
1. Initialize the main critic network parameter 𝜃𝜃 and the main actor network
parameter 𝜙𝜙
2. Initialize the target critic network parameter 𝜃𝜃𝜃 by just copying the main
critic network parameter 𝜃𝜃
3. Initialize the target actor network parameter 𝜙𝜙𝜙 by just copying the main
actor network parameter 𝜙𝜙𝜙
4. Initialize the replay buffer 𝒟𝒟
5. For N number of episodes, repeat steps 6 and 7
6. Initialize an Ornstein-Uhlenbeck random process 𝒩𝒩 for an action space
exploration
7. For each step in the episode, that is, for t = 0,…,T – 1:
1. Select action a based on the policy 𝜇𝜇𝜙𝜙 (𝑠𝑠) and exploration noise, that
is, 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠) + 𝒩𝒩.
2. Perform the selected action a, move to the next state 𝑠𝑠 ′, get the
reward r, and store this transition information in the replay buffer 𝒟𝒟.
3. Randomly sample a minibatch of K transitions from the replay buffer 𝒟𝒟.
4. Compute the target value of the critic, that is,
𝑦𝑦𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾𝛾𝛾 ′ (𝑠𝑠𝑖𝑖′ , 𝜇𝜇 ′ (𝑠𝑠𝑖𝑖′ )).
𝜃𝜃 𝜙𝜙
1 2
5. Compute the loss of the critic network, 𝐽𝐽(𝜃𝜃) = ∑(𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 )) .
𝐾𝐾
𝑖𝑖
6. Compute the gradient of the loss ∇𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽 and update the critic
network parameter using gradient descent, 𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽.
7. Compute the gradient of the actor network ∇𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽 and update the
actor network parameter by gradient ascent, 𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽.
8. Update the target critic and target actor network parameter as
𝜃𝜃 ′ = 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏𝜏𝜏𝜏𝜏 and 𝜙𝜙𝜙 𝜙 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏𝜏𝜏𝜏𝜏.
[ 463 ]
Learning DDPG, TD3, and SAC
import warnings
warnings.filterwarnings('ignore')
import tensorflow as tf
tf.compat.v1.disable_v2_behavior()
import numpy as np
import gym
env = gym.make("Pendulum-v0").unwrapped
state_shape = env.observation_space.shape[0]
action_shape = env.action_space.shape[0]
Note that the pendulum is a continuous environment, and thus our action space
consists of continuous values. Hence, we get the bounds of our action space:
gamma = 0.9
[ 464 ]
Chapter 12
tau = 0.001
replay_buffer = 10000
batch_size = 32
class DDPG(object):
self.num_transitions = 0
self.sess = tf.Session()
[ 465 ]
Learning DDPG, TD3, and SAC
Then, initialize the state shape, action shape, and high action value:
with tf.variable_scope('Actor'):
Define the main actor network, which is parameterized by 𝜙𝜙. The actor network
takes the state as an input and returns the action to be performed in that state:
self.actor = self.build_actor_network(self.state,
scope='main', trainable=True)
Define the target actor network that is parameterized by 𝜙𝜙𝜙. The target actor network
takes the next state as an input and returns the action to be performed in that state:
target_actor = self.build_actor_network(self.next_state,
scope='target', trainable=False)
with tf.variable_scope('Critic'):
Define the main critic network, which is parameterized by 𝜃𝜃. The critic network
takes the state and also the action produced by the actor in that state as an input and
returns the Q value:
[ 466 ]
Chapter 12
Define the target critic network, which is parameterized by 𝜃𝜃𝜃 . The target critic
network takes the next state and also the action produced by the target actor network
in that next state as an input and returns the Q value:
target_critic = self.build_critic_network(self.next_state,
target_actor, scope='target', trainable=False)
self.main_actor_params = tf.get_collection(tf.GraphKeys.GLOBAL_
VARIABLES, scope='Actor/main')
self.target_actor_params = tf.get_collection(tf.GraphKeys.
GLOBAL_VARIABLES, scope='Actor/target')
self.main_critic_params = tf.get_collection(tf.GraphKeys.
GLOBAL_VARIABLES, scope='Critic/main')
self.target_critic_params = tf.get_collection(tf.GraphKeys.
GLOBAL_VARIABLES, scope='Critic/target')
Perform the soft replacement, update the parameter of the target actor network
as 𝜙𝜙𝜙 𝜙 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏𝜏𝜏𝜏𝜏, and update the parameter of the target critic network as
𝜃𝜃 ′ = 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏𝜏𝜏𝜏𝜏:
self.soft_replacement = [
[ 467 ]
Learning DDPG, TD3, and SAC
Compute the target Q value. We learned that the target Q value can be computed
as the sum of reward and the discounted Q value of the next state-action pair,
𝑦𝑦𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾𝛾𝛾𝜃𝜃′ (𝑠𝑠𝑖𝑖′ , 𝜇𝜇𝜙𝜙′ (𝑠𝑠𝑖𝑖′ )):
y = self.reward + gamma * target_critic
Now, let's compute the loss of the critic network. The loss of the critic network is the
mean squared error between the target Q value and the predicted Q value:
1 2
𝐽𝐽(𝜃𝜃) = ∑(𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
MSE = tf.losses.mean_squared_error(labels=y,
predictions=critic)
Train the critic network by minimizing the mean squared error using the Adam
optimizer:
self.train_critic = tf.train.AdamOptimizer(0.01).minimize(MSE,
name="adam-ink", var_list = self.main_critic_params)
We learned that the objective function of the actor is to generate an action that
maximizes the Q value produced by the critic network, as shown here:
1
𝐽𝐽(𝜙𝜙) = ∑ 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎)
𝐾𝐾
𝑖𝑖
Where the action 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑖𝑖 ), and we can maximize this objective by computing
gradients and by performing gradient ascent. However, it is a standard convention
to perform minimization rather than maximization. So, we can convert the preceding
maximization objective into a minimization objective by just adding a negative sign.
Hence, we can define the actor network objective as:
1
𝐽𝐽(𝜙𝜙) = − ∑ 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎)
𝐾𝐾
𝑖𝑖
Now, we can minimize the actor network objective by computing gradients and by
performing gradient descent. Thus, we can write:
actor_loss = -tf.reduce_mean(critic)
[ 468 ]
Chapter 12
Train the actor network by minimizing the loss using the Adam optimizer:
self.train_actor = tf.train.AdamOptimizer(0.001).
minimize(actor_loss, var_list=self.main_actor_params)
self.sess.run(tf.global_variables_initializer())
Now, we generate a normal distribution with the mean as the action and the
standard deviation as the noise and we randomly select an action from this normal
distribution:
We need to make sure that our action should not fall away from the action bound.
So, we clip the action so that it lies within the action bound and then we return the
action:
return action
def train(self):
self.sess.run(self.soft_replacement)
[ 469 ]
Learning DDPG, TD3, and SAC
Randomly select indices from the replay buffer with the given batch size:
Select the batch of transitions from the replay buffer with the selected indices:
batch_transition = self.replay_buffer[indices, :]
trans = np.hstack((state,actor,[reward],next_state))
self.replay_buffer[index, :] = trans
[ 470 ]
Chapter 12
self.num_transitions += 1
If the number of transitions is greater than the replay buffer, train the network:
[ 471 ]
Learning DDPG, TD3, and SAC
num_episodes = 300
num_timesteps = 500
for i in range(num_episodes):
state = env.reset()
Return = 0
for t in range(num_timesteps):
env.render()
action = ddpg.select_action(state)
[ 472 ]
Chapter 12
Return += reward
if done:
break
state = next_state
if i %10 ==0:
print("Episode:{}, Return: {}".format(i,Return))
By rendering the environment, we can observe how the agent learns to swing up the
pendulum:
Now that we have learned how DDPG works and how to implement it, in the next
section, we will learn about another interesting algorithm called twin delayed DDPG.
[ 473 ]
Learning DDPG, TD3, and SAC
One of the problems with DDPG is that the critic overestimates the target Q value.
This overestimation causes several issues. We learned that the policy is improved
based on the Q value given by the critic, but when the Q value has an approximation
error, it causes stability issues to our policy and the policy may converge to local
optima.
Thus, to combat this, TD3 proposes three important features, which are as follows:
First, we will understand how TD3 works intuitively, and then we will look at the
algorithm in detail.
• Clipped double Q learning: Instead of using one critic network, we use two
main critic networks to compute the Q value and also use two target critic
networks to compute the target value.
We compute two target Q values using two target critic networks and use the
minimum value of these two while computing the loss. This helps to prevent
overestimation of the target Q value. We will learn more about this in detail
in the next section.
• Delayed policy updates: In DDPG, we learned that we update the parameter
of both the actor (policy network) and critic (DQN) network at every step
of the episode. Unlike DDPG, here we delay updating the parameter of the
actor network.
That is, the critic network parameter is updated at every step of the episode,
but the actor network (policy network) parameter is delayed and updated
only after every two steps of the episode.
[ 474 ]
Chapter 12
Thus, computing the target value by using the preceding equation prevents the
overestimation of the Q value in the DQN.
We learned that in DDPG, the critic network is the DQN, and so it also suffers from
the overestimation of the Q value in the target. Can we employ double Q learning in
DDPG and try to solve the overestimation bias? Yes! But the problem is that in the
actor-critic method, the policy and target network parameter updates happen slowly,
and this will not help us in removing the overestimation bias.
So, we will use a slightly different version of double Q learning called clipped double
Q learning. In clipped double Q learning, we use two target critic networks to
compute the Q value.
We use the two target critic networks and compute the two Q values and select the
minimum value out of these two to compute the target value. This helps to prevent
overestimation bias. Let's understand this in more detail.
[ 475 ]
Learning DDPG, TD3, and SAC
If we need two target critic networks, then we also need two main critic networks.
We know that the target network parameter is just a time-delayed copy of the main
network parameter. So, we define two main critic networks with the parameters 𝜃𝜃1
and 𝜃𝜃2 to compute the two Q values, that is, 𝑄𝑄𝜃𝜃1 (𝑠𝑠𝑠 𝑠𝑠) and 𝑄𝑄𝜃𝜃2 (𝑠𝑠𝑠 𝑠𝑠), respectively.
We also define the two target critic networks with parameters 𝜃𝜃1′ and 𝜃𝜃2′ to compute
the two Q values of next state-action pair in the target, that is, 𝑄𝑄𝜃𝜃1′ (𝑠𝑠 ′ , 𝑎𝑎′ ) and
𝑄𝑄𝜃𝜃′ (𝑠𝑠 ′ , 𝑎𝑎′ ), respectively. Let's understand this clearly step by step.
2
So, to avoid this, in TD3, first, we compute the Q value of the next state-action pair in the
′ ′
target using the first target critic network with a parameter 𝜃𝜃1′, that is, 𝑄𝑄𝜃𝜃1′ (𝑠𝑠 , 𝑎𝑎 ), and
then we compute the Q value of the next state-action pair in the target using the second
target critic network with a parameter 𝜃𝜃2′, that is, 𝑄𝑄𝜃𝜃2′ (𝑠𝑠 ′ , 𝑎𝑎′ ). Then, we use the minimum
of these two Q values to compute the target value as expressed here:
Computing the target value in this way prevents overestimation of the Q value of
the next state-action pair.
[ 476 ]
Chapter 12
Okay, we computed the target value. How do we compute the loss and update the
critic network parameter? We learned that we use two main critic networks, so, first,
we compute the loss of the first main critic network, parameterized by 𝜃𝜃1:
1 2
𝐽𝐽(𝜃𝜃1 ) = ∑ (𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃1 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
After computing the loss, we compute the gradients and update the parameter 𝜃𝜃1
using gradient descent as 𝜃𝜃1 = 𝜃𝜃1 − 𝛼𝛼𝛼𝜃𝜃1 𝐽𝐽𝐽𝐽𝐽1 ).
Next, we compute the loss of the second main critic network, parameterized by 𝜃𝜃2:
1 2
𝐽𝐽(𝜃𝜃2 ) = ∑ (𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃2 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
After computing the loss, we compute the gradients and update the parameter 𝜃𝜃2
using gradient descent as 𝜃𝜃2 = 𝜃𝜃2 − 𝛼𝛼𝛼𝜃𝜃2 𝐽𝐽(𝜃𝜃2 ).
1 2
𝐽𝐽(𝜃𝜃𝑗𝑗 ) = ∑ (𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃𝑗𝑗 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 )) for 𝑗𝑗 𝑗𝑗𝑗 𝑗
𝐾𝐾
𝑖𝑖
𝜃𝜃𝑗𝑗 = 𝜃𝜃𝑗𝑗 − 𝛼𝛼𝛼𝜃𝜃𝑗𝑗 𝐽𝐽𝐽𝐽𝐽𝑗𝑗 ) for 𝑗𝑗 𝑗𝑗𝑗 𝑗
After updating the two main critic network parameters, 𝜃𝜃1 and 𝜃𝜃2, we can update the
two target critic network parameters, 𝜃𝜃1′ and 𝜃𝜃2′, by soft replacement, as shown here:
[ 477 ]
Learning DDPG, TD3, and SAC
When the critic network parameter is not good, then it estimates the incorrect Q values.
If the Q value estimated by the critic network is not correct, then the actor network
cannot update its parameter correctly. That is, we learned that the actor network learns
based on feedback from the critic network. This feedback is just the Q value. When
the critic network gives incorrect feedback (incorrect Q value), then the actor network
cannot learn the correct action and cannot update its parameter correctly.
Thus, to avoid this, we hold updating the parameter of the actor network for a while
and only update the critic network to make the critic estimate the correct Q value.
That is, we update the parameter of the critic network at every step of the episode,
and we delay updating the parameter of the actor network, and only update it for
some specific steps of the episode because we don't want our actor to learn from the
incorrect critic's feedback.
In a nutshell, the critic network parameter is updated at every step of the episode,
but the actor network parameter update is delayed. We generally delay the update
by two steps.
Okay, in DDPG, we learned that the objective of the actor network (policy network)
is to maximize the Q value:
1
𝐽𝐽(𝜙𝜙) = ∑ 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎)
𝐾𝐾
𝑖𝑖
The preceding objective of the actor network is the same in TD3 as well. That is,
similar to DDPG, here, the objective of the actor is to generate actions in such a way
that it maximizes the Q value produced by the critic. But wait! Unlike DDPG, here
we have two Q values, 𝑄𝑄𝜃𝜃1 (𝑠𝑠𝑠 𝑠𝑠) and 𝑄𝑄𝜃𝜃2 (𝑠𝑠𝑠 𝑠𝑠), since we use two critic networks with
parameters 𝜃𝜃1 and 𝜃𝜃2, respectively. So which Q value should our actor network
maximize? Should it be 𝑄𝑄𝜃𝜃1 (𝑠𝑠𝑠 𝑠𝑠) or 𝑄𝑄𝜃𝜃2 (𝑠𝑠𝑠 𝑠𝑠)? We can take either of these and
maximize one. So, we can take 𝑄𝑄𝜃𝜃1 (𝑠𝑠𝑠 𝑠𝑠).
Thus, in TD3, the objective of the actor network is to maximize the Q value, 𝑄𝑄𝜃𝜃1 (𝑠𝑠𝑠 𝑠𝑠),
as shown here:
1
𝐽𝐽(𝜙𝜙) = ∑ 𝑄𝑄𝜃𝜃1 (𝑠𝑠𝑖𝑖 , 𝑎𝑎)
𝐾𝐾
𝑖𝑖
Remember that in the above equation, the action a is selected by actor network,
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑖𝑖 ). In order to maximize the objective function, we compute the gradients
of our objective function, ∇𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽, and update the parameter of the network using
gradient ascent:
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽(𝜙𝜙)
[ 478 ]
Chapter 12
Now, instead of doing this parameter update of the actor network at every time step
of the episode, we delay the updates and update the parameter only on every other
step (every two steps). Let t be the time step of the episode and d denotes the number
of time steps we want to delay the update by (usually d is set to 2); then we can write
the following:
As we can notice, we compute the target values with action 𝑎𝑎𝑎 generated by the
target actor network, 𝜇𝜇𝜙𝜙′ (𝑠𝑠𝑠. Instead of using the action given by the target actor
network directly, we add some noise 𝜖𝜖 to the action and modify the action to 𝑎𝑎𝑎, as
shown here:
Here, −c to +c indicates that noise is clipped, so that we can keep the target close to
the actual action. Thus, our target value computation now becomes:
[ 479 ]
Learning DDPG, TD3, and SAC
But why are we doing this? Why do we need to add noise to the action and use it to
compute the target value? Similar actions should have similar target values, right?
However, the DDPG method produces target values with high variance even for
similar actions. This is because deterministic policies overfit to the sharp peaks in
the value estimate. So, we can smooth out these peaks for similar actions by adding
some noise. Thus, target policy smoothing basically acts as a regularizer and reduces
the variance in the target values.
Now that we have understood the key features of the TD3 algorithm, let's get clarity
on what we have learned so far and how the TD3 algorithm works by putting all the
concepts together.
• The two main critic network parameters are represented by 𝜃𝜃1 and 𝜃𝜃2
• The two target critic network parameters are represented by 𝜃𝜃1′ and 𝜃𝜃2′
• The main actor network parameter is represented by 𝜙𝜙
• The target actor network parameter is represented by 𝜙𝜙 ′
TD3 is an actor-critic method, and so the parameters of TD3 will get updated at every
step of the episode, unlike the policy gradient method where we generate complete
episodes and then update the parameter. Now, let's get started and understand how
TD3 works.
First, we initialize the two main critic network parameters, 𝜃𝜃1 and 𝜃𝜃2, and the main
actor network parameter 𝜙𝜙 with random values. We know that the target network
parameter is just a copy of the main network parameter. So, we initialize the two
target critic network parameters 𝜃𝜃1′ and 𝜃𝜃2′ by just copying 𝜃𝜃1 and 𝜃𝜃2, respectively.
Similarly, we initialize the target actor network parameter 𝜙𝜙 ′ by just copying the
main actor network parameter 𝜙𝜙. We also initialize the replay buffer 𝒟𝒟.
Now, for each step in the episode, first, we select an action a using the actor network:
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠)
[ 480 ]
Chapter 12
But instead of using the action a directly, to ensure exploration, we add some noise 𝜖𝜖,
where 𝜖𝜖 𝜖 𝜖𝜖(0, 𝜎𝜎). Thus, our action now becomes:
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠) + 𝜖𝜖
Then, we perform the action a, move to the next state 𝑠𝑠 ′, and get the reward r. We
store this transition information in a replay buffer 𝒟𝒟.
Next, we randomly sample a minibatch of K transitions (s, a, r, s') from the replay buffer.
These K transitions will be used for updating both our critic and actor network.
First, let us compute the loss of the critic networks. We learned that the loss function
of the critic networks is:
1 2
𝐽𝐽(𝜃𝜃𝑗𝑗 ) = ∑ (𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃𝑗𝑗 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 )) for 𝑗𝑗 𝑗𝑗𝑗 𝑗
𝐾𝐾
𝑖𝑖
• The action ai is the action produced by the actor network, that is, 𝑎𝑎𝑖𝑖 = 𝜇𝜇𝜙𝜙 (𝑠𝑠𝑖𝑖 )
′
• yi is the target value of the critic, that is, 𝑦𝑦𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾 min 𝑄𝑄𝜃𝜃𝑗𝑗′ (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑎), and
𝑗𝑗𝑗𝑗𝑗𝑗
the action 𝑎𝑎𝑎 is the action produced by the target actor network, that is,
𝑎𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙′ (𝑠𝑠𝑖𝑖′ ) + 𝜖𝜖 where 𝜖𝜖 𝜖 (𝒩𝒩(0, 𝜎𝜎), −𝑐𝑐𝑐 𝑐𝑐𝑐)
After computing the loss of the critic network, we compute the gradients ∇𝜃𝜃𝑗𝑗 𝐽𝐽𝐽𝐽𝐽𝑗𝑗 )
and update the critic network parameter using gradient descent:
Now, let us update the actor network. We learned that the objective function of the
actor network is:
1
𝐽𝐽(𝜙𝜙) = ∑ 𝑄𝑄𝜃𝜃1 (𝑠𝑠𝑖𝑖 , 𝑎𝑎)
𝐾𝐾
𝑖𝑖
Note that in the above equation, we are only using the state (si) from the sampled K
transitions (s, a, r, s'). The action a is selected by actor network, 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑖𝑖 ). In order
to maximize the objective function, we compute gradients of our objective function
∇𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽 and update the parameters of the network using gradient ascent:
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽(𝜙𝜙)
[ 481 ]
Learning DDPG, TD3, and SAC
Instead of doing this parameter update of the actor network at every time step of the
episode, we delay the updates. Let t be the time step of the episode and d denotes the
number of time steps we want to delay the update by (usually d is set to 2); then we
can write the following:
1. If t mod d = 0, then:
1. Compute the gradient of the objective function ∇𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽
2. Update the actor network parameter using gradient ascent
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽(𝜙𝜙)
Finally, we update the parameter of the target critic networks 𝜃𝜃1′ and 𝜃𝜃2′ and the
parameter of the target actor network 𝜙𝜙𝜙 by soft replacement:
1. If t mod d = 0, then:
1. Compute the gradient of the objective function ∇𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽 and update
the actor network parameter using gradient ascent 𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽(𝜙𝜙)
2. Update the target critic network parameter and target actor
network parameter as 𝜃𝜃𝑗𝑗′ = 𝜏𝜏𝜏𝜏𝑗𝑗 + (1 − 𝜏𝜏)𝜃𝜃𝑗𝑗′ for 𝑗𝑗 𝑗𝑗𝑗𝑗, and
𝜙𝜙𝜙 𝜙 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏𝜏𝜏𝜏𝜏, respectively
We repeat the preceding steps for several episodes and improve the policy. To get a
better understanding of how TD3 works, let's look into the TD3 algorithm in the next
section.
Algorithm – TD3
The TD3 algorithm is exactly similar to the DDPG algorithm except that it includes
the three key features we learned in the previous sections. So, before looking into
the TD3 algorithm directly, you can revise all the key features of TD3.
[ 482 ]
Chapter 12
1. Initialize the two main critic network parameters 𝜃𝜃1 and 𝜃𝜃2 and the main
actor network parameter 𝜙𝜙
2. Initialize the two target critic network parameters 𝜃𝜃1′ and 𝜃𝜃2′ by copying the
main critic network parameters 𝜃𝜃1 and 𝜃𝜃2, respectively
3. Initialize the target actor network parameter 𝜙𝜙 ′ by copying the main actor
network parameter 𝜙𝜙 ′
4. Initialize the replay buffer 𝒟𝒟
5. For N number of episodes, repeat step 6
6. For each step in the episode, that is, for t = 0,…,T – 1:
1. Select the action a based on the policy 𝜇𝜇𝜙𝜙 (𝑠𝑠) and with exploration
noise 𝜖𝜖 , that is, 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠) + 𝜖𝜖 where, 𝜖𝜖 𝜖 𝜖𝜖𝜖(0, 𝜎𝜎)
2. Perform the selected action a, move to the next state 𝑠𝑠 ′, get the
reward r, and store the transition information in the replay buffer 𝒟𝒟
3. Randomly sample a minibatch of K transitions from the replay buffer
𝒟𝒟
4. Select the action 𝑎𝑎𝑎 to compute the target value, 𝑎𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙′ (𝑠𝑠𝑖𝑖′ ) + 𝜖𝜖 ,
where 𝜖𝜖 𝜖 𝜖𝜖𝜖(0, 𝜎𝜎), −𝑐𝑐𝑐 𝑐𝑐𝑐𝑐
5. Compute the target value of the critic, that is,
𝑦𝑦𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾 min 𝑄𝑄𝜃𝜃′ (𝑠𝑠𝑖𝑖′ , 𝑎𝑎𝑎)
𝑗𝑗𝑗𝑗𝑗𝑗 𝑗𝑗
7. Compute the gradients of the loss ∇𝜃𝜃𝑗𝑗 𝐽𝐽𝐽𝐽𝐽𝑗𝑗 ) and minimize the loss
using gradient descent, 𝜃𝜃𝑗𝑗 = 𝜃𝜃𝑗𝑗 − 𝛼𝛼𝛼𝜃𝜃𝑗𝑗 𝐽𝐽(𝜃𝜃𝑗𝑗 ) for 𝑗𝑗 𝑗 𝑗𝑗 𝑗
8. If t mod d =0, then:
[ 483 ]
Learning DDPG, TD3, and SAC
Now that we have learned how TD3 works, in the next section, we will learn about
another interesting algorithm, called SAC.
Soft actor-critic
Now, we will look into another interesting actor-critic algorithm, called SAC. This
is an off-policy algorithm and it borrows several features from the TD3 algorithm.
But unlike TD3, it uses a stochastic policy 𝜋𝜋. SAC is based on the concept of entropy.
So first, let's understand what is meant by entropy. Entropy is a measure of the
randomness of a variable. It basically tells us the uncertainty or unpredictability of
the random variable and is denoted by ℋ.
If the random variable always gives the same value every time, then we can say that
its entropy is low because there is no randomness. But if the random variable gives
different values, then we can say that its entropy is high.
For an example, consider a dice throw experiment. Every time a dice is thrown, if
we get a different number, then we can say that the entropy is high because we are
getting a different number every time and there is high uncertainty since we don't
know which number will come up on the next throw. But if we are getting the same
number, say 3, every time the dice is thrown, then we can say that the entropy is
low, since there is no randomness here as we are getting the same number on every
throw.
We know that the policy 𝜋𝜋 tells what action to perform in a given state. What
happens when the entropy of the policy ℋ(𝜋𝜋(⋅ |𝑠𝑠)) is high or low? If the entropy of
the policy is high, then this means that our policy performs different actions instead
of performing the same action every time. But if the entropy of the policy is low, then
this means that our policy performs the same action every time. As you may have
guessed, increasing the entropy of a policy promotes exploration, while decreasing
the entropy of the policy means less exploration.
We know that, in reinforcement learning, our goal is to maximize the return. So, we
can define our objective function as shown here:
[ 484 ]
Chapter 12
We know that the return of the trajectory is just the sum of rewards, that is:
𝑇𝑇𝑇𝑇
𝑅𝑅(𝜏𝜏) = ∑ 𝑟𝑟𝑡𝑡
𝑡𝑡𝑡𝑡
So, we can rewrite our objective function by expanding the return as:
𝑇𝑇𝑇𝑇
Maximizing the preceding objective function maximizes the return. In the SAC
method, we use a slightly modified version of the objective function with the entropy
term as shown here:
𝑇𝑇𝑇𝑇
As we can see, our objective function now has two terms; one is the reward and the
other is the entropy of the policy. Thus, instead of maximizing only the reward, we
also maximize the entropy of a policy. But what is the point of this? Maximizing
the entropy of the policy allows us to explore new actions. But we don't want to
explore actions that give us a bad reward. Hence, maximizing entropy along with
maximizing reward means that we can explore new actions along with maintaining
maximum reward. The preceding objective function is often referred to as maximum
entropy reinforcement learning, or entropy regularized reinforcement learning.
Adding an entropy term is also often referred to as an entropy bonus.
Also, the term 𝛼𝛼 in the objective function is called temperature and is used to set the
importance of our entropy term, or we can say that it is used to control exploration.
When 𝛼𝛼 is high, we allow exploration in the policy, but when it is low, then we don't
allow exploration.
Okay, now that we have a basic idea of SAC, let's get into some more details.
[ 485 ]
Learning DDPG, TD3, and SAC
Similarly, in SAC, the actor uses the policy gradient to find the optimal policy and
the critic evaluates the policy produced by the actor. However, instead of using
only the Q function to evaluate the actor's policy, the critic uses both the Q function
and the value function. But why exactly do we need both the Q function and the
value function to evaluate the actor's policy? This will be explained in detail in the
upcoming sections.
So, in SAC, we have three networks, one actor network (policy network) to find
the optimal policy, and two critic networks—a value network and a Q network, to
compute the value function and the Q function, respectively, to evaluate the policy
produced by the actor.
Before moving on, let's look at the modified version of the value function and the Q
function with the entropy term.
We learned that the return is the sum of rewards of the trajectory, so we can rewrite
the preceding equation by expanding the return as:
𝑇𝑇𝑇𝑇
Now, we can rewrite the value function by adding the entropy term as shown here:
𝑇𝑇𝑇𝑇
[ 486 ]
Chapter 12
We know that the Q function (state-action value) is the expected return of the
trajectory starting from state s and action a following a policy 𝜋𝜋:
𝑇𝑇𝑇𝑇
Now, we can rewrite the Q function by adding the entropy term as shown here:
𝑇𝑇𝑇𝑇 𝑇𝑇𝑇𝑇
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = 𝐸𝐸𝜏𝜏𝜏𝜏𝜏 [∑ 𝑟𝑟𝑡𝑡 + 𝛼𝛼 𝛼 𝛼(𝜋𝜋(⋅ |𝑠𝑠𝑡𝑡 ))|𝑠𝑠0 = 𝑠𝑠𝑠 𝑠𝑠0 = 𝑎𝑎]
𝑡𝑡𝑡𝑡 𝑡𝑡𝑡𝑡
The modified Bellman equation for the preceding Q function with the entropy term
is given as:
Here, the value function can be computed using the relation between the Q function
and the value function as:
Components of SAC
Now that we have a basic idea of SAC, let's go into more detail and understand how
exactly each component of SAC works by looking at them separately.
Critic network
We learned that unlike other actor-critic methods we have seen earlier, the critic in
SAC uses both the value function and the Q function to evaluate the policy produced
by the actor network. But why is that?
[ 487 ]
Learning DDPG, TD3, and SAC
In the previous algorithms, we used the critic network to compute the Q function for
evaluating the action produced by the actor. Also, the target Q value in the critic is
computed using the Bellman equation. We can do the same here. However, here we
have modified the Bellman equation of the Q function due to the entropy term, as
we learned in equation (2):
From the preceding equation, we can observe that in order to compute the Q
function, first we need to compute the value function. So, we need to compute both
the Q function and the value function in order to evaluate the policy produced by the
actor. We can use a single network to approximate both the Q function and the value
function. However, instead of using a single network, we use two different networks,
the Q network to estimate the Q function, and the value network to estimate the
value function. Using two different networks to compute the Q function and value
function stabilizes the training.
First, we will learn how the value network works and then we will learn about the Q
network.
Value network
The value network is denoted by V, the parameter of the value network is denoted
by 𝜓𝜓, and the parameter of the target value network is denoted by 𝜓𝜓 ′.
Thus, 𝑉𝑉𝜓𝜓 (𝑠𝑠) implies that we approximate the value function (state value) using the
neural network parameterized by 𝜓𝜓. Okay, how can we train the value network? We
can train the network by minimizing the loss between the target state value and the
state value predicted by our network. How can we obtain the target state value? We
can use the value function given in equation (3) to compute the target state value.
We learned that according to equation (3), the value of the state is computed as:
[ 488 ]
Chapter 12
In the preceding equation, we can remove the expectation. We will approximate the
expectation by sampling K number of transitions from the replay buffer. So, we can
compute the target state value yv using the preceding equation as:
As we can observe in the preceding equation, for clipped double Q learning, we are
using the two main Q networks parameterized by 𝜃𝜃1 and 𝜃𝜃2, but in TD3, we used two
target Q networks parameterized by 𝜃𝜃1′ and 𝜃𝜃2′. Why is that?
Because here, we are computing the Q value of a state-action pair 𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) so we can
use the two main Q networks parameterized by 𝜃𝜃1 and 𝜃𝜃2, but in TD3, we compute
the Q value of the next state-action pair 𝑄𝑄(𝑠𝑠 ′ , 𝑎𝑎′ ), so we used the two target Q
networks parameterized by 𝜃𝜃1′ and 𝜃𝜃2′. Thus, here, we don't need target Q networks.
[ 489 ]
Learning DDPG, TD3, and SAC
Now, we can define our objective function 𝐽𝐽𝑉𝑉 (𝜓𝜓𝜓 of the value network as the mean
squared difference between the target state value and the state value predicted by
our network, as shown here:
1 2
𝐽𝐽𝑉𝑉 (𝜓𝜓) = ∑ (𝑦𝑦𝜐𝜐𝑖𝑖 − 𝑉𝑉𝜓𝜓 (𝑠𝑠𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
Where K denotes the number of transitions we sample from the replay buffer.
We can calculate the gradients of our objective function and then update our main
value network parameter 𝜓𝜓 as:
𝜓𝜓 𝜓 𝜓𝜓 𝜓 𝜓𝜓𝜓𝜓𝜓 𝐽𝐽𝐽𝐽𝐽𝐽
Note that we are using 𝜆𝜆 to represent the learning rate since we are already using 𝛼𝛼
to denote the temperature.
We can update the parameter of the target value network 𝜓𝜓𝜓 using soft replacement:
𝜓𝜓 ′ = 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏)𝜓𝜓 ′
We will learn where exactly the target value network is used in the next section.
Q network
The Q network is denoted by Q and it is parameterized by 𝜃𝜃. Thus, 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠𝑠 implies
that we approximate the Q function using the neural network parameterized by 𝜃𝜃.
How can we train the Q network? We can train the network by minimizing the loss
between the target Q value and the Q value predicted by the network. How can we
obtain the target Q value? Here is where we use the Bellman equation.
We learned that according to the Bellman equation (2), the Q value can be
computed as:
We can remove the expectation in the preceding equation. We will approximate the
expectation by sampling K number of transitions from the replay buffer. So, we can
compute the target Q value yq using the preceding equation as:
[ 490 ]
Chapter 12
𝑦𝑦𝑞𝑞 = 𝑟𝑟 𝑟 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ )
If we look at the preceding equation, we have a value of next state 𝑉𝑉(𝑠𝑠 ′ ). In order to
compute the value of next state 𝑉𝑉(𝑠𝑠 ′ ), we use a target value network parameterized
by 𝜓𝜓 ′, so we can rewrite the preceding equation with the parameterized value
function as shown here:
1 2
𝐽𝐽𝑄𝑄 (𝜃𝜃) = ∑ (𝑦𝑦𝑞𝑞𝑖𝑖 − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
Where K denotes the number of transitions we sample from the replay buffer.
1 2
𝐽𝐽𝑄𝑄 (𝜃𝜃1 ) = ∑ (𝑦𝑦𝑞𝑞𝑖𝑖 − 𝑄𝑄𝜃𝜃1 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
Then, we compute the gradients and update the parameter 𝜃𝜃1 using gradient descent
as 𝜃𝜃1 = 𝜃𝜃1 − 𝜆𝜆𝜆𝜃𝜃1 𝐽𝐽𝐽𝐽𝐽1 ).
1 2
𝐽𝐽𝑄𝑄 (𝜃𝜃2 ) = ∑ (𝑦𝑦𝑞𝑞𝑖𝑖 − 𝑄𝑄𝜃𝜃2 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
Then, we compute the gradients and update the parameter 𝜃𝜃2 using gradient descent
as 𝜃𝜃2 = 𝜃𝜃2 − 𝜆𝜆𝜆𝜃𝜃2 𝐽𝐽𝐽𝐽𝐽2 ).
[ 491 ]
Learning DDPG, TD3, and SAC
1 2
𝐽𝐽𝑄𝑄 (𝜃𝜃𝑗𝑗 ) = ∑ (𝑦𝑦𝑞𝑞𝑖𝑖 − 𝑄𝑄𝜃𝜃𝑗𝑗 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 )) for 𝑗𝑗 𝑗𝑗𝑗𝑗
𝐾𝐾
𝑖𝑖
𝜃𝜃𝑗𝑗 = 𝜃𝜃𝑗𝑗 − 𝜆𝜆𝜆𝜃𝜃𝑗𝑗 𝐽𝐽𝐽𝐽𝐽𝑗𝑗 ) for 𝑗𝑗 𝑗𝑗𝑗𝑗
Actor network
The actor network (policy network) is parameterized by 𝜙𝜙. Let's recall the objective
function of the actor network we learned in TD3:
1
𝐽𝐽(𝜙𝜙) = ∑ 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎)
𝐾𝐾
𝑖𝑖
The preceding objective function means that the goal of the actor is to generate action
in such a way that it maximizes the Q value computed by the critic.
The objective function of the actor network in SAC is the same as what we learned in
TD3, except that here we use a stochastic policy 𝜋𝜋𝜙𝜙(𝑎𝑎|𝑠𝑠), and also, we maximize the
entropy. So, we can write the objective function of the actor network in SAC as:
1
𝐽𝐽𝜋𝜋 (𝜙𝜙) = ∑[𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎) − 𝛼𝛼𝛼log 𝜋𝜋𝜙𝜙 (𝑎𝑎|𝑠𝑠𝑖𝑖 )]
𝐾𝐾
𝑖𝑖
Now, how can we compute the derivative of the preceding objective function?
Because, unlike TD3, here, our action is computed using a stochastic policy. It will
be difficult to apply backpropagation and compute gradients of the preceding
objective function with the action computed using a stochastic policy. So, we use
the reparameterization trick. The reparameterization trick guarantees that sampling
from our policy is differentiable. Thus, we can rewrite our action as shown here:
[ 492 ]
Chapter 12
1
𝐽𝐽𝜋𝜋 (𝜙𝜙) = ∑[𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎) − 𝛼𝛼𝛼log𝜋𝜋𝜙𝜙 (𝑎𝑎|𝑠𝑠𝑖𝑖 )]
𝐾𝐾
𝑖𝑖
Note that in the preceding equation, our action is 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝜖𝜖𝑖𝑖 ; 𝑠𝑠𝑖𝑖 ). Remember how
we used two Q functions parameterized by 𝜃𝜃1 and 𝜃𝜃2 to avoid overestimation bias?
Now, which Q function should we use in the preceding objective function? We can
use either of the functions and so we use the Q function parameterized by 𝜃𝜃1 and
write our final objective function as:
1
𝐽𝐽𝜋𝜋 (𝜙𝜙) = ∑[𝑄𝑄𝜃𝜃1 (𝑠𝑠𝑖𝑖 , 𝑎𝑎) − 𝛼𝛼𝛼log𝜋𝜋𝜙𝜙 (𝑎𝑎|𝑠𝑠𝑖𝑖 )]
𝐾𝐾
𝑖𝑖
Now that we have understood how the SAC algorithm works, let's recap what we
have learned so far and how the SAC algorithm works exactly by putting all the
concepts together.
SAC is an actor-critic method, and so the parameters of SAC will get updated at
every step of the episode. Now, let's get started and understand how SAC works.
First, we initialize the main network parameter of the value network 𝜓𝜓, two Q
network parameters 𝜃𝜃1 and 𝜃𝜃2, and the actor network parameter 𝜙𝜙. Next, we
initialize the target value network parameter 𝜓𝜓 ′ by just copying the main network
parameter 𝜓𝜓 and then we initialize the replay buffer 𝒟𝒟.
[ 493 ]
Learning DDPG, TD3, and SAC
Now, for each step in the episode, first, we select an action a using the actor network:
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠)
Then, we perform the action a, move to the next state 𝑠𝑠𝑠, and get the reward r. We
store this transition information in a replay buffer 𝒟𝒟.
Next, we randomly sample a minibatch of K transitions from the replay buffer. These
K transitions (s, a, r, s') are used for updating our value, Q, and actor network.
First, let us compute the loss of the value network. We learned that the loss function
of the value network is:
1 2
𝐽𝐽𝑉𝑉 (𝜓𝜓) = ∑ (𝑦𝑦𝜐𝜐𝑖𝑖 − 𝑉𝑉𝜓𝜓 (𝑠𝑠𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
Now, we compute the loss of the Q networks. We learned that the loss function of
the Q network is:
1 2
𝐽𝐽𝑄𝑄 (𝜃𝜃𝑗𝑗 ) = ∑ (𝑦𝑦𝑞𝑞𝑖𝑖 − 𝑄𝑄𝜃𝜃𝑗𝑗 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 )) for 𝑗𝑗 𝑗𝑗𝑗 𝑗
𝐾𝐾
𝑖𝑖
Where 𝑦𝑦𝑞𝑞𝑖𝑖 is the target Q value and it is given as 𝑦𝑦𝑞𝑞𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾𝛾𝛾𝜓𝜓′ (𝑠𝑠𝑖𝑖′ ).
After computing the loss, we calculate the gradients and update the parameter of the
Q networks using gradient descent:
Next, we update the actor network. We learned that the objective of the actor
network is:
1
𝐽𝐽𝜋𝜋 (𝜙𝜙) = ∑[𝑄𝑄𝜃𝜃1 (𝑠𝑠𝑖𝑖 , 𝑎𝑎) − 𝛼𝛼𝛼log𝜋𝜋𝜙𝜙 (𝑎𝑎|𝑠𝑠𝑖𝑖 )]
𝐾𝐾
𝑖𝑖
[ 494 ]
Chapter 12
Now, we calculate gradients and update the parameter 𝜙𝜙 of the actor network using
gradient ascent:
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽
Finally, in the end, we update the target value network parameter by soft
replacement, as shown here:
𝜓𝜓 ′ = 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏𝜏𝜏𝜏𝜏
We repeat the preceding steps for several episodes and improve the policy. To get
a better understanding of how SAC works, let's look into the SAC algorithm in the
next section.
Algorithm – SAC
The SAC algorithm is given as follows:
1. Initialize the main value network parameter 𝜓𝜓, the Q network parameters 𝜃𝜃1
and 𝜃𝜃2, and the actor network parameter 𝜙𝜙
2. Initialize the target value network 𝜓𝜓 ′ by just copying the main value network
parameter 𝜓𝜓
3. Initialize the replay buffer 𝒟𝒟
4. For N number of episodes, repeat step 5
5. For each step in the episode, that is, for t = 0,…, T – 1:
1. Select an action a based on the policy 𝜋𝜋𝜙𝜙 (𝑠𝑠), that is, 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑠
2. Perform the selected action a, move to the next state 𝑠𝑠𝑠, get the reward
r, and store the transition information in the replay buffer 𝒟𝒟
3. Randomly sample a minibatch of K transitions from the replay buffer
4. Compute the target state value
𝑦𝑦𝜐𝜐𝑖𝑖 = min 𝑄𝑄𝜃𝜃𝑗𝑗 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ) − 𝛼𝛼𝛼log 𝜋𝜋𝜙𝜙 (𝑎𝑎𝑖𝑖 |𝑠𝑠𝑖𝑖 )
𝑗𝑗𝑗𝑗𝑗𝑗
1 2
5. Compute the loss of value network 𝐽𝐽𝑉𝑉 (𝜓𝜓) = ∑ (𝑦𝑦𝜐𝜐𝑖𝑖 − 𝑉𝑉𝜓𝜓 (𝑠𝑠𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
and update the parameter using gradient descent, 𝜓𝜓 𝜓 𝜓𝜓 𝜓 𝜓𝜓𝜓𝜓𝜓 𝐽𝐽𝐽𝐽𝐽𝐽
6. Compute the target Q value 𝑦𝑦𝑞𝑞𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾𝛾𝛾𝜓𝜓′ (𝑠𝑠𝑖𝑖′ )
[ 495 ]
Learning DDPG, TD3, and SAC
Summary
We started off the chapter by understanding the DDPG algorithm. We learned
that DDPG is an actor-critic algorithm where the actor estimates the policy using
policy gradient and the critic evaluates the policy produced by the actor using the
Q function. We learned how DDPG uses a deterministic policy and how it is used in
environments with a continuous action space.
Later, we looked into the actor and critic components of DDPG in detail and
understood how they work, before finally learning about the DDPG algorithm.
Moving on, we learned about the twin delayed DDPG, which is the successor to
DDPG and constitutes an improvement to the DDPG algorithm. We learned the key
features of TD3, including clipped double Q learning, delayed policy updates, and
target policy smoothing, in detail and finally, we looked into the TD3 algorithm.
At the end of the chapter, we learned about the SAC algorithm. We learned that,
unlike DDPG and TD3, the SAC method uses a stochastic policy. We also understood
how SAC works with the entropy bonus in the objective function, and we learned
what is meant by maximum entropy reinforcement learning.
In the next chapter, we will learn the state-of-the-art policy gradient algorithms such
as trust region policy optimization, proximal policy optimization, and actor-critic
using Kronecker-factored trust region.
[ 496 ]
Chapter 12
Questions
Let's put our knowledge of actor-critic methods to the test. Try answering the
following questions:
Further reading
For more information, refer to the following papers:
[ 497 ]
TRPO, PPO, and
13
ACKTR Methods
In this chapter, we will learn two interesting state-of-art policy gradient algorithms:
trust region policy optimization and proximal policy optimization. Both of these
algorithms act as an improvement to the policy gradient algorithm (REINFORCE
with baseline) we learned in Chapter 10, Policy Gradient Method.
Moving on, we will learn about Proximal Policy Optimization (PPO). We will
understand how PPO works and how it acts as an improvement to the TRPO
algorithm in detail. We will also learn two types of PPO algorithm called PPO-
clipped and PPO-penalty.
At the end of the chapter, we will learn about an interesting actor-critic method
called the Actor-Critic using Kronecker-Factored Trust Region (ACKTR) method,
which uses Kronecker factorization to approximate the second-order derivative. We
will explore how ACKTR works and how it uses the trust region in its update rule.
[ 499 ]
TRPO, PPO, and ACKTR Methods
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽(𝜃𝜃)
Where ∇𝜃𝜃 𝐽𝐽(𝜃𝜃) is the gradient and 𝛼𝛼 is known as the step size or learning rate. If the
step size is large then there will be a large policy update, and if it is small then there
will be a small update in the policy. How can we find an optimal step size? In the
policy gradient method, we keep the step size small and so on every iteration there
will be a small improvement in the policy.
But what happens if we take a large step on every iteration? Let's suppose we have
a policy 𝜋𝜋 parameterized by 𝜃𝜃. So, on every iteration, updating 𝜃𝜃 implies that we
are improving our policy. If the step size is large, then the policy on every iteration
varies greatly, meaning the old policy (the policy used in the previous iteration) and
the new policy (the policy used in the current iteration) vary greatly. Since we are
using a parametrized policy, it implies that if we make a large update (large step
size) then the parameter of the old policy and the new policy vary heavily, and this
leads to a problem called model collapse.
This is the reason that in the policy gradient method, instead of taking larger steps
and updating the parameter of our network, we take small steps and update the
parameter to keep the old policy and new policy close. But how can we improve this?
[ 500 ]
Chapter 13
Can we take a larger step along with keeping the old and new policies close so that
it won't affect our model performance and also helps us to learn quickly? Yes, this
problem is solved by TRPO.
TRPO tries to make a large policy update while imposing a constraint that the old
policy and the new policy should not vary too much. Okay, what is this constraint?
But first, how can we measure and understand if the old policy and new policy are
changing greatly? Here is where we use a measure called the Kullback-Leibler
(KL) divergence. The KL divergence is ubiquitous in reinforcement learning. It tells
us how two probability distributions are different from each other. So, we can use
the KL divergence to understand if our old policy and new policy vary greatly or
not. TRPO adds a constraint that the KL divergence between the old policy and the
new policy should be less than or equal to some constant 𝛿𝛿. That is, when we make
a policy update, the old policy and the new policy should not vary more than some
constant. This constraint is called the trust region constraint.
Thus, TRPO tries to make a large policy update while imposing the constraint that
the parameter of the old policy and the new policy should be within the trust region.
Note that in the policy gradient method, we use a parameterized policy. Thus,
keeping the parameter of the old policy and the new policy within the trust region
implies that the old and new policies are within the trust region.
TRPO guarantees monotonic policy improvement; that is, it guarantees that there
will always be a policy improvement on every iteration. This is the fundamental
idea behind the TRPO algorithm.
To understand how exactly TRPO works, we should understand the math behind
TRPO. TRPO has pretty heavy math. But worry not! It will be simple if we
understand the fundamental math concepts required to understand TRPO. So, before
diving into the TRPO algorithm, first, we will understand several essential math
concepts that are required to understand TRPO. Then we will learn how to design
a TRPO objective function with the trust region constraint, and finally, we will see
how to solve the TRPO objective function.
Math essentials
Before understanding how TRPO works, first, we will understand the following
important math concepts:
[ 501 ]
TRPO, PPO, and ACKTR Methods
• Lagrange multipliers
• Importance sampling
∞
𝑓𝑓 (𝑛𝑛) (𝑎𝑎)
𝑓𝑓(𝑥𝑥) = ∑ (𝑥𝑥 𝑥 𝑥𝑥𝑥𝑛𝑛
𝑛𝑛𝑛
𝑛𝑛𝑛𝑛
So for each term in the Taylor series, we calculate the nth order derivative, divide
them by n!, and multiply by (x – a)n.
Let's understand how exactly the Taylor series approximates a function with an
example. Let's say we have an exponential function ex as shown in Figure 13.1:
[ 502 ]
Chapter 13
Can we approximate the exponential function ex using the Taylor series? We know
that the Taylor series is given as:
𝑓𝑓(𝑥𝑥) = 𝑒𝑒 𝑥𝑥
Say our function f(x) = ex is centered at x = a, first, let's calculate the derivatives of
the function up to 3 orders. The derivative of the exponential function is the function
itself, so we can write:
𝑓𝑓 ′ (𝑎𝑎) = 𝑒𝑒 𝑎𝑎
𝑓𝑓 ′′ (𝑎𝑎) = 𝑒𝑒 𝑎𝑎
𝑓𝑓 ′′′ (𝑎𝑎) = 𝑒𝑒 𝑎𝑎
Substituting the preceding terms in the equation (1), we can write:
𝑒𝑒 𝑎𝑎 𝑒𝑒 𝑎𝑎 𝑒𝑒 𝑎𝑎
𝑒𝑒 𝑥𝑥 = 𝑒𝑒 𝑎𝑎 + (𝑥𝑥 𝑥 𝑥𝑥) + (𝑥𝑥 𝑥 𝑥𝑥𝑥2 + (𝑥𝑥 𝑥 𝑥𝑥𝑥3 + ⋯
1! 2! 3!
Let's suppose a = 0; then our equation becomes:
𝑒𝑒 0 𝑒𝑒 0 𝑒𝑒 0
𝑒𝑒 𝑥𝑥 = 𝑒𝑒 0 + (𝑥𝑥 𝑥 𝑥) + (𝑥𝑥 𝑥 𝑥𝑥2 + (𝑥𝑥 𝑥 𝑥𝑥3 + ⋯
1! 2! 3!
We know that e0 =1; thus, the Taylor series of the exponential function is given as:
𝑥𝑥 2 𝑥𝑥 3
𝑒𝑒 𝑥𝑥 = 1 + 𝑥𝑥 𝑥 + +⋯ (2)
2! 3!
[ 503 ]
TRPO, PPO, and ACKTR Methods
It implies that the sum of the terms on the right-hand side approximates the
exponential function ex. Let's understand this with the help of a plot. Let's take only
the terms till the 0th order derivative from the Taylor series (equation 2), that is, ex =
1, and plot them:
Figure 13.2: Taylor series approximation till the 0th order derivative
As we can observe from the preceding plot, just taking the 0th order derivative, we
are far away from the actual function ex. That is, our approximation is not good.
So, let's take the sum of terms till the 1st order derivative from the Taylor series
(equation 2), that is, ex = 1 + x, and plot them:
Figure 13.3: Taylor series approximation till the 1st order derivative
[ 504 ]
Chapter 13
As we can observe from the preceding plot, including the terms till the 1st order
derivative from the Taylor series gets us closer to the actual function ex. So, let's take
the sum of terms till the 2nd order derivative from the Taylor series (equation 2), that
𝑥𝑥 2
is, 𝑒𝑒 𝑥𝑥 = 1 + 𝑥𝑥 𝑥, and plot them. As we can observe from the following plot our
2!
approximation gets better and we reach closer to the actual function ex:
Figure 13.4: Taylor series approximation till the 2nd order derivative
Now, let's take the sum of terms till the 3rd order derivative from the Taylor series,
𝑥𝑥 2 𝑥𝑥 3
that is, 𝑒𝑒 𝑥𝑥 = 1 + 𝑥𝑥 𝑥 + , and plot them:
2! 3!
Figure 13.5: Taylor series approximation till the 3rd order derivative
[ 505 ]
TRPO, PPO, and ACKTR Methods
By looking at the preceding graph, we can understand that our approximation is far
better after including the sum of terms till the 3rd order derivative. As you might have
guessed, adding more and more terms in the Taylor series makes our approximation
of ex better. Thus, using the Taylor series, we can approximate any function.
The Taylor polynomial till the first degree is called linear approximation. In linear
approximation, we calculate the Taylor series only till the first-order derivative.
Thus, the linear approximation (first-order) of the function f(x) around the point
a can be given as:
𝑓𝑓 ′ (𝑎𝑎)
𝑓𝑓(𝑥𝑥) = 𝑓𝑓(𝑎𝑎) + (𝑥𝑥 𝑥 𝑥𝑥𝑥
1!
We can denote our first-order derivative by ∇𝑓𝑓𝑓𝑓𝑓𝑓, so we can just replace 𝑓𝑓 ′ (𝑎𝑎) by
∇𝑓𝑓𝑓𝑓𝑓𝑓 and rewrite the preceding equation as:
𝑓𝑓 ′ (𝑎𝑎) 𝑓𝑓 ′′ (𝑎𝑎)
𝑓𝑓(𝑥𝑥) = 𝑓𝑓(𝑎𝑎) + (𝑥𝑥 𝑥 𝑥𝑥) + (𝑥𝑥 𝑥 𝑥𝑥𝑥2
1! 2!
We can denote our first-order derivative by ∇𝑓𝑓𝑓𝑓𝑓𝑓 and second-order derivative by
∇2 𝑓𝑓𝑓𝑓𝑓𝑓; so, we can just replace 𝑓𝑓 ′ (𝑎𝑎) with ∇𝑓𝑓𝑓𝑓𝑓𝑓 and 𝑓𝑓 ′′ (𝑎𝑎) with ∇2 𝑓𝑓𝑓𝑓𝑓𝑓 and rewrite
the preceding equation as:
1
𝑓𝑓(𝑥𝑥) = 𝑓𝑓(𝑎𝑎) + ∇𝑓𝑓(𝑎𝑎)(𝑥𝑥 𝑥 𝑥𝑥) + (𝑥𝑥 𝑥 𝑥𝑥)𝑇𝑇 ∇2 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
2!
A Hessian is a second-order derivative, so we can denote ∇2 𝑓𝑓𝑓𝑓𝑓𝑓 by H(a) and rewrite
the preceding equation as:
1
𝑓𝑓(𝑥𝑥) = 𝑓𝑓(𝑎𝑎) + ∇𝑓𝑓(𝑎𝑎)(𝑥𝑥 𝑥 𝑥𝑥) + (𝑥𝑥 𝑥 𝑥𝑥)𝑇𝑇 𝐻𝐻(𝑎𝑎)(𝑥𝑥 𝑥 𝑥𝑥)
2!
Thus, to summarize, a linear approximation of the function f(x) is given as:
[ 506 ]
Chapter 13
1
𝑓𝑓(𝑥𝑥) = 𝑓𝑓(𝑎𝑎) + ∇𝑓𝑓(𝑎𝑎)(𝑥𝑥 𝑥 𝑥𝑥) + (𝑥𝑥 𝑥 𝑥𝑥)𝑇𝑇 𝐻𝐻(𝑎𝑎)(𝑥𝑥 𝑥 𝑥𝑥)
2!
Say we use the quadratic approximation, we learned that with the quadratic
approximation, we calculate the Taylor series only till the second-order derivative.
Thus, the quadratic approximation (second-order) of the given function f(x) around
the region a can be given as:
1
𝑓𝑓̃(𝑥𝑥) = 𝑓𝑓(𝑎𝑎) + ∇𝑓𝑓(𝑎𝑎)(𝑥𝑥 𝑥 𝑥𝑥) + (𝑥𝑥 𝑥 𝑥𝑥)𝑇𝑇 𝐻𝐻(𝑎𝑎)(𝑥𝑥 𝑥 𝑥𝑥)
2!
So, we can just use the approximated function 𝑓𝑓̃(𝑥𝑥𝑥 and compute the minimum
value. But wait! What if our approximated function 𝑓𝑓̃(𝑥𝑥𝑥 is inaccurate at a particular
point, say a*, and if a* is optimal, then we miss out on finding the optimal value.
So, we will introduce a new constraint called the trust region constraint. The trust
region implies the region where our actual function f(x) and approximated function
𝑓𝑓̃(𝑥𝑥𝑥 are close together. So, we can say that our approximation will be accurate if our
approximated function 𝑓𝑓̃(𝑥𝑥𝑥 is in the trust region.
For instance, as shown in Figure 13.6, our approximated function 𝑓𝑓̃(𝑥𝑥𝑥 is in the trust
region and thus our approximation will be accurate since the approximated function
𝑓𝑓̃(𝑥𝑥𝑥 is closer to the actual function f(x):
[ 507 ]
TRPO, PPO, and ACKTR Methods
But when 𝑓𝑓̃(𝑥𝑥𝑥 is not in the trust region, then our approximation will not be accurate
since the approximated function 𝑓𝑓̃(𝑥𝑥𝑥 is far from the actual function f(x):
Thus, we need to make sure that our approximated function stays in the trust region
so that it will be close to the actual function.
𝐴𝐴𝐴𝐴 = 𝑏𝑏
Where A is the positive definite, square, and symmetric matrix, x is the vector we
want to find, and b is the known vector. Let's consider the following quadratic
function:
1
𝑓𝑓(𝑥𝑥) = 𝑥𝑥 𝑇𝑇 𝐴𝐴𝐴𝐴 − 𝑏𝑏 𝑇𝑇 𝑥𝑥 𝑥 𝑥𝑥
2
When A is the positive semi-definite matrix; finding the minimum of this function
is equal to solving the system Ax = b. Just like gradient descent, conjugate gradient
descent also tries to find the minimum of the function; however, the search direction
of conjugate gradient descent will be different from gradient descent, and conjugate
gradient descent attains convergence in N iterations. Let's understand how conjugate
gradient descent differs from gradient descent with the help of a contour plot.
First, let's look at the contour plot of gradient descent. As we can see in the following
plot, in order to find the minimum value of a function, gradient descent takes several
search directions and we get a zigzag pattern of directions:
[ 508 ]
Chapter 13
Unlike the gradient descent method, in the conjugate gradient descent, the search
direction is orthogonal to the previous search direction as shown in Figure 13.9:
So, using conjugate gradient descent, we can solve a system of the form Ax = b.
Lagrange multipliers
Let's say we have a function f(x) = x2: how do we find the minimum of the function?
We can find the minimum of the function by finding a point where the gradient of
the function is zero. The gradient of the function f(x) = x2 is given as:
∇𝑓𝑓(𝑥𝑥) = 2𝑥𝑥
When x = 0, the gradient of the function is zero; that is, ∇𝑓𝑓(𝑥𝑥) = 0 when x = 0. So,
we can say that the minimum of the function f(x) = x2 is at x = 0. The problem we just
saw is called the unconstrained optimization problem.
[ 509 ]
TRPO, PPO, and ACKTR Methods
∇𝑓𝑓(𝑥𝑥) = λ∇𝑔𝑔(𝑥𝑥)
Where λ is known as the Lagrange multiplier. So, we can rewrite the preceding
equation as:
∇𝑓𝑓(𝑥𝑥) − λ∇𝑔𝑔(𝑥𝑥) = 0
Solving the preceding equation implies that we find the minimum of the function f(x)
along with satisfying the constraint g(x). So, we can rewrite our objective function as:
Let's understand this with one more example. Say we want to find the minimum of
the function 3𝑥𝑥 2 + 2𝑦𝑦 2 subject to the constraint 𝑥𝑥 2 + 𝑦𝑦 2 ≤ 3, as the following shows:
[ 510 ]
Chapter 13
We can rewrite our objective function with the constraint multiplied by the Lagrange
multiplier as:
Importance sampling
Let's recap the importance sampling method we learned in Chapter 4, Monte Carlo
Methods. Say we want to compute the expectation of a function f(x) where the value
of x is sampled from the distribution p(x), that is, 𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥; we can write:
Can we approximate the expectation of a function f(x)? We learned that using the
Monte Carlo method, we can approximate the expectation as:
1
𝔼𝔼𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥𝑥 [𝑓𝑓(𝑥𝑥)] ≈ ∑ 𝑓𝑓𝑓𝑓𝑓𝑖𝑖 )
𝑁𝑁
𝑖𝑖
That is, using the Monte Carlo method, we sample x from the distribution p(x) for N
times and compute the average of f(x) to approximate the expectation.
Instead of using the Monte Carlo method, we can also use importance sampling to
approximate the expectation. In the importance sampling method, we estimate the
expectation using a different distribution q(x); that is, instead of sampling x from p(x)
we use a different distribution q(x):
𝑝𝑝(𝑥𝑥)
𝔼𝔼[𝑓𝑓(𝑥𝑥)] ≈ ∫ 𝑓𝑓𝑓𝑓𝑓𝑓 𝑞𝑞(𝑥𝑥)𝑑𝑑𝑑𝑑
𝑥𝑥 𝑞𝑞𝑞𝑞𝑞𝑞
1 𝑝𝑝(𝑥𝑥𝑖𝑖 )
𝔼𝔼[𝑓𝑓(𝑥𝑥)] ≈ ∑ 𝑓𝑓(𝑥𝑥𝑖𝑖 )
𝑁𝑁 𝑞𝑞(𝑥𝑥𝑖𝑖 )
𝑖𝑖
𝑝𝑝(𝑥𝑥)
The ratio is called the importance sampling ratio or the importance correction.
𝑞𝑞𝑞𝑞𝑞𝑞
Now that we have understood the several important math prerequisites, we will
learn how the TRPO algorithm works in the next section.
[ 511 ]
TRPO, PPO, and ACKTR Methods
This section will be pretty dense and optional. If you are not interested in math you
can directly navigate to the section Solving the TRPO objective function, where we
learn how to solve the TRPO objective function step by step.
Let us say we have a policy 𝜋𝜋; we can express the expected discounted return 𝜂𝜂
following the policy 𝜋𝜋 as follows:
As we can notice from the preceding equation, the expected return following the
new policy 𝜋𝜋𝜋, that is, 𝜂𝜂(𝜋𝜋𝜋), is just the sum of the expected return following the old
policy 𝜋𝜋, that is, 𝜂𝜂(𝜋𝜋), and the expected discounted advantage of the old policy 𝐴𝐴𝜋𝜋.
That is:
But, why are we using the advantage of the old policy? Because we are measuring
how good the new policy 𝜋𝜋𝜋 is with respect to the average performance of the old
policy 𝜋𝜋.
[ 512 ]
Chapter 13
We can simplify the equation (2) and replace the sum over time steps with the sum
over states and actions as shown here:
Where 𝜌𝜌𝜋𝜋̃ (𝑠𝑠𝑠 is the discounted visitation frequency of the new policy. We already
learned that the expected return of the new policy 𝜂𝜂(𝜋𝜋𝜋) is obtained by adding the
expected return of the old policy 𝜂𝜂(𝜋𝜋) and the advantage of the old policy 𝐴𝐴𝜋𝜋.
In the preceding equation (3), if the advantage 𝐴𝐴𝜋𝜋 (𝑠𝑠𝑠 𝑠𝑠𝑠 is always positive, then it
means that our policy is improving and we have better 𝜂𝜂(𝜋𝜋𝜋). That is, if the advantage
𝐴𝐴𝜋𝜋 (𝑠𝑠𝑠 𝑠𝑠𝑠 is always ≥ 0, then we will always have an improvement in our policy.
However, equation (3) is difficult to optimize, so we approximate 𝜂𝜂(𝜋𝜋𝜋) by a local
approximate 𝐿𝐿𝜋𝜋 (𝜋𝜋𝜋):
As you may notice, unlike equation (3), in equation (4) we use 𝜌𝜌𝜋𝜋 (𝑠𝑠𝑠 instead of 𝜌𝜌𝜋𝜋̃ (𝑠𝑠𝑠.
That is, we use a discounted visitation frequency of the old policy 𝜌𝜌𝜋𝜋 (𝑠𝑠𝑠 instead
of the new policy 𝜌𝜌𝜋𝜋̃ (𝑠𝑠𝑠. But why do we have to do that? Because we already have
trajectories sampled from the old policy, so it is easier to obtain 𝜌𝜌𝜋𝜋 (𝑠𝑠𝑠 than 𝜌𝜌𝜋𝜋̃ (𝑠𝑠𝑠.
Thus, 𝐿𝐿𝜋𝜋 (𝜋𝜋𝜋) is the local approximate of our objective 𝜂𝜂(𝜋𝜋𝜋). We need to make sure
that our local approximate is accurate. Remember how, in the The trust region method
section, we learned that the local approximation of the function will be accurate if it
is in the trust region? So, our local approximate 𝐿𝐿𝜋𝜋 (𝜋𝜋𝜋) will be accurate if it is in the
trust region. Thus, while updating the values of 𝐿𝐿𝜋𝜋 (𝜋𝜋𝜋), we need to make sure that
it remains in the trust region; that is, the policy updates should remain in the trust
region.
So, when we update the old policy 𝜋𝜋 to a new policy 𝜋𝜋𝜋, we just need to ensure that
the new policy update stays within the trust region. In order to do that, we have to
measure how far our new policy is from the old policy, so, we use the KL divergence
to measure this:
[ 513 ]
TRPO, PPO, and ACKTR Methods
Therefore, while updating the policy, we check the KL divergence between the policy
updates and make sure that our policy updates are within the trust region. To satisfy
this KL constraint, Kakade and Langford introduced a new policy updating scheme
called conservative policy iteration and derived the following lower bound:
max (𝜋𝜋𝜋
𝜂𝜂(𝜋𝜋𝜋) ≥ 𝐿𝐿𝜋𝜋 (𝜋𝜋𝜋) − 𝐶𝐶𝐶𝐶𝐾𝐾𝐾𝐾 𝜋𝜋𝜋)
4𝜖𝜖𝜖𝜖
Where, 𝐶𝐶 𝐶
(1 − 𝛾𝛾)2
As we can observe, in the preceding equation, we have the KL divergence as the
penalty term and C is the penalty coefficient.
Now, our surrogate objective function (4) along with the penalized KL term is
written as:
max (𝜋𝜋𝜋
maximize[𝐿𝐿𝜋𝜋 (𝜋𝜋𝜋) − 𝐶𝐶𝐶𝐶𝐾𝐾𝐾𝐾 𝜋𝜋𝜋)] (5)
̃
𝜋𝜋
max (𝜋𝜋𝜋
Maximizing the surrogate function 𝐿𝐿(𝜋𝜋𝜋) − 𝐶𝐶𝐶𝐶𝐾𝐾𝐾𝐾 𝜋𝜋𝜋) improves our true objective
function 𝜂𝜂(𝜋𝜋𝜋) and guarantees a monotonic improvement in the policy. The preceding
objective function is known as KL penalized objective.
We parameterize the old policy with 𝜃𝜃 as 𝜋𝜋𝜃𝜃old and the new policy with 𝜃𝜃 as 𝜋𝜋𝜃𝜃 . So,
we can rewrite our equation (5) in terms of parameterized policies as shown here:
max
maximize[𝐿𝐿(𝜋𝜋𝜃𝜃 ) − 𝐶𝐶𝐶𝐶𝐾𝐾𝐾𝐾 (𝜋𝜋𝜃𝜃old , 𝜋𝜋𝜃𝜃 )] (6)
𝜃𝜃
As shown in the preceding equation, we are using the max KL divergence between
max
the old and new policies, that is, 𝐷𝐷𝐾𝐾𝐾𝐾 (𝜋𝜋𝜃𝜃old , 𝜋𝜋𝜃𝜃 ). It is difficult to optimize our
objective with the max KL term, so instead of using max KL, we can take the average
KL divergence 𝐷𝐷̅𝐾𝐾𝐾𝐾 (𝜋𝜋𝜃𝜃 , 𝜋𝜋𝜃𝜃 ) and rewrite our surrogate objective as:
old
[ 514 ]
Chapter 13
The issue with the preceding objective function is that when we substitute the value
4𝜖𝜖𝜖𝜖
of the penalty coefficient C as 𝐶𝐶 𝐶 , it reduces the step size, and it takes us
(1 − 𝛾𝛾)2
a lot of time to attain convergence.
maximize 𝐿𝐿𝐿𝐿𝐿𝜃𝜃 )
𝜃𝜃
(8)
̅𝐾𝐾𝐾𝐾 (𝜋𝜋𝜃𝜃 , 𝜋𝜋𝜃𝜃 ) ≤ 𝛿𝛿
subject to 𝐷𝐷 old
The preceding equation implies that we maximize our surrogate objective function
𝐿𝐿𝐿𝐿𝐿𝜃𝜃 ) while maintaining the constraint that the KL divergence between the old
policy 𝜋𝜋𝜃𝜃old and new policy 𝜋𝜋𝜃𝜃 is less than or equal to a constant 𝛿𝛿, and it ensures
that our old policy and the new policy will not vary very much. The preceding
objective function is called the KL-constrained objective.
Sample-based estimation
In the previous section, we learned how to frame our objective function as a KL-
constrained objective with parameterized policies. In this section, we will learn how
to simplify our objective function.
maximize 𝐿𝐿𝐿𝐿𝐿𝜃𝜃 )
𝜃𝜃
̅𝐾𝐾𝐾𝐾 (𝜋𝜋𝜃𝜃 , 𝜋𝜋𝜃𝜃 ) ≤ 𝛿𝛿
subject to 𝐷𝐷 old
Now we will see how we can simplify equation (9) by getting rid of the two
summations using sampling.
[ 515 ]
TRPO, PPO, and ACKTR Methods
The first sum ∑ 𝜌𝜌𝜋𝜋𝜃𝜃 (𝑠𝑠𝑠 expresses the summation over state visitation frequency;
old
𝑠𝑠
we can replace it by sampling states from state visitation as 𝔼𝔼𝑠𝑠𝑠𝑠𝑠 (𝜋𝜋𝜃𝜃old ). Then, our
equation becomes:
𝔼𝔼𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝜃𝜃
old
) [∑ 𝜋𝜋𝜃𝜃 (𝑎𝑎|𝑠𝑠)𝐴𝐴𝜋𝜋𝜃𝜃𝑜𝑜ld (𝑠𝑠𝑠 𝑠𝑠𝑠]
𝑎𝑎
Next, we replace the sum over actions ∑ 𝜋𝜋𝜃𝜃 (𝑎𝑎|𝑠𝑠) with an importance sampling
𝑎𝑎
estimator. Let q be the sampling distribution, and a is sampled from q, that is, 𝑎𝑎𝑎𝑎𝑎.
Then, we can rewrite our preceding equation as:
𝜋𝜋𝜃𝜃 (𝑎𝑎|𝑠𝑠)
𝔼𝔼𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 [ 𝐴𝐴𝜋𝜋𝜃𝜃 (𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝜃𝜃old ),𝑎𝑎𝑎𝑎𝑎 𝑞𝑞(𝑎𝑎|𝑠𝑠) 𝑜𝑜ld
𝜋𝜋𝜃𝜃 (𝑎𝑎|𝑠𝑠)
𝔼𝔼𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 [ 𝐴𝐴 (𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝜃𝜃old ),𝑎𝑎𝑎𝑎𝑎𝜃𝜃𝑜𝑜ld 𝜋𝜋𝜃𝜃𝑜𝑜ld (𝑎𝑎|𝑠𝑠) 𝜋𝜋𝜃𝜃𝑜𝑜ld
Thus, our equation (9) becomes:
𝜋𝜋𝜃𝜃 (𝑎𝑎|𝑠𝑠)
maximize 𝔼𝔼𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝜃𝜃 [ 𝐴𝐴 (𝑠𝑠𝑠 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝜃𝜃 𝑜𝑜ld 𝜋𝜋𝜃𝜃𝑜𝑜ld (𝑎𝑎|𝑠𝑠) 𝜋𝜋𝜃𝜃𝑜𝑜ld
subject to 𝔼𝔼𝑠𝑠𝑠𝑠𝑠𝜃𝜃 [𝐷𝐷𝐾𝐾𝐾𝐾 (𝜋𝜋𝜃𝜃old (∙ |𝑠𝑠)‖ 𝜋𝜋𝜃𝜃 (∙ |𝑠𝑠))] ≤ 𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿
𝑜𝑜ld
In the next section, we will learn how to solve the preceding objective function to
find the optimal policy.
𝜋𝜋𝜃𝜃 (𝑎𝑎|𝑠𝑠)
maximize 𝔼𝔼𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝜃𝜃 [ 𝐴𝐴𝜋𝜋𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝜃𝜃 𝑜𝑜ld 𝜋𝜋
𝜃𝜃𝑜𝑜ld (𝑎𝑎|𝑠𝑠) 𝑜𝑜ld
[ 516 ]
Chapter 13
The preceding equation implies that we try to find the policy that gives the
maximum return along with the constraint that the KL divergence between the old
and new policies should be less than or equal to 𝛿𝛿. This KL constraint makes sure
that our new policy is not too far away from the old policy.
For notation brevity, let us represent our objective with 𝐿𝐿𝐿𝐿𝐿𝐿 and the KL constraint
with 𝐷𝐷𝐷𝐷𝐷𝐷 and rewrite the preceding equation as:
maximize 𝐿𝐿(𝜃𝜃)
𝜃𝜃 (10)
subject to 𝐷𝐷𝐷𝐷𝐷𝐷 𝐷 𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷
By maximizing our objective function 𝐿𝐿(𝜃𝜃), we can find the optimal policy. We can
maximize the objective 𝐿𝐿(𝜃𝜃) by calculating gradients with respect to 𝜃𝜃 and update
the parameter using gradient ascent as:
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃
Where ∆𝜃𝜃 is the search direction (gradient) and 𝛼𝛼 is the backtracking coefficient.
That is, to update the parameter 𝜃𝜃, we perform the two following steps:
• First, we compute the search direction ∆𝜃𝜃 using the Taylor series
approximation
• Next, we perform the line search in the computed search direction ∆𝜃𝜃 by
finding the value of 𝛼𝛼 using the backtracking line search method
We will learn what the backtracking coefficient is and how exactly the backtracking
line search method works in the Performing a line search in the search direction section.
Okay, but why do we have to perform these two steps? If you look at our objective
function (10), we have a constrained optimization problem. Our constraint here
is that while updating the parameter 𝜃𝜃, we need to make sure that our parameter
updates are within the trust region; that is, the KL divergence between the old and
new parameters should be less than or equal to 𝛿𝛿.
Thus, performing these two steps and updating our parameter helps us satisfy the
KL constraint and also guarantees monotonic improvement. Let's get into details and
learn how exactly the two steps work.
To better understand the upcoming steps, recap The Taylor series from the Math
essentials section.
The linear approximation of our objective function at a point 𝜃𝜃𝑘𝑘 is given as:
1
𝐷𝐷𝐾𝐾𝐾𝐾 (𝜃𝜃𝑘𝑘 ‖𝜃𝜃) = 𝐷𝐷𝐾𝐾𝐾𝐾 (𝜃𝜃𝑘𝑘 ‖𝜃𝜃𝑘𝑘 ) + ∇𝜃𝜃 𝐷𝐷𝐾𝐾𝐾𝐾 (𝜃𝜃𝑘𝑘 ‖𝜃𝜃)(𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 ) + (𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 )𝑇𝑇 𝐻𝐻(𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 )
2
Where H is the second-order derivative, that is, 𝐻𝐻 𝐻 𝐻2𝜃𝜃 𝐷𝐷𝐾𝐾𝐾𝐾 (𝜃𝜃𝑘𝑘 , 𝜃𝜃). In the preceding
equation, the first term 𝐷𝐷𝐾𝐾𝐾𝐾 (𝜃𝜃𝑘𝑘 ‖𝜃𝜃𝑘𝑘 ) becomes zero as the KL divergence between two
identical distributions is zero, and the first-order derivative ∇𝜃𝜃 𝐷𝐷𝐾𝐾𝐾𝐾 (𝜃𝜃𝑘𝑘 ‖𝜃𝜃)(𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 )
becomes zero at 𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 .
1
𝐷𝐷𝐾𝐾𝐾𝐾 (𝜃𝜃𝑘𝑘 ‖𝜃𝜃) = (𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 )𝑇𝑇 𝐻𝐻(𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 ) (12)
2
Substituting (11) and (12) in the equation (10), we can write:
[ 518 ]
Chapter 13
Thus, using the Lagrange multiplier λ, we can rewrite our objective function (13) as:
1
𝐿𝐿(𝜃𝜃𝜃 𝜃) = 𝑔𝑔𝑇𝑇 (𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 ) −λ (𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 )𝑇𝑇 𝐻𝐻(𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 ) − 𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿14)
2
For notation brevity, let s represent 𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘, so we can rewrite equation (14) as:
1
𝐿𝐿 𝐿 𝐿𝐿 𝑇𝑇 𝑠𝑠 𝑠 𝑠 𝑠𝑠 𝑇𝑇 𝐻𝐻𝐻𝐻 − 𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿𝛿 (15)
2
Our goal is to find the optimal parameter 𝜃𝜃. So, we need to calculate the gradient of
the preceding function and update our parameter using gradient ascent as follows:
𝑑𝑑𝑑𝑑
= 𝑔𝑔 𝑔 𝑔𝐻𝐻𝐻𝐻 = 0
𝑑𝑑𝑑𝑑
Thus, we can write:
𝑔𝑔 𝑔 𝑔𝐻𝐻𝐻𝐻
λ is just our Lagrange multiplier and it will not affect our gradient, so we can write:
𝑔𝑔 𝑔 𝐻𝐻𝐻𝐻 (17)
Thus, we can write:
𝑠𝑠 𝑠 𝑠𝑠 −1 𝑔𝑔
However, computing the value of s directly in this way is not optimal. This is
because in the preceding equation, we have 𝐻𝐻 −1, which implies the inverse of the
second-order derivative. Computing the second-order derivative and its inverse is
a expensive task. So, we need to find a better way to compute s; how we do that?
[ 519 ]
TRPO, PPO, and ACKTR Methods
𝑔𝑔 𝑔 𝐻𝐻𝐻𝐻
From the preceding equation, we can observe the equation is in the form of Ax = B.
Thus, using conjugate gradient descent, we can approximate the value of s as:
𝑠𝑠 𝑠 𝑠𝑠 −1 𝑔𝑔
Thus, our update equation becomes:
Now that we have calculated the gradient, we need to determine the learning rate 𝛽𝛽.
We need to keep in mind that our update should be within the trust region, so while
calculating the value of 𝛽𝛽, we need to maintain the KL constraint.
𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 + 𝛽𝛽𝛽𝛽
By rearranging the terms, we can write:
1
(𝛽𝛽𝛽𝛽)𝑇𝑇 𝐻𝐻𝐻𝛽𝛽𝛽𝛽) = 𝛿𝛿
2
The preceding equation can be solved as:
1 2 𝑇𝑇
𝛽𝛽 𝑠𝑠 𝐻𝐻𝐻𝐻 = 𝛿𝛿
2
𝛽𝛽 2 𝑠𝑠 𝑇𝑇 𝐻𝐻𝐻𝐻 = 2𝛿𝛿
[ 520 ]
Chapter 13
2𝛿𝛿
𝛽𝛽 2 =
𝑠𝑠 𝑇𝑇 𝐻𝐻𝐻𝐻
2𝛿𝛿
𝛽𝛽 𝛽 √ 𝑇𝑇
𝑠𝑠 𝐻𝐻𝐻𝐻
Thus, we can substitute the preceding value of the learning rate 𝛽𝛽 in the equation (18)
and rewrite our parameter update as:
2𝛿𝛿
𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 + √ 𝑇𝑇 𝑠𝑠
𝑠𝑠 𝐻𝐻𝐻𝐻
Thus, we have computed the search direction using the Taylor series approximation
and the Lagrange multiplier:
2𝛿𝛿
𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 + 𝛼𝛼 𝑗𝑗 √ 𝑇𝑇 𝑠𝑠
𝑠𝑠 𝐻𝐻𝐻𝐻
Okay, what does this mean? What's that new parameter 𝛼𝛼 doing there? It is called
the backtracking coefficient and the value of 𝛼𝛼 ranges from 0 to 1. It helps us to
take a large step to update our parameter. That is, we can set 𝛼𝛼 to a high value and
make a large update. However, we need to make sure that we are maximizing our
objective 𝐿𝐿𝐿𝐿𝐿𝐿 𝐿 𝐿 along with satisfying our constraint 𝐷𝐷𝐷𝐷𝐷𝐷 𝐷 𝐷𝐷.
[ 521 ]
TRPO, PPO, and ACKTR Methods
1. For iterations j = 0, 1, 2, 3, . . . , N:
2𝛿𝛿
1. Compute 𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 + 𝛼𝛼 𝑗𝑗 √ 𝑠𝑠
𝑠𝑠 𝑇𝑇 𝐻𝐻𝐻𝐻
2𝛿𝛿
1. Update 𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 + 𝛼𝛼 𝑗𝑗 √ 𝑠𝑠
𝑠𝑠 𝑇𝑇 𝐻𝐻𝐻𝐻
2. Break
2𝛿𝛿
𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 + 𝛼𝛼 𝑗𝑗 √ 𝑇𝑇 𝑠𝑠
𝑠𝑠 𝐻𝐻𝐻𝐻
In the next section, we will learn how exactly the TRPO algorithm works by using
the preceding update rule.
Algorithm – TRPO
TRPO acts as an improvement to the policy gradient algorithm we learned in
Chapter 10, Policy Gradient Method. It ensures that we can take large steps and
update our parameter along with maintaining the constraint that our old policy
and the new policy should not vary very much. The TRPO update rule is given as:
2𝛿𝛿
𝜃𝜃 𝜃 𝜃𝜃𝑘𝑘 + 𝛼𝛼 𝑗𝑗 √ 𝑇𝑇 𝑠𝑠
𝑠𝑠 𝐻𝐻𝐻𝐻
[ 522 ]
Chapter 13
Now, let's look at the algorithm of TRPO and see exactly how TRPO uses the
preceding update rule and finds the optimal policy. Before going ahead, let's recap
how we computed gradient in the policy gradient method. In the policy gradient
method, we computed the gradient g as:
𝑁𝑁 𝑇𝑇𝑇𝑇
1
𝑔𝑔 𝑔 ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )(𝑅𝑅𝑡𝑡 − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
Where Rt is the reward-to-go. The reward-to-go is the sum of the rewards of the
trajectory starting from a state s and action a; it is expressed as:
𝑇𝑇𝑇𝑇
Isn't the reward-to-go similar to something we learned about earlier? Yes! If you
recall, we learned that the Q function is the sum of rewards of the trajectory starting
from the state s and action a. So, we can just replace the reward-to-go with the Q
function and write our gradient as:
𝑁𝑁 𝑇𝑇𝑇𝑇
1
𝑔𝑔 𝑔 ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )(𝑄𝑄𝑄𝑄𝑄𝑡𝑡 , 𝑎𝑎𝑡𝑡 ) − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
In the preceding equation, we have a difference between the Q function and the
value function. We learned that the advantage function is the difference between the
Q function and the value function and hence we can rewrite our gradient with the
advantage function as:
𝑁𝑁 𝑇𝑇𝑇𝑇
1
𝑔𝑔 𝑔 ∑ [∑ ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )𝐴𝐴𝑡𝑡 ]
𝑁𝑁
𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡
Now, let's look at the algorithm of TRPO. Remember that TRPO is the policy
gradient method, so unlike actor-critic methods, here, first we generate N number
of trajectories and then we update the parameter of the policy and value network.
[ 523 ]
TRPO, PPO, and ACKTR Methods
2𝛿𝛿
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃 𝑗𝑗 √ 𝑇𝑇 𝑠𝑠
𝑠𝑠 𝐻𝐻𝐻𝐻
Now that we have understood how TRPO works, in the next section, we will learn
another interesting algorithm called proximal policy optimization.
PPO improves upon the TRPO algorithm and is simple to implement. Similar to
TRPO, PPO ensures that the policy updates are in the trust region. But unlike TRPO,
PPO does not use any constraints in the objective function. Going forward, we will
learn how exactly PPO works and how PPO ensures that the policy updates are in
the trust region.
[ 524 ]
Chapter 13
We will now look into the preceding two types of PPO algorithm in detail.
Let us take only the objective without the constraint and write the PPO objective
function as:
[ 525 ]
TRPO, PPO, and ACKTR Methods
If we update the policy using the preceding objective function then the policy
updates will not be in the trust region. So, to ensure that our policy updates are in
the trust region (that the new policy is not far from the old policy), we modify our
objective function by adding a new function called the clipping function and rewrite
our objective function as:
[ 526 ]
Chapter 13
When the advantage is positive, 𝐴𝐴𝑡𝑡 > 0, then it means that the corresponding action
should be preferred over the average of all other actions. So, we can increase the
value of 𝑟𝑟𝑡𝑡 (𝜃𝜃) for that action so that it will have a greater chance of being selected.
However, while increasing the value of 𝑟𝑟𝑡𝑡 (𝜃𝜃), we should not increase it too much
that it goes far away from the old policy. So, to prevent this, we clip 𝑟𝑟𝑡𝑡 (𝜃𝜃) at 1 + 𝜖𝜖.
Figure 13.10 shows how we increase the value of 𝑟𝑟𝑡𝑡 (𝜃𝜃) when the advantage is
positive and how we clip it at 1 + 𝜖𝜖:
When the advantage is negative, 𝐴𝐴𝑡𝑡 < 0, then it means that the corresponding action
should not be preferred over the average of all other actions. So, we can decrease
the value of 𝑟𝑟𝑡𝑡 (𝜃𝜃) for that action so that it will have a lower chance of being selected.
However, while decreasing the value of 𝑟𝑟𝑡𝑡 (𝜃𝜃), we should not decrease it too much
that it goes far away from the old policy. So, in order to prevent that, we clip 𝑟𝑟𝑡𝑡 (𝜃𝜃) at
1 − 𝜖𝜖.
[ 527 ]
TRPO, PPO, and ACKTR Methods
Figure 13.11 shows how we decrease the value of 𝑟𝑟𝑡𝑡 (𝜃𝜃) when the advantage is
negative and how we clip it at 1 − 𝜖𝜖:
The value of 𝜖𝜖 is usually set to 0.1 or 0.2. Thus, we learned that the clipped objective
keeps our policy updates close to the old policy by clipping at 1 + 𝜖𝜖 and 1 − 𝜖𝜖 based
on the advantage function. So, our final objective function takes the minimum value
of the unclipped and clipped objectives as:
Algorithm – PPO-clipped
The steps involved in the PPO-clipped algorithm are given as follows:
[ 528 ]
Chapter 13
import warnings
warnings.filterwarnings('ignore')
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import numpy as np
import matplotlib.pyplot as plt
import gym
env = gym.make('Pendulum-v0').unwrapped
state_shape = env.observation_space.shape[0]
[ 529 ]
TRPO, PPO, and ACKTR Methods
action_shape = env.action_space.shape[0]
Note that the pendulum is a continuous environment and thus our action space
consists of continuous values. So, we get the bound of our action space:
epsilon = 0.2
class PPO(object):
def __init__(self):
self.sess = tf.Session()
Now, let's build the value network that returns the value of a state:
with tf.variable_scope('value'):
[ 530 ]
Chapter 13
Define the advantage value as the difference between the Q value and the state
value:
self.value_loss = tf.reduce_mean(tf.square(self.advantage))
Train the value network by minimizing the loss using the Adam optimizer:
self.train_value_nw = tf.train.AdamOptimizer(0.002).
minimize(self.value_loss)
Now, we obtain the new policy and its parameter from the policy network:
Obtain the old policy and its parameter from the policy network:
with tf.variable_scope('sample_action'):
self.sample_op = tf.squeeze(pi.sample(1), axis=0)
with tf.variable_scope('update_oldpi'):
self.update_oldpi_op = [oldp.assign(p) for p, oldp in
zip(pi_params, oldpi_params)]
[ 531 ]
TRPO, PPO, and ACKTR Methods
Now, let's define our surrogate objective function of the policy network:
with tf.variable_scope('loss'):
with tf.variable_scope('surrogate'):
Define the objective by multiplying the ratio 𝑟𝑟𝑡𝑡 (𝜃𝜃) and the advantage value At:
Define the objective function with the clipped and unclipped objectives:
L = tf.reduce_mean(tf.minimum(objective, tf.clip_by_
value(ratio, 1.-epsilon, 1.+ epsilon)*self.advantage_ph))
Now, we can compute the gradient and maximize the objective function using
gradient ascent. However, instead of doing that, we can convert the preceding
maximization objective into the minimization objective by just adding a negative
sign. So, we can denote the loss of the policy network as:
self.policy_loss = -L
Train the policy network by minimizing the loss using the Adam optimizer:
with tf.variable_scope('train_policy'):
self.train_policy_nw = tf.train.AdamOptimizer(0.001).
minimize(self.policy_loss)
self.sess.run(tf.global_variables_initializer())
[ 532 ]
Chapter 13
self.sess.run(self.update_oldpi_op)
[ 533 ]
TRPO, PPO, and ACKTR Methods
params = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope=name)
return norm_dist, params
Sample an action from the normal distribution generated by the policy network:
We clip the action so that it lies within the action bounds and then we return the
action:
[ 534 ]
Chapter 13
ppo = PPO()
num_episodes = 1000
num_timesteps = 200
gamma = 0.9
batch_size = 32
for i in range(num_episodes):
state = env.reset()
Initialize the lists for holding the states, actions, and rewards obtained in the episode:
Return = 0
for t in range(num_timesteps):
env.render()
[ 535 ]
TRPO, PPO, and ACKTR Methods
action = ppo.select_action(state)
episode_states.append(state)
episode_actions.append(action)
episode_rewards.append((reward+8)/8)
state = next_state
Return += reward
If we reached the batch size or if we reached the final step of the episode:
v_s_ = ppo.get_state_value(next_state)
discounted_r = []
for reward in episode_rewards[::-1]:
v_s_ = reward + gamma * v_s_
discounted_r.append(v_s_)
discounted_r.reverse()
[ 536 ]
Chapter 13
if i %10 ==0:
print("Episode:{}, Return: {}".format(i,Return))
Now that we have learned how PPO with a clipped objective works and how to
implement it, in the next section we will learn about another interesting type of PPO
algorithm called PPO with a penalized objective.
Let 𝑑𝑑 𝑑 KL[𝜋𝜋𝜃𝜃old (∙ |𝑠𝑠𝑡𝑡 ), 𝜋𝜋𝜃𝜃 (∙ |𝑠𝑠𝑡𝑡 )] and let 𝛿𝛿 be the target KL divergence; then, we set
the value of 𝛽𝛽 adaptively as:
We can understand how exactly this works by looking into the PPO-penalty
algorithm in the next section.
[ 537 ]
TRPO, PPO, and ACKTR Methods
Algorithm – PPO-penalty
The steps involved in the PPO-penalty algorithm are:
1. Initialize the policy network parameter 𝜃𝜃 and the value network parameter 𝜙𝜙,
and initialize the penalty coefficient 𝛽𝛽1 and the target KL divergence 𝛿𝛿
2. For iterations 𝑖𝑖 𝑖 𝑖𝑖 𝑖𝑖 𝑖 𝑖 𝑖𝑖:
In the next section, we will learn another interesting algorithm called ACKTR.
[ 538 ]
Chapter 13
We know that the actor-critic architecture consists of the actor and critic networks,
where the role of the actor is to produce a policy and the role of the critic is to
evaluate the policy produced by the actor network. We learned that in the actor
network (policy network), we compute gradients and update the parameter of the
actor network using gradient ascent:
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
Instead of updating our actor network parameter using the preceding update rule,
we can also update it by computing the natural gradients as:
Where F is called the Fisher information matrix. Thus, the natural gradient is just the
product of the inverse of the Fisher matrix and standard gradient:
The use of the natural gradient is that it guarantees a monotonic improvement in the
policy. However, updating the actor network (policy network) parameter using the
preceding update rule is a computationally expensive task, because computing the
Fisher information matrix and then taking its inverse is a computationally expensive
task. So, to avoid this tedious computation, we can just approximate the value of
𝐹𝐹 −1 using a Kronecker-factored approximation. Once we approximate 𝐹𝐹 −1 using
a Kronecker-factored approximation, then we can just update our policy network
parameter using the natural gradient update rule given in equation (21), and while
updating the policy network parameter, we also ensure that the policy updates are in
the trust region so that the new policy is not far from the old policy. This is the main
idea behind the ACKTR algorithm.
Now that we have a basic understanding of what ACKTR is, let us understand how
this works exactly in detail. First, we will understand what Kronecker factorization
is, then we will learn how it is used in the actor-critic setting, and later we will learn
how to incorporate the trust region in the policy updates.
Before going ahead, let's learn several math concepts that are required to understand
ACKTR.
[ 539 ]
TRPO, PPO, and ACKTR Methods
Math essentials
To understand how Kronecker factorization works, we will learn the following
important concepts:
• Block matrix
• Block diagonal matrix
• The Kronecker product
• The vec operator
• Properties of the Kronecker product
Block matrix
A block matrix is defined as a matrix that can be broken down into submatrices
called blocks, or we can say a block matrix is formed by a set of submatrices or
blocks. For instance, let's consider a block matrix A as shown here:
1 2 4 5
𝐴𝐴 𝐴 [3 4 6 7]
1 1 1 1
1 2 2 1
The matrix A can be broken into four 2 × 2 submatrices as shown here:
1 2 4 5 1 1 1 1
𝐴𝐴1 = [ ] 𝐴𝐴 = [ ] 𝐴𝐴 = [ ] 𝐴𝐴 = [ ]
3 4 , 2 6 7 , 3 1 2 , 4 2 1
Now, we can simply write our block matrix A as:
𝐴𝐴1 𝐴𝐴2
𝐴𝐴 𝐴 [ ]
𝐴𝐴3 𝐴𝐴4
[ 540 ]
Chapter 13
𝐴𝐴1 0 ⋯ 0
0 𝐴𝐴2 ⋯ 0
𝐴𝐴 𝐴 [ ]
⋮ ⋮ ⋱ ⋮
0 0 ⋯ 𝐴𝐴𝑛𝑛
Where the diagonals 𝐴𝐴1 , 𝐴𝐴2 , ⋯ , 𝐴𝐴𝑛𝑛 are the square matrices.
1 2 0 0 0 0
3 4 0 0 0 0
4 5
𝐴𝐴 𝐴 0 0 0 0
0 0 6 7 0 0
0 0 0 0 7 8
[0 0 0 0 9 0]
As we can see, the diagonals are basically the square matrix and off-diagonal
elements are set to zero:
1 2 4 5 7 8
𝐴𝐴1 = [ ] 𝐴𝐴 = [ ] 𝐴𝐴 = [ ]
3 4 , 2 6 7 , 3 9 0
Now, we can simply denote our block diagonal matrix A as:
𝐴𝐴1 0 0
𝐴𝐴 𝐴 [ 0 𝐴𝐴2 0]
0 0 𝐴𝐴3
[ 541 ]
TRPO, PPO, and ACKTR Methods
1 1 2 4
𝐴𝐴 𝐴 [ ] and 𝐵𝐵 𝐵 [ ]
2 0 1 3
Then the Kronecker product of matrices A and B is given as:
[ 542 ]
Chapter 13
1 2
𝐴𝐴 𝐴 [ ]
3 4
Applying the vec operator on A stacks all the columns in the matrix one below the
other as follows:
1
3
vec(A) = [2]
4
Now that we have learned several important concepts, let's understand what
Kronecker factorization is.
[ 543 ]
TRPO, PPO, and ACKTR Methods
Let's learn how we approximate 𝐹𝐹 −1 using Kronecker factors. Say our network
has 1, 2, . . . , 𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑙𝑙 layers and the weight of the network is represented by 𝜃𝜃. Thus,
𝜃𝜃1 , 𝜃𝜃2 , … , 𝜃𝜃𝑙𝑙 , … , 𝜃𝜃𝐿𝐿 denotes the weights of the layers 1, 2, … , 𝑙𝑙𝑙𝑙𝑙 𝑙𝑙 respectively. Let
𝑝𝑝(𝑦𝑦|𝑥𝑥) denote the output distribution of the network, and we will use the negative
log-likelihood as the loss function J:
As we can observe, each block F1 to FL contains the derivatives of loss J with respect
to the weights of the corresponding layer. Okay, how can we compute each block?
That is, how can the values in the preceding block diagonal matrix be computed?
To understand this, let's just take one block, say, Fl, and learn how it is computed.
Let's take a layer l. Let a be the input activation vector, let 𝜃𝜃𝑙𝑙 be the weights of the
layer, and let s be the output pre-activation vector, and it can be sent to the next
layer l + 1.
[ 544 ]
Chapter 13
We know that in the neural network, we multiply the activation vector by weights
and send that to the next layer; so, we can write:
T
𝐹𝐹𝑙𝑙 = 𝔼𝔼 [vec{∇𝜃𝜃𝑙𝑙 𝐽𝐽} vec{∇𝜃𝜃𝑙𝑙 𝐽𝐽} ] (23)
The preceding equation Fl denotes the gradient of the loss with respect to the weights
of layer l.
From (22), the partial derivative of the loss function J with respect to weights 𝜃𝜃𝑙𝑙 in
layer l can be written as:
The update rule for updating the weights 𝜃𝜃𝑙𝑙 of the layer l is given as:
[ 545 ]
TRPO, PPO, and ACKTR Methods
Thus, we have learned how to approximate the natural gradient using Kronecker
factors. In the next section, we will learn how to apply this in an actor-critic setting.
K-FAC in actor-critic
We know that in the actor-critic method, we have actor and critic networks. The role
of the actor is to produce the policy and the role of the critic is to evaluate the policy
produced by the actor network.
[ 546 ]
Chapter 13
First, let's take a look at the actor network. In the actor network, our goal is to find
the optimal policy. So, we try to find the optimal parameter 𝜃𝜃 with which we can
obtain the optimal policy. We compute gradients and update the parameter of the
actor network using gradient ascent:
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
Instead of updating the actor network parameter using the preceding update rule,
we can also update the parameter of the actor network by computing the natural
gradients as:
T
𝐹𝐹 𝐹 𝐹𝐹𝑝𝑝𝑝𝑝𝑝𝑝 [∇𝜃𝜃 log 𝜋𝜋 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )(∇𝜃𝜃 log 𝜋𝜋 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )) ]
Now, just as we learned in the previous section, we can approximate the Fisher
information matrix as a block diagonal matrix where each block contains the
derivatives, and then we can approximate each block as the Kronecker product of
two matrices.
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃
The value of ∆𝜃𝜃 can be computed using Kronecker factorization as:
Now, let's look at the critic network. We know that the critic evaluates the policy
produced by the actor network by estimating the Q function. So, we train the critic
by minimizing the mean squared error between the target value and predicted value.
[ 547 ]
TRPO, PPO, and ACKTR Methods
We minimize the loss using gradient descent and update the critic network
parameter 𝜙𝜙 as:
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽
Where ∇𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽 is the standard first-order gradient.
Instead of using the first-order gradient, can we use the second-order gradient and
update the critic network parameter 𝜙𝜙, similar to what we did with the actor? Yes, in
settings like least squares (MSE), we can use an algorithm called the Gauss-Newton
method for finding the second-order derivative. You can learn more about the
Gauss-Newton method here: http://www.seas.ucla.edu/~vandenbe/236C/lectures/
gn.pdf. Let's represent our error as 𝑒𝑒(𝜙𝜙) = target − predicted. According to the
Gauss-Newton method, the update rule for updating the critic network parameter 𝜙𝜙
is given as:
Instead of applying K-FAC to the actor and critic separately, we can also apply
them in a shared mode. As specified in the paper Scalable trust-region method for
deep reinforcement learning using Kronecker-factored approximation by Yuhuai
Wu, Elman Mansimov, Shun Liao, Roger Grosse, Jimmy Ba (https://arxiv.org/
pdf/1708.05144.pdf), "We can have a single architecture where both actor and critic share
the lower layer representations but they have different output layers."
In a nutshell, in the ACKTR method, we update the parameters of the actor and critic
networks by computing the second-order derivatives. Since computing the second-
order derivative is an expensive task, we use a method called Kronecker-factored
approximation to approximate the second-order derivative.
In the next section, we will learn how to incorporate the trust region into our update
rule so that our new and old policy updates will not be too far apart.
[ 548 ]
Chapter 13
2𝛿𝛿
𝛼𝛼 𝛼 min (𝛼𝛼max , √ ), where 𝛼𝛼 and the trust region radius 𝛿𝛿 are the
∆𝜃𝜃 T 𝐹𝐹̂ ∆𝜃𝜃
hyperparameters, as mentioned in the ACKTR paper (refer to the Further reading
section). Updating our network parameters with this step size ensures that our
policy updates are in the trust region.
Summary
We started off the chapter by understanding what TRPO is and how it acts as an
improvement to the policy gradient algorithm. We learned that when the new policy
and old policy vary greatly then it causes model collapse.
So in TRPO, we make a policy update while imposing the constraint that the
parameters of the old and new policies should stay within the trust region. We also
learned that TRPO guarantees monotonic policy improvement; that is, it guarantees
that there will always be a policy improvement on every iteration.
Later, we learned about the PPO algorithm, which acts as an improvement to the
TRPO algorithm. We learned about two types of PPO algorithm: PPO-clipped and
PPO-penalty. In the PPO-clipped method, in order to ensure that the policy updates
are in the trust region, PPO adds a new function called the clipping function that
ensures the new and old policies are not far away from each other. In the PPO-
penalty method, we modify our objective function by converting the KL constraint
term to a penalty term and update the penalty coefficient adaptively during training
by ensuring that the policy updates are in the trust region.
[ 549 ]
TRPO, PPO, and ACKTR Methods
At the end of the chapter, we learned about ACKTR. In the ACKTR method, we
update the parameters of the actor and critic networks by computing the second-
order derivative. Since computing the second-order derivative is an expensive task,
we use a method called Kronecker-factored approximation to approximate the
second-order derivatives, and while updating the policy network parameter, we also
ensure that the policy updates are in the trust region so that the new policy is not far
from the old policy.
Questions
Let's evaluate our understanding of the algorithms we learned in this chapter. Try
answering the following questions:
Further reading
For more information, refer to the following papers:
[ 550 ]
Distributional
14
Reinforcement Learning
In this chapter, we will learn about distributional reinforcement learning. We will
begin the chapter by understanding what exactly distributional reinforcement
learning is and why it is useful. Next, we will learn about one of the most popular
distributional reinforcement learning algorithms called categorical DQN. We
will understand what a categorical DQN is and how it differs from the DQN we
learned in Chapter 9, Deep Q Networks and Its Variants, and then we will explore the
categorical DQN algorithm in detail.
At the end of the chapter, we will learn about the policy gradient algorithm called
the Distributed Distributional Deep Deterministic Policy Gradient (D4PG).
We will learn what the D4PG is and how it differs from the DDPG we covered
in Chapter 12, Learning DDPG, TD3, and SAC, in detail
[ 551 ]
Distributional Reinforcement Learning
We learned that the Q value is the expected return an agent would obtain when
starting from state s and performing an action a following the policy 𝜋𝜋:
But there is a small problem in computing the Q value in this manner because the
Q value is just an expectation of the return, and the expectation does not include the
intrinsic randomness. Let's understand exactly what this means with an example.
Let's suppose we want to drive from work to home and we have two routes A and B.
Now, we have to decide which route is better, that is, which route helps us to reach
home in the minimum amount of time. To find out which route is better, we can
calculate the Q values and select the route that has the maximum Q value, that is,
the route that gives us the maximum expected return.
Say the Q value of choosing route A is Q(s, A) = 31, and the Q value of choosing
route B is Q(s, B) = 28. Since the Q value (the expected return of route A) is higher,
we can choose route A to travel home. But are we missing something here? Instead
of viewing the Q value as an expectation over a return, can we directly look into the
distribution of return and make a better decision?
Yes!
[ 552 ]
Chapter 14
But first, let's take a look at the distribution of route A and route B and understand
which route is best. The following plot shows the distribution of route A. It tells us
with 70% probability we reach home in 10 minutes, and with 30% probability we
reach home in 80 minutes. That is, if we choose route A we usually reach home in
10 minutes but when there is heavy traffic we reach home in 80 minutes:
Figure 14.2 shows the distribution of route B. It tells us that with 80% probability we
reach home in 20 minutes and with 20% probability we reach home in 60 minutes.
[ 553 ]
Distributional Reinforcement Learning
That is, if we choose route B we usually reach home in 20 minutes but when there is
heavy traffic we reach home in 60 minutes:
After looking at these two distributions, it makes more sense to choose route B
instead of choosing route A. With route B, even in the worst case, that is, even when
there is heavy traffic, we can reach home in 60 minutes. But with route A, when there
is heavy traffic, we reach home in 80 minutes. So, it is a wise decision to choose route
B rather than A.
Similarly, if we can observe the distribution of return of route A and route B, we can
understand more information and we will miss out on these details when we take
actions just based on the maximum expected return, that is, the maximum Q value.
So, instead of using the expected return to select an action, we use the distribution
of return and then select optimal action based on the distribution.
[ 554 ]
Chapter 14
Categorical DQN
In the last section, we learned why it is more beneficial to choose an action based
on the distribution of return than to choose an action based on the Q value, which
is just the expected return. In this section, we will understand how to compute the
distribution of return using an algorithm called categorical DQN.
The distribution of return is often called the value distribution or return distribution.
Let Z be the random variable and Z(s, a) denote the value distribution of a state s
and an action a. We know that the Q function is represented by Q(s, a) and it gives
the value of a state-action pair. Similarly, now we have Z(s, a) and it gives the value
distribution (return distribution) of the state-action pair.
Okay, how can we compute Z(s, a)? First, let's recollect how we compute Q(s, a).
Let's understand the difference between the DQN and categorical DQN with an
example. Suppose we are in the state s and say our action space has two actions
a and b. Now, as shown in Figure 14.3, given the state s as an input to the DQN, it
returns the Q value of all the actions, then we select the action that has the maximum
Q value, whereas in the categorical DQN, given the state s as an input, it returns the
value distribution of all the actions, then we select the action based on this value
distribution:
[ 555 ]
Distributional Reinforcement Learning
Okay, how can we train the network? In DQN, we learned that we train the network
by minimizing the loss between the target Q value and the Q value predicted by the
network. We learned that the target Q value is obtained by the Bellman optimality
equation. Thus, we minimize the loss between the target value (the optimal Bellman
Q value) and the predicted value (the Q value predicted by the network) and train
the network.
Similarly, in categorical DQN, we train the network by minimizing the loss between
the target value distribution and the value distribution predicted by the network.
Okay, how can we obtain the target value distribution? In DQN, we obtained the
target Q value using the Bellman equation; similarly in categorical DQN, we can
obtain the target value distribution using the distributional Bellman equation.
What's the distributional Bellman equation? First, let's recall the Bellman equation
before learning about the distributional Bellman equation.
We learned that the Bellman equation for the Q function Q(s, a) is given as:
Similarly, the Bellman equation for the value distribution Z(s, a) is given as:
Okay, what loss function should we use? In DQN, we use the mean squared
error (MSE) as our loss function. Unlike a DQN, we cannot use the MSE as the
loss function in the categorical DQN because in categorical DQN, we predict
the probability distribution and not the Q value. Since we are dealing with the
distribution we use the cross entropy loss as our loss function. Thus, in categorical
DQN, we train the network by minimizing the cross entropy loss between the
target value distribution and the value distribution predicted by the network.
[ 556 ]
Chapter 14
Now that we have understood what a categorical DQN is and how it differs from a
DQN, in the next section we will learn how exactly the categorical DQN predicts the
value distribution.
The horizontal axis values are called support or atoms and the vertical axis values
are the probability. We denote the support by Z and the probability by P. In order to
predict the value distribution, along with the state, our network takes the support of
the distribution as input and it returns the probability of each value in the support.
So, now, we will see how to compute the support of the distribution. To compute
support, first, we need to decide the number of values of the support N, the
minimum value of the support 𝑉𝑉min, and the maximum value of the support 𝑉𝑉max .
Given a number of support N, we divide them into N equal parts from 𝑉𝑉min to 𝑉𝑉max .
Let's understand this with an example. Say the number of support N = 5, the
minimum value of support 𝑉𝑉min = 2, and the maximum value of the support
𝑉𝑉max = 10. Now, how can we find the values of the support? In order to find the
values of the support, first, we will compute the step size called △ 𝑧𝑧. The value of
△ 𝑧𝑧 can be computed as:
𝑉𝑉max − 𝑉𝑉min
△ 𝑧𝑧 𝑧
𝑁𝑁 𝑁 𝑁
[ 557 ]
Distributional Reinforcement Learning
10 − 2
△ 𝑧𝑧 𝑧
5−1
△ 𝑧𝑧 𝑧 8⁄4 = 2
Now, to compute the values of support, we start with the minimum value of support
𝑉𝑉min and add △ 𝑧𝑧 to every value until we reach the number of support N. In our
example, we start with 𝑉𝑉min, which is 2, and we add △ 𝑧𝑧 𝑧 𝑧 to every value until
we reach the number of support N. Thus, the support values become:
Okay, we have learned how to compute the support of the distribution, now how
does the neural network take this support as input and return the probabilities?
In order to predict the value distribution, along with the state, we also need to
give the support of the distribution as input and then the network returns the
probabilities of our value distribution as output. Let's understand this with an
example. Say we are in a state s and we have two actions to perform in this state,
and let the actions be up and down. Say our calculated support values are z1, z2,
and z3.
As Figure 14.5 shows, along with giving the state s as input to the network, we
also give the support of our distribution z1, z2, and z3. Then our network returns
the probabilities pi(s, a) of the given support for the distribution of action up and
distribution of action down:
[ 558 ]
Chapter 14
The authors of the categorical DQN paper (see the Further reading section for more
details) suggest that it will be efficient to set the number of support N as 51, and so
the categorical DQN is also known as the C51 algorithm. Thus, we have learned how
categorical DQN predicts the value distribution. In the next section, we will learn
how to select the action based on this predicted value distribution.
[ 559 ]
Distributional Reinforcement Learning
We generally select an action based on the Q value, that is, we usually select the
action that has the maximum Q value. But now we don't have a Q value; instead,
we have a value distribution. How can we select an action based on the value
distribution?
First, we will extract the Q value from the value distribution and then we select
the action as the one that has the maximum Q value. Okay, how can we extract the
Q value? We can compute the Q value by just taking the expectation of the value
distribution. The expectation of the distribution is given as the sum of support zi
multiplied by their corresponding probability pi. So the expectation of the value
distribution Z is given as:
After computing the Q value, we select the best action as the one that has the
maximum Q value:
Let's understand how this works exactly. Suppose we are in the state s and say
we have two actions in the state. Let the actions be up and down. First, we need
to compute support. Let the number of support N = 3, the minimum value of the
support 𝑉𝑉min = 2, and the maximum value of the support 𝑉𝑉max = 4. Then, our
computed support values will be [2,3,4].
Now, along with the state s, we feed the support, then the categorical DQN returns
the probabilities pi(s, a) of the given support for the value distribution of action up
and distribution of action down as shown here:
[ 560 ]
Chapter 14
Now, how can we select the best action, based on these two value distributions?
First, we will extract the Q value from the value distributions and then we select
the action that has the maximum Q value.
We learned that the Q value can be extracted from the value distribution as the sum
of support multiplied by their probabilities:
[ 561 ]
Distributional Reinforcement Learning
Wait! What makes categorical DQN special then? Because just like DQN, we are
selecting the action based on the Q value at the end. One important point we have
to note is that, in DQN, we compute the Q value based on the expectation of the
return directly, but in categorical DQN, first, we learn the return distribution and
then we compute the Q value based on the expectation of the return distribution,
which captures the intrinsic randomness.
We have learned that the categorical DQN outputs the value distribution of all
the actions in the given state and then we extract the Q value from the value
distribution and select the action that has the maximum Q value as the best action.
But the question is how exactly does our categorical DQN learn? How do we train
the categorical DQN to predict the accurate value distribution? Let's discuss this in
the next section.
[ 562 ]
Chapter 14
After computing the target distribution, we train the network by minimizing the
cross entropy loss between the target value distribution and the predicted value
distribution. One important point we need to note here is that we can apply the
cross entropy loss between any two distributions only when their supports are
equal; when their supports are not equal we cannot apply the cross entropy loss.
For instance, Figure 14.7 shows the support of both the target and predicted
distribution is the same, (1,2,3,4). Thus, in this case, we can apply the cross
entropy loss:
In Figure 14.8, we can see that the target distribution support (1,3,4,5) and the
predicted distribution support (1,2,3,4) are different, so in this case, we cannot
apply the cross entropy loss.
So, when the support of the target and prediction distribution is different, we
perform a special step called the projection step using which we can make the
support of the target and prediction distribution equal. Once we make the support
of the target and prediction distribution equal then we can apply the cross
entropy loss.
In the next section, we will learn how exactly the projection works and how it
makes the support of the target and prediction distribution equal.
[ 563 ]
Distributional Reinforcement Learning
Projection step
Let's understand how exactly the projection step works with an example. Suppose
the input support is z = [1, 2].
Let the probability of predicted distribution be p = [0.5, 0.5]. Figure 14.9 shows the
predicted distribution:
Let the probability of target distribution be p = [0.3, 0.7]. Let the reward r = 0.1 and
the discount factor 𝛾𝛾 𝛾 𝛾𝛾𝛾. The target distribution support value is computed as
𝜏𝜏𝜏𝜏𝑗𝑗 = 𝑟𝑟𝑡𝑡 + 𝛾𝛾𝑡𝑡 𝑧𝑧𝑗𝑗 , so, we can write:
𝜏𝜏𝜏𝜏1 = 𝑟𝑟 + 𝛾𝛾𝛾𝛾1
= 0.1 + (0.9 ∗ 1) = 1.0
𝜏𝜏𝜏𝜏2 = 𝑟𝑟 𝑟 𝑟𝑟𝑟𝑟2
= 0.1 + (0.9 ∗ 2) = 1.9
Thus, the target distribution becomes:
As we can observe from the preceding plots, the supports of the predicted
distribution and target distribution are different. The predicted distribution has the
support [1, 2] while the target distribution has the support [1, 1.9], so in this case,
we cannot apply the cross entropy loss.
[ 564 ]
Chapter 14
Now, using the projection step we can convert the support of our target distribution
to be the same support as the predicted distribution. Once the supports of the
predicted and target distribution are the same then we can apply the cross entropy
loss.
Okay, what's that projection step exactly? How can we apply it and convert the
support of the target distribution to match the support of the predicted distribution?
Let's understand this with the same example. As the following shows, we have the
target distribution support [1, 1.9] and we need to make it equal to the predicted
distribution support [1, 2], how can we do that?
So, what we can do is that we can distribute the probability 0.7 from the support 1.9
to the support 1 and 2:
Okay, but how can we distribute the probabilities from the support 1.9 to the
support 1 and 2? Should it be an equal distribution? Of course not. Since 2 is closer
to 1.9, we distribute more probability to 2 and less to 1.
[ 565 ]
Distributional Reinforcement Learning
As shown in Figure 14.13, from 0.7, we will distribute 0.63 to support 2 and 0.07 to
support 1.
From Figure 14.14, we can see that support of the target distribution is changed from
[1, 1.9] to [1, 2] and now it matches the support of the predicted distribution. This
step is called the projection step.
What we learned is just a simple example, consider a case where our target and
predicted distribution support varies very much. In this case, we cannot manually
determine the amount of probability we have to distribute across the supports to
make them equal. So, we introduce a set of steps to perform the projection, as the
following shows. After performing these steps, our target distribution support will
match our predicted distribution by distributing the probabilities across the support.
First, we initialize an array m with its shape as the number of support with zero
values. The m denotes the distributed probability of the target distribution after the
projection step.
[ 566 ]
Chapter 14
1. Compute the target support value: 𝜏𝜏̂ 𝑧𝑧𝑗𝑗 = [𝑟𝑟𝑡𝑡 + 𝛾𝛾𝑡𝑡 𝑧𝑧𝑗𝑗 ]𝑉𝑉max
𝑉𝑉min
2. Compute the value of b: 𝑏𝑏𝑗𝑗 = (𝜏𝜏̂ 𝑧𝑧𝑗𝑗 − 𝑉𝑉min )/△ 𝑧𝑧
3. Compute the lower bound and the upper bound: 𝑙𝑙 𝑙 ⌊𝑏𝑏𝑗𝑗 ⌋, 𝑢𝑢 𝑢 ⌈𝑏𝑏𝑗𝑗 ⌉
4. Distribute the probability on the lower bound: 𝑚𝑚𝑙𝑙 = 𝑚𝑚𝑙𝑙 + 𝑝𝑝𝑗𝑗 (𝑢𝑢 𝑢 𝑢𝑢𝑗𝑗 )
5. Distribute the probability on the upper bound: 𝑚𝑚𝑢𝑢 = 𝑚𝑚𝑢𝑢 + 𝑝𝑝𝑗𝑗 (𝑏𝑏𝑗𝑗 − 𝑙𝑙𝑙
Understanding how exactly these projection steps work is a little tricky! So, let's
understand this by considering the same example we used earlier. Let z = [1, 2],
N = 2, 𝑉𝑉min = 1, and 𝑉𝑉max = 2.
Let the probability of predicted distribution be p = [0.5, 0.5]. Figure 14.15 shows the
predicted distribution:
Let the probability of target distribution be p = [0.3, 0.7]. Let the reward r = 0.1
and the discount factor 𝛾𝛾 𝛾 𝛾𝛾𝛾, and we know 𝜏𝜏𝜏𝜏𝑗𝑗 = 𝑟𝑟𝑡𝑡 + 𝛾𝛾𝑡𝑡 𝑧𝑧𝑗𝑗 , thus, the target
distribution becomes:
[ 567 ]
Distributional Reinforcement Learning
From Figure 14.16, we can infer that support in target distribution is different from
the predicted distribution. Now, we will learn how to perform the projection using
the preceding steps.
First, we initialize an array m with its shape as the number of support with zero
values. Thus, m = [0, 0].
Iteration, j=0:
𝑙𝑙 𝑙 𝑙𝑙 𝑙𝑙 𝑙 𝑙
4. Distribute the probability on the lower bound:
[ 568 ]
Chapter 14
Iteration, j=1:
𝑙𝑙 𝑙 𝑙𝑙 𝑙𝑙 𝑙 𝑙
4. Distribute the probability on the lower bound:
[ 569 ]
Distributional Reinforcement Learning
After the second iteration, the value of m becomes [0.07, 0,63]. The number of
iterations = the length of our support. Since the length of our support is 2, we will
stop here and thus the value of m becomes our new distributed probability for the
modified support, as Figure 14.17 shows:
The following snippet will give us more clarity on how exactly the projection step
works:
m = np.zeros(num_support)
for j in range(num_support):
Tz = min(v_max,max(v_min,r+gamma * z[j]))
bj = (Tz - v_min) / delta_z
l,u = math.floor(bj),math.ceil(bj)
pj = p[j]
m[int(l)] += pj * (u - bj)
m[int(u)] += pj * (bj - l)
Now that we have understood how to compute the target value distribution and
how we can make the support of target value distribution equal to the support of
predicted value distribution using the projection step, we will learn how to compute
the cross entropy loss. Cross entropy loss is given as:
Where y is the actual value and 𝑦𝑦𝑦 is the predicted value. Thus, we can write:
[ 570 ]
Chapter 14
Where m is the target probabilities from the target value distribution and p(s, a) is the
predicted probabilities from the predicted value distribution. We train our network
by minimizing the cross entropy loss.
Thus, using a categorical DQN, we select the action based on the distribution of
the return (value distribution). In the next section, we will put all these concepts
together and see how a categorical DQN works.
Now, for each step in the episode, we feed the state of the environment and support
values to the main categorical DQN parameterized by 𝜃𝜃. The main network takes
the support and state of the environment as input and returns the probability value
for each support. Then the Q value of the value distribution can be computed as the
sum of support multiplied by their probabilities:
After computing the Q value of all the actions in the state, we select the best action in
the state s as the one that has the maximum Q value:
However, instead of selecting the action that has the maximum Q value all the
time, we select the action using the epsilon-greedy policy. With the epsilon-greedy
policy, we select a random action with probability epsilon and with the probability
1-epsilon, we select the best action that has the maximum Q value. We perform the
selected action, move to the next state, obtain the reward, and store this transition
information (𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠𝑠 in the replay buffer 𝒟𝒟.
Now, we sample a transition (𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠𝑠𝑠 from the replay buffer 𝒟𝒟 and feed the
next state 𝑠𝑠𝑠 and support values to the target categorical DQN parameterized by
𝜃𝜃𝜃 . The target network takes the support and next state 𝑠𝑠𝑠 as input and returns the
probability value for each support.
[ 571 ]
Distributional Reinforcement Learning
Then the Q value can be computed as the sum of support multiplied by their
probabilities:
After computing the Q value of all next state-action pairs, we select the best action in
the state 𝑠𝑠𝑠 as the one that has the maximum Q value:
Now, we perform the projection step. The m denotes the distributed probability of
the target distribution after the projection step.
1. Compute the target support value: 𝜏𝜏̂ 𝑧𝑧𝑗𝑗 = [𝑟𝑟𝑡𝑡 + 𝛾𝛾𝑡𝑡 𝑧𝑧𝑗𝑗 ]𝑉𝑉max
𝑉𝑉min
2. Compute the value of b: 𝑏𝑏𝑗𝑗 = (𝜏𝜏̂ 𝑧𝑧𝑗𝑗 − 𝑉𝑉min )/△ 𝑧𝑧
3. Compute the lower bound and the upper bound: 𝑙𝑙 𝑙 ⌊𝑏𝑏𝑗𝑗 ⌋, 𝑢𝑢 𝑢 ⌈𝑏𝑏𝑗𝑗 ⌉
4. Distribute the probability on the lower bound: 𝑚𝑚𝑙𝑙 = 𝑚𝑚𝑙𝑙 + 𝑝𝑝𝑗𝑗 (𝑠𝑠′ , 𝑎𝑎∗ )(𝑢𝑢 − 𝑏𝑏𝑗𝑗 )
5. Distribute the probability on the upper bound: 𝑚𝑚𝑢𝑢 = 𝑚𝑚𝑢𝑢 + 𝑝𝑝𝑗𝑗 (𝑠𝑠′ , 𝑎𝑎∗ )(𝑏𝑏𝑗𝑗 − 𝑙𝑙𝑙
After performing the projection step, compute the cross entropy loss:
Where m is the target probabilities from the target value distribution and p(s, a) is the
predicted probabilities from the predicted value distribution. We train our network
by minimizing the cross entropy loss.
We don't update the target network parameter 𝜃𝜃𝜃 in every time step. We freeze
the target network parameter 𝜃𝜃𝜃 for several time steps, and then we copy the main
network parameter 𝜃𝜃 to the target network parameter 𝜃𝜃𝜃. We keep repeating the
preceding steps for several episodes to approximate the optimal value distribution.
To give us a more detailed understanding, the categorical DQN algorithm is given
in the next section.
[ 572 ]
Chapter 14
3. Perform the selected action and move to the next state 𝑠𝑠𝑠 and obtain
the reward r
4. Store the transition information in the replay buffer 𝒟𝒟
5. Randomly sample a transition from the replay buffer 𝒟𝒟
6. Feed the next state 𝑠𝑠𝑠 and support values to the target categorical
DQN parameterized by 𝜃𝜃𝜃 and get the probability value for each
support. Then compute the value as 𝑄𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄𝑄 𝑄 𝑄 𝑄𝑄𝑖𝑖 𝑝𝑝𝑖𝑖 (𝑠𝑠𝑠𝑠𝑠𝑠𝑠
𝑖𝑖
7. After computing the Q value, we select the best action in the state 𝑠𝑠𝑠
as the one that has the maximum Q value 𝑎𝑎∗ = arg max 𝑄𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄𝑄
𝑎𝑎
8. Initialize the array m with zero values with its shape as the number of
support
9. For j in range of the number of support:
𝑉𝑉
1. Compute the target support value: 𝜏𝜏̂ 𝑧𝑧𝑗𝑗 = [𝑟𝑟𝑡𝑡 + 𝛾𝛾𝑡𝑡 𝑧𝑧𝑗𝑗 ]𝑉𝑉𝑚𝑚𝑚𝑚𝑚𝑚
𝑚𝑚𝑚𝑚𝑚𝑚
2. Compute the value of b: 𝑏𝑏𝑗𝑗 = (𝜏𝜏̂ 𝑧𝑧𝑗𝑗 − 𝑉𝑉𝑚𝑚𝑚𝑚𝑚𝑚 )/∆𝑧𝑧
[ 573 ]
Distributional Reinforcement Learning
import numpy as np
import random
from collections import deque
import math
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
import gym
from tensorflow.python.framework import ops
[ 574 ]
Chapter 14
v_min = 0
v_max = 1000
atoms = 51
gamma = 0.99
batch_size = 64
Set the time step at which we want to update the target network:
update_target_net = 50
epsilon = 0.5
buffer_length = 20000
replay_buffer = deque(maxlen=buffer_length)
[ 575 ]
Distributional Reinforcement Learning
def sample_transitions(batch_size):
batch = np.random.permutation(len(replay_buffer))[:batch_size]
trans = np.array(replay_buffer)[batch]
return trans
For a clear understanding, let's take a look into the code line by line:
class Categorical_DQN():
def __init__(self,env):
self.sess = tf.InteractiveSession()
self.v_max = v_max
self.v_min = v_min
self.atoms = atoms
self.epsilon = epsilon
self.state_shape = env.observation_space.shape
[ 576 ]
Chapter 14
self.action_shape = env.action_space.n
self.time_step = 0
target_state_shape = [1]
target_state_shape.extend(self.state_shape)
self.state_ph = tf.placeholder(tf.float32,target_state_shape)
self.action_ph = tf.placeholder(tf.int32,[1,1])
Define the placeholder for the m value (the distributed probability of the target
distribution):
self.m_ph = tf.placeholder(tf.float32,[self.atoms])
self.build_categorical_DQN()
self.sess.run(tf.global_variables_initializer())
[ 577 ]
Distributional Reinforcement Learning
with tf.variable_scope('conv1'):
conv1 = conv(state, [5, 5, 3, 6], [6], [1, 2, 2, 1],
weights, bias)
with tf.variable_scope('conv2'):
conv2 = conv(conv1, [3, 3, 6, 12], [12], [1, 2, 2, 1],
weights, bias)
Flatten the feature maps obtained as a result of the second convolutional layer:
with tf.variable_scope('flatten'):
flatten = tf.layers.flatten(conv2)
with tf.variable_scope('dense1'):
dense1 = dense(flatten, units_1, [units_1], weights, bias)
with tf.variable_scope('dense2'):
dense2 = dense(dense1, units_2, [units_2], weights, bias)
with tf.variable_scope('concat'):
concatenated = tf.concat([dense2, tf.cast(action,
tf.float32)], 1)
Define the third layer and apply the softmax function to the result of the third layer
and obtain the probabilities for each of the atoms:
with tf.variable_scope('dense3'):
dense3 = dense(concatenated, self.atoms, [self.atoms],
[ 578 ]
Chapter 14
weights, bias)
return tf.nn.softmax(dense3)
Now, let's define a function called build_categorical_DQN for building the main and
target categorical DQNs:
def build_categorical_DQN(self):
with tf.variable_scope('main_net'):
name = ['main_net_params',tf.GraphKeys.GLOBAL_VARIABLES]
weights = tf.random_uniform_initializer(-0.1,0.1)
bias = tf.constant_initializer(0.1)
self.main_p = self.build_network(self.state_ph,self.action_
ph,name,24,24,weights,bias)
with tf.variable_scope('target_net'):
name = ['target_net_params',tf.GraphKeys.GLOBAL_VARIABLES]
weights = tf.random_uniform_initializer(-0.1,0.1)
bias = tf.constant_initializer(0.1)
self.target_p = self.build_network(self.state_ph,self.
action_ph,name,24,24,weights,bias)
Compute the main Q value with the probabilities obtained from the main categorical
DQN as 𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄𝑄 𝑄 𝑄 𝑄𝑄𝑖𝑖 𝑃𝑃𝑖𝑖 (𝑠𝑠𝑠𝑠𝑠𝑠:
𝑖𝑖
self.main_Q = tf.reduce_sum(self.main_p * self.z)
Similarly, compute the target Q value with probabilities obtained from the target
categorical DQN as 𝑄𝑄𝑄𝑄𝑄𝑄𝑄 𝑄𝑄𝑄 𝑄 𝑄 𝑄𝑄𝑖𝑖 𝑝𝑝𝑖𝑖 (𝑠𝑠𝑠𝑠𝑠𝑠𝑠:
𝑖𝑖
self.target_Q = tf.reduce_sum(self.target_p * self.z)
Define the cross entropy loss as Cross Entropy = − ∑ 𝑚𝑚𝑖𝑖 log 𝑝𝑝𝑖𝑖 (𝑠𝑠𝑠 𝑠𝑠𝑠:
𝑖𝑖
self.cross_entropy_loss = -tf.reduce_sum(self.m_ph *
tf.log(self.main_p))
[ 579 ]
Distributional Reinforcement Learning
Define the optimizer and minimize the cross entropy loss using the Adam optimizer:
self.optimizer = tf.train.AdamOptimizer(0.01).minimize(self.
cross_entropy_loss)
main_net_params = tf.get_collection("main_net_params")
target_net_params = tf.get_collection('target_net_params')
Define the update_target_net operation for updating the target network parameters
by copying the parameters of the main network:
def train(self,s,r,action,s_,gamma):
self.time_step += 1
list_q_ = [self.sess.run(self.target_Q,feed_dict={self.state_
ph:[s_],self.action_ph:[[a]]}) for a in range(self.action_shape)]
Select the next state action 𝑎𝑎𝑎 as the one that has the maximum Q value:
a_ = tf.argmax(list_q_).eval()
Initialize an array m with its shape as the number of support with zero values. The m
denotes the distributed probability of the target distribution after the projection step:
m = np.zeros(self.atoms)
[ 580 ]
Chapter 14
Get the probability for each atom using the target categorical DQN:
p = self.sess.run(self.target_p,feed_dict = {self.state_
ph:[s_],self.action_ph:[[a_]]})[0]
for j in range(self.atoms):
Tz = min(self.v_max,max(self.v_min,r+gamma * self.z[j]))
bj = (Tz - self.v_min) / self.delta_z
l,u = math.floor(bj),math.ceil(bj)
pj = p[j]
m[int(l)] += pj * (u - bj)
m[int(u)] += pj * (bj - l)
self.sess.run(self.optimizer,feed_dict={self.state_ph:[s] ,
self.action_ph:[action], self.m_ph: m })
Update the target network parameters by copying the main network parameters:
if self.time_step % update_target_net == 0:
self.sess.run(self.update_target_net)
def select_action(self,s):
We generate a random number, and if the number is less than epsilon we select the
random action, else we select the action that has the maximum Q value:
[ 581 ]
Distributional Reinforcement Learning
env = gym.make("Tennis-v0")
agent = Categorical_DQN(env)
num_episodes = 800
for i in range(num_episodes):
done = False
Return = 0
state = env.reset()
env.render()
[ 582 ]
Chapter 14
Select an action:
action = agent.select_action(state)
If the length of the replay buffer is greater than or equal to the buffer size then start
training the network by sampling transitions from the replay buffer:
state = next_state
Now that we have learned how a categorical DQN works and how to implement it,
in the next section, we will learn about another interesting algorithm.
[ 583 ]
Distributional Reinforcement Learning
Math essentials
Before going ahead, let's recap two important concepts that we use in QR-DQN:
• Quantile
• Inverse cumulative distribution function (Inverse CDF)
Quantile
When we divide our distribution into equal areas of probability, they are called
quantiles. For instance, as Figure 14.18 shows, we have divided our distribution into
two equal areas of probabilities and we have two quantiles with 50% probability
each:
[ 584 ]
Chapter 14
In the preceding plot, 𝜏𝜏𝑖𝑖 represents the cumulative probability, that is, 𝜏𝜏𝑖𝑖 = 𝐹𝐹𝐹𝐹𝐹𝑖𝑖 ).
Say i =1, then 𝜏𝜏1 = 𝐹𝐹(1) = 0.3.
The CDF takes x as an input and returns the cumulative probability 𝜏𝜏 . Hence, we
can write:
𝜏𝜏 𝜏 𝜏𝜏𝜏𝜏𝜏𝜏
Say x = 2, then we get 𝜏𝜏 𝜏 𝜏𝜏𝜏.
Now, we will look at the inverse CDF. Inverse CDF, as the name suggests, is the
inverse of the CDF. That is, in CDF, given the support x, we obtain the cumulative
probability 𝜏𝜏, whereas in inverse CDF, given the cumulative probability 𝜏𝜏, we obtain
the support x. Inverse CDF can be expressed as:
𝑥𝑥 𝑥 𝑥𝑥 −1 (𝜏𝜏𝜏
[ 585 ]
Distributional Reinforcement Learning
As shown in Figure 14.20, given the cumulative probability 𝜏𝜏, we obtain the
support x.
We have learned that the quantiles are equally divided probabilities. As Figure 14.20
shows, we have three quantiles q1 to q3 with equally divided probabilities and the
quantile values are [0.3,0.6,1.0], which are just our cumulative probabilities. Hence,
we can say that the inverse CDF (quantile function) helps us to obtain the value
of support given the equally divided probabilities. Note that in inverse CDF, the
support should always be increasing as it is based on the cumulative probability.
Now that we have learned what the quantile function is, we will gain an
understanding of how we can make use of the quantile function in the distributional
RL setting using an algorithm called QR-DQN.
Understanding QR-DQN
In categorical DQN (C51), we learned that in order to predict the value distribution,
the network takes the support of the distribution as input and returns the
probabilities.
[ 586 ]
Chapter 14
To compute the support, we also need to decide the number of support N, the
minimum value of support 𝑉𝑉min, and the maximum value of support 𝑉𝑉max .
If you recollect in C51, our support values are equally spaced at fixed locations
(𝑧𝑧1 , 𝑧𝑧2 , … , 𝑧𝑧𝑛𝑛 ) and we feed this equally spaced support as input and obtained the
non-uniform probabilities (𝑝𝑝1 , 𝑝𝑝2 , … , 𝑝𝑝𝑛𝑛 ). As Figure 14.21 shows, in C51, we feed the
equally spaced support (𝑧𝑧1 , 𝑧𝑧2 , 𝑧𝑧3 , 𝑧𝑧4 ) as input to the network along with the state(s)
and obtain the non-uniform probabilities (𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 ) as output:
QR-DQN can be viewed just as the opposite of C51. In QR-DQN, to estimate the
value distribution, we feed the uniform probabilities (𝑝𝑝1 , 𝑝𝑝2 , … , 𝑝𝑝𝑛𝑛 ) and the network
outputs the supports at variable locations (𝑧𝑧1 , 𝑧𝑧2 , … , 𝑧𝑧𝑛𝑛 ). As shown in the following
figure, we feed the uniform probabilities (𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 ) as input to the network along
with the state(s) and obtain the support (𝑧𝑧1 , 𝑧𝑧2 , 𝑧𝑧3 , 𝑧𝑧4 ) placed at variable locations as
output:
Thus, from the two preceding figures we can observe that, in a categorical DQN,
along with the state, we feed the fixed support at equally spaced intervals as input
to the network and it returns the non-uniform probabilities, whereas in a QR-DQN,
along with the state, we feed the fixed uniform probabilities as input to the network
and it returns the support at variable locations (unequally spaced support).
[ 587 ]
Distributional Reinforcement Learning
Okay, but what's the use of this? How does a QR-DQN work exactly? Let's explore
this in detail.
We understood that a QR-DQN takes the uniform probabilities as input and returns
the support values for estimating the value distribution. Can we make use of the
quantile function to estimate the value distribution? Yes! We learned that the
quantile function helps us to obtain the values of support given the equally divided
probabilities. Thus, in QR-DQN, we estimate the value distribution by estimating the
quantile function.
𝑧𝑧 𝑧 𝑧𝑧 −1 (𝜏𝜏𝜏
Where z is the support and 𝜏𝜏 is the equally divided cumulative probability. Thus,
we can obtain the support z given 𝜏𝜏.
Let N be the number of quantiles, then the probability can be obtained as:
1
𝑝𝑝𝑖𝑖 = for 𝑖𝑖 𝑖𝑖𝑖 𝑖𝑖 𝑖 𝑖 𝑖𝑖
𝑁𝑁
For example, if N = 4, then p = [0.25, 0.25. 0.25, 0.25]. If N = 5, then p = [0.20, 0.20,
0.20, 0.20, 0.20].
𝑖𝑖
𝜏𝜏𝑖𝑖 = for 𝑖𝑖 𝑖 𝑖𝑖 𝑖 𝑖 𝑖𝑖
𝑁𝑁
For example, if N = 4, then 𝜏𝜏 𝜏 𝜏𝜏𝜏25, 0.50, 0.75, 1.0]. If N = 5, then
𝜏𝜏 𝜏 𝜏𝜏𝜏𝜏𝜏 𝜏𝜏𝜏𝜏 𝜏𝜏𝜏𝜏 𝜏𝜏𝜏𝜏 𝜏𝜏𝜏𝜏.
We just feed this equally divided cumulative probability 𝜏𝜏 (quantile values) as
input to the QR-DQN and it returns the support value. That is, we have learned
that the QR-DQN estimates the value distribution as the quantile function, so we
just feed the 𝜏𝜏 and obtain the support values z of the value distribution.
Let's understand this with a simple example. Say we are in a state s and we have
two possible actions up and down to perform in the state. As shown in the following
figure, along with giving the state s as input to the network, we also feed the quantile
value 𝜏𝜏, which is just the equally divided cumulative probability. Then our network
returns the support for the distribution of action up and the distribution of action
down:
[ 588 ]
Chapter 14
If you recollect, in C51, we computed the probability p(s, a) for the given state and
action, whereas here in QR-DQN, we compute the support z(s, a) for the given state
and action.
Similarly, we can also compute the target value distribution using the quantile
function. Then we train our network by minimizing the distance between the
predicted quantile and the target quantile distribution.
Still, the fundamental question is why are we doing this? How it is more beneficial
than C51? There are several advantages of quantile regression DQN over categorical
DQN. In quantile regression DQN:
• We don't have to choose the number of supports and the bounds of support,
which is 𝑉𝑉min and 𝑉𝑉max .
• There are no limitations on the bounds of support, thus the range of returns
can vary across states.
• We can also get rid of the projection step that we performed in the C51 to
match the supports of the target and predicted distribution.
[ 589 ]
Distributional Reinforcement Learning
Okay, what exactly is the p-Wasserstein distance? The p-Wasserstein distance, Wp, is
characterized as the Lp metric on inverse CDF. Say we have two distributions U and
V, then the p-Wasserstein metric between these two distributions is given as:
1
1 𝑝𝑝
𝑝𝑝
𝑊𝑊𝑝𝑝 (𝑈𝑈𝑈 𝑈𝑈) = (∫ |𝐹𝐹−1 −1
𝑉𝑉 (𝜔𝜔) − 𝐹𝐹𝑈𝑈 (𝜔𝜔𝜔𝜔 𝑑𝑑𝑑𝑑)
0
Where 𝐹𝐹−1 −1
𝑈𝑈 (𝜔𝜔𝜔 and 𝐹𝐹𝑉𝑉 (𝜔𝜔𝜔 denote the inverse CDF of the distributions U and V
respectively. Thus, minimizing the distance between two inverse CDFs implies that
we minimize the Wasserstein distance.
The authors of the QR-DQN paper (see the Further reading section for more details)
also highlighted that instead of computing the support for the quantile values 𝜏𝜏,
they suggest using the quantile midpoint values 𝜏𝜏̂ . The quantile midpoint can be
computed as:
𝜏𝜏𝑖𝑖𝑖𝑖 + 𝜏𝜏𝑖𝑖
𝜏𝜏𝜏𝑖𝑖 =
2
That is, the value of the support z can be obtained using quantile midpoint values as
𝑧𝑧 𝑧 𝑧 𝑧𝑧 −1 (𝜏𝜏𝜏 𝜏 instead of obtaining support using the quantile values as 𝑧𝑧 𝑧 𝑧 𝑧𝑧 −1 (𝜏𝜏𝜏.
But why the quantile midpoint? The quantile midpoint acts as a unique minimizer,
that is, the Wasserstein distance between two inverse CDFs will be less when we
use quantile midpoint values 𝜏𝜏̂ instead of quantile values 𝜏𝜏. Since we are trying to
minimize the Wasserstein distance between the target and predicted distribution,
we can use quantile midpoints 𝜏𝜏̂ so that the distance between them will be less. For
instance, as Figure 14.24 shows, the Wasserstein distance is less when we use the
quantile midpoint values 𝜏𝜏̂ instead of quantile values 𝜏𝜏:
[ 590 ]
Chapter 14
Source (https://arxiv.org/pdf/1710.10044.pdf)
Action selection
Action selection in QR-DQN is just the same as in C51. First, we extract Q value from
the predicted value distribution and then we select the action as the one that has
the maximum Q value. We can extract the Q value by just taking the expectation of
the value distribution. The expectation of distribution is given as a sum of support
multiplied by their corresponding probability.
Where pi(s, a) is the probability given by the network for state s and action a and zi is
the support.
[ 591 ]
Distributional Reinforcement Learning
Whereas in a QR-DQN, our network outputs the support instead of the probability.
So, the Q value in the QR-DQN can be computed as:
Where zi(s, a) is the support given by the network for state s and action a and pi is
the probability.
After computing the Q value, we select the action that has the maximum Q value.
For instance, let's say, we have a state s and two actions in the state, let them be up
and down. The Q value for action up in the state s is computed as:
After computing the Q value, we select the optimal action as the one that has the
maximum Q value:
Now that we have learned how to select actions in QR-DQN, in the next section, we
will look into the loss function of QR-DQN.
Loss function
In C51, we used cross entropy loss as our loss function because our network predicts
the probability of the value distribution. So we used the cross entropy loss to
minimize the probabilities between the target and predicted distribution. But in QR-
DQN, we predict the support of the distribution instead of the probabilities. That is,
in QR-DQN, we feed the probabilities as input and predict the support as output. So,
how can we define the loss function for a QR-DQN?
We can use the quantile regression loss to minimize the distance between the target
support and the predicted support. But first, let's understand how to calculate the
target support value.
[ 592 ]
Chapter 14
Before going ahead, let's recall how we compute the target value in a DQN. In DQN,
we use the Bellman equation and compute the target value as:
𝑦𝑦 𝑦 𝑦𝑦 𝑦 𝑦𝑦 max
′
𝑄𝑄 𝑄𝑄𝑄 ′ , 𝑎𝑎′ )
𝑎𝑎
In the preceding equation, we select action 𝑎𝑎′ by taking the maximum Q value over
all possible next state-action pairs.
Similarly, in QR-DQN, to compute the target value, we can use the distributional
Bellman equation. The distributional Bellman equation can be given as:
Let's say the target support value is [1, 5, 10, 15, 20] and the predicted support
value is [100, 5, 10, 15, 20]. As we can see, our predicted support has a very high
value in the initial quantile and then it is decreasing. In the inverse CDF section, we
learned that support should always be increasing as it is based on the cumulative
probability. But if you look at the predicted values the support starts from 100 and
then decreases.
Let's consider another case. Suppose the target support value is [1, 5, 10, 15, 20] and
the predicted support value is [1, 5, 10, 15, 4]. As we can see, our predicted support
value is increasing from the initial quantile and then it is decreasing to 4 in the final
quantile. But this should not happen. Since we are using inverse CDF, our support
values should always be increasing.
[ 593 ]
Distributional Reinforcement Learning
Thus, we need to make sure that our support should be increasing and not
decreasing. So, if the initial quantile values are overestimated with high values and
if the later quantile values are underestimated with low values, we can penalize
them. That is, we multiply the overestimated value by 𝜏𝜏 and the underestimated
value by (𝜏𝜏 𝜏 𝜏𝜏. Okay, how can we determine if the value is overestimated or
underestimated?
First, we compute the difference between the target and the predicted value. Let u
be the difference between the target support value and the predicted support value.
Then, if the value of u is less than 0, we multiply u by (𝜏𝜏 𝜏 𝜏𝜏, else we multiply u by
𝜏𝜏. This is known as quantile regression loss.
But the problem with quantile regression loss is that it will not be smooth at 0 and
it makes the gradient stay constant. So, instead of using quantile regression loss, we
use a new modified version of loss called quantile Huber loss.
To understand how exactly quantile Huber loss works, first, let's look into the Huber
loss. Let's denote the difference between our actual and predicted values as u. Then
the Huber loss ℒ𝜅𝜅 (𝑢𝑢) can be given as:
#absolute value of u
abs_u = abs(u)
[ 594 ]
Chapter 14
#Loss is the quadratic loss where the absolute value is less than
kappa
#else it is linear loss
return loss
Now that we have understood what the Huber loss ℒ𝜅𝜅 (𝑢𝑢) is, let's look into the
quantile Huber loss. In the quantile Huber loss, when the value of u (the difference
between target and predicted support) is less than 0, then we multiply the Huber loss
ℒ𝜅𝜅 (𝑢𝑢) by 1 − 𝜏𝜏, and when the value of u is greater than or equal to 0, we multiply the
Huber loss ℒ𝜅𝜅 (𝑢𝑢) by 𝜏𝜏.
Now that we have understood how a QR-DQN works, in the next section, we will
look into another interesting algorithm called D4PG.
D4PG works just like DDPG but in the critic network, instead of using a DQN for
estimating the Q function, we can use our distributional DQN to estimate the value
distribution. That is, in the previous sections, we have learned several distributional
DQN algorithms, such as C51 and QR-DQN. So, in the critic network, instead of
using a regular DQN, we can use any distributional DQN algorithm, say C51.
[ 595 ]
Distributional Reinforcement Learning
Apart from this, D4PG also proposes several changes to the DDPG architecture. So,
we will get into the details and learn how exactly D4PG differs from DDPG. Before
going ahead, let's be clear with the notation:
Now, we will understand how exactly the critic and actor network in D4PG works.
Critic network
In DDPG, we learned that we use the critic network to estimate the Q function. Thus,
given a state and action, the critic network estimates the Q function as 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠𝑠. To
train the critic network we minimize the MSE between the target Q value given by
the Bellman optimality equation and the Q value predicted by the network.
Once we compute the target value, we compute the loss as the MSE between the
target value and the predicted value as:
1 2
𝐽𝐽(𝜃𝜃) = ∑(𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
Where K denotes the number of transitions randomly sampled from the replay
buffer. After computing the loss, we compute the gradients ∇𝜃𝜃 𝐽𝐽(𝜃𝜃) and update the
critic network parameter using gradient descent:
𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽(𝜃𝜃)
[ 596 ]
Chapter 14
Now, let's talk about the critic in D4PG. As we learned in D4PG, we use the
distributional DQN to estimate the Q value. Thus, given a state and action, the critic
network estimates the value distribution as 𝑍𝑍𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠𝑠.
To train the critic network, we minimize the distance between the target value
distribution given by the distributional Bellman equation and the value distribution
predicted by the network.
𝑁𝑁𝑁𝑁
𝑦𝑦𝑖𝑖 = ( ∑ 𝛾𝛾𝑛𝑛 𝑟𝑟𝑖𝑖𝑖𝑖𝑖 ) + 𝛾𝛾𝑁𝑁 𝑍𝑍𝜃𝜃′ (𝑠𝑠𝑖𝑖′ +𝑁𝑁 , 𝜇𝜇𝜃𝜃′ (𝑠𝑠′𝑖𝑖𝑖𝑖𝑖 ))
𝑛𝑛𝑛𝑛
Where N is the length of the transition, which we sample from the replay buffer.
After computing the target value distribution, we can compute the distance between
the target value distribution and the predicted value distribution as:
1
𝐽𝐽(𝜃𝜃) = ∑ 𝑑𝑑(𝑦𝑦𝑖𝑖 , 𝑍𝑍𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠))
𝐾𝐾
𝑖𝑖
Where d denotes any distance measure for measuring the distance between two
distributions. Say we are using C51, then d denotes the cross entropy and K denotes
the number of transitions sampled from the replay buffer. After computing the loss,
we calculate the gradients and update the critic network parameter. The gradients
can be computed as:
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∑ ∇𝜃𝜃 𝑑𝑑 (𝑦𝑦𝑖𝑖 , 𝑍𝑍𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠))
𝐾𝐾
𝑖𝑖
[ 597 ]
Distributional Reinforcement Learning
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∑ ∇𝜃𝜃 (𝑅𝑅𝑅𝑅𝑖𝑖 )−1 𝑑𝑑 (𝑦𝑦𝑖𝑖 , 𝑍𝑍𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠))
𝐾𝐾
𝑖𝑖
After computing the gradient, we can update the critic network parameter using
gradient descent as 𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽(𝜃𝜃). Now that we have understood how the critic
network works in D4PG, let's look into the actor network in the next section.
Actor network
First, let's quickly recap how the actor network in DDPG works. In DDPG, we
learned that the actor network takes the state as input and returns the action:
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠)
Note that we are using the deterministic policy in the continuous action space,
and to explore new actions we just add some noise 𝒩𝒩 to the action produced by
the actor network since the action is a continuous value.
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠) + 𝒩𝒩
Thus, the objective function of the actor is to generate an action that maximizes
the Q value produced by the citric network:
1
𝐽𝐽(𝜙𝜙) = ∑ 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑎
𝐾𝐾
𝑖𝑖
[ 598 ]
Chapter 14
We learned that to maximize the objective, we compute the gradients of our objective
function ∇𝜙𝜙 𝐽𝐽(𝜙𝜙) and update the actor network parameter by performing gradient
ascent.
Now let's come to D4PG. In D4PG we perform the same steps with a little difference.
Note that here we are not using the Q function in the critic. Instead, we are
computing the value distribution and thus our objective function becomes:
1
𝐽𝐽(𝜙𝜙) = ∑ 𝑍𝑍𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑎
𝐾𝐾
𝑖𝑖
Where the action, 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑖𝑖 ) and just like we saw in DDPG, to maximize the
objective, first, we compute the gradients of our objective function ∇𝜙𝜙 𝐽𝐽(𝜙𝜙). After
computing the gradients we update the actor network parameter by performing
gradient ascent:
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽(𝜙𝜙)
We learned that D4PG is a distributed algorithm, meaning that instead of using one
actor, we use L number of actors, each of which acts parallel and is independent of
the environment, collects experience, and stores the experience in the replay buffer.
Then we update the network parameter to the actors periodically.
1. We use the distributional DQN in the critic network instead of using the
regular DQN to estimate the Q values.
2. We calculate N-step returns in the target instead of calculating the one-step
return.
3. We use a prioritized experience replay and add importance to the gradient
update in the critic network.
4. Instead of using one actor, we use L independent actors, each of which acts
in parallel, collects experience, and stores the experience in the replay buffer.
Now that we have understood how D4PG works, putting together all the concepts
we have learned, let's look into the algorithm of D4PG in the next section.
[ 599 ]
Distributional Reinforcement Learning
Algorithm – D4PG
Let 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 denote the time steps at which we want to update the target critic and
actor network parameters. We set 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 = 2, which states that we update the target
critic network and target actor network parameter for every 2 steps of the episode.
Similarly, let 𝑡𝑡𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 denote the time steps at which we want to replicate the network
weights to the L actors. We set 𝑡𝑡𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 2, which states that we replicate the network
weights to the actors on every 2 steps of the episode.
𝑁𝑁𝑁𝑁
𝑦𝑦𝑖𝑖 = ( ∑ 𝛾𝛾𝑛𝑛 𝑟𝑟𝑖𝑖𝑖𝑖𝑖 ) + 𝛾𝛾𝑁𝑁 𝑍𝑍𝜃𝜃′ (𝑠𝑠𝑖𝑖′ +𝑁𝑁 , 𝜇𝜇𝜃𝜃′ (𝑠𝑠′𝑖𝑖𝑖𝑖𝑖 ))
𝑛𝑛𝑛𝑛
3. Compute the loss of the critic network and calculate the gradient as
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∑ ∇𝜃𝜃 (𝑅𝑅𝑅𝑅𝑖𝑖 )−1 𝑑𝑑 (𝑦𝑦𝑖𝑖 , 𝑍𝑍𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠))
𝐾𝐾
𝑖𝑖
[ 600 ]
Chapter 14
Update the target critic and target actor network parameter using
soft replacement as 𝜃𝜃 ′ = 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏)𝜃𝜃𝜃 and 𝜙𝜙 ′ = 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏)𝜙𝜙 ′
respectively
8. If t mod 𝑡𝑡𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 0, then:
Replicate the network weights to the actors
1. Select action a based on the policy 𝜇𝜇𝜙𝜙 (𝑠𝑠) and exploration noise, that is,
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠) + 𝒩𝒩
2. Perform the selected action a, move to the next state 𝑠𝑠 ′, get the reward r,
and store the transition information in the replay buffer 𝒟𝒟
3. Repeat steps 1 to 2 until the learner finishes
Summary
We started the chapter by understanding how distributional reinforcement learning
works. We learned that in distributional reinforcement learning, instead of selecting
an action based on the expected return, we select the action based on the distribution
of return, which is often called the value distribution or return distribution.
Next, we learned about the categorical DQN algorithm, also known as C51, where
we feed the state and support of the distribution as the input and the network
returns the probabilities of the value distribution. We also learned how the projection
step matches the support of the target and predicted the value distribution so that
we can apply the cross entropy loss.
Going ahead, we learned about quantile regression DQNs, where we feed the state
and also the equally divided cumulative probabilities 𝜏𝜏 as input to the network and
it returns the support value of the distribution.
At the end of the chapter, we learned about how D4PG works, and we also learned
how it varies from DDPG.
[ 601 ]
Distributional Reinforcement Learning
Questions
Let's test our knowledge of distributional reinforcement learning by answering the
following questions:
Further reading
For more information, refer to the following papers:
[ 602 ]
Imitation Learning
15
and Inverse RL
Learning from demonstration is often called imitation learning. In the imitation
learning setting, we have expert demonstrations and train our agent to mimic those
expert demonstrations. Learning from demonstrations has many benefits, including
helping an agent to learn more quickly. There are several approaches to perform
imitation learning, and two of them are supervised imitation learning and Inverse
Reinforcement Learning (IRL).
First, we will understand how we can perform imitation learning using supervised
learning, and then we will learn about an algorithm called Dataset Aggregation
(DAgger). Next, we will learn how to use demonstration data in a DQN using an
algorithm called Deep Q Learning from Demonstrations (DQfD).
Moving on, we will learn about IRL and how it differs from reinforcement learning.
We will learn about one of the most popular IRL algorithms called maximum
entropy IRL. Toward the end of the chapter, we will understand how Generative
Adversarial Imitation Learning (GAIL) works.
[ 603 ]
Imitation Learning and Inverse RL
Let's begin our chapter by understanding how supervised imitation learning works.
We can train an agent to mimic the actions performed by the expert in various
respective states. Thus, we can view expert demonstrations as training data used to
train our agent. The fundamental idea of imitation learning is to imitate (learn) the
behavior of an expert.
One of the simplest and most naive ways to perform imitation learning is to treat
the imitation learning task as a supervised learning task. First, we collect a set of
expert demonstrations, and then we train a classifier to perform the same action
performed by the expert in the respective states. We can view this as a big multiclass
classification problem and train our agent to perform the action performed by the
expert in the respective states.
Our goal is to minimize the loss 𝐿𝐿𝐿𝐿𝐿 ∗ , 𝜋𝜋𝜃𝜃 (𝑠𝑠𝑠𝑠 where 𝑎𝑎 ∗ is the expert action and 𝜋𝜋𝜃𝜃 (𝑠𝑠𝑠
denotes the action performed by our agent.
However, there exist several challenges and drawbacks with this method. The
knowledge of the agent is limited only to the expert demonstrations (training
data), so if the agent comes across a new state that is not present in the expert
demonstrations, then the agent will not know what action to perform in that state.
Say, we train an agent to drive a car using supervised imitation learning and let the
agent perform in the real world. If the training data has no state where the agent
encounters a traffic signal, then our agent will have no clue about the traffic signal.
[ 604 ]
Chapter 15
Also, the accuracy of the agent is highly dependent on the knowledge of the expert.
If the expert demonstrations are poor or not optimal, then the agent cannot learn
correct actions or the optimal policy.
DAgger
DAgger is one of the most-used imitation learning algorithms. Let's understand how
DAgger works with an example. Let's revisit our example of training an agent to
drive a car. First, we initialize an empty dataset 𝒟𝒟.
In the first iteration, we start off with some policy 𝜋𝜋1 to drive the car. Thus, we
generate a trajectory 𝜏𝜏 using the policy 𝜋𝜋1. We know that the trajectory consists of
a sequence of states and actions—that is, states visited by our policy 𝜋𝜋1 and actions
made in those states using our policy 𝜋𝜋1. Now, we create a new dataset 𝒟𝒟1 by taking
only the states visited by our policy 𝜋𝜋1 and we use an expert to provide the actions
for those states. That is, we take all the states from the trajectory and ask the expert to
provide actions for those states.
Now, we combine the new dataset 𝒟𝒟1 with our initialized empty dataset 𝒟𝒟 and
update 𝒟𝒟 as:
𝒟𝒟 𝒟 𝒟𝒟𝒟𝒟 𝒟 𝒟 𝒟𝒟1
Next, we train a classifier on this updated dataset 𝒟𝒟 and learn a new policy 𝜋𝜋2.
In the second iteration, we use the new policy 𝜋𝜋2 to generate trajectories, create a
new dataset 𝒟𝒟2 by taking only the states visited by the new policy 𝜋𝜋2, and ask the
expert to provide the actions for those states.
𝒟𝒟 𝒟 𝒟𝒟𝒟𝒟 𝒟 𝒟 𝒟𝒟2
Next, we train a classifier on this updated dataset 𝒟𝒟 and learn a new policy 𝜋𝜋3.
In the third iteration, we use the new policy 𝜋𝜋3 to generate trajectories and create a
new dataset 𝒟𝒟3 by taking only the states visited by the new policy 𝜋𝜋3, and then we
ask the expert to provide the actions for those states.
[ 605 ]
Imitation Learning and Inverse RL
𝒟𝒟 𝒟 𝒟𝒟𝒟𝒟 𝒟 𝒟 𝒟𝒟3
Next, we train a classifier on this updated dataset 𝒟𝒟 and learn a new policy 𝜋𝜋4 . In
this way, DAgger works in a series of iterations until it finds the optimal policy.
Now that we have a basic understanding of Dagger; let's go into more detail and
learn how DAgger finds the optimal policy.
Understanding DAgger
Let's suppose we have a human expert, and let's denote the expert policy with 𝜋𝜋𝐸𝐸.
We initialize an empty dataset 𝒟𝒟 and also a novice policy 𝜋𝜋𝜋1.
Iteration 1:
𝛽𝛽𝑖𝑖 = 𝑝𝑝𝑖𝑖𝑖𝑖
The value of p is chosen between 0.1 and 0.9. Since we are in iteration 1, substituting
i = 1, we can write:
𝛽𝛽1 = 𝑝𝑝0 = 1
Thus, substituting 𝛽𝛽1 = 1 in equation (1), we can write:
𝜋𝜋1 = 𝜋𝜋𝐸𝐸
As we can observe, in the first iteration, the policy 𝜋𝜋1 is just an expert policy 𝜋𝜋𝐸𝐸 .
Now, we use this policy 𝜋𝜋1 and generate trajectories. Next, we create a new dataset
𝒟𝒟1 by collecting all the states visited by our policy 𝜋𝜋1 and ask the expert to provide
actions of those states. So, our dataset will consist of 𝒟𝒟1 = {(𝑠𝑠𝑠 𝑠𝑠𝐸𝐸 (𝑠𝑠𝑠𝑠𝑠.
Now, we combine the dataset 𝒟𝒟1 with our initialized empty dataset 𝒟𝒟 and update
𝒟𝒟 as:
𝒟𝒟 𝒟 𝒟𝒟𝒟𝒟 𝒟 𝒟 𝒟𝒟1
[ 606 ]
Chapter 15
Now that we have an updated dataset 𝒟𝒟, we train a classifier on this new dataset
and extract a new policy. Let the new policy be 𝜋𝜋𝜋2.
Iteration 2:
Now, we use this policy 𝜋𝜋2 and generate trajectories. Next, we create a new dataset
𝒟𝒟2 by collecting all the states visited by our policy 𝜋𝜋2 and ask the expert to provide
actions of those states. So, our dataset will consist of 𝒟𝒟2 = {(𝑠𝑠𝑠 𝑠𝑠𝐸𝐸 (𝑠𝑠𝑠𝑠𝑠.
𝒟𝒟 𝒟 𝒟𝒟𝒟𝒟 𝒟 𝒟 𝒟𝒟2
Now that we have an updated dataset 𝒟𝒟, we train a classifier on this new dataset
and extract a new policy. Let that new policy be 𝜋𝜋𝜋3.
We repeat these steps for several iterations to obtain the optimal policy. As we can
observe in each iteration, we aggregate our dataset 𝒟𝒟 and train a classifier to obtain
the new policy. Notice that the value of 𝛽𝛽 is decaying exponentially. This makes
sense as over a series of iterations, our policy will become better and so we can
reduce the importance of the expert policy.
Now that we have understood how DAgger works, in the next section, we will look
into the algorithm of DAgger for a better understanding.
Algorithm – DAgger
The algorithm of DAgger is given as follows:
[ 607 ]
Imitation Learning and Inverse RL
Now that we have learned the DAgger algorithm, in the next section, we will learn
about DQfD.
In the previous chapters, we have learned about several types of DQN. We started
off with vanilla DQN, and then we explored various improvements to the DQN,
such as double DQN, dueling DQN, prioritized experience replay, and more. In
all these methods, the agent tries to learn from scratch by interacting with the
environment. The agent interacts with the environment and stores their interaction
experience in a buffer called a replay buffer and learns based on their experience.
In order for the agent to perform better, it has to gather a lot of experience from the
environment, add it to the replay buffer, and train itself. However, this method costs
us a lot of training time. In all the previous methods we have learned so far, we have
trained our agent in a simulator, so the agent gathers experience in the simulator
environment to perform better. To learn the optimal policy, the agent has to perform
a lot of interactions with the environment, and some of these interactions give the
agent a very bad reward. This is tolerable in a simulator environment. But how can
we train the agent in a real-world environment? We can't train the agent by directly
interacting with the real-world environment and by making a lot of bad actions in
the real-world environment.
So, in those cases, we can train the agent in a simulator that corresponds to the
particular real-world environment. But the problem is that it is hard to find an
accurate simulator corresponding to the real-world environment for most use cases.
However, we can easily obtain expert demonstrations.
For instance, let's suppose we want to train an agent to play chess. Let's assume we
don't find an accurate simulator to train the agent to play chess. But we can easily
obtain good expert demonstrations of an expert playing chess.
[ 608 ]
Chapter 15
Now, can we make use of these expert demonstrations and train our agent? Yes!
Instead of learning from scratch by interacting with the environment, if we add the
expert demonstrations directly to the replay buffer and pre-train our agent based on
these expert demonstrations, then the agent can learn better and faster.
This is the fundamental idea behind DQfD. We fill the replay buffer with expert
demonstrations and pre-train the agent. Note that these expert demonstrations are
used only for pre-training the agent. Once the agent is pre-trained, the agent will
interact with the environment and gather more experience and make use of it for
learning. Thus DQfD consists of two phases, which are pre-training and training.
First, we pre-train the agent based on the expert demonstrations, and then we train
the agent by interacting with the environment. When the agent interacts with the
environment, it collects some experience, and the agent's experience (self-generated
data) also gets added to the replay buffer. The agent makes use of both the expert
demonstrations and also the self-generated data for learning. We use a prioritized
experience replay buffer and give more priority to the expert demonstrations than
the self-generated data. Now that we have a basic understanding of DQfD, let's go
into detail and learn how exactly it works.
Phases of DQfD
DQfD consists of two phases:
• Pre-training phase
• Training phase
Pre-training phase
In the pre-training phase, the agent does not interact with the environment. We
directly add the expert demonstrations to the replay buffer and the agent learns by
sampling the expert demonstrations from the replay buffer.
The agent learns from expert demonstrations by minimizing the loss J(Q) using
gradient descent. However, pre-training with expert demonstrations alone is not
sufficient for the agent to perform better because the expert demonstrations will
not contain all possible transitions. But the pretraining with expert demonstrations
acts as a good starting point to train our agent. Once the agent is pre-trained with
demonstrations, then during the training phase, the agent will perform better actions
in the environment from the initial iteration itself instead of performing random
actions, and so the agent can learn quickly.
[ 609 ]
Imitation Learning and Inverse RL
Training phase
Once the agent is pre-trained, we start the training phase, where the agent
interacts with the environment and learns based on its experience. Since the agent
has already learned some useful information from the expert demonstrations in the
pre-training phase, it will not perform random actions in the environment.
During the training phase, the agent interacts with the environment and stores its
transition information (experience) in the replay buffer. We learned that our replay
buffer will be pre-filled with the expert demonstrations data. So, now, our replay
buffer will consist of a mixture of both expert demonstrations and the agent's
experience (self-generated data). We sample a minibatch of experience from the
replay buffer and train the agent. Note that here we use a prioritized replay buffer,
so while sampling, we give more priority to the expert demonstrations than the
agent-generated data. In this way, we train the agent by sampling experience from
the replay buffer and minimize the loss using gradient descent.
We learned that the agent interacts with the environment and stores the experience
in the replay buffer. If the replay buffer is full, then we overwrite the buffer with
new transition information generated by the agent. However, we won't overwrite
the expert demonstrations. So, the expert demonstrations will always remain in the
replay buffer so that the agent can make use of expert demonstrations for learning.
Thus, we have learned how to pre-train and train an agent with expert
demonstrations. In the next section, we will learn about the loss function of DQfD.
Double DQN loss – 𝐽𝐽𝐷𝐷𝐷𝐷 (𝑄𝑄𝑄 represents the 1-step double DQN loss.
N-step double DQN loss – 𝐽𝐽𝑛𝑛 (𝑄𝑄𝑄 represents the n-step double DQN loss.
[ 610 ]
Chapter 15
Supervised classification loss – 𝐽𝐽𝐸𝐸 (𝑄𝑄𝑄 represents the supervised classification loss. It
is expressed as:
Where:
• aE is the action taken by the expert.
• l(aE, a) is known as the margin function or margin loss. It will be 0 when the
action taken is equal to the expert action a = aE; else, it is positive.
L2 regularization loss – 𝐽𝐽𝐿𝐿𝐿 (𝑄𝑄𝑄 represents the L2 regularization loss. It prevents the
agent from overfitting to the demonstration data.
Thus, the final loss function will be the sum of all the preceding four losses:
𝐽𝐽(𝑄𝑄) = 𝐽𝐽𝐷𝐷𝐷𝐷 (𝑄𝑄) + 𝜆𝜆1 𝐽𝐽𝑛𝑛 (𝑄𝑄) + 𝜆𝜆2 𝐽𝐽𝐸𝐸 (𝑄𝑄) + 𝜆𝜆3 𝐽𝐽𝐿𝐿𝐿 (𝑄𝑄)
Where the value of 𝜆𝜆 acts as a weighting factor and helps us to control the
importance of the respective loss.
Now that we have learned how DQfD works, we will look into the algorithm of
DQfD in the next section.
Algorithm – DQfD
The algorithm of DQfD is given as follows:
[ 611 ]
Imitation Learning and Inverse RL
Consider designing the reward function for tasks such as an agent learning to walk,
self-driving cars, and so on. In these cases, designing the reward function is not that
handy and involves assigning rewards to a variety of agent behaviors. For instance,
consider designing the reward function for an agent learning to drive a car. In this
case, we need to assign a reward for every behavior of the agent. For example, we
can assign a high reward if the agent follows the traffic signal, avoids pedestrians,
doesn't hit any objects, and so on. But designing the reward function in this way
is not optimal, and there is also a good chance that we might miss out on several
behaviors of an agent.
Okay, now the question is can we learn the reward function? Yes! If we have
expert demonstrations, then we can learn the reward function from the expert
demonstrations. But how can we do that exactly? Here is where IRL helps us.
As the name suggests, IRL is the inverse of reinforcement learning.
[ 612 ]
Chapter 15
In RL, we try to find the optimal policy given the reward function, but in IRL, we
try to learn the reward function given the expert demonstrations. Once we have
derived the reward function from the expert demonstrations using IRL, we can
use the reward function to train our agent to learn the optimal policy using any
reinforcement learning algorithm.
IRL consists of several interesting algorithms. In the next section, we will learn one
of the most popular IRL algorithms, called maximum entropy IRL.
Key terms
Feature vector – We can represent the state by a feature vector f. Let's say we have a
state s; its feature vector can then be defined as fs.
Feature count – Say we have a trajectory 𝜏𝜏; the feature count of the trajectory is then
defined as the sum of the feature vectors of all the states in the trajectory:
Reward function – The reward function can be defined as the linear combination
of the features, that is, the sum of feature vectors multiplied by a weight 𝜃𝜃:
[ 613 ]
Imitation Learning and Inverse RL
We know that the feature count is the sum of feature vectors of all the states in the
trajectory, so from (2), we can rewrite the preceding equation as:
We have already learned that the reward function is given as 𝑅𝑅𝜃𝜃 (𝜏𝜏) = 𝜃𝜃 𝑇𝑇 𝑓𝑓𝜏𝜏 . Finding
the optimal parameter 𝜃𝜃 helps us to learn the correct reward function. So, we will
sample a trajectory 𝜏𝜏 from the expert demonstrations 𝒟𝒟 and try to find the reward
function by finding the optimal parameter 𝜃𝜃.
1
𝐿𝐿(𝜃𝜃) = log 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝
𝑀𝑀
Where M denotes the number of demonstrations.
[ 614 ]
Chapter 15
1 exp(𝑅𝑅𝜃𝜃 (𝜏𝜏))
𝐿𝐿(𝜃𝜃) = log ( )
𝑀𝑀 𝑧𝑧
Based on the log rule, log 𝑎𝑎𝑎𝑎𝑎 𝑎 log 𝑎𝑎 𝑎 log 𝑏𝑏𝑏, we can write:
1
𝐿𝐿(𝜃𝜃) = (log exp(𝑅𝑅𝜃𝜃 (𝜏𝜏)) − log 𝑧𝑧𝑧
𝑀𝑀
The logarithmic and exponential terms cancel each other out, so the preceding
equation becomes:
1
𝐿𝐿(𝜃𝜃) = (𝑅𝑅 (𝜏𝜏) − log 𝑧𝑧𝑧
𝑀𝑀 𝜃𝜃
We know that 𝑧𝑧 𝑧 𝑧 𝑧𝑧𝑧𝑧𝑧𝑧𝑧𝜃𝜃 (𝜏𝜏𝜏𝜏; substituting the value of z, we can rewrite the
𝜏𝜏
preceding equation as:
1
𝐿𝐿(𝜃𝜃) = (𝑅𝑅 (𝜏𝜏) − log ∑ exp(𝑅𝑅𝜃𝜃 (𝜏𝜏)))
𝑀𝑀 𝜃𝜃
𝜏𝜏
We know that 𝑅𝑅𝜃𝜃 (𝜏𝜏) = 𝜃𝜃 𝑇𝑇 𝑓𝑓𝜏𝜏 ; substituting the value of 𝑅𝑅𝜃𝜃 (𝜏𝜏), our final simplified
objective function is given as:
1
𝐿𝐿(𝜃𝜃) = (𝜃𝜃 𝑇𝑇 𝑓𝑓𝜏𝜏 − log ∑ exp( 𝜃𝜃 𝑇𝑇 𝑓𝑓𝜏𝜏 ))
𝑀𝑀
𝜏𝜏
To find the optimal parameter 𝜃𝜃, we compute the gradient of the preceding objective
function ∇𝜃𝜃 𝐿𝐿𝐿𝐿𝐿𝐿 and update the value of 𝜃𝜃 as 𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐿𝐿𝐿𝐿𝐿𝐿. In the next section,
we will learn how to compute the gradient ∇𝜃𝜃 𝐿𝐿𝐿𝐿𝐿𝐿.
1
𝐿𝐿(𝜃𝜃) = (𝜃𝜃 𝑇𝑇 𝑓𝑓𝜏𝜏 − log ∑ exp ( 𝜃𝜃 𝑇𝑇 𝑓𝑓𝜏𝜏 ))
𝑀𝑀
𝜏𝜏
[ 615 ]
Imitation Learning and Inverse RL
Now, we compute the gradient of the objective function with respect to 𝜃𝜃. After
computation, our gradient is given as:
1
∇𝜃𝜃 𝐿𝐿(𝜃𝜃) = (∑ 𝑓𝑓𝜏𝜏 − ∑ 𝑝𝑝(𝜏𝜏|𝜃𝜃)𝑓𝑓𝜏𝜏 )
𝑀𝑀
𝜏𝜏 𝜏𝜏
1 1
∇𝜃𝜃 𝐿𝐿(𝜃𝜃) = ∑ 𝑓𝑓𝜏𝜏 − ∑ 𝑝𝑝(𝜏𝜏|𝜃𝜃)𝑓𝑓𝜏𝜏
𝑀𝑀 𝑀𝑀
𝜏𝜏 𝜏𝜏
The average of the feature count is just the feature expectation 𝑓𝑓̃ , so we can
1
substitute ∑ 𝑓𝑓𝜏𝜏 = 𝑓𝑓̃ and rewrite the preceding equation as:
𝑀𝑀
𝜏𝜏
1
∇𝜃𝜃 𝐿𝐿(𝜃𝜃) = 𝑓𝑓̃ − ∑ 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑓𝑓𝜏𝜏
𝑀𝑀
𝜏𝜏
We can rewrite the preceding equation by combining all the states of the trajectories
as:
Thus, using the preceding equation, we compute gradients and update the
parameter 𝜃𝜃. If you look at the preceding equation, we can easily compute the first
term, which is just the feature expectation 𝑓𝑓̃, but what about the 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 in the second
term? 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 is called the state visitation frequency and it represents the probability
of being in a given state. Okay, how can we compute 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝?
If we have a policy 𝜋𝜋, then we can use the policy to compute the state visitation
frequency. But we don't have any policy yet. So, we can use a dynamic programming
method, say, value iteration, to compute the policy. However, in order to compute
the policy using the value iteration method, we require a reward function. So, we
just feed our reward function 𝑅𝑅𝜃𝜃 (𝜏𝜏𝜏 and extract the policy using the value iteration.
Then, using the extracted policy, we compute the state visitation frequency.
The steps involved in computing the state visitation frequency using the policy 𝜋𝜋 are
as follows:
1. Let the probability of visiting a state s at a time t be 𝜇𝜇𝑡𝑡 (𝑠𝑠𝑠. We can write
the probability of visiting the initial state s1 at a first time step t = 1 as:
𝜇𝜇1 (𝑠𝑠) = 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝1 )
[ 616 ]
Chapter 15
Thus, over a series of iterations, we will find an optimal parameter 𝜃𝜃. Once we have
𝜃𝜃, we can use it to define the correct reward function 𝑅𝑅𝜃𝜃 (𝜏𝜏𝜏. In the next section, we
will learn about GAIL.
In a GAN, we have two networks: one is the generator and the other is the
discriminator. The role of the generator is to generate new data points by learning
the distribution of the input dataset. The role of the discriminator is to classify
whether a given data point is generated by the generator (learned distribution) or
whether it is from the real data distribution.
[ 617 ]
Imitation Learning and Inverse RL
Minimizing the loss function of a GAN implies minimizing the Jensen Shannon (JS)
divergence between the real data distribution and the fake data distribution (learned
distribution). The JS divergence is used to measure how two probability distributions
differ from each other. Thus, when the JS divergence between the real and fake
distributions is zero, it means that the real and fake data distributions are equal, that
is, our generator network has successfully learned the real distribution.
Now, let's learn how to make use of GANs in an IRL setting. First, let's introduce a
new term called occupancy measure. It is defined as the distribution of the states
and actions that our agent comes across while exploring the environment with
some policy 𝜋𝜋. In simple terms, it is basically the distribution of state-action pairs
following a policy 𝜋𝜋. The occupancy measure of a policy 𝜋𝜋 is denoted by 𝜌𝜌𝜋𝜋 .
In the imitation learning setting, we have an expert policy, and let's denote the expert
policy by 𝜋𝜋𝐸𝐸. Similarly, let's denote the agent's policy by 𝜋𝜋𝜃𝜃. Now, our goal is to make
our agent learn the expert policy. How can we do that? If we make the occupancy
measure of the expert policy and the agent policy equal, then it implies that our
agent has successfully learned the expert policy. That is, the occupancy measure
is the distribution of the state-action pairs following a policy. If we can make the
distribution of state-action pairs of the agent's policy equal to the distribution of
state-action pairs of the expert's policy, then it means that our agent has learned the
expert policy. Let's explore how we can do this using GANs.
We can perceive the occupancy measure of the expert policy as the real data
distribution and the occupancy measure of the agent policy as the fake data
distribution. Thus, minimizing the JS divergence between the occupancy measure
of the expert policy 𝜌𝜌𝜋𝜋𝜋𝜋 and the occupancy measure of the agent policy 𝜌𝜌𝜋𝜋𝜋𝜋 implies
that the agent will learn the expert policy.
With GANs, we know that the role of the generator is to generate new data points
by learning the distribution of a given dataset. Similarly, in GAIL, the role of the
generator is to generate a new policy by learning the distribution (occupancy
measure) of the expert policy. The role of the discriminator is to classify whether
the given policy is the expert policy or the agent policy.
With GANs, we know that, for a generator, the optimal discriminator is the one that
is not able to distinguish between the real and fake data distributions; similarly, in
GAIL, the optimal discriminator is the one that is unable to distinguish whether
the generated state-action pair is from the agent policy or the expert policy.
[ 618 ]
Chapter 15
To make ourselves clear, let's understand the terms we use in GAIL by relating to
GAN terminology:
In a nutshell, we use the generator to generate the state-action pair in a way that the
discriminator is not able to distinguish whether the state-action pair is generated
using the expert policy or the agent policy. Both the generator and discriminator are
neural networks. We train the generator to generate a policy similar to the expert
policy using TRPO. The discriminator is a classifier, and it is optimized using Adam.
Thus, we can define the objective function of GAIL as:
max min 𝔼𝔼𝜋𝜋𝜃𝜃 [log(𝐷𝐷𝜔𝜔 (𝑠𝑠𝑠 𝑠𝑠))] + 𝔼𝔼𝜋𝜋𝐸𝐸 [log(1 − 𝐷𝐷𝜔𝜔 (𝑠𝑠𝑠 𝑠𝑠𝑠𝑠𝑠
𝜃𝜃 𝜔𝜔
Now that we have an understanding of how GAIL works, let's get into more detail
and learn how the preceding equation is derived.
Formulation of GAIL
In this section, we explore the math of GAIL and see how it works. You can skip this
section if you are not interested in math. We know that in reinforcement learning,
our objective is to find the optimal policy that gives the maximum reward. It can
be expressed as:
We can rewrite our objective function by adding the entropy of a policy as shown
here:
The preceding equation tells us that we can maximize the entropy of the policy
while also maximizing the reward. Instead of defining the objective function in terms
of the reward, we can also define the objective function in terms of cost.
[ 619 ]
Imitation Learning and Inverse RL
That is, we can define our RL objective function in terms of cost, as our objective is
to find an optimal policy that minimizes the cost; this can be expressed as:
Where c is the cost. Thus, given the cost function, our goal is to find the optimal
policy that minimizes the cost.
Now, let's talk about IRL. We learned that in IRL, our objective is to find the reward
function from the given set of expert demonstrations. We can also define the
objective of IRL in terms of cost instead of reward. That is, we can define our IRL
objective function in terms of cost, as our objective is to find the cost function under
which the expert demonstration is optimal. The objective can be expressed using
maximum causal entropy IRL as:
𝐼𝐼𝐼𝐼𝐼𝐼(𝜋𝜋𝐸𝐸 ) = arg max (min − 𝐻𝐻(𝜋𝜋) + 𝔼𝔼𝜋𝜋 [𝑐𝑐(𝑠𝑠𝑠 𝑠𝑠)]) − 𝔼𝔼𝜋𝜋𝐸𝐸 [𝑐𝑐(𝑠𝑠𝑠 𝑠𝑠)]
𝑐𝑐 𝜋𝜋
What does the preceding equation imply? In the IRL setting, our goal is to learn
the cost function given the expert demonstrations (expert policy). We know that
the expert policy performs better than the other policy, so we try to learn the cost
function c, which assigns low cost to the expert policy and high cost to other policies.
Thus, the preceding objective function implies that we try to find a cost function that
assigns low cost to the expert policy and high cost to other policies.
IRL𝜓𝜓 (𝜋𝜋𝐸𝐸 ) = arg max − 𝜓𝜓(𝑐𝑐) + (min − 𝐻𝐻(𝜋𝜋) + 𝔼𝔼𝜋𝜋 [𝑐𝑐(𝑠𝑠𝑠 𝑠𝑠)]) − 𝔼𝔼𝜋𝜋𝐸𝐸 [𝑐𝑐(𝑠𝑠𝑠 𝑠𝑠)] (7)
𝑐𝑐 𝜋𝜋
From equation (6), we learned that in a reinforcement learning setting, given a cost,
we obtain the optimal policy, and from (7), we learned that in an IRL setting, given
an expert policy (expert demonstration), we obtain the cost. Thus, from (6) and (7),
we can observe that the output of IRL can be sent as an input to the RL. That is, IRL
results in the cost function and we can use this cost function as an input in RL to
learn the optimal policy. Thus, we can write RL(IRL(𝜋𝜋𝐸𝐸 )), which implies that the
result of IRL is fed as an input to the RL.
[ 620 ]
Chapter 15
To learn how exactly the equation (8) is derived, you can refer to the GAIL paper
given in the Further reading section at the end of the chapter. The objective function
(equation (8)) implies that we try to find the optimal policy whose occupancy
measure is close to the occupancy measure of the expert policy. The occupancy
⋆
measure between the agent policy and the expert policy is measured by 𝜓𝜓 .
⋆
There are several choices for the regularizer 𝜓𝜓 . We use a generative adversarial
⋆
regularizer 𝜓𝜓𝐺𝐺𝐺𝐺 and write our equation as:
⋆
RL ∘ IRL𝜓𝜓 (𝜋𝜋𝐸𝐸 ) = arg min − 𝐻𝐻(𝜋𝜋) + 𝜓𝜓𝐺𝐺𝐺𝐺 (𝜌𝜌𝜋𝜋 − 𝜌𝜌𝜋𝜋𝐸𝐸 ) (9)
𝜋𝜋
⋆
Thus, minimizing 𝜓𝜓𝐺𝐺𝐺𝐺 (𝜌𝜌𝜋𝜋 − 𝜌𝜌𝜋𝜋𝐸𝐸 ) basically implies that we minimize the JS
divergence between the occupancy measure of the agent policy 𝜌𝜌𝜋𝜋 and the expert
policy 𝜌𝜌𝜋𝜋𝜋𝜋. Thus, we can rewrite the RHS of the equation (9) as:
⋆
min − 𝜆𝜆𝜆𝜆(𝜋𝜋) + 𝜓𝜓𝐺𝐺𝐺𝐺 (𝜌𝜌𝜋𝜋 − 𝜌𝜌𝜋𝜋𝐸𝐸 ) = −𝜆𝜆𝜆𝜆(𝜋𝜋) + 𝐷𝐷𝐽𝐽𝐽𝐽 (𝜌𝜌𝜋𝜋 , 𝜌𝜌𝜋𝜋𝐸𝐸 )
𝜋𝜋
⋆
min 𝜓𝜓𝐺𝐺𝐺𝐺 (𝜌𝜌𝜋𝜋 − 𝜌𝜌𝜋𝜋𝐸𝐸 ) − 𝜆𝜆𝜆𝜆(𝜋𝜋) = 𝐷𝐷𝐽𝐽𝐽𝐽 (𝜌𝜌𝜋𝜋 , 𝜌𝜌𝜋𝜋𝐸𝐸 ) − 𝜆𝜆𝜆𝜆(𝜋𝜋)
𝜋𝜋
Where 𝜆𝜆 is just the policy regularizer. We know that the JS divergence between the
occupancy measure of the agent policy 𝜌𝜌𝜋𝜋 and the expert policy 𝜌𝜌𝜋𝜋𝜋𝜋 is minimized
using the GAN, so we can just replace 𝐷𝐷𝐽𝐽𝐽𝐽 (𝜌𝜌𝜋𝜋 , 𝜌𝜌𝜋𝜋 ) in the preceding equation with
𝐸𝐸
the GAN objective function as:
⋆
min 𝜓𝜓𝐺𝐺𝐺𝐺 (𝜌𝜌𝜋𝜋 − 𝜌𝜌𝜋𝜋𝐸𝐸 ) − 𝜆𝜆𝜆𝜆(𝜋𝜋) = max min 𝔼𝔼𝜋𝜋𝜃𝜃 [log(𝐷𝐷𝜔𝜔 (𝑠𝑠𝑠 𝑠𝑠))] + 𝔼𝔼𝜋𝜋𝐸𝐸 [log(1 − 𝐷𝐷𝜔𝜔 (𝑠𝑠𝑠 𝑠𝑠))] − 𝜆𝜆𝜆𝜆(𝜋𝜋) (10)
𝜋𝜋 𝜃𝜃 𝜔𝜔
RL ∘ IRL𝜓𝜓 (𝜋𝜋𝐸𝐸 ) = max min 𝔼𝔼𝜋𝜋𝜃𝜃 [log(𝐷𝐷𝜔𝜔 (𝑠𝑠𝑠 𝑠𝑠))] + 𝔼𝔼𝜋𝜋𝐸𝐸 [log(1 − 𝐷𝐷𝜔𝜔 (𝑠𝑠𝑠 𝑠𝑠))] − 𝜆𝜆𝜆𝜆(𝜋𝜋) (11)
𝜃𝜃 𝜔𝜔
[ 621 ]
Imitation Learning and Inverse RL
The objective equation implies that we can find the optimal policy by minimizing
the occupancy measure of the expert policy and the agent policy, and we minimize
that using GANs.
The role of the generator is to generate a policy by learning the occupancy measure
of the expert policy, and the role of the discriminator is to classify whether the
generated policy is from the expert policy or the agent policy. So, we train the
generator using TRPO and the discriminator is basically a neural network that
tells us whether the policy generated by the generator is the expert policy or the
agent policy.
Summary
We started the chapter by understanding what imitation learning is and how
supervised imitation learning works. Next, we learned about the DAgger algorithm,
where we aggregate the dataset obtained over a series of iterations and learn the
optimal policy.
After looking at DAgger, we learned about DQfD, where we prefill the replay buffer
with expert demonstrations and pre-train the agent with expert demonstrations
before the training phase.
At the end of the chapter, we learned about GAIL, where we used GANs to learn
the optimal policy. In the next chapter, we will explore a reinforcement learning
library called Stable Baselines.
[ 622 ]
Chapter 15
Questions
Let's assess our understanding of imitation learning and IRL. Try answering the
following questions:
Further reading
For more information, refer to the following papers:
[ 623 ]
Deep Reinforcement
16
Learning with Stable
Baselines
So far, we have learned various deep reinforcement learning (RL) algorithms.
Wouldn't it be nice if we had a library to easily implement a deep RL algorithm?
Yes! There are various libraries available to easily build a deep RL algorithm.
One such popular deep RL library is OpenAI Baselines. OpenAI Baselines provides
an efficient implementation of many deep RL algorithms, which makes them easier
to use. However, OpenAI Baselines does not provide good documentation. So, we
will look at the fork of OpenAI Baselines called Stable Baselines.
Let's start off the chapter by installing Stable Baselines, and then we will learn
how to create our first agent using the library. Next, we will learn about vectorized
environments. Going further, we will learn to implement several deep RL algorithms
using Stable Baselines along with exploring various functionalities of baselines.
[ 625 ]
Deep Reinforcement Learning with Stable Baselines
Several deep RL algorithms require MPI to run, so, let's install MPI:
Note that currently, Stable Baselines works only with TensorFlow version 1.x. So,
make sure you are running the Stable Baselines experiment with TensorFlow 1.x.
Now that we have installed Stable Baselines, let's see how to create our first agent
using it.
import gym
from stable_baselines import DQN
env = gym.make('MountainCar-v0')
Now, let's instantiate our agent. As we can observe in the following code, we are
passing MlpPolicy, which implies that our network is a multilayer perceptron:
Now, let's train the agent by specifying the number of time steps we want to train:
agent.learn(total_timesteps=25000)
In the following code, agent is the trained agent, agent.get_env() gets the
environment we trained our agent in, and n_eval_episodes represents the number
of episodes we need to evaluate our agent:
agent.save("DQN_mountain_car_agent")
agent = DQN.load("DQN_mountain_car_agent")
[ 627 ]
Deep Reinforcement Learning with Stable Baselines
state = env.reset()
for t in range(5000):
Predict the action to perform in the given state using our trained agent:
action, _ = agent.predict(state)
state = next_state
env.render()
Now, we can see how our trained agent performs in the environment:
[ 628 ]
Chapter 16
for t in range(5000):
action, _ = agent.predict(state)
next_state, reward, done, info = env.step(action)
state = next_state
env.render()
Now that we have a basic idea of how to use Stable Baselines, let's explore it in detail.
Vectorized environments
One of the very interesting and useful features of Stable Baselines is that we can train
our agent in multiple independent environments either in separate processes (using
SubprocVecEnv) or in the same process (using DummyVecEnv).
[ 629 ]
Deep Reinforcement Learning with Stable Baselines
For example, say we are training our agent in a cart pole balancing environment –
instead of training our agent only in a single cart pole balancing environment, we
can train our agent in the multiple cart pole balancing environments.
We generally train our agent in a single environment per step but now we can train
our agent in multiple environments per step. This helps our agent to learn more
quickly. Now, our state, action, reward, and done will be in the form of a vector since
we are training our agent in multiple environments. So, we call this a vectorized
environment.
• SubprocVecEnv
• DummyVecEnv
SubprocVecEnv
In the subproc vectorized environment, we run each environment in a separate
process (taking advantage of multiprocessing). Now, let's see how to create the
subproc vectorized environment.
env_name = 'Pendulum-v0'
num_process = 2
env = SubprocVecEnv([make_env(env_name, i) for i in range(num_
process)])
[ 630 ]
Chapter 16
DummyVecEnv
In the dummy vectorized environment, we run each environment in sequence on the
current Python process. It does not support multiprocessing. Now, let's see how to
create the dummy vectorized environment.
env_name = 'Pendulum-v0'
env = DummyVecEnv([lambda: gym.make(env_name)])
Now that we have learned to train the agent in multiple independent environments
using vectorized environments, in the next section, we will see how to integrate
custom environments into Stable Baselines.
Suppose the name of our custom environment is CustomEnv. First, we instantiate our
custom environment as follows:
env = CustomEnv()
That's it. In the next section, let's learn how to play Atari games using a DQN and its
variants.
[ 631 ]
Deep Reinforcement Learning with Stable Baselines
Since we are dealing with Atari games, we can use a convolutional neural network
instead of a vanilla neural network. So, we use CnnPolicy:
We learned that we preprocess the game screen before feeding it to the agent. With
Stable Baselines, we don't have to preprocess manually; instead, we can make use
of the make_atari module, which takes care of preprocessing the game screen:
Now, let's create an Atari game environment. Let's create the Ice Hockey game
environment:
env = make_atari('IceHockeyNoFrameskip-v4')
agent.learn(total_timesteps=25000)
After training the agent, we can have a look at how our trained agent performs in the
environment:
state = env.reset()
while True:
action, _ = agent.predict(state)
next_state, reward, done, info = env.step(action)
state = next_state
env.render()
[ 632 ]
Chapter 16
The preceding code displays how our trained agent plays the ice hockey game:
Now, while instantiating our agent, we just need to pass the keyword arguments:
[ 633 ]
Deep Reinforcement Learning with Stable Baselines
agent.learn(total_timesteps=25000)
That's it! Now we have the dueling double DQN with prioritized experience replay.
In the next section, we will learn how to play the lunar lander game using the
Advantage Actor-Critic Algorithm (A2C).
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.common.evaluation import evaluate_policy
from stable_baselines import A2C
env = gym.make('LunarLander-v2')
Let's use the dummy vectorized environment. We learned that in the dummy
vectorized environment, we run each environment in the same process:
agent.learn(total_timesteps=25000)
[ 634 ]
Chapter 16
After training, we can evaluate our agent by looking at the mean rewards:
We can also have a look at how our trained agent performs in the environment:
state = env.reset()
while True:
action, _states = agent.predict(state)
next_state, reward, done, info = env.step(action)
state = next_state
env.render()
The preceding code will show how well our trained agent lands on the landing pad:
[ 635 ]
Deep Reinforcement Learning with Stable Baselines
Now, we can define our custom policy (custom network) as shown in the
following snippet. As we can observe in the following code, we are passing net_
arch=[dict(pi=[128, 128, 128], vf=[128, 128, 128])], which specifies our
network architecture. pi represents the architecture of the policy network and vf
represents the architecture of value network:
class CustomPolicy(FeedForwardPolicy):
def __init__(self, *args, **kargs):
super(CustomPolicy, self).__init__(*args, **kargs,
net_arch=[dict(pi=[128, 128,
128], vf=[128, 128, 128])], feature_extraction="mlp")
agent.learn(total_timesteps=25000)
That's it. Similarly, we can create our own custom network. In the next section,
let's learn how to perform the inverted pendulum swing-up task using the Deep
Deterministic Policy Gradient (DDPG) algorithm.
import gym
import numpy as np
[ 636 ]
Chapter 16
env = gym.make('Pendulum-v0')
n_actions = env.action_space.shape[-1]
We know that in DDPG, instead of selecting the action directly, we add some noise
using the Ornstein-Uhlenbeck process to ensure exploration. So, we create the action
noise as follows:
action_noise = OrnsteinUhlenbeckActionNoise(mean=np.zeros(n_actions),
sigma=float(0.5) * np.ones(n_actions))
agent.learn(total_timesteps=25000)
After training the agent, we can also look at how our trained agent swings up the
pendulum by rendering the environment. Can we also look at the computational
graph of DDPG? Yes! In the next section, we will learn how to do that.
agent.learn(total_timesteps=25000)
[ 637 ]
Deep Reinforcement Learning with Stable Baselines
After training, open the terminal and type the following command to run
TensorBoard:
As we can observe, we can now see the computational graph of the DDPG model
(agent):
From Figure 16.4, we can understand how the DDPG computational graph is
generated just as we learned in Chapter 12, Learning DDPG, TD3, and SAC.
Now, let's expand and look into the model node for more clarity:
[ 638 ]
Chapter 16
As we can observe from Figure 16.5, our model includes the policy (actor) and
Q (critic) network.
Now that we have learned how to use Stable Baselines to implement DDPG for
the inverted pendulum swing-up task, in the next section we will learn how to
implement TRPO using Stable Baselines.
[ 639 ]
Deep Reinforcement Learning with Stable Baselines
If you are using Linux, then you can download the zip file named mujoco200 linux.
After downloading the zip file, unzip the file and rename it to mujoco200. Now, copy
the mujoco200 folder and place the folder inside the .mujoco folder in your home
directory.
As Figure 16.7 shows, now in our home directory, we have a .mujoco folder, and
inside the .mujoco folder, we have a mujoco200 folder:
[ 640 ]
Chapter 16
To register, we also need the computer id. As Figure 16.8 shows, to the right of the
Computer id field, we have the name of different platforms. Now, just click on your
operating system and you will obtain the relevant executable getid file. For instance,
if you are using Linux, then you will obtain a file named getid_linux.
After downloading the getid_linux file, run the following command on your
terminal:
chmod +x getid_linux
./getid_linux
The preceding command will display your computer id. After getting the computer
id, fill in the form and register to obtain a license. Once you click on the Submit
button, you will get an email from Roboti LLC Licensing.
From the email, download the file named mjkey.txt. Next, place the mjkey.txt file in
the .mujoco folder. As Figure 16.9 shows, now our .mujoco hidden folder contains the
mjkey.txt file and a folder named mujoco200:
Next, open your terminal and run the following command to edit the bashrc file:
nano ~/.bashrc
Copy the following line to the bashrc file and make sure to replace the username text
with your own username:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/username/.mujoco/
mujoco200/bin
[ 641 ]
Deep Reinforcement Learning with Stable Baselines
Next, save the file and exit the nano editor. Now, run the following command on
your terminal:
source ~/.bashrc
Well done! We are almost there. Now, clone the MuJoCo GitHub repository:
cd mujoco-py
To test the successful installation of MuJoCo, let's run a Humanoid agent by taking
a random action in the environment. So, create the following Python file named
mujoco_test.py with the following code:
import gym
env = gym.make('Humanoid-v2')
env.reset()
for t in range(1000):
env.render()
env.step(env.action_space.sample())
env.close()
python mujoco_test.py
[ 642 ]
Chapter 16
The preceding code will render the Humanoid environment as Figure 16.10 shows:
Now that we have successfully installed MuJoCo, let's start implementing TRPO to
train our agent to walk in the next section.
Implementing TRPO
Import the necessary libraries:
import gym
[ 643 ]
Deep Reinforcement Learning with Stable Baselines
agent.learn(total_timesteps=250000)
After training the agent, we can see how our trained agent learned to walk by
rendering the environment:
state = env.reset()
while True:
action, _ = agent.predict(state)
next_state, reward, done, info = env.step(action)
state = next_state
env.render()
Save the whole code used in this section in a Python file called trpo.py and then
open the terminal and run the file:
python trpo.py
We can see how our trained agent learned to walk in Figure 16.11:
[ 644 ]
Chapter 16
Always use the terminal to run the program that uses the MuJoCo
environment.
That's it. In the next section, we will learn how to record our trained agent's actions
as a video.
[ 645 ]
Deep Reinforcement Learning with Stable Baselines
Note that to record the video, we need the ffmpeg package installed in our machine.
If it is not installed, then install it using the following set of commands:
Select actions in the environment using our trained agent where the number of time
steps is set to the video length:
state = env.reset()
for t in range(video_length):
action, _ = agent.predict(state)
next_state, reward, done, info = env.step(action)
state = next_state
env.close()
[ 646 ]
Chapter 16
That's it! Now, let's call our record_video function. Note that we are passing the
environment name, our trained agent, the length of the video, and the name of our
video file:
In this way, we can record our trained agent's action. In the next section, we will
learn how to implement PPO using Stable Baselines.
[ 647 ]
Deep Reinforcement Learning with Stable Baselines
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize
from stable_baselines import PPO2
env = VecNormalize(env,norm_obs=True)
agent.learn(total_timesteps=250000)
After training, we can see how our trained cheetah bot learned to run by rendering
the environment:
state = env.reset()
while True:
action, _ = agent.predict(state)
next_state, reward, done, info = env.step(action)
state = next_state
env.render()
Save the whole code used in this section in a Python file called ppo.py and then open
the terminal and run the file:
python ppo.py
[ 648 ]
Chapter 16
We can see how our trained cheetah bot learned to run, as Figure 16.13 shows:
import imageio
import numpy as np
images = []
Initialize the state by resetting the environment, where agent is the agent we trained
in the previous section:
state = agent.env.reset()
[ 649 ]
Deep Reinforcement Learning with Stable Baselines
img = agent.env.render(mode='rgb_array')
for i in range(500):
images.append(img)
action, _ = agent.predict(state)
next_state, reward, done ,info = agent.env.step(action)
state = next_state
img = agent.env.render(mode='rgb_array')
Now, we will have a new file called HalfCheetah.gif, as Figure 16.14 shows:
In this way, we can obtain a GIF of our trained agent. In the next section, we will
learn how to implement GAIL using Stable Baselines.
[ 650 ]
Chapter 16
Implementing GAIL
In this section, let's explore how to implement Generative Adversarial Imitation
Learning (GAIL) with Stable Baselines. In Chapter 15, Imitation Learning and Inverse
RL, we learned that we use the generator to generate the state-action pair in a way
that the discriminator is not able to distinguish whether the state-action pair is
generated using the expert policy or the agent policy. We train the generator to
generate a policy similar to an expert policy using TRPO, while the discriminator
is a classifier and it is optimized using Adam.
import gym
from stable_baselines import GAIL, TD3
from stable_baselines.gail import ExpertDataset, generate_expert_traj
Instantiate the GAIL agent with the expert dataset (expert trajectories):
agent.learn(total_timesteps=25000)
[ 651 ]
Deep Reinforcement Learning with Stable Baselines
After training, we can also render the environment and see how our trained agent
performs in the environment. That's it, implementing GAIL using Stable Baselines
is that simple.
Summary
We started the chapter by understanding what Stable Baselines is and how to install
it. Then, we learned to create our first agent with Stable Baselines using a DQN. We
also learned how to save and load an agent. Next, we learned how to create multiple
independent environments using vectorization. We also learned about two types of
vectorized environment called SubprocVecEnv and DummyVecEnv.
Later, we learned how to implement a DQN and its variants to play Atari games
using Stable Baselines. Next, we learned how to implement A2C and also how to
create a custom policy network. Moving on, we learned how to implement DDPG
and also how to view the computational graph in TensorBoard.
Going further, we learned how to set up the MuJoCo environment and how to
train an agent to walk using TRPO. We also learned how to record a video of a
trained agent. Next, we learned how to implement PPO and how to make a GIF of
a trained agent. At the end of the chapter, we learned how to implement generative
adversarial imitation learning using Stable Baselines.
Questions
Let's put our knowledge of Stable Baselines to the test. Try answering the following
questions:
[ 652 ]
Chapter 16
Further reading
To learn more, check the following resource:
[ 653 ]
Reinforcement
17
Learning Frontiers
Congratulations! You have made it to the final chapter. We have come a long way.
We started off with the fundamentals of reinforcement learning and gradually
we learned about the state-of-the-art deep reinforcement learning algorithms.
In this chapter, we will look at some exciting and promising research trends in
reinforcement learning. We will start the chapter by learning what meta learning
is and how it differs from other learning paradigms. Then, we will learn about one
of the most used meta-learning algorithms, called Model-Agnostic Meta Learning
(MAML).
We will understand MAML in detail, and then we will see how to apply it in a
reinforcement learning setting. Following this, we will learn about hierarchical
reinforcement learning, and we look into a popular hierarchical reinforcement
learning algorithm called MAXQ value function decomposition.
[ 655 ]
Reinforcement Learning Frontiers
Meta learning is one of the most promising and trending research areas in the field
of artificial intelligence. It is believed to be a stepping stone for attaining Artificial
General Intelligence (AGI). What is meta learning? And why do we need meta
learning? To answer these questions, let's revisit how deep learning works.
We know that in deep learning, we train a deep neural network to perform a task.
But the problem with deep neural networks is that we need to have a large training
dataset to train our network, as it will fail to learn when we have only a few data
points.
Let's say we trained a deep learning model to perform task A. Suppose we have a
new task B, which is closely related to task A. Although task B is closely related to
task A, we can't use the model we trained for task A to perform task B. We need to
train a new model from scratch for task B. So, for each task, we need to train a new
model from scratch although they might be related. But is this really true AI? Not
really. How do we humans learn? We generalize our learning to multiple concepts
and learn from there. But current learning algorithms master only one task. So, here
is where meta learning comes in.
Meta learning produces a versatile AI model that can learn to perform various tasks
without having to be trained from scratch. We train our meta-learning model on
various related tasks with few data points, so for a new related task, it can make use
of the learning achieved in previous tasks. Many researchers and scientists believe
that meta learning can get us closer to achieving AGI. Learning to learn is the key
focus of meta learning. We will understand how exactly meta learning works by
looking at a popular meta learning algorithm called MAML in the next section.
[ 656 ]
Chapter 17
So, what do we mean by that? Let's say we are performing a classification task
using a neural network. How do we train the network? We start off by initializing
random weights and train the network by minimizing the loss. How do we minimize
the loss? We minimize the loss using gradient descent. Okay, but how do we use
gradient descent to minimize the loss? We use gradient descent to find the optimal
weights that will give us the minimal loss. We take multiple gradient steps to find
the optimal weights so that we can reach convergence.
In MAML, we try to find these optimal weights by learning from the distribution
of similar tasks. So, for a new task, we don't have to start with randomly initialized
weights; instead, we can start with optimal weights, which will take fewer gradient
steps to reach convergence and doesn't require more data points for training.
Let's understand how MAML works in simple terms. Let's suppose we have three
related tasks: T1, T2, and T3.
First, we randomly initialize our model parameter (weight), 𝜃𝜃. We train our network
on task T1. Then, we try to minimize the loss L by gradient descent. We minimize the
loss by finding the optimal parameter. Let 𝜃𝜃1′ be the optimal parameter for the task
T1. Similarly, for tasks T2 and T3, we will start off with a randomly initialized model
parameter 𝜃𝜃 and minimize the loss by finding the optimal parameters by gradient
descent. Let 𝜃𝜃2′ and 𝜃𝜃3′ be the optimal parameters for tasks T2 and T3, respectively.
[ 657 ]
Reinforcement Learning Frontiers
As we can see in the following figure, we start off each task with the randomly
initialized parameter 𝜃𝜃 and minimize the loss by finding the optimal parameters 𝜃𝜃1′,
𝜃𝜃2′, and 𝜃𝜃3′ for the tasks T1, T2, and T3 respectively:
However, instead of initializing 𝜃𝜃 in a random position, that is, with random values,
if we initialize 𝜃𝜃 in a position that is common to all three tasks, then we don't need
to take many gradient steps and it will take us less time to train. MAML tries to do
exactly this. MAML tries to find this optimal parameter 𝜃𝜃 that is common to many
of the related tasks, so we can train a new task relatively quick with few data points
without having to take many gradient steps.
As Figure 17.2 shows, we shift 𝜃𝜃 to a position that is common to all different optimal
𝜃𝜃 ′ values:
[ 658 ]
Chapter 17
So, for a new related task, say, T4, we don't have to start with a randomly initialized
parameter, 𝜃𝜃. Instead, we can start with the optimal 𝜃𝜃 value (shifted 𝜃𝜃) so that it will
take fewer gradient steps to attain convergence.
Thus, in MAML, we try to find this optimal 𝜃𝜃 value that is common to related tasks
to help us learn from fewer data points and minimize our training time. MAML is
model-agnostic, meaning that we can apply MAML to any models that are trainable
with gradient descent. But how exactly does MAML work? How do we shift the
model parameters to an optimal position? Now that we have a basic understanding
of MAML, we will address all these questions in the next section.
[ 659 ]
Reinforcement Learning Frontiers
Understanding MAML
Suppose we have a model f parameterized by 𝜃𝜃, that is, 𝑓𝑓𝜃𝜃, and we have a
distribution over tasks, p(T). First, we initialize our parameter 𝜃𝜃 with some random
values. Next, we sample a batch of tasks Ti from a distribution over tasks―that
is, Ti ~ p(T). Let's say we have sampled five tasks: T1, T2, T3, T4, T5.
Now, for each task Ti, we sample k number of data points and train the model f
parameterized by 𝜃𝜃, that is, 𝑓𝑓𝜃𝜃. We train the model by computing the loss 𝐿𝐿 𝑇𝑇𝑖𝑖 (𝑓𝑓𝜃𝜃 )
and we minimize the loss using gradient descent and find the optimal parameter 𝜃𝜃𝑖𝑖′.
The parameter update rule using gradient descent is given as:
So, after the preceding parameter update using gradient descent, we will have
optimal parameters for all five tasks that we have sampled. That is, for the tasks
T1, T2, T3, T4, T5, we will have the optimal parameters 𝜃𝜃1′ , 𝜃𝜃2′ , 𝜃𝜃3′ , 𝜃𝜃4′ , 𝜃𝜃5′, respectively.
Now, before the next iteration, we perform a meta update or meta optimization.
That is, in the previous step, we found the optimal parameter 𝜃𝜃𝑖𝑖′ by training on each
of the tasks, Ti. Now we take some new set of tasks and for each of these new tasks
Ti, we don't have to start from the random position 𝜃𝜃; instead, we can start from the
optimal position 𝜃𝜃𝑖𝑖′ to train the model.
[ 660 ]
Chapter 17
That is, for each of the new tasks Ti, instead of using the randomly initialized
parameter 𝜃𝜃, we use the optimal parameter 𝜃𝜃𝑖𝑖′. This implies that we train the model
f parameterized by 𝜃𝜃𝑖𝑖′, that is, 𝑓𝑓𝜃𝜃𝑖𝑖′ instead of using 𝑓𝑓𝜃𝜃. Then, we calculate the loss
𝐿𝐿 𝑇𝑇𝑖𝑖 (𝑓𝑓𝜃𝜃′ ), compute the gradients, and update the parameter 𝜃𝜃. This makes our
𝑖𝑖
randomly initialized parameter 𝜃𝜃 move to an optimal position where we don't have
to take many gradient steps. This step is called a meta update, meta optimization,
or meta training. It can be expressed as follows:
If you look at our previous meta update equation (2) closely, we can see that we are
updating our model parameter 𝜃𝜃 by merely taking an average of gradients of each
new task Ti with the model f parameterized by 𝜃𝜃𝑖𝑖′.
Figure 17.3 helps us to understand the MAML algorithm better. As we can observe,
our MAML algorithm has two loops—an inner loop where we find the optimal
parameter 𝜃𝜃𝑖𝑖′ for each of the tasks Ti using the model f parameterized by initial
parameter 𝜃𝜃, that is, 𝑓𝑓𝜃𝜃, and an outer loop where we use the model f parameterized
by the optimal parameter 𝜃𝜃𝑖𝑖′ obtained in the previous step, that is 𝑓𝑓𝜃𝜃𝑖𝑖′, and train the
model on the new set of tasks, calculate the loss, compute the gradient of the loss,
and update the randomly initialized model parameter 𝜃𝜃:
[ 661 ]
Reinforcement Learning Frontiers
Note that we should not use the same set of tasks we used to find
the optimal parameter 𝜃𝜃𝑖𝑖′ when updating the model parameter 𝜃𝜃 in
the outer loop.
In a nutshell, in MAML, we sample a batch of tasks and for each task Ti in the batch,
we minimize the loss using gradient descent and get the optimal parameter 𝜃𝜃𝑖𝑖′. Then,
we update our randomly initialized model parameter 𝜃𝜃 by calculating gradients for
each new task Ti with the model parameterized as 𝑓𝑓𝜃𝜃𝑖𝑖′.
[ 662 ]
Chapter 17
Still not clear how exactly MAML works? Worry not! Let's look in even more detail
at the steps and understand how MAML works in a supervised learning setting in
the next section.
If we are performing regression, then we can use mean squared error as our loss
function:
If we are performing classification, then we can use cross-entropy loss as our loss
function:
𝐿𝐿 𝑇𝑇𝑖𝑖 (𝑓𝑓𝜃𝜃 ) = ∑ 𝑦𝑦𝑗𝑗 log 𝑓𝑓𝜃𝜃 (𝑥𝑥𝑗𝑗 ) + (1 − 𝑦𝑦𝑗𝑗 ) log (1 − 𝑓𝑓𝜃𝜃 (𝑥𝑥𝑗𝑗 ))
𝑥𝑥𝑗𝑗 ,𝑦𝑦𝑗𝑗 ~𝑇𝑇𝑖𝑖
Now let's see step by step how exactly MAML is used in supervised learning.
Next, we sample a batch of tasks Ti from a distribution of tasks, that is, Ti ~ p(T).
Let's say we have sampled three tasks; then, we have T1, T2, T3.
Inner loop: For each task Ti, we sample k data points and prepare our training and
test datasets:
[ 663 ]
Reinforcement Learning Frontiers
Now, we train the model 𝑓𝑓𝜃𝜃 on the training dataset 𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡, calculate the loss,
minimize the loss using gradient descent, and get the optimal parameter 𝜃𝜃𝑖𝑖′ as
𝜃𝜃𝑖𝑖′ = 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃𝜃 𝐿𝐿 𝑇𝑇𝑖𝑖 (𝑓𝑓𝜃𝜃 ).
That is, for each of the tasks Ti, we sample k data points and prepare 𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 and
𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡. Next, we minimize the loss on the training dataset 𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 and get the optimal
parameter 𝜃𝜃𝑖𝑖′. As we sampled three tasks, we will have three optimal parameters,
𝜃𝜃1′ , 𝜃𝜃2′ , 𝜃𝜃3′.
Outer loop: Now, we perform meta optimization on the test set (meta-training set);
that is, we try to minimize the loss in the test set 𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡. Here, we parameterize our
model f by the optimal parameter 𝜃𝜃𝑖𝑖′ calculated in the previous step. So, we compute
the loss of the model 𝐿𝐿 𝑇𝑇𝑖𝑖 (𝑓𝑓𝜃𝜃′ ) and the gradients of the loss and update our randomly
𝑖𝑖
initialized parameter 𝜃𝜃 using our test dataset (meta-training dataset) as:
We repeat the preceding steps for several iterations to find the optimal parameter.
For a clear understanding of how MAML works in supervised learning, let's look
into the algorithm in the next section.
[ 664 ]
Chapter 17
4. Now, minimize the loss on the test set 𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡. Parameterize the model f
with the optimal parameter 𝜃𝜃𝑖𝑖′ calculated in the previous step, compute loss
𝐿𝐿 𝑇𝑇𝑖𝑖 (𝑓𝑓𝜃𝜃′ ). Calculate gradients of the loss and update our randomly initialized
𝑖𝑖
parameter 𝜃𝜃 using our test (meta-training) dataset as:
The following figure gives us an overview of how the MAML algorithm works in
supervised learning:
Now that we have learned how to use MAML in a supervised learning setting, in
the next section, we will see how to use MAML in a reinforcement learning setting.
[ 665 ]
Reinforcement Learning Frontiers
We can apply MAML to any algorithm that can be trained with gradient descent.
For instance, let's take the policy gradient method. In the policy gradient method,
we use a neural network parameterized by 𝜃𝜃 to find the optimal policy and we train
our network using gradient descent. So, we can apply the MAML algorithm to the
policy gradient method.
Let's say we have a model (policy network) f parameterized by a parameter 𝜃𝜃. The
model (policy network) f tries to find the optimal policy by learning the optimal
parameter 𝜃𝜃. Suppose, we have a distribution over tasks p(T). First, we randomly
initialize the model parameter 𝜃𝜃.
Next, we sample a batch of tasks Ti from a distribution of tasks, that is, Ti ~ p(T).
Let's say we have sampled three tasks; then, we have T1, T2, T3.
Inner loop: For each task Ti, we prepare our train dataset 𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 . Okay, how can we
create the training dataset in a reinforcement learning setting?
We have the model (policy network) 𝑓𝑓𝜃𝜃. So, we generate k number of trajectories
using our model 𝑓𝑓𝜃𝜃. We know that the trajectories consist of a sequence of state-
action pairs. So, we have:
That is, for each of the tasks Ti, we sample k trajectories and prepare the training
dataset 𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 . Next, we minimize the loss on the training dataset 𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 and get
the optimal parameter 𝜃𝜃𝑖𝑖′. As we sampled three tasks, we will have three optimal
parameters, 𝜃𝜃1′ , 𝜃𝜃2′ , 𝜃𝜃3′.
We also need the test dataset 𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡, which we use in the outer loop. How do we
prepare our test dataset? Now, we use our model f parameterized by the optimal
parameter 𝜃𝜃𝑖𝑖′; that is, we use 𝑓𝑓𝜃𝜃𝑖𝑖′ and generate k number of trajectories. So, we have:
[ 666 ]
Chapter 17
Outer loop: Now, we perform meta optimization on the test (meta-training) dataset;
that is, we try to minimize the loss in the test dataset 𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡. Here, we parameterize
our model f by the optimal parameter 𝜃𝜃𝑖𝑖′ calculated in the previous step. So, we
compute the loss of the model 𝐿𝐿 𝑇𝑇𝑖𝑖 (𝑓𝑓𝜃𝜃′ ) and the gradients of the loss and update
𝑖𝑖
our randomly initialized parameter 𝜃𝜃 using our test (meta-training) dataset as:
We repeat the preceding step for several iterations to find the optimal parameter. For
a clear understanding of how MAML works in reinforcement learning, let's look into
the algorithm in the next section.
[ 667 ]
Reinforcement Learning Frontiers
That's it! Meta learning is a growing field of research. Now that we have a basic
idea of meta learning, you can explore more about meta learning and see how meta
learning is used in reinforcement learning. In the next section, we will learn about
hierarchical reinforcement learning.
There are different methods used in HRL, such as state-space decomposition, state
abstraction, and temporal abstraction. In state-space decomposition, we decompose
the state space into different subspaces and try to solve the problem in a smaller
subspace. Breaking down the state space also allows faster exploration, as the agent
does not want to explore the entire state space. In state abstraction, the agent ignores
the variables that are irrelevant to achieving the current subtasks in the current state
space. In temporal abstraction, the action sequence and action sets are grouped,
which divides the single step into multiple steps.
We will now look into one of the most commonly used algorithms in HRL, called
MAXQ value function decomposition.
[ 668 ]
Chapter 17
Let's suppose our agent is driving a taxi. As Figure 17.5 shows, the tiny, yellow-
colored rectangle is the taxi driven by our agent. The letters (R, G, Y, B) represent
the different locations. Thus we have four locations in total, and the agent has to
pick up a passenger at one location and drop them off at another location. The agent
will receive +20 points as a reward for a successful drop-off and -1 point for every
time step it takes. The agent will also lose -10 points for illegal pickups and drop-offs.
So the goal of our agent is to learn to pick up and drop off passengers at the correct
location in a short time without adding illegal passengers.
Now we break the goal of our agent into four subtasks as follows:
• Navigate: In the Navigate subtask, the goal of our agent is to drive the taxi
from the current location to one of the target locations. The Navigate(t)
subtask will use the four primitive actions: north, south, east, and west.
• Get: In the Get subtask, the goal of our agent is to drive the taxi from its
current location to the passenger's location and pick up the passenger.
• Put: In the Put subtask, the goal of our agent is to drive the taxi from its
current location to the passenger's destination and drop off the passenger.
• Root: Root is the whole task.
[ 669 ]
Reinforcement Learning Frontiers
We can represent all these subtasks in a directed acyclic graph called a task graph, as
Figure 17.6 shows:
As we can observe from the preceding figure, all the subtasks are arranged
hierarchically. Each node represents a subtask or primitive action and each edge
connects in such a way that a subtask can call its child subtask. As shown, the
Navigate(t) subtask has four primitive actions: East, West, North, and South.
The Get subtask has a Pickup primitive action and a Navigate(t) subtask. Similarly,
the Put subtask has a Putdown (drop) primitive action and Navigate(t) subtask.
[ 670 ]
Chapter 17
As we can observe from Figure 17.7, our redesigned graph contains two special types
of nodes: max nodes and Q nodes. The max nodes define the subtasks in the task
decomposition and the Q nodes define the actions that are available for each subtask.
Thus, in this section, we got a basic idea of MaxQ value function decomposition.
In the next section, we will learn about I2A.
[ 671 ]
Reinforcement Learning Frontiers
As we can observe from Figure 17.8, I2A architecture has both model-based and
model-free paths. Thus, the action the agent takes is the result of both the model-
based and model-free paths. In the model-based path, we have rollout encoders.
[ 672 ]
Chapter 17
These rollout encoders are where the agent performs imagination tasks, so let's take
a closer look at the rollout encoders. Figure 17.9 shows a single rollout encoder:
From Figure 17.9, we can observe that the rollout encoders have two layers: the
imagine future layer and the encoder layer. The imagine future layer is where the
imagination happens. The imagine future layer consists of the imagination core.
When we feed the state st to the imagination core, we get the next state 𝑠𝑠𝑠𝑡𝑡𝑡𝑡 and the
reward 𝑟𝑟𝑟𝑡𝑡𝑡𝑡, and when we feed this next state 𝑠𝑠𝑠𝑡𝑡𝑡𝑡 to the next imagination core, we
get the next state 𝑠𝑠𝑠𝑡𝑡𝑡𝑡 and reward 𝑟𝑟𝑟𝑡𝑡𝑡𝑡. If we repeat these for n steps, we get a rollout,
which is basically a pair of states and rewards, and then we use encoders such as
Long Short-Term Memory (LSTM) to encode this rollout. As a result, we get rollout
encodings. These rollout encodings are actually the embeddings describing the
future imagined path. We will have multiple rollout encoders for different future
imagined paths, and we use an aggregator to aggregate this rollout encoder.
[ 673 ]
Reinforcement Learning Frontiers
Okay, but how exactly does the imagination happen in the imagination core? What
is actually in the imagination core? Figure 17.10 shows a single imagination core:
As we can observe from Figure 17.10, the imagination core consists of a policy
network and an environment model. The environment model learns from all the
actions that the agent has performed so far. It takes information about the state 𝑠𝑠𝑠𝑡𝑡,
imagines all the possible futures considering the experience, and chooses the action
𝑎𝑎𝑎𝑡𝑡 that gives a high reward.
[ 674 ]
Chapter 17
Figure 17.11 shows the complete architecture of I2As with all components expanded:
Have you played Sokoban before? Sokoban is a classic puzzle game where the player
has to push boxes to a target location. The rules of the game are very simple: boxes
can only be pushed and cannot be pulled. If we push a box in the wrong direction
then the puzzle becomes unsolvable:
[ 675 ]
Reinforcement Learning Frontiers
The I2A architecture provides good results in these kinds of environments, where
the agent has to plan in advance before taking an action. The authors of the paper
tested I2A performance on Sokoban and achieved great results.
Summary
We started this chapter by understanding what meta learning is. We learned that
with meta learning, we train our model on various related tasks with a few data
points, such that for a new related task, our model can make use of the learning
obtained from the previous tasks.
Moving on, we learned about HRL, where we decompose large problems into small
subproblems in a hierarchy. We also looked into the different methods used in HRL,
such as state-space decomposition, state abstraction, and temporal abstraction. Next,
we got an overview of MAXQ value function decomposition, where we decompose
the value function into a set of value functions for each of the subtasks.
At the end of the chapter, we learned about I2As, which are augmented with
imagination. Before taking any action in an environment, the agent imagines the
consequences of taking the action, and if they think the action will provide a good
reward, they will perform the action.
[ 676 ]
Chapter 17
Questions
Let's test the knowledge you gained in this chapter; try answering the following
questions:
Further reading
For more information, we can refer to the following papers:
[ 677 ]
Appendix 1 – Reinforcement
Learning Algorithms
Let's have a look at all the reinforcement learning algorithms we have learned about
in this book.
Value Iteration
The algorithm of value iteration is given as follows:
[ 679 ]
Reinforcement Learning Algorithms
Policy Iteration
The algorithm of policy iteration is given as follows:
First-Visit MC Prediction
The algorithm of first-visit MC prediction is given as follows:
1. Let total_return(s) be the sum of the return of a state across several episodes
and N(s) be the counter, that is, the number of times a state is visited across
several episodes. Initialize total_return(s) and N(s) as zero for all the states.
The policy 𝜋𝜋 is given as input.
2. For M number of iterations:
1. Generate an episode using the policy 𝜋𝜋
2. Store all rewards obtained in the episode in a list called rewards
3. For each step t in the episode:
3. Compute the value of a state by just taking the average, that is:
total_return(𝑠𝑠𝑠
𝑉𝑉(𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁
[ 680 ]
Appendix 1
Every-Visit MC Prediction
The algorithm of every-visit MC prediction is given as follows:
1. Let total_return(s) be the sum of the return of a state across several episodes
and N(s) be the counter, that is, the number of times a state is visited across
several episodes. Initialize total_return(s) and N(s) as zero for all the states.
The policy 𝜋𝜋 is given as input.
2. For M number of iterations:
1. Generate an episode using the policy 𝜋𝜋
2. Store all the rewards obtained in the episode in the list called rewards
3. For each step t in the episode:
3. Compute the value of a state by just taking the average, that is:
total_return(𝑠𝑠𝑠
𝑉𝑉(𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁
[ 681 ]
Reinforcement Learning Algorithms
3. Compute the Q function (Q value) by just taking the average, that is:
total_return(𝑠𝑠𝑠 𝑠𝑠𝑠
𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) =
𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁
MC Control Method
The algorithm for the MC control method is given as follows:
[ 682 ]
Appendix 1
[ 683 ]
Reinforcement Learning Algorithms
Off-Policy MC Control
The algorithm for the off-policy MC control method is given as follows:
1. Initialize the Q function Q(s, a) with random values and set the behavior
policy b to be epsilon-greedy, set the target policy 𝜋𝜋 to be greedy policy
and initialize the cumulative weights as C(s, a) = 0
2. For M number of episodes:
1. Generate an episode using the behavior policy b
2. Initialize return R to 0 and weight W to 1
3. For each step t in the episode, t = T – 1, T – 2, . . . , 0:
[ 684 ]
Appendix 1
TD Prediction
The algorithm for the TD prediction method is given as follows:
1. Initialize the value function V(s) with random values. A policy 𝜋𝜋 is given.
2. For each episode:
1. Perform the action a, move to the new state 𝑠𝑠 ′, and observe the
reward r
2. In the state 𝑠𝑠 ′, select the action 𝑎𝑎′ using the epsilon-greedy policy
[ 685 ]
Reinforcement Learning Algorithms
Deep Q Learning
The algorithm for deep Q learning is given as follows:
[ 686 ]
Appendix 1
Double DQN
The algorithm for double DQN is given as follows:
[ 687 ]
Reinforcement Learning Algorithms
[ 688 ]
Appendix 1
[ 689 ]
Reinforcement Learning Algorithms
1. The worker agent interacts with its own copies of the environment
2. Each worker follows a different policy and collects the experience
3. Next, the worker agents compute the loss of the actor and critic networks
4. After computing the loss, they calculates the gradients of the loss and sends
those gradients to the global agent asynchronously
5. The global agent updates its parameter with the gradients received from the
worker agents
6. Now, the updated parameter from the global agent will be sent to the worker
agents periodically
1. Initialize the main critic network parameter 𝜃𝜃 and main actor network
parameter 𝜙𝜙
2. Initialize the target critic network parameter 𝜃𝜃 ′ by just copying the main
critic network parameter 𝜃𝜃
3. Initialize the target actor network parameter 𝜙𝜙 ′ by just copying the main
actor network parameter 𝜙𝜙.
4. Initialize the replay buffer 𝒟𝒟
[ 690 ]
Appendix 1
1. Select action a based on the policy 𝜇𝜇𝜙𝜙 (𝑠𝑠𝑠 and exploration noise, that
is, 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠) + 𝒩𝒩
2. Perform the selected action a, move to the next state 𝑠𝑠𝑠 and get the
reward r, and store this transition information in the replay buffer 𝒟𝒟
3. Randomly sample a minibatch of K transitions from the replay
buffer 𝒟𝒟
4. Compute the target value of the critic, that is,
𝑦𝑦𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾𝛾𝛾𝜃𝜃′ (𝑠𝑠𝑖𝑖′ , 𝜇𝜇𝜙𝜙′ (𝑠𝑠𝑖𝑖′ ))
1 2
5. Compute the loss of the critic network 𝐽𝐽(𝜃𝜃) = ∑(𝑦𝑦𝑖𝑖 − 𝑄𝑄𝜃𝜃 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
6. Compute the gradient of the loss ∇𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽 and update the critic
network parameter using gradient descent, 𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃 𝐽𝐽𝐽𝐽𝐽𝐽
7. Compute the gradient of the actor network ∇𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽 and update the
actor network parameter using gradient ascent, 𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽
8. Update the target critic and target actor network parameters,
𝜃𝜃 ′ = 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏)𝜃𝜃 ′ for 𝑗𝑗 𝑗𝑗𝑗𝑗 and 𝜙𝜙𝜙 𝜙 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏𝜏𝜏𝜏𝜏
1. Initialize two main critic networks parameters, 𝜃𝜃1 and 𝜃𝜃2, and the main actor
network parameter 𝜙𝜙
2. Initialize two target critic networks parameters, 𝜃𝜃1′ and 𝜃𝜃2′, by copying the
main critic network parameters 𝜃𝜃1 and 𝜃𝜃2, respectively
3. Initialize the target actor network parameter 𝜙𝜙 ′ by copying the main actor
network parameter 𝜙𝜙
4. Initialize the replay buffer 𝒟𝒟
5. For N number of episodes, repeat step 6
[ 691 ]
Reinforcement Learning Algorithms
1. Select action a based on the policy 𝜇𝜇𝜙𝜙 (𝑠𝑠𝑠 and with exploration noise
𝜖𝜖, that is, 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠) + 𝜖𝜖 where, 𝜖𝜖 𝜖 𝜖𝜖𝜖(0, 𝜎𝜎)
2. Perform the selected action a, move to the next state 𝑠𝑠 ′, get the reward
r, and store the transition information in the replay buffer 𝒟𝒟
3. Randomly sample a minibatch of K transitions from the replay
buffer 𝒟𝒟
4. Select the action 𝑎𝑎𝑎 for computing the target value 𝑎𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙′ (𝑠𝑠 ′ ) + 𝜖𝜖
where 𝜖𝜖 𝜖 𝜖𝜖𝜖(0, 𝜎𝜎), −𝑐𝑐𝑐 𝑐𝑐𝑐𝑐
5. Compute the target value of the critic, that is,
𝑦𝑦𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾 min 𝑄𝑄𝜃𝜃′ (𝑠𝑠𝑖𝑖′ , 𝑎𝑎𝑎)
𝑗𝑗𝑗𝑗𝑗𝑗 𝑗𝑗
Soft Actor-Critic
The algorithm for Soft Actor-Critic (SAC) is given as follows:
1. Initialize the main value network parameter 𝜓𝜓, the Q network parameters 𝜃𝜃1
and 𝜃𝜃2, and the actor network parameter 𝜙𝜙
2. Initialize the target value network 𝜓𝜓 ′ by just copying the main value network
parameter 𝜓𝜓
3. Initialize the replay buffer 𝒟𝒟
[ 692 ]
Appendix 1
1. Select action a based on the policy 𝜋𝜋𝜙𝜙 (𝑠𝑠𝑠, that is, 𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠𝑠
2. Perform the selected action a, move to the next state 𝑠𝑠 ′, get the reward
r, and store the transition information in the replay buffer 𝒟𝒟
3. Randomly sample a minibatch of K transitions from the replay buffer
4. Compute target state value 𝑦𝑦𝜐𝜐𝑖𝑖 = min 𝑄𝑄𝜃𝜃𝑗𝑗 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 ) − 𝛼𝛼𝛼log 𝜋𝜋𝜙𝜙 (𝑎𝑎𝑖𝑖 |𝑠𝑠𝑖𝑖 )
𝑗𝑗𝑗𝑗𝑗𝑗
1 2
5. Compute the loss of value network 𝐽𝐽𝑉𝑉 (𝜓𝜓) = ∑ (𝑦𝑦𝜐𝜐𝑖𝑖 − 𝑉𝑉𝜓𝜓 (𝑠𝑠𝑖𝑖 ))
𝐾𝐾
𝑖𝑖
and update the parameter using gradient descent, 𝜓𝜓 𝜓 𝜓𝜓 𝜓 𝜓𝜓𝜓𝜓𝜓 𝐽𝐽𝐽𝐽𝐽𝐽
6. Compute the target Q value 𝑦𝑦𝑞𝑞𝑖𝑖 = 𝑟𝑟𝑖𝑖 + 𝛾𝛾𝛾𝛾𝜓𝜓′ (𝑠𝑠𝑖𝑖′ )
7. Compute the loss of the Q networks
1 2
𝐽𝐽𝑄𝑄 (𝜃𝜃𝑗𝑗 ) = ∑ (𝑦𝑦𝑞𝑞𝑖𝑖 − 𝑄𝑄𝜃𝜃𝑗𝑗 (𝑠𝑠𝑖𝑖 , 𝑎𝑎𝑖𝑖 )) for 𝑗𝑗 𝑗𝑗𝑗 𝑗 and update the
𝐾𝐾
𝑖𝑖
parameter using gradient descent, 𝜃𝜃𝑗𝑗 = 𝜃𝜃𝑗𝑗 − 𝜆𝜆𝜆𝜃𝜃𝑗𝑗 𝐽𝐽(𝜃𝜃𝑗𝑗 ) for 𝑗𝑗 𝑗 𝑗𝑗 𝑗
8. Compute gradients of the actor objective function,
∇𝜙𝜙 𝐽𝐽𝐽𝐽𝐽𝐽 and update the parameter using gradient ascent,
𝜙𝜙 𝜙 𝜙𝜙 𝜙 𝜙𝜙𝜙𝜙𝜙 𝐽𝐽(𝜙𝜙)
9. Update the target value network parameter, 𝜓𝜓𝜓 𝜓 𝜏𝜏𝜏𝜏 + (1 − 𝜏𝜏𝜏𝜏𝜏𝜏
[ 693 ]
Reinforcement Learning Algorithms
PPO-Clipped
The algorithm for the PPO-clipped method is given as follows:
[ 694 ]
Appendix 1
PPO-Penalty
The algorithm for the PPO-penalty method is given as follows:
Categorical DQN
The algorithm for a categorical DQN is given as follows:
[ 695 ]
Reinforcement Learning Algorithms
10. Compute the cross entropy loss Cross Entropy = − ∑ 𝑚𝑚𝑖𝑖 log 𝑝𝑝𝑖𝑖 (𝑠𝑠𝑠 𝑠𝑠𝑠.
𝑖𝑖
11. Minimize the loss using gradient descent and update the parameter
of the main network
12. Freeze the target network parameter 𝜃𝜃 ′ for several time steps and
then update it by just copying the main network parameter 𝜃𝜃
[ 696 ]
Appendix 1
1. Initialize the critic network parameter 𝜃𝜃 and the actor network parameter 𝜙𝜙
2. Initialize the target critic network parameter 𝜃𝜃 ′ and the target actor network
parameter 𝜙𝜙 ′ by copying from 𝜃𝜃 and 𝜙𝜙, respectively
3. Initialize the replay buffer 𝒟𝒟
4. Launch L number of actors
5. For N number of episodes, repeat step 6
6. For each step in the episode, that is, for t = 0, . . ., T – 1:
3. Compute the loss of the critic network and calculate the gradient as
1
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∑ ∇𝜃𝜃 (𝑅𝑅𝑅𝑅𝑖𝑖 )−1 𝑑𝑑 (𝑦𝑦𝑖𝑖 , 𝑍𝑍𝜃𝜃 (𝑠𝑠𝑠 𝑠𝑠))
𝐾𝐾
𝑖𝑖
[ 697 ]
Reinforcement Learning Algorithms
1. Select action a based on the policy 𝜇𝜇𝜙𝜙 (𝑠𝑠) and exploration noise, that is,
𝑎𝑎 𝑎 𝑎𝑎𝜙𝜙 (𝑠𝑠) + 𝒩𝒩
2. Perform the selected action a, move to the next state 𝑠𝑠 ′ and get the reward r,
and store the transition information in the replay buffer 𝒟𝒟
3. Repeat steps 1 and 2 until the learner finishes
DAgger
The algorithm for DAgger is given as follows:
[ 698 ]
Appendix 1
1. Select an action
2. Perform the selected action and move to the next state, observe the
reward, and store this transition information in the replay buffer 𝒟𝒟
3. Sample a minibatch of experience from the replay buffer 𝒟𝒟 with
prioritization
4. Compute the loss J(Q)
5. Update the parameter of the network using gradient descent
6. If t mod d = 0:
[ 699 ]
Reinforcement Learning Algorithms
4. Now, we minimize the loss on the test dataset 𝐷𝐷𝑖𝑖𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 . Parameterize the
model f with the optimal parameter 𝜃𝜃𝑖𝑖′ calculated in the previous step and
compute the loss 𝐿𝐿 𝑇𝑇𝑖𝑖 (𝑓𝑓𝜃𝜃𝑖𝑖′ ). Calculate the gradients of the loss and update
our randomly initialized parameter 𝜃𝜃 using our test (meta-training) dataset:
𝜃𝜃 𝜃 𝜃𝜃𝜃 𝜃 𝜃𝜃𝜃𝜃𝜃𝜃 ∑ 𝐿𝐿 𝑇𝑇𝑖𝑖 (𝑓𝑓𝜃𝜃′ )
𝑖𝑖
𝑇𝑇𝑖𝑖 ~𝑝𝑝(𝑇𝑇)
[ 700 ]
Appendix 2 – Assessments
The following are the answers to the questions mentioned at the end of each chapter.
Chapter 1 – Fundamentals
of Reinforcement Learning
1. In supervised and unsupervised learning, the model (agent) learns based
on the given training dataset, whereas, in reinforcement learning (RL),
the agent learns by directly interacting with the environment. Thus RL
is essentially an interaction between the agent and its environment.
2. The environment is the world of the agent. The agent stays within the
environment. For instance, in the chess game, the chessboard is the
environment since the chess player (agent) learns to play chess within the
chessboard (environment). Similarly, in the Super Mario Bros game, the
world of Mario is called the environment.
3. The deterministic policy maps the state to one particular action, whereas the
stochastic policy maps the state to the probability distribution over an action
space.
4. The agent interacts with the environment by performing actions, starting
from the initial state until they reach the final state. This agent-environment
interaction starting from the initial state until the final state is called an
episode.
5. The discount factor helps us in preventing the return reaching up to infinity
by deciding how much importance we give to future rewards and immediate
rewards.
6. The value function (value of a state) is the expected return of the trajectory
starting from that state whereas the Q function (the Q value of a state-action
pair) is the expected return of the trajectory starting from that state and
action.
[ 701 ]
Assessments
[ 702 ]
Appendix 2
2. The Bellman expectation equation gives the Bellman value and Q functions
whereas the Bellman optimality equation gives the optimal Bellman value
and Q functions.
∗
3. The value function can be derived from the Q function as 𝑉𝑉 (𝑠𝑠) = max 𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠𝑠.
𝑎𝑎
4. The Q function can be derived from the value function as
𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠) = ∑ 𝑃𝑃 (𝑠𝑠 ′ |𝑠𝑠𝑠 𝑠𝑠)[𝑅𝑅(𝑠𝑠𝑠 𝑠𝑠𝑠 𝑠𝑠 ′ ) + 𝛾𝛾𝛾𝛾 ∗ (𝑠𝑠 ′ )].
𝑠𝑠′
5. In the value iteration method, we perform the following steps:
1. Compute the optimal value function by taking maximum over
Q function, that is, 𝑉𝑉 ∗ (𝑠𝑠) = max 𝑄𝑄 ∗ (𝑠𝑠𝑠 𝑠𝑠𝑠
𝑎𝑎
2. Extract the optimal policy from the computed optimal value function
7. In the value iteration method, first, we compute the optimal value function
by taking the maximum over the Q function iteratively. Once we find the
optimal value function then we will use it to extract the optimal policy. In the
policy iteration method, we will try to compute the optimal value function
using the policy iteratively. Once we have found the optimal value function
then the policy that was used to create the optimal value function will be
extracted as the optimal policy.
[ 703 ]
Assessments
[ 704 ]
Appendix 2
3. The TD error can be defined as the difference between the target value and
predicted value.
4. The TD learning update rule is given as 𝑉𝑉(𝑠𝑠) = 𝑉𝑉(𝑠𝑠) + 𝛼𝛼(𝑟𝑟 𝑟 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ ) − 𝑉𝑉𝑉𝑉𝑉𝑉).
5. In a TD prediction task, given a policy, we estimate the value function using
the given policy. So, we can say what the expected return an agent can obtain
in each state if it acts according to the given policy.
6. SARSA is an on-policy TD control algorithm and it stands for State-Action-
Reward-State-Action. The update rule for computing the Q function using
SARSA is given as 𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) = 𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠) + 𝛼𝛼𝛼𝛼𝛼 𝛼 𝛾𝛾𝛾𝛾(𝑠𝑠 ′ ,𝑎𝑎′ ) − 𝑄𝑄(𝑠𝑠𝑠 𝑠𝑠)).
7. SARSA is an on-policy algorithm, meaning that we use a single epsilon-
greedy policy for selecting an action in the environment and also to compute
the Q value of the next state-action pair, whereas Q learning is an off-policy
algorithm meaning that we use an epsilon-greedy policy for selecting an
action in the environment, but to compute the Q value of the next state-action
pair we use a greedy policy.
2 log(𝑡𝑡𝑡
4. The upper confidence bound is computed as UCB(a) = Q(a) + √ .
𝑁𝑁𝑁𝑁𝑁𝑁
5. When the value of 𝛼𝛼 is higher than 𝛽𝛽, then we will have a high probability
closer to 1 than 0.
[ 705 ]
Assessments
[ 706 ]
Appendix 2
[ 707 ]
Assessments
[ 708 ]
Appendix 2
7. The LSTM layer is in the DQN so that we can retain information about the
past states as long as it is required. Retaining information about the past
states helps us when we have the problem of Partially Observable Markov
Decision Processes (POMDPs).
[ 709 ]
Assessments
2. In the actor-critic method, the actor computes the optimal policy and the
critic evaluates the policy computed by the actor network by estimating the
value function.
3. In the policy gradient method with baseline, first, we generate complete
episodes (trajectories), and then we update the parameter of the network,
whereas, in the actor-critic method, we update the network parameter at
every step of the episode.
4. In the actor network, we compute the gradient as
∇𝜃𝜃 𝐽𝐽(𝜃𝜃) = ∇𝜃𝜃 log 𝜋𝜋𝜃𝜃 (𝑎𝑎𝑡𝑡 |𝑠𝑠𝑡𝑡 )(𝑟𝑟 𝑟 𝑟𝑟𝑟𝑟𝜙𝜙 (𝑠𝑠𝑡𝑡′ ) − 𝑉𝑉𝜙𝜙 (𝑠𝑠𝑡𝑡 ))
.
5. In advantage actor-critic (A2C), we compute the policy gradient with the
advantage function and the advantage function is the difference between
the Q function and the value function, that is, Q(s, a) – V(s).
6. The word asynchronous implies the way A3C works. That is, instead of
having a single agent that tries to learn the optimal policy, here, we have
multiple agents that interact with the environment. Since we have multiple
agents interacting with the environment at the same time, we provide copies
of the environment to every agent so that each agent can interact with its
own copy of the environment. So, all these multiple agents are called worker
agents and we have a separate agent called the global agent. All the worker
agents report to the global agent asynchronously and the global agent
aggregates the learning.
7. In A2C, we can have multiple worker agents, each interacting with its
own copies of the environment, and all the worker agents perform the
synchronous updates, unlike A3C where the worker agents perform
asynchronous updates.
[ 710 ]
Appendix 2
4. Instead of using one critic network, we use two main critic networks for
computing the Q value and we use two target critic networks for computing
the target value. We compute two target Q values using two target critic
networks and use the minimum value out of these two while computing the
loss. This helps in preventing the overestimation of the target Q value.
5. The DDPG method produces different target values even for the same action,
thus the variance of the target value will be high even for the same action, so
we reduce this variance by adding some noise to the target action.
6. In the SAC method, we use a slightly modified version of the objective
𝑇𝑇𝑇𝑇
function with the entropy term as 𝐽𝐽(𝜙𝜙) = 𝐸𝐸𝜏𝜏𝜏𝜏𝜏𝜙𝜙 [∑ (𝑟𝑟𝑡𝑡 + 𝛼𝛼𝛼(𝜋𝜋(⋅ |𝑠𝑠𝑡𝑡 )))]
𝑡𝑡𝑡𝑡
and it is often called maximum entropy RL or entropy regularized RL.
2𝛿𝛿
4. The update rule of TRPO is given as 𝜃𝜃 𝜃 𝜃𝜃 𝜃 𝜃𝜃 𝑗𝑗 √ 𝑠𝑠.
𝑠𝑠 𝑇𝑇 𝐻𝐻𝐻𝐻
[ 711 ]
Assessments
Chapter 14 – Distributional
Reinforcement Learning
1. In a distributional RL, instead of selecting an action based on the expected
return, we select the action based on the distribution of the return, which is
often called the value distribution or return distribution.
2. In categorical DQN, we feed the state and support of the distribution as the
input and the network returns the probabilities of the value distribution.
3. The authors of the categorical DQN suggest that it will be efficient to choose
the number of support N as 51 and so the categorical DQN is also known as
the C51 algorithm.
4. Inverse CDF is also known as the quantile function. Inverse CDF as the name
suggests is the inverse of the cumulative distribution function. That is, in
CDF, given the support x, we obtain the cumulative probability 𝜏𝜏 , whereas
in inverse CDF, given cumulative probability 𝜏𝜏 , we obtain the support x.
5. In a categorical DQN, along with the state, we feed the fixed support at
equally spaced intervals as an input to the network, and it returns the non-
uniform probabilities. However, in a QR-DQN, along with the state, we feed
the fixed uniform probabilities as an input to the network and it returns the
support at variable locations (unequally spaced support).
6. The D4PG is similar to DDPG except for the following:
[ 713 ]
Assessments
[ 714 ]
Appendix 2
4. The meta training set basically acts as a training set in the outer loop and
[ 715 ]
Other Books You
May Enjoy
If you enjoyed this book, you may be interested in these other books by Packt:
ISBN: 978-1-83882-699-4
[ 717 ]
Other Books You May Enjoy
[ 718 ]
Other Books You May Enjoy
ISBN: 978-1-78995-575-0
● Master the frameworks, models, and techniques that enable machines to 'learn'
from data
● Use scikit-learn for machine learning and TensorFlow for deep learning
● Apply machine learning to image classification, sentiment analysis, intelligent
web applications, and more
● Build and train neural networks, GANs, and other models
● Discover best practices for evaluating and tuning models
● Predict continuous target outcomes using regression analysis
● Dig deeper into textual and social media data using sentiment analysis
[ 719 ]
Other Books You May Enjoy
[ 720 ]
Index
A about 429, 430
designing 448
ACKTR, math concepts 540 Advantage Actor Critic 690
block diagonal matrix 540, 541 algorithm 689
block matrix 540 Advantage Actor Critic Algorithm (A2C) 634
Kronecker product 542 custom network, creating 635
Kronecker product, properties 543 lunar lander environment, using 634, 635
vec operator 543 advantage function 384
actions 2, 14, 39 advertisement banner
action space 18, 40, 73, 74 bandit test, executing 257, 259
activation function 265, 267 dataset, creating 256
exploring 267 epsilon-greedy method, defining 257
Rectified Linear Unit (ReLU) finding, with bandits 255
function 269, 270 variables, initializing 256
sigmoid function 268 agent 2, 39
softmax function 270, 271 creating, with Stable Baselines 626, 627
tanh function 269 training, with TRPO 639
activation map 300 algorithmic environment
Actor-Critic 431 about 81
actor critic algorithm 428, 429 reference link 81
actor critic class Anaconda
action, selecting 441 download link 44
defining 436 installing 44, 45
global network, updating 440 Artificial General Intelligence (AGI) 656
init method, defining 436-439 Artificial Neural Networks (ANNs)
network, building 440 about 266
worker network, updating 441 forward propagation 271-273
actor critic method hidden layer 267
K-FAC, applying 546-548 input layer 267
overview 424, 425 layers 266
working 425-427 learning 274-281
Actor Critic using Kronecker-Factored Trust output layer 267
Region (ACKTR) 538, 539 artificial neurons 264-266
actor network 598, 599 Asynchronous 431
Advantage 431 Asynchronous Advantage Actor Critic
Advantage Actor Critic (A2C) (A3C) 430, 431
[ 721 ]
algorithm 690 bootstrapped sequential updates 391
architecture 432, 433 Box2D environment 79
used, for implementing mountain car
climbing 434 C
Atari game environment 69
agent, creating 75-77 cart pole balancing
Deterministic environment 70 discounted and normalized reward,
frame, skipping 71 computing 413
game recording 77, 78 policy network, building 413, 414
reference link 71 policy network, training 415-417
state and action space 71 with policy gradient 412
Atari games categorical DQN 555, 556
playing, with categorical DQN 574 action, selecting on value
playing, with DQN 632 distribution 559-562
playing, with DQN variants 632 algorithm 573, 574
Atari games, playing with DQN implementing 571, 572
about 368 projection step, working 564-570
DQN class, defining 371 replay buffer, defining 575
game screen, preprocessing 370 training 562, 563
used, for playing Atari games 574
B value distribution, predicting 557-559
variables, defining 575
backpropagate 276 categorical DQN algorithm 695, 696
backpropagation 277 categorical DQN class
computing 334 action, selecting 581
bandit environment deep network, building 578-580
creating, with Gym toolkit 228, 229 defining 576
bandits init method, defining 576, 577
used, for finding advertisement banner 255 network, training 582, 583
behavior policy 184 train function, defining 580, 581
Bellman backup 87, 91 categorical policy 21
Bellman equation 86 cheetah bot
of Q function 90-92 training, to execute PPO 648
of value function 86-90 classic control environment
Bellman expectation equation 90 about 62, 63
Bellman optimality equation 93, 94 action space 66
biological neurons 264, 265 Cart-Pole balancing, with random
block diagonal matrix 540, 541 policy 67, 68
block matrix 540 reference link 69
Boltzmann exploration 234 state space 64-66
Bonjour Keras 348 clipped double Q learning 474
functional model, defining 349, 350 components, DDPG
model, compiling 350, 351 actor network 459, 461
model, defining 348 critic network 454-458
model, evaluating 351 components, SAC
model, training 351 actor network 492, 493
sequential model, defining 349 critic network 487, 488
[ 722 ]
computational graph 320-322 critic 453, 454
viewing, in TensorBoard 637-639 critic network 596-598
visualizing 446-448 implementing 461, 462
conjugate gradient method 508, 509 overview 452, 453
constants 324 pendulum environment, creating with
contextual bandits 259, 260 Gym 464
continuous action space 19 used, for swinging up pendulum 636
continuous environment 37 variables, defining 464
continuous task 25, 40 deep neural network 267
control task 135 deep Q learning algorithm 686, 687
ConvNet 295 Deep Q Learning from Demonstrations (DQfD)
convolutional layer 297-302 608, 609, 698
padding 303, 304 algorithm 611, 612, 698, 699
stride 302 loss function 610, 611
convolutional neural network (CNN) 295-368 Deep Q Network (DQN) 356, 357, 358, 454,
architecture 306 455, 626
convolution operation 300 architecture 368, 369
cost function 273 Atari games, playing with 368
critic network 596, 597, 598 building 372
cumulative distribution function (CDF) 584 loss function 361-363
custom environments replay buffer 358-360
integrating 631 target network 364-366
training 375, 376
D used, for playing Atari games 632
with prioritized experience replay 380, 381
DAgger algorithm 698 working 366, 367
Dataset Aggregation (DAgger) 605, 606 working with 369
algorithm 607 deep recurrent Q network (DRQN) 388, 389
defining 606, 607 architecture 389, 390, 391
DDPG class Deep Reinforcement Learning (DRL) 38
action, selecting 469 delayed policy updates 474
actor network, building 471 dendrite 264
critic network, building 471 deterministic environment 36, 41
defining 465 deterministic policy 20
init method, defining 465-468 direct dependency 321
network, training 472, 473 discount factor 27, 40
train function, defining 469, 470 setting, to 0 28
transitions, storing 470 setting, to 1 29
Deep Deterministic Policy Gradient (DDPG) discrete action space 18
452, 595, 636, 690 discrete environment 37
actor 453 discriminator loss function
actor network 598, 599 about 314
algorithm 463, 690 final term 316
algorithm, implementing 464 first term 314, 315
components 454 second term 315, 316
computational graph, viewing in TensorBoard Distributed Distributional Deep Deterministic
637-639 Policy Gradient (D4PG) 595, 697
[ 723 ]
algorithm 600, 601, 697, 698 episodic environment 38
distributional Bellman equation 556 episodic task 25, 40
distributional reinforcement learning (DQN) epsilon-greedy 230, 231
need for 552-554 implementing 232-234
double DQN 377-379 epsilon-greedy policy 683, 684
algorithm 380, 687 every-visit MC prediction algorithm 681
double DQN loss 610 every-visit Monte Carlo
downsampling operation 304 episode, generating 161, 162
DQfD, phases 609 policy, defining 160, 161
pre-training phase 609 value function, computing 162-166
training phase 610 with blackjack game 160
DQN algorithm 367, 368 expectation 16, 17
DQN class exploration-exploitation dilemma 177
defining 371 exploration strategies 229
epsilon-greedy policy, defining 373 epsilon-greedy 230, 231
init method, defining 371 softmax exploration 234-238
target network, updating 374 Thompson sampling (TS) 245-252
training, defining 374 upper confidence bound (UCB) 240-243
transition, storing 373
DQN variants F
implementing 633, 634
used, for playing Atari games 632 feature map 300
DQN, with prioritized experience replay feed dictionaries 324, 325
bias, correcting 383 feedforward network
dueling DQN 384, 385 versus Recurrent Neural Networks (RNNs)
architecture 386, 387, 388 287, 288
dynamic programming (DP) 192, 221 filter 298
about 97 filter matrix 297
advantages 192 finite horizon 25
applying, to environments 129, 130 first-visit MC prediction algorithm 680
disadvantages 192 first-visit Monte Carlo
policy iteration 115-118 with blackjack game 166, 167
value iteration 97, 98 forget gate 295
versus Monte Carlo (MC) method 221 forward propagation 333
versus TD control method 221 about 273
in ANNs 271-273
E frozen lake problem
solving, with policy iteration 125
eager execution 343 solving, with value iteration 107-110
entropy regularized reinforcement frozen lake problem, solving with policy
learning 485 iteration 125
environment 2, 36, 39 policy, extracting from value
categories 79 function 127-129
synopsis 82 value function, computing 125-127
episode 23, 24, 40 frozen lake problem, solving with value
generating 58-61 iteration
generating, in gym environment 56 about 107-110
[ 724 ]
optimal policy, extracting from optimal value graphs, visualizing in TensorBoard 338-342
function 112-114 loss and backpropagation, computing 334
optimal value function, computing 110-112 model, training 336, 337
fully connected layer 305, 306 number of neurons, defining in each
functional model 349, 350 layer 331
placeholders, defining 332, 333
G required libraries, importing 330
summary, creating 335, 336
gates 294 hidden layer 267
Gaussian policy 22 hierarchical reinforcement learning
Gauss-Newton method (HRL) 668
reference link 548 horizon 25, 40
Generative Adversarial Imitation Learning hyperbolic tangent (tanh) 269
(GAIL) 617-619, 651
formulation 619-621 I
implementing 651
Generative Adversarial Networks Imagination Augmented Agents
(GANs) 307-309, 617 (I2As) 672-676
architecture 313 importance sampling method 511
discriminative model 310 incremental mean updates 167
generative model 309, 310 indirect dependency 322
generator and discriminative model, infinite horizon 26
learning 311, 312 init method
loss function, examining 314 defining 441-444
generative model, types inner loop 661
explicit density model 309 input gate 295
implicit density model 309 input layer 267
generator loss function 316 inverse cumulative distribution function
gradient descent 274 (inverse CDF) 584-586
gym environment Inverse Reinforcement Learning (IRL) 612
action selection 56-58 maximum entropy IRL 613
Atari game environment 69, 70
classic control environment 62, 63 J
creating 47-49
episode, generating 56-61 Jensen Shannon (JS) divergence 618
exploring 50, 62
Gym toolkit K
error fixes 46
k-armed bandit 226
installing 45
Keras 348
used, for creating bandit environment 228
kernel 298
key features, TD3
H clipped double Q learning 475-477
handwritten digit classification, using delayed policy updates 477-479
TensorFlow 330 target policy smoothing 479, 480
accuracy, computing 334, 335 KL-constrained objective 515
dataset, loading 330, 331 KL penalized objective 514
forward propagation 333 Kronecker-Factored Approximate Curvature
[ 725 ]
(K-FAC) 543-546 key terms 613, 614
applying, in actor critic method 546-548 working 614, 615
Kronecker product 542 maximum entropy reinforcement
properties 543 learning 485
Kullback-Leibler (KL) 501 MAXQ value function decomposition 668-671
MC Control algorithm
L with epsilon-greedy policy 178, 179
MC control method algorithm 682
L2 regularization loss 611 MC prediction algorithm 681, 682
lagrange multipliers 509-511 mean squared error (MSE) 361, 419, 556
large discount factor 28 meta reinforcement learning 656
learning rate 276 MNIST digit classification
linear approximation 506, 518 TensorFlow 2.0, using 351, 352
logistic function 268 Model Agnostic Meta Learning
Long Short-Term Memory (LSTM) 292-673 (MAML) 657-662
Long Short-Term Memory Recurrent Neural algorithm, in reinforcement learning
Network (LSTM RNN) 389 setting 667
loss function 273, 361-363 algorithm, in supervised learning
computing 334 setting 664, 665
examining 314 in reinforcement learning setting 665-667
LSTM cell in supervised learning setting 663, 664
working 293-295 model-based learning 36, 40
lunar lander environment model-free learning 36, 40
with A2C 634, 635 Monte Carlo (MC) method 134, 221
about 192, 425
M advantages 192, 193
control algorithm 172-174
machine
control task 170-172
setting up 44
disadvantages 193
used, for installing Anaconda 44, 45
incremental mean updates 167
used, for installing gym toolkit 45
off-policy control method 184-188
Machine Learning (ML) 1
on-policy control method 174
versus Reinforcement Learning (RL) 9, 10
prediction algorithm 140-144
MAML algorithm
prediction algorithm, types 144
in reinforcement learning setting 700
prediction task, performing 136-140
Markov chain 11, 12
Q function,,predicting 168-170
Markov Decision Process (MDP) 10, 13, 50
tasks 188, 189
actions 51
versus dynamic programming (DP) 221
reward function 52-55
versus TD control method 221
states 50
Monte Carlo (MC) prediction method
transition probability 52-55
blackjack environment, in Gym
Markov property 11
library 158, 159
Markov Reward Process (MRP) 13
blackjack game 147-158
maximum entropy inverse reinforcement
every-visit Monte Carlo, with blackjack
learning algorithm 699
game 160
maximum entropy IRL 613
first-visit Monte Carlo, with blackjack
algorithm 617
game 166, 167
gradient, computing 615, 616
[ 726 ]
implementing 147 implementing 179, 180
Monte Carlo policy gradient 407 optimal policy, computing 181-184
mountain car environment optimal policy, computing with
creating 435 SARSA 211-213
Mujoco environment with epsilon-greedy policy 176-178
about 80 with exploration 174-176
reference link 80 on-policy MC control algorithm
multi-agent environment 38 with epsilon-greedy policy 683, 684
multi-armed bandit (MAB) 226-228 with starts method 683
applications 254, 255 on-policy TD control algorithm
issues 226 with SARSA 685
bandit environment, creating with Gym optimal policy
toolkit 228, 229 computing 212, 213
exploration strategies 229 computing, with Q learning 218-220
Multi-joint dynamics with contact computing, with SARSA 211
(MuJoCo) 80, 639 Ornstein-Uhlenbeck random process 459
environment, installing 640-643 outer loop 661
output gate 295
N output layer 267
[ 727 ]
PPO algorithm quadratic approximation 506, 507, 518
PPO-clipped method 525 quantile 584
PPO-penalty method 525 Quantile regression DQN (QR-DQN) 583-591
PPO class action selection 591, 592
action, selecting 534 loss function 592-595
defining 530 quantile regression loss 594
init method, defining 530-532
policy network, building 533, 534 R
state value, computing 534
train function, defining 533 rank-based prioritization 383
PPO-clipped algorithm 528 recording directory 78
PPO-clipped method 525-528 Rectified Linear Unit (ReLU)
Gym environment, creating 529, 530 function 269, 270
implementing 529 Recurrent Neural Networks (RNNs) 285, 286
network, training 535-537 backpropagating, through time 290-292
PPO class, defining 530 forward propagation 288-290
PPO-clipped method algorithm 694 versus feedforward network 287, 288
PPO-penalty algorithm 538 red, green, and blue (RGB) 296
PPO-penalty method 537 REINFORCE algorithm
algorithm 695 with baseline 689
prediction algorithm, Monte Carlo (MC) Reinforcement Learning (RL) 1
method action 2
every-visit Monte Carlo 146 action space 18
first-visit Monte Carlo 145 agent 2
prediction task 135 algorithm 679
prioritization, types applications 38, 39
proportional prioritization 382 basic idea 3, 4
rank-based prioritization 383 continuous tasks 25
prioritized experience replay discount factor 27
DQN, using with 380, 381 environment 2
probability density function (pdf) 404 episode 23, 24
probability mass function (pmf) 404 episodic tasks 25
proportional prioritization 382 expectation 16, 17
Proximal Policy Optimization (PPO) 524, 648 horizon 25
used, for training cheetah bot 648 math essentials 16
pull_from_global function 441 policy 19
Q function 33-35
Q return 26
reward 3
Q function 33-35, 40, 681 state 2
Q learning 213-217, 686 value function 29-33
used, for computing optimal policy 218-220 versus Machine Learning (ML) 9, 10
versus SARSA 220 REINFORCE method 408
Q network 358, 490, 491 with baseline function 420
QR-DQN, math essentials 584 REINFORCE policy gradient algorithm 688
inverse CDF 584-586 replay buffer 358-360
quantile 584 return 26, 40
[ 728 ]
return distribution 555 stochastic policy 20, 21
reward 3, 39 categorical policy 21
reward function 15, 16 Gaussian policy 22
reward-to-go 411 stride 302
right-hand side (RHS) 406 SubprocVecEnv 630
RL agent subsampling operation 304
in grid world 5-9 supervised classification loss 611
RL algorithm supervised imitation learning 604, 605
steps 5 synapse 264
robotics environment
about 80, 81 T
reference link 81
tanh function 269
S target network 364-366
target policy 184
same padding 303 target policy smoothing 475
SARSA 685 Taylor series 502-506
select_action function 441 TD control method 206
sequential model 349 on-policy control method 206-211
sigmoid function 268 versus dynamic programming (DP) 221
single-agent environment 38 versus Monte Carlo (MC) method 221
small discount factor 27 TD learning algorithm 192
Soft Actor-Critic (SAC) 484, 485, 692 TD prediction algorithm 196-202
algorithm 495, 496, 692 value of states, computing 203, 204
components 487 value of states, evaluating 204, 205
defining 486 value of states, predicting 202
implementing 493-495 TD prediction method 193-195
state-action value function, with entropy algorithm 685
term 486, 487 temporal difference (TD) 221
state value function, with entropy TensorBoard 325, 326, 327
term 486, 487 computational graph, viewing 637-639
softmax exploration 234-237 graphs, visualizing in 338-342
implementing 238, 239, 240 name scope, creating 327-329
softmax function 270, 271 TensorFlow 320
Stable Baselines math operations in 344-347
installing 626 using, for handwritten digit classification 330
used, for creating agent 626, 627 TensorFlow 2.0 348
starts method using, for MNIST digit classification 351, 352
exploring 683 TensorFlow session 322, 323
state 2, 39 Thompson sampling (TS) 245-252
State-Action-Reward-State-Action implementing 252-254
(SARSA) 206 total loss function 317
used, for computing optimal policy 211-213 toy text environment
versus Q learning 221 about 81
states 14 reference link 81
state space 71, 73 trained agent
stochastic environment 37, 41 evaluating 627
[ 729 ]
GIF, creating 649, 650 value iteration
implementing 629 algorithm 99, 679
loading 627 used, for solving frozen lake
storing 627 problem 107-110
viewing 628 optimal policy, extracting from optimal value
transfer function 265-267 function 105-107
transition probability 14 optimal value function, computing 100-104
TRPO algorithm 522, 523, 524 value network 488-490
TRPO, math concepts 501 vanishing gradients 292
conjugate gradient method 508, 509 variables 323
importance sampling method 511 defining 435, 436
lagrange multipliers 509-511 variance reduction methods 408
Taylor series 502-506 cart pole balancing, with policy gradient 412
trust region method 507, 508 policy gradient, with baseline 417-419, 420
TRPO objective function policy gradient, with reward-to-go 409-411
designing 512-514 vec operator 543
line search, performing in search vectorized environments 629, 630
direction 521, 522 vectorized environments, types
policies, parameterizing 514, 515 DummyVecEnv 631
sample-based estimation 515, 516 SubprocVecEnv 630
search direction, computing 517-521 video
solving 516, 517 recording 646, 647
trust region method 507, 508
Trust Region Policy Optimization (TRPO) 500, W
501, 639, 693
algorithm 693 worker class
implementing 643, 645 defining 441
incorporating 549
used, for training agent 639 Z
Twin Delayed DDPG (TD3) 473, 474, 691
zero padding 303
algorithm 482-484, 691, 692
implementing 480-482
key features 474, 475
U
UCB function 244
update_global function 440
upper confidence bound (UCB) 240-243
implementing 243-245
V
valid padding 304
value distribution 555
value function 29-33, 40
and Q function, relationship 95, 96
Bellman equation 86-92
[ 730 ]