Reinforcement Learning
Markov Decision Processes and Dynamic Programming
Natnael Argaw
Reinforcement learning
1
Lecture by Natnael Argaw (PhD) @CopyRight Diana Borsa
Background
Sutton & Barto 2018, Chapter 3 + 4
2
Recap
► Reinforcement learning is the science of learning to make decisions
► Agents can learn a policy, value function and/or a model
► The general problem involves taking into account time and
consequences
► Decisions affect the reward, the agent state, and environment state
3
This Lecture
► Last lecture: multiple actions, but only one state—no model
► This lecture:
► Formalise the problem with full sequential structure
► Discuss first class of solution methods which assume true model is
given
► These methods are called dynamic programming
► Next lectures: use similar ideas, but use sampling instead of true
model
4
Formalising the RL interaction
5
Formalising the RL interface
► We will discuss a mathematical formulation of the agent-environment
interaction
► This is called a Markov Decision Process (MDP)
► Enables us to talk clearly about the objective and how to achieve it
6
MDPs: A simplifying assumption
► For now, assume the environment is fully observable:
⇒ the current observation contains all relevant information
► Note: Almost all RL problems can be formalised as MDPs,
e.g.,
► Optimal control primarily deals with continuous MDPs
► Partially observable problems can be converted into MDPs
► Bandits are MDPs with one state
7
Markov Decision Process
Definition (Markov Decision Process - Sutton & Barto 2018 )
A Markov Decision Process is a tuple (S, A, p, γ), where
► S is the set of all possible states
► A is the set of all possible actions (e.g., motor controls)
► p(r, sJ | s, a) is the joint probability of a reward r and next state sJ, given a
state s and action a
► γ ∈ [0, 1] is a discount factor that trades off later rewards to earlier ones
Observations:
► p defines the dynamics of the problem
► Sometimes it is useful to marginalise out the state transitions or expected
reward:
8
Markov Decision Process: Alternative Definition
Definition (Markov Decision Process)
A Markov Decision Process is a tuple (S, A, p, r ,γ), where
► S is the set of all possible states
► A is the set of all possible actions (e.g., motor controls)
► p(sJ | s, a) is the probability of transitioning to sJ, given a state s and action
a
► r : S × A → R is the expected reward, achieved on a transition starting in (s,
a)
r = E [R | s, a]
► γ ∈ [0, 1] is a discount factor that trades off later rewards to earlier ones
Note: These are equivalent formulations: no additional assumptions w.r.t the previous
def.
9
Markov Property: The future is independent of the past given the
present
Definition (Markov Property)
Consider a sequence of random variables, {St }t∈N, indexed by time. A state s
has the Markov property when for states ∀sJ ∈ S
for all possible histories ht−1 = {S1, . . . , St−1, A1, . . . , At−1, R1, . . . , Rt−1}
In a Markov Decision Process all states are assumed to have the Markov property.
► The state captures all relevant information from the history.
► Once the state is known, the history may be thrown away.
► The state is a sufficient statistic of the past.
10
Markov Property in a MDP: Test your understanding
In a Markov Decision Process all states are assumed to have the Markov property.
Q: In an MDP this property implies: (Which of the following statements are true?)
11
Example: cleaning robot
► Consider a robot that cleans soda cans
► Two states: high battery charge or low battery charge
► Actions: {wait, search} in high, {wait, search, recharge} in low
► Dynamics may be stochastic
► p(St+1 = high | St = high, At = search) = α
► p(St+1 = low | St = high, At = search) = 1 − α
► Reward could be expected number of collected cans (deterministic), or
actual number of collected cans (stochastic)
Reference: Sutton and Barto, Chapter 3, pg 52-53.
12
Example: robot MDP
13
Example: robot MDP
14
Formalising the objective
15
Returns
► Acting in a MDP results in immediate rewards Rt , which leads to returns Gt :
► Undiscounted return (episodic/finite horizon pb.)
Note: These are random variables that depends on MDP and policy
16
Discounted Return
► Discounted returns Gt for infinite horizon T → ∞:
► The discount γ ∈ [0, 1] is the present value of future rewards
► The marginal value of receiving reward R after k + 1 time-steps is
γk R
► For γ < 1, immediate rewards are more important than delayed
rewards
► γ close to 0 leads to ”myopic” evaluation
► γ close to 1 leads to ”far-sighted” evaluation
17
Why discount?
Most Markov decision processes are discounted. Why?
► Problem specification:
► Immediate rewards may actually be more valuable (e.g., consider earning
interest)
► Animal/human behaviour shows preference for immediate reward
► Solution side:
► Mathematically convenient to discount rewards
► Avoids infinite returns in cyclic Markov processes
► The way to think about it: reward and discount together determine the
goal
18
Policies
Goal of an RL agent
To find a behaviour policy that maximises the (expected) return Gt
► A policy is a mapping π : S × A → [0, 1] that, for every state s assigns for
each action a ∈ A the probability of taking that action in state s. Denoted
by π(a|s).
► For deterministic policies, we sometimes use the notation at = π(st ) to
denote the action taken by the policy.
19
Value Functions
► The value function v (s) gives the long-term value of state s
vπ (s) = E [Gt | St = s, π]
► We can define (state-)action values:
qπ (s, a) = E [Gt | St = s, At = a, π]
► (Connection between them) Note that:
20
Optimal Value Function
Definition (Optimal value functions)
The optimal state-value function v ∗(s) is the maximum value function over all
policies
The optimal action-value function q∗(s, a) is the maximum action-value function
over all policies
► The optimal value function specifies the best possible performance in the MDP
► An MDP is “solved” when we know the optimal value function
21
Optimal Policy
Define a partial ordering over policies
π ≥ πJ ⇐⇒ vπ (s) ≥ vπ′ (s) , ∀s
Theorem (Optimal Policies)
For any Markov decision process
► There exists an optimal policy π∗ that is better than or equal to all other policies,
π∗ ≥ π, ∀π
(There can be more than one such optimal policy.)
► All optimal policies achieve the optimal value function, vπ∗ (s) = v ∗(s)
► All optimal policies achieve the optimal action-value function, qπ∗ (s, a) = q∗(s, a)
22
Finding an Optimal Policy
An optimal policy can be found by maximising over q∗(s, a),
Observations:
► There is always a deterministic optimal policy for any MDP
► If we know q∗(s, a), we immediately have the optimal policy
► There can be multiple optimal policies
► If multiple actions maximize q∗(s, ·), we can also just pick any of
these (including stochastically)
23
Bellman Equations
24
Value Function
► The value function v (s) gives the long-term value of state s
vπ (s) = E [Gt | St = s, π]
► It can be defined recursively:
► The final step writes out the expectation explicitly
25
Action values
► We can define state-action values
qπ (s, a) = E [Gt | St = s, At = a, π]
► This implies
► Note that
26
Bellman Equations
Theorem (Bellman Expectation Equations)
Given an MDP, M = ⟨S, A, p, r , γ⟩, for any policy π, the value functions obey
the following expectation equations:
27
The Bellman Optimality Equations
Theorem (Bellman Optimality Equations)
Given an MDP, M = ⟨S, A, p, r , γ⟩, the optimal value functions obey the
following expectation equations:
There can be no policy with a higher value than v∗(s) = maxπ vπ (s), ∀s
28
Some intuition
(Reminder) Greedy on v ∗ = Optimal Policy
► An optimal policy can be found by maximising over q∗(s, a),
29
Solving RL problems using the Bellman
Equations
30
Problems in RL
► Pb1: Estimating vπ or qπ is called policy evaluation or, simply, prediction
► Given a policy, what is my expected return under that behaviour?
► Given this treatment protocol/trading strategy, what is my expected return?
► Pb2 : Estimating v∗ or q∗ is sometimes called control, because these can be
used for policy optimisation
► What is the optimal way of behaving? What is the optimal value function?
► What is the optimal treatment? What is the optimal control policy to
minimise time, fuel consumption, etc?
31
Exercise:
► Consider the following MDP:
► The actions have a 0.9 probability of success and 0.1 probability to remain in
the same state
► Rt = 0 for all transitions that end up in S0, and Rt = −1 for all other transitions
32
Exercise: (pause to work this out)
► The actions have a 0.9 probability of success and with 0.1 probably we remain in the
same state
► Rt = 0 for all transitions that end up in S0, and Rt = −1 for all other transitions
► Q: Evaluation problems (Consider a discount γ = 0.9)
► What is vπ for π(s) = a1(→), ∀s?
► What is vπ for the uniformly random policy?
► Same policy evaluation problems for γ = 0.0? (What do you notice?)
33
A solution
34
What is vπ for π(s) = a1(→), ∀s?
35
Solving the Bellman Optimality Equation
► The Bellman optimality equation is non-linear
► Many iterative solution methods:
► Using models / dynamic programming
► Value iteration
► Policy iteration
► Using samples
► Monte Carlo
► Q-learning
► Sarsa
36
Dynamic Programming
37
Dynamic Programming
The 1950s were not good years for mathematical research. I felt I had to
shield the Air Force from the fact that I was really doing mathematics.
What title, what name, could I choose? I was interested in planning, in
decision making, in thinking. But planning is not a good word for various
reasons. I decided to use the word ‘programming.’ I wanted to get
across the idea that this was dynamic, this was time-varying—I thought,
let’s kill two birds with one stone. Let’s take a word that has a precise
meaning, namely dynamic, in the classical physical sense. It also is
impossible to use the word, dynamic, in a pejorative sense. Try thinking
of some combination that will possibly give it a pejorative meaning. It’s
impossible. Thus, I thought dynamic programming was a good name. It
was something not even a Congressman could object to. So I used it as
an umbrella for my activities.
– Richard Bellman
(slightly paraphrased for conciseness)
38
Dynamic programming
Dynamic programming refers to a collection of algorithms that can be
used to compute optimal policies given a perfect model of the
environment as a Markov decision process (MDP).
Sutton & Barto 2018
► We will discuss several dynamic programming methods to solve MDPs
► All such methods consist of two important parts:
policy evaluation and policy improvement
39
Policy evaluation
► We start by discussing how to estimate
vπ (s) = E [Rt+1 + γvπ (St+1) | s, π]
► Idea: turn this equality into an update
Algorithm
► First, initialise v0, e.g., to zero
► Then, iterate
∀s : vk+1(s) ← E [Rt+1 + γvk (St+1) | s, π]
► Stopping: whenever vk+1(s) = vk (s), for all s, we must have found vπ
► Q: Does this algorithm always converge?
Answer : Yes, under appropriate conditions (e.g., γ < 1).
40
Example: Policy evaluation
41
Policy evaluation
42
Policy evaluation
43
Policy evaluation + Greedy Improvement
44
Policy evaluation + Greedy Improvement
45
Policy Improvement
► The example already shows we can use evaluation to then improve our policy
► In fact, just being greedy with respect to the values of the random policy
sufficed! (That is not true in general)
Algorithm
Iterate, using
Then, evaluate πnew and repeat
► Claim: One can show that vπnew (s) ≥ vπ (s), for all s
46
Policy Iteration
Policy evaluation Estimate vπ
Policy improvement Generate πJ ≥ π
47
Policy Iteration
► Does policy evaluation need to converge to vπ ?
► Or should we stop when we are ‘close’?
(E.g., with a threshold on the change to the values)
► Or simply stop after k iterations of iterative policy evaluation?
► In the small gridworld k = 3 was sufficient to achieve optimal policy
► Extreme: Why not update policy every iteration — i.e. stop after k = 1?
► This is equivalent to value iteration
48
Value Iteration
► We could take the Bellman optimality equation, and turn that into an
update
► This is equivalent to policy iteration, with k = 1 step of policy evaluation
between each two (greedy) policy improvement steps
Algorithm: Value Iteration
► Initialise v0
► Update:vk+1(s) ← maxa E [Rt+1 + γvk (St+1) | St = s, At = s]
► Stopping: whenever vk+1(s) = vk (s), for all s, we must have found v ∗
49
Example: Shortest Path
g 0 0 0 0 0 -1 -1 -1 0 -1 -2 -2
0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2
0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2
0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2
Problem V1 V2 V3
0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3
-1 -2 -3 -3 -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4
-2 -3 -3 -3 -2 -3 -4 -4 -2 -3 -4 -5 -2 -3 -4 -5
-3 -3 -3 -3 -3 -4 -4 -4 -3 -4 -5 -5 -3 -4 -5 -6
V4 V5 V6 V7
50
Synchronous Dynamic Programming Algorithms
Problem Bellman Equation Algorithm
Prediction Bellman Expectation Equation Iterative
Policy Evaluation
Control Bellman Expectation Equation Policy Iteration
+ (Greedy) Policy
Improvement
Control Bellman Optimality Equation Value Iteration
Observations:
► Algorithms are based on state-value function vπ (s) or v ∗(s) ⇒ complexity
O(|A||S|2) per iteration, for |A| actions and |S| states
► Could also apply to action-value function qπ (s, a) or q∗(s, a) ⇒ complexity
O(|A|2|S|2) per iteration
51
Extensions to Dynamic Programming
52
Asynchronous Dynamic Programming
► DP methods described so far used synchronous updates (all states in
parallel)
► Asynchronous DP
► backs up states individually, in any order
► can significantly reduce computation
► guaranteed to converge if all states continue to be selected
53
Asynchronous Dynamic Programming
Three simple ideas for asynchronous dynamic programming:
► In-place dynamic programming - no additional Data Structure and
memory
► Prioritised sweeping
► Real-time dynamic programming
54
In-Place Dynamic Programming
► Synchronous value iteration stores two copies of value function
► In-place value iteration only stores one copy of value function
for all s in S : v (s) ← max E [R t+1 + γv (S t+1 ) | St = s]
a
55
Prioritised Sweeping
► Use magnitude of Bellman error to guide state selection,
e.g.
► Backup the state with the largest remaining Bellman error
► Update Bellman error of affected states after each backup
► Requires knowledge of reverse dynamics (predecessor
states)
► Can be implemented efficiently by maintaining a priority
queue
56
Real-Time Dynamic Programming
► Idea: only update states that are relevant to agent
► E.g., if the agent is in state St , update that state value, or states that it
expects to be in soon
57
Full-Width Backups
► Standard DP uses full-width backups
► For each backup (sync or async)
► Every successor state and action is considered s Vk+1(s)
► Using true model of transitions and reward function
a
► DP is effective for medium-sized problems (millions of
states) r
► For large problems DP suffers from curse of dimensionality s'
► Number of states n = |S| grows exponentially with number of
state variables
► Even one full backup can be too expensive
58
Sample Backups
► In subsequent lectures we will consider sample backups
► Using sample rewards and sample transitions ⟨s, a, r , sJ⟩
(Instead of reward function r and transition dynamics p)
s Vk+1(s)
► Advantages
► Model-free: no advance knowledge of MDP required
► Breaks the curse of dimensionality through sampling a
► Cost of backup is constant, independent of n = |S| r
s'
59
Summary
60
What have we covered today?
► Markov Decision Processes
► Objectives in an MDP: different notion of return
► Value functions - expected returns, condition on state (and action)
► Optimality principles in MDPs: optimal value functions and optimal policies
► Bellman Equations
► Two class of problems in RL: evaluation and control
► How to compute vπ (aka solve an evaluation/prediction problem)
► How to compute the optimal value function via dynamic programming:
► Policy Iteration
► Value Iteration
61
Questions?
The only stupid question is the one you were afraid to ask but never did.
-Rich Sutton
For questions that may arise during this lecture please use Moodle and/or the next
Q&A session.
62