0% found this document useful (0 votes)

1 views62 pages

Lecture 3 - MDPs and Dynamic Programming

Uploaded by

Husein Yusuf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views62 pages

Lecture 3 - MDPs and Dynamic Programming

Uploaded by

Husein Yusuf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Reinforcement Learning

Markov Decision Processes and Dynamic Programming

Natnael Argaw

Reinforcement learning

1
Lecture by Natnael Argaw (PhD) @CopyRight Diana Borsa
Background

Sutton & Barto 2018, Chapter 3 + 4

2
Recap

► Reinforcement learning is the science of learning to make decisions

► Agents can learn a policy, value function and/or a model
► The general problem involves taking into account time and
consequences
► Decisions affect the reward, the agent state, and environment state
3
This Lecture

► Last lecture: multiple actions, but only one state—no model

► This lecture:
► Formalise the problem with full sequential structure
► Discuss first class of solution methods which assume true model is
given
► These methods are called dynamic programming
► Next lectures: use similar ideas, but use sampling instead of true
model

4
Formalising the RL interaction

5
Formalising the RL interface

► We will discuss a mathematical formulation of the agent-environment

interaction
► This is called a Markov Decision Process (MDP)
► Enables us to talk clearly about the objective and how to achieve it

6
MDPs: A simplifying assumption

► For now, assume the environment is fully observable:

⇒ the current observation contains all relevant information

► Note: Almost all RL problems can be formalised as MDPs,

e.g.,
► Optimal control primarily deals with continuous MDPs
► Partially observable problems can be converted into MDPs
► Bandits are MDPs with one state

7
Markov Decision Process
Definition (Markov Decision Process - Sutton & Barto 2018 )
A Markov Decision Process is a tuple (S, A, p, γ), where
► S is the set of all possible states
► A is the set of all possible actions (e.g., motor controls)
► p(r, sJ | s, a) is the joint probability of a reward r and next state sJ, given a
state s and action a
► γ ∈ [0, 1] is a discount factor that trades off later rewards to earlier ones

Observations:
► p defines the dynamics of the problem
► Sometimes it is useful to marginalise out the state transitions or expected
reward:

8
Markov Decision Process: Alternative Definition

Definition (Markov Decision Process)

A Markov Decision Process is a tuple (S, A, p, r ,γ), where
► S is the set of all possible states
► A is the set of all possible actions (e.g., motor controls)
► p(sJ | s, a) is the probability of transitioning to sJ, given a state s and action
a
► r : S × A → R is the expected reward, achieved on a transition starting in (s,
a)
r = E [R | s, a]
► γ ∈ [0, 1] is a discount factor that trades off later rewards to earlier ones

Note: These are equivalent formulations: no additional assumptions w.r.t the previous
def.
9
Markov Property: The future is independent of the past given the
present
Definition (Markov Property)
Consider a sequence of random variables, {St }t∈N, indexed by time. A state s
has the Markov property when for states ∀sJ ∈ S

for all possible histories ht−1 = {S1, . . . , St−1, A1, . . . , At−1, R1, . . . , Rt−1}

In a Markov Decision Process all states are assumed to have the Markov property.
► The state captures all relevant information from the history.
► Once the state is known, the history may be thrown away.
► The state is a sufficient statistic of the past.

10
Markov Property in a MDP: Test your understanding

In a Markov Decision Process all states are assumed to have the Markov property.

Q: In an MDP this property implies: (Which of the following statements are true?)

11
Example: cleaning robot

► Consider a robot that cleans soda cans

► Two states: high battery charge or low battery charge
► Actions: {wait, search} in high, {wait, search, recharge} in low
► Dynamics may be stochastic
► p(St+1 = high | St = high, At = search) = α
► p(St+1 = low | St = high, At = search) = 1 − α
► Reward could be expected number of collected cans (deterministic), or
actual number of collected cans (stochastic)

Reference: Sutton and Barto, Chapter 3, pg 52-53.

12
Example: robot MDP

13
Example: robot MDP

14
Formalising the objective

15
Returns
► Acting in a MDP results in immediate rewards Rt , which leads to returns Gt :
► Undiscounted return (episodic/finite horizon pb.)

Note: These are random variables that depends on MDP and policy

16
Discounted Return

► Discounted returns Gt for infinite horizon T → ∞:

► The discount γ ∈ [0, 1] is the present value of future rewards

► The marginal value of receiving reward R after k + 1 time-steps is
γk R
► For γ < 1, immediate rewards are more important than delayed
rewards
► γ close to 0 leads to ”myopic” evaluation
► γ close to 1 leads to ”far-sighted” evaluation

17
Why discount?

Most Markov decision processes are discounted. Why?

► Problem specification:
► Immediate rewards may actually be more valuable (e.g., consider earning
interest)
► Animal/human behaviour shows preference for immediate reward
► Solution side:
► Mathematically convenient to discount rewards
► Avoids infinite returns in cyclic Markov processes
► The way to think about it: reward and discount together determine the
goal

18
Policies

Goal of an RL agent
To find a behaviour policy that maximises the (expected) return Gt

► A policy is a mapping π : S × A → [0, 1] that, for every state s assigns for

each action a ∈ A the probability of taking that action in state s. Denoted
by π(a|s).
► For deterministic policies, we sometimes use the notation at = π(st ) to
denote the action taken by the policy.

19
Value Functions

► The value function v (s) gives the long-term value of state s

vπ (s) = E [Gt | St = s, π]

► We can define (state-)action values:

qπ (s, a) = E [Gt | St = s, At = a, π]

► (Connection between them) Note that:

20
Optimal Value Function

Definition (Optimal value functions)

The optimal state-value function v ∗(s) is the maximum value function over all
policies

The optimal action-value function q∗(s, a) is the maximum action-value function

over all policies

► The optimal value function specifies the best possible performance in the MDP
► An MDP is “solved” when we know the optimal value function

21
Optimal Policy

Define a partial ordering over policies

π ≥ πJ ⇐⇒ vπ (s) ≥ vπ′ (s) , ∀s

Theorem (Optimal Policies)

For any Markov decision process
► There exists an optimal policy π∗ that is better than or equal to all other policies,
π∗ ≥ π, ∀π
(There can be more than one such optimal policy.)
► All optimal policies achieve the optimal value function, vπ∗ (s) = v ∗(s)
► All optimal policies achieve the optimal action-value function, qπ∗ (s, a) = q∗(s, a)

22
Finding an Optimal Policy

An optimal policy can be found by maximising over q∗(s, a),

Observations:
► There is always a deterministic optimal policy for any MDP
► If we know q∗(s, a), we immediately have the optimal policy
► There can be multiple optimal policies
► If multiple actions maximize q∗(s, ·), we can also just pick any of
these (including stochastically)

23
Bellman Equations

24
Value Function

► The value function v (s) gives the long-term value of state s

vπ (s) = E [Gt | St = s, π]

► It can be defined recursively:

► The final step writes out the expectation explicitly

25
Action values
► We can define state-action values

qπ (s, a) = E [Gt | St = s, At = a, π]

► This implies

► Note that

26
Bellman Equations

Theorem (Bellman Expectation Equations)

Given an MDP, M = ⟨S, A, p, r , γ⟩, for any policy π, the value functions obey
the following expectation equations:

27
The Bellman Optimality Equations

Theorem (Bellman Optimality Equations)

Given an MDP, M = ⟨S, A, p, r , γ⟩, the optimal value functions obey the
following expectation equations:

There can be no policy with a higher value than v∗(s) = maxπ vπ (s), ∀s

28
Some intuition
(Reminder) Greedy on v ∗ = Optimal Policy
► An optimal policy can be found by maximising over q∗(s, a),

29
Solving RL problems using the Bellman
Equations

30
Problems in RL

► Pb1: Estimating vπ or qπ is called policy evaluation or, simply, prediction

► Given a policy, what is my expected return under that behaviour?
► Given this treatment protocol/trading strategy, what is my expected return?

► Pb2 : Estimating v∗ or q∗ is sometimes called control, because these can be

used for policy optimisation
► What is the optimal way of behaving? What is the optimal value function?
► What is the optimal treatment? What is the optimal control policy to
minimise time, fuel consumption, etc?

31
Exercise:

► Consider the following MDP:

► The actions have a 0.9 probability of success and 0.1 probability to remain in
the same state
► Rt = 0 for all transitions that end up in S0, and Rt = −1 for all other transitions

32
Exercise: (pause to work this out)

► The actions have a 0.9 probability of success and with 0.1 probably we remain in the
same state
► Rt = 0 for all transitions that end up in S0, and Rt = −1 for all other transitions

► Q: Evaluation problems (Consider a discount γ = 0.9)

► What is vπ for π(s) = a1(→), ∀s?
► What is vπ for the uniformly random policy?
► Same policy evaluation problems for γ = 0.0? (What do you notice?)

33
A solution

34
What is vπ for π(s) = a1(→), ∀s?

35
Solving the Bellman Optimality Equation

► The Bellman optimality equation is non-linear

► Many iterative solution methods:

► Using models / dynamic programming
► Value iteration
► Policy iteration
► Using samples
► Monte Carlo
► Q-learning
► Sarsa

36
Dynamic Programming

37
Dynamic Programming
The 1950s were not good years for mathematical research. I felt I had to
shield the Air Force from the fact that I was really doing mathematics.
What title, what name, could I choose? I was interested in planning, in
decision making, in thinking. But planning is not a good word for various
reasons. I decided to use the word ‘programming.’ I wanted to get
across the idea that this was dynamic, this was time-varying—I thought,
let’s kill two birds with one stone. Let’s take a word that has a precise
meaning, namely dynamic, in the classical physical sense. It also is
impossible to use the word, dynamic, in a pejorative sense. Try thinking
of some combination that will possibly give it a pejorative meaning. It’s
impossible. Thus, I thought dynamic programming was a good name. It
was something not even a Congressman could object to. So I used it as
an umbrella for my activities.
– Richard Bellman
(slightly paraphrased for conciseness)
38
Dynamic programming

Dynamic programming refers to a collection of algorithms that can be

used to compute optimal policies given a perfect model of the
environment as a Markov decision process (MDP).

Sutton & Barto 2018

► We will discuss several dynamic programming methods to solve MDPs
► All such methods consist of two important parts:
policy evaluation and policy improvement

39
Policy evaluation

► We start by discussing how to estimate

vπ (s) = E [Rt+1 + γvπ (St+1) | s, π]

► Idea: turn this equality into an update

Algorithm
► First, initialise v0, e.g., to zero
► Then, iterate
∀s : vk+1(s) ← E [Rt+1 + γvk (St+1) | s, π]

► Stopping: whenever vk+1(s) = vk (s), for all s, we must have found vπ

► Q: Does this algorithm always converge?
Answer : Yes, under appropriate conditions (e.g., γ < 1).

40
Example: Policy evaluation

41
Policy evaluation

42
Policy evaluation

43
Policy evaluation + Greedy Improvement

44
Policy evaluation + Greedy Improvement

45
Policy Improvement
► The example already shows we can use evaluation to then improve our policy
► In fact, just being greedy with respect to the values of the random policy
sufficed! (That is not true in general)

Algorithm
Iterate, using

Then, evaluate πnew and repeat

► Claim: One can show that vπnew (s) ≥ vπ (s), for all s
46
Policy Iteration

Policy evaluation Estimate vπ

Policy improvement Generate πJ ≥ π

47
Policy Iteration

► Does policy evaluation need to converge to vπ ?

► Or should we stop when we are ‘close’?

(E.g., with a threshold on the change to the values)
► Or simply stop after k iterations of iterative policy evaluation?
► In the small gridworld k = 3 was sufficient to achieve optimal policy

► Extreme: Why not update policy every iteration — i.e. stop after k = 1?
► This is equivalent to value iteration

48
Value Iteration

► We could take the Bellman optimality equation, and turn that into an
update

► This is equivalent to policy iteration, with k = 1 step of policy evaluation

between each two (greedy) policy improvement steps

Algorithm: Value Iteration

► Initialise v0
► Update:vk+1(s) ← maxa E [Rt+1 + γvk (St+1) | St = s, At = s]
► Stopping: whenever vk+1(s) = vk (s), for all s, we must have found v ∗

49
Example: Shortest Path
g 0 0 0 0 0 -1 -1 -1 0 -1 -2 -2

0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2

0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2

Problem V1 V2 V3

0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3

-1 -2 -3 -3 -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4

-2 -3 -3 -3 -2 -3 -4 -4 -2 -3 -4 -5 -2 -3 -4 -5

-3 -3 -3 -3 -3 -4 -4 -4 -3 -4 -5 -5 -3 -4 -5 -6

V4 V5 V6 V7

50
Synchronous Dynamic Programming Algorithms

Problem Bellman Equation Algorithm

Prediction Bellman Expectation Equation Iterative
Policy Evaluation
Control Bellman Expectation Equation Policy Iteration
+ (Greedy) Policy
Improvement
Control Bellman Optimality Equation Value Iteration

Observations:
► Algorithms are based on state-value function vπ (s) or v ∗(s) ⇒ complexity
O(|A||S|2) per iteration, for |A| actions and |S| states
► Could also apply to action-value function qπ (s, a) or q∗(s, a) ⇒ complexity
O(|A|2|S|2) per iteration
51
Extensions to Dynamic Programming

52
Asynchronous Dynamic Programming

► DP methods described so far used synchronous updates (all states in

parallel)

► Asynchronous DP
► backs up states individually, in any order
► can significantly reduce computation
► guaranteed to converge if all states continue to be selected

53
Asynchronous Dynamic Programming

Three simple ideas for asynchronous dynamic programming:

► In-place dynamic programming - no additional Data Structure and
memory
► Prioritised sweeping
► Real-time dynamic programming

54
In-Place Dynamic Programming

► Synchronous value iteration stores two copies of value function

► In-place value iteration only stores one copy of value function

for all s in S : v (s) ← max E [R t+1 + γv (S t+1 ) | St = s]

55
Prioritised Sweeping

► Use magnitude of Bellman error to guide state selection,

e.g.

► Backup the state with the largest remaining Bellman error

► Update Bellman error of affected states after each backup
► Requires knowledge of reverse dynamics (predecessor
states)
► Can be implemented efficiently by maintaining a priority
queue

56
Real-Time Dynamic Programming

► Idea: only update states that are relevant to agent

► E.g., if the agent is in state St , update that state value, or states that it
expects to be in soon

57
Full-Width Backups

► Standard DP uses full-width backups

► For each backup (sync or async)
► Every successor state and action is considered s Vk+1(s)
► Using true model of transitions and reward function
a
► DP is effective for medium-sized problems (millions of
states) r

► For large problems DP suffers from curse of dimensionality s'

► Number of states n = |S| grows exponentially with number of

state variables
► Even one full backup can be too expensive

58
Sample Backups

► In subsequent lectures we will consider sample backups

► Using sample rewards and sample transitions ⟨s, a, r , sJ⟩
(Instead of reward function r and transition dynamics p)
s Vk+1(s)
► Advantages
► Model-free: no advance knowledge of MDP required
► Breaks the curse of dimensionality through sampling a

► Cost of backup is constant, independent of n = |S| r

59
Summary

60
What have we covered today?

► Markov Decision Processes

► Objectives in an MDP: different notion of return
► Value functions - expected returns, condition on state (and action)
► Optimality principles in MDPs: optimal value functions and optimal policies
► Bellman Equations
► Two class of problems in RL: evaluation and control
► How to compute vπ (aka solve an evaluation/prediction problem)
► How to compute the optimal value function via dynamic programming:
► Policy Iteration
► Value Iteration

61
Questions?

The only stupid question is the one you were afraid to ask but never did.
-Rich Sutton

For questions that may arise during this lecture please use Moodle and/or the next
Q&A session.

A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
CS229
No ratings yet
CS229
17 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Lec 12
No ratings yet
Lec 12
60 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
AS02
No ratings yet
AS02
16 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
MDPs
No ratings yet
MDPs
19 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
Slidedeck 6 MAS 2021 22 RL 2 MDP Model-Based
No ratings yet
Slidedeck 6 MAS 2021 22 RL 2 MDP Model-Based
36 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
No ratings yet
Textbook Solutions Expert Q&A Practice: Find Solutions For Your Homework
6 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Understanding The Markov Decision Process (MDP) - Built in
No ratings yet
Understanding The Markov Decision Process (MDP) - Built in
18 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
1 page
Unit Vi
No ratings yet
Unit Vi
17 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Lecture 2- Exploration and Control_slides
No ratings yet
Lecture 2- Exploration and Control_slides
51 pages
Introduction to Cognitive Science
No ratings yet
Introduction to Cognitive Science
49 pages
CV Ch_4_ Low Level Feature Extraction
No ratings yet
CV Ch_4_ Low Level Feature Extraction
72 pages
CV Ch_1 Introduction
No ratings yet
CV Ch_1 Introduction
42 pages
03 NetworkStructure
No ratings yet
03 NetworkStructure
75 pages
CV Ch_5_High Level Feature Extraction
No ratings yet
CV Ch_5_High Level Feature Extraction
62 pages
06 Homophily
No ratings yet
06 Homophily
24 pages
Quant CV
No ratings yet
Quant CV
1 page
Ai Yapping
No ratings yet
Ai Yapping
2 pages
Do Cry Over Spilt Milk Possibly You Can Change The Past
No ratings yet
Do Cry Over Spilt Milk Possibly You Can Change The Past
19 pages
Concave and Convex Mirror Exercise
100% (2)
Concave and Convex Mirror Exercise
5 pages
IRC - 1 Mathematics
No ratings yet
IRC - 1 Mathematics
8 pages
The Golden Ratio
No ratings yet
The Golden Ratio
2 pages
EFL CM2 DecidingWhatSkills
No ratings yet
EFL CM2 DecidingWhatSkills
7 pages
Dissertation Recherche Bonheur
100% (2)
Dissertation Recherche Bonheur
7 pages
Ansi Niso Z39.104 2022
No ratings yet
Ansi Niso Z39.104 2022
12 pages
Lesson Plan
No ratings yet
Lesson Plan
3 pages
Beauty and the Alpha 1st Edition Liliana Carlisle download
100% (4)
Beauty and the Alpha 1st Edition Liliana Carlisle download
117 pages
DPP 1 Quadratic Equation by Om Sir
No ratings yet
DPP 1 Quadratic Equation by Om Sir
5 pages
Veterinary Wellness: Bien-Être Vétérinaire
No ratings yet
Veterinary Wellness: Bien-Être Vétérinaire
2 pages
NVS Lab Attendant Notes
No ratings yet
NVS Lab Attendant Notes
4 pages
Diagram Fasa Dan Transisi-Baru-2023
No ratings yet
Diagram Fasa Dan Transisi-Baru-2023
23 pages
List Keputusan P2 2022
No ratings yet
List Keputusan P2 2022
18 pages
Feudalism Debate
No ratings yet
Feudalism Debate
10 pages
Essentials of Sociology 4th Edition Giddens Carr eBook and TestBank Bundle Download Instantly
No ratings yet
Essentials of Sociology 4th Edition Giddens Carr eBook and TestBank Bundle Download Instantly
323 pages
NDDS Question Paper
No ratings yet
NDDS Question Paper
2 pages
An Analysis On Reflections by Angela Carter
100% (1)
An Analysis On Reflections by Angela Carter
2 pages
Organization and Management: Schools Division of Dipolog City Dipolog City Government
No ratings yet
Organization and Management: Schools Division of Dipolog City Dipolog City Government
20 pages
22 KCV For Teaching
No ratings yet
22 KCV For Teaching
2 pages
MBBS/BDS (Management Quota) Course Session:2022 - 2023 Provisional List of Candidates Allotted On - 26.11.2022
No ratings yet
MBBS/BDS (Management Quota) Course Session:2022 - 2023 Provisional List of Candidates Allotted On - 26.11.2022
161 pages
Civil Engineering Broucher
No ratings yet
Civil Engineering Broucher
13 pages
Central Place Theory Christaller and Losch
No ratings yet
Central Place Theory Christaller and Losch
10 pages
Use Daily Affirmations To Crank Up Your Self Esteem
No ratings yet
Use Daily Affirmations To Crank Up Your Self Esteem
2 pages
Full Chapter of Social Psychology 10th Edition by Saul Kassin Ebook and TestBank Bundle EPUB DOCX PDF Download Now
No ratings yet
Full Chapter of Social Psychology 10th Edition by Saul Kassin Ebook and TestBank Bundle EPUB DOCX PDF Download Now
405 pages
DiscreteMaths LectureNotes
No ratings yet
DiscreteMaths LectureNotes
4 pages
Rheolube 363F: Rust Inhibited, PTFE Fortified
No ratings yet
Rheolube 363F: Rust Inhibited, PTFE Fortified
1 page
如何撰写研究假设
100% (1)
如何撰写研究假设
6 pages

Lecture 3 - MDPs and Dynamic Programming

Uploaded by

Lecture 3 - MDPs and Dynamic Programming

Uploaded by

Reinforcement Learning

Markov Decision Processes and Dynamic Programming

Sutton & Barto 2018, Chapter 3 + 4

► Reinforcement learning is the science of learning to make decisions

► Last lecture: multiple actions, but only one state—no model

► We will discuss a mathematical formulation of the agent-environment

► For now, assume the environment is fully observable:

► Note: Almost all RL problems can be formalised as MDPs,

Definition (Markov Decision Process)

► Consider a robot that cleans soda cans

Reference: Sutton and Barto, Chapter 3, pg 52-53.

► Discounted returns Gt for infinite horizon T → ∞:

► The discount γ ∈ [0, 1] is the present value of future rewards

Most Markov decision processes are discounted. Why?

► A policy is a mapping π : S × A → [0, 1] that, for every state s assigns for

► The value function v (s) gives the long-term value of state s

► We can define (state-)action values:

► (Connection between them) Note that:

Definition (Optimal value functions)

The optimal action-value function q∗(s, a) is the maximum action-value function

Define a partial ordering over policies

π ≥ πJ ⇐⇒ vπ (s) ≥ vπ′ (s) , ∀s

Theorem (Optimal Policies)

An optimal policy can be found by maximising over q∗(s, a),

► The value function v (s) gives the long-term value of state s

► It can be defined recursively:

► The final step writes out the expectation explicitly

Theorem (Bellman Expectation Equations)

Theorem (Bellman Optimality Equations)

► Pb1: Estimating vπ or qπ is called policy evaluation or, simply, prediction

► Pb2 : Estimating v∗ or q∗ is sometimes called control, because these can be

► Consider the following MDP:

► Q: Evaluation problems (Consider a discount γ = 0.9)

► The Bellman optimality equation is non-linear

► Many iterative solution methods:

Dynamic programming refers to a collection of algorithms that can be

Sutton & Barto 2018

► We start by discussing how to estimate

vπ (s) = E [Rt+1 + γvπ (St+1) | s, π]

► Idea: turn this equality into an update

► Stopping: whenever vk+1(s) = vk (s), for all s, we must have found vπ

Then, evaluate πnew and repeat

Policy evaluation Estimate vπ

► Does policy evaluation need to converge to vπ ?

► Or should we stop when we are ‘close’?

► This is equivalent to policy iteration, with k = 1 step of policy evaluation

Algorithm: Value Iteration

Problem Bellman Equation Algorithm

► DP methods described so far used synchronous updates (all states in

Three simple ideas for asynchronous dynamic programming:

► Synchronous value iteration stores two copies of value function

► In-place value iteration only stores one copy of value function

for all s in S : v (s) ← max E [R t+1 + γv (S t+1 ) | St = s]

► Use magnitude of Bellman error to guide state selection,

► Backup the state with the largest remaining Bellman error

► Idea: only update states that are relevant to agent

► Standard DP uses full-width backups

► For large problems DP suffers from curse of dimensionality s'

► Number of states n = |S| grows exponentially with number of

► In subsequent lectures we will consider sample backups

► Cost of backup is constant, independent of n = |S| r

► Markov Decision Processes

You might also like