0% found this document useful (0 votes)
27 views23 pages

Stochastic DP

Uploaded by

Antonio Gargiulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views23 pages

Stochastic DP

Uploaded by

Antonio Gargiulo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Master MVA: Reinforcement Learning Lecture: 2

Markov Decision Processes and Dynamic Programming

Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/∼lazaric/Webpage/Teaching.html

Objectives of the lecture

1. Understand: Markov decision processes, Bellman equations and Bellman operators.

2. Use: dynamic programming algorithms.

1 The Markov Decision Process

1.1 Definitions

Definition 1 (Markov chain). Let the state space X be a bounded compact subset of the Euclidean space,
the discrete-time dynamic system (xt )t∈N ∈ X is a Markov chain if

P(xt+1 = x | xt , xt−1 , . . . , x0 ) = P(xt+1 = x | xt ), (1)

so that all the information needed to predict (in probability) the future is contained in the current state
(Markov property). Given an initial state x0 ∈ X, a Markov chain is defined by the transition proba-
bility p such that

p(y|x) = P(xt+1 = y|xt = x). (2)

Remark : notice that in some cases we can turn a higher-order Markov process into a Markov process by
including the past as a new state variable. For instance, in the control of an inverted pendulum, the state
that can be observed is only the angular position θt . In this case the system is non-Markov since the next
position depends on the previous position but also on the angular speed, which is defined as dθt = θt − θt−1
(as a first approximation). Thus, if we define the state space as xt = (θt , dθt ) we obtain a Markov chain.

Definition 2 (Markov decision process [Bellman, 1957, Howard, 1960, Fleming and Rishel, 1975, Puterman, 1994,
Bertsekas and Tsitsiklis, 1996]). A Markov decision process is defined as a tuple M = (X, A, p, r) where

• X is the state space (finite, countable, continuous),1

• A is the action space (finite, countable, continuous),


1 In most of our lectures it can be consider as finite such that |X| = N .

1
2 Markov Decision Processes and Dynamic Programming

• p(y|x, a) is the transition probability (i.e., environment dynamics) such that for any x ∈ X, y ∈ X,
and a ∈ A
p(y|x, a) = P(xt+1 = y|xt = x, at = a),

is the probability of observing a next state y when action a is taking in x,

• r(x, a, y) is the reinforcement obtained when taking action a, a transition from a state x to a state y
is observed.2

Definition 3 (Policy). At time t ∈ N a decision rule πt is a mapping from states to actions, in particular
it can be

• Deterministic: πt : X → A, πt (x) denotes the action chosen at state x at time t,

• Stochastic: πt : X → ∆(A), πt (a|x) denotes the probability of taking action a at state x at time t.

A policy (strategy, plan) is a sequence of decision rules. In particular, we have

• Non-stationary: π = (π0 , π1 , π2 , . . . ),

• Stationary (Markovian): π = (π, π, π, . . . ).

Remark : an MDP M together with a deterministic (stochastic) stationary policy π forms a dynamic process
(xt )t∈N (obtained by taking the actions at = π(xt )) which corresponds to a Markov chain of state X and
transition probability p(y|x) = p(y|x, π(x)).
Time horizons

• Finite time horizon T : the problem is characterized by a deadline at time T (e.g., the end of the course)
and the agent only focuses on the sum of the rewards up to that time.

• Infinite time horizon with discount: the problem never terminates but rewards which are closer in time
receive a higher importance.

• Infinite time horizon with absorbing (terminal) state: the problem never terminates but the agent will
eventually reach a termination state.

• Infinite time horizon with average reward : the problem never terminates but the agent only focuses on
the average of the rewards.

Definition 4 (The state value function). Depending on the time horizon we consider we have

• Finite time horizon T


−1
 TX
V π (t, x) = E

r(xs , πs (xs )) + R(xT )| xt = x; π , (3)
s=t

where R is a reward function for the final state.


2 Most of the time we will use either r(x, a) (which can be considered as the expected value of r(x, a, y)) or r(x) as a state-only

function.
Markov Decision Processes and Dynamic Programming 3

• Infinite time horizon with discount



π
X
γ t r(xt , π(xt )) | x0 = x; π ,

V (x) = E (4)
t=0

where 0 ≤ γ < 1 is a discount factor (i.e., small = focus on short-term rewards, big = focus on long
term rewards).3
• Infinite time horizon with absorbing (terminal) states
T
X
V π (x) = E

r(xt , π(xt ))|x0 = x; π , (5)
t=0

where T is the first (random) time when the agent achieves a absorbing state.
• Infinite time horizon with average reward
T −1
1 X
V π (x) = lim

E r(xt , π(xt )) | x0 = x; π . (6)
T →∞ T
t=0

Remark : the previous expectations refer to all the possible stochastic trajectories. More precisely, let
us consider an MDP M = (X, A, p, r), if an agent follows a non-stationary policy π from the state x0 ,
then it observes the random sequence of states (x0 , x1 , x2 , . . .) and the corresponding sequence of rewards
(r0 , r1 , r2 , . . .) where rt = r(xt , πt (xt )). The corresponding value function (in the infinite horizon with
discount setting) is then defined as

X 
π t
V (x) = E(x1 ,x2 ,...) γ r(xt , π(xt )) | x0 = x; π ,
t=0

where each xt ∼ p(·|xt−1 , at = π(xt )) is a random realization from the transition probability of the MDP.
Definition 5 (Optimal policy and optimal value function). The solution to an MDP is an optimal policy
π ∗ satisfying
π ∗ ∈ arg max V π
π∈Π

in all the states x ∈ X, where Π is some policy set of interest. The corresponding value function is the

optimal value function V ∗ = V π .

Remark : π ∗ ∈ arg max(·) and not π ∗ = arg max(·) because an MDP may admit more than one optimal
policy.
Beside the state value function, we can also introduce an alternative formulation, the state-action value
function. For infinite horizon discounted problems we have the following definition (similar for other settings).
Definition 6 (The state-action value function). In discounted infinite horizon problems, for any policy π,
the state-action value function (or Q-function) Qπ : X × A 7→ R is defined as
hX i
Qπ (x, a) = E γ t r(xt , at )|x0 = x, a0 = a, at = π(xt ), ∀t ≥ 1 ,
t≥0

and the corresponding optimal Q-function is

Q∗ (x, a) = max Qπ (x, a).


π
3 Mathematical interpretation: for any γ ∈ [0, 1) the series always converge (for bounded rewards).
4 Markov Decision Processes and Dynamic Programming

The relationships between the V-function and the Q-function are:


X
Qπ (x, a) = r(x, a) + γ p(y|x, a)V π (y)
y∈X
V π (x)
= Qπ (x, π(x))
X
Q∗ (x, a) = r(x, a) + γ p(y|x, a)V ∗ (y)
y∈X

V (x) = Q (x, π (x)) = max Q∗ (x, a).
∗ ∗
a∈A

Finally, from an optimal Q-function we can deduce the optimal policy as π ∗ (x) ∈ arg maxa∈A Q∗ (x, a).

1.2 Examples

Example 1 (the MVA student dilemma).

2 r=1
p=0.5
1 Rest
0.4 r=−10
Rest 0.5

r=0 Work 5
Work 0.3 0.4 0.6

0.5 0.7
0.5 r=100
Rest
0.6
r=−1
Rest
0.9 6
Work
3
0.1 r=−1000
0.5
1
0.5
Work
r=−10
4 7

• Model : all the transitions are Markov since the probability of transition only depend on the previous
state. The states x5 , x6 , x7 are terminal (absorbing) states.

• Setting: infinite horizon with terminal states.

• Objective: find the policy that maximizes the expected sum of rewards before achieving a terminal
state.

• Problem 1 : if a student knows the transition probabilities and the rewards, how does he/she compute
the optimal strategy?

Solution:
Markov Decision Processes and Dynamic Programming 5

V2 = 88.3
p=0.5 r=1
Rest r=−10
Rest 0.5 0.4
V1 = 88.3
V5 = −10
r=0 Work
Work 0.3 0.4 0.6

0.5 0.7
0.5 r=100
Rest
0.6
V6 = 100
r=−1 0.9
V3 = 86.9 Rest
Work

0.1 r=−1000
0.5
1
0.5 r=−10 Work
V7 = −1000
V4 = 88.9

V5 = −10, V6 = 100, V7 = −1000


V4 = −10 + 0.9V6 + 0.1V4 ' 88.9
V3 = −1 + 0.5V4 + 0.5V3 ' 86.9
V2 = 1 + 0.7V3 + 0.3V1
V1 = max{0.5V2 + 0.5V1 , 0.5V3 + 0.5V1 }
V1 = V2 = 88.3

• Problem 2 : what if the student doesn’t know the MDP?

Example 2 (The retail store management problem).


At each month t, a store contains xt items of a specific goods and the demand for that goods is Dt . At
the end of each month the manager of the store can order at more items from his supplier. Furthermore we
know that

• The cost of maintaining an inventory of x is h(x).


• The cost to order a items is C(a).
• The income for selling q items is f (q).
• If the demand D is bigger than the available inventory x, the customers that cannot be served will
leave and move to another store.
• The income of the remaining inventory at the end of the year is g(x).
• Constraint: the store has a maximum capacity M .
• Objective: the performance of the manager is evaluated as the profit obtained over an horizon of a
year (T = 12 months).

Problem: How do we formalize the problem?


Solution: we define the following simplified model using the MDP formalism.
6 Markov Decision Processes and Dynamic Programming

• State space: x ∈ X = {0, 1, . . . , M }.


• Action space: it is not possible to order more items that the capacity of the store, then the action
space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.
• Dynamics: xt+1 = [xt + at − Dt ]+ .
Problem: the dynamics should be Markov and stationary. The demand Dt is stochastic and time-
i.i.d.
independent. Formally, Dt ∼ D.
• Reward : rt = −C(at ) − h(xt + at ) + f ([xt + at − xt+1 ]+ ).
 PT −1 
• Objective function: E t=1 rt + g(xT )

Example 3 (The parking problem).


A driver wants to park his car as close as possible to the restaurant.

Reward t
1 2 T

p(t) Restaurant

Reward 0

We know that

• The driver cannot see whether a place is available unless he/she is in front of it.
• There are P places.
• At each place i the driver can either move to the next place or park (if the place is available).
• The closer to the restaurant the parking, the higher the satisfaction.
• If the driver doesn’t park anywhere, then he/she leaves the restaurant and has to find another one.
• Objective: maximize the satisfaction.

Problem: How do we formalize the problem?


Solution: we define the following simplified model using the MDP formalism.

• State space: x ∈ X = {(1, T ), (1, A), (2, T ), (2, A), . . . , (P, T ), (P, A), parked, left}. Each of the P places
can be (A)vailable or (T )aken. Whenever the driver parks, he reaches the state parked, while if he
never parks then the state becomes left. Both these states are terminal states.
• Action space: A(x) = {park, continue} if x = (·, A) or A(x) = {continue} if x = (·, T ).
• Dynamics: (see graph). We assume that the probability of a place i to be available is ρ(i).
• Reward : for any i ∈ [1, P ] we have r(x, park, parked) = i and 0 otherwise.
Markov Decision Processes and Dynamic Programming 7

 PT 
• Objective function: E t=1 rt with T the random time when a terminal state is reached.

Rew 1 Rew 2 Rew T


Park Park Park

L L L Rew 0
p(2) p(T)
p(1) Cont Cont Cont
D 1−p(1)
1−p(2) 1−p(T)
O O O Rew 0

1 2 T

Example 4 (The tetris game).


Model:
• State space: configuration of the wall and next piece and
terminal state when the well reach the maximum height.
• Action space: position and orientation of the current space
in the wall.
• Dynamics: new configuration of the well and new random
piece.
• Reward : number of deleted rows.
 PT 
• Objective function: E t=1 rt with T the random time
when a terminal state is reached. (remark: it has been
proved that the game eventually terminates with probabil-
ity 1 for any playing strategy).
Problem: Compute the optimal strategy.
Solution: unknown! The state space is huge! |X| = 1061 for a problem with maximum height 20, width 10
and 7 different pieces.

1.3 Finite Horizon Problems

Proposition 1. For any horizon T and a (non-stationary) policy π = (π0 , . . . , πT −1 ), the state value
function at a state x ∈ X at time t ∈ {0, . . . , T } satisfies the Bellman equation:
−1
 TX
V π (t, x) = E

r(xs , πs (xs )) + R(xT ) | xt = x; π . (7)
s=t
8 Markov Decision Processes and Dynamic Programming

Proof. Recalling equation 3 and Def. 5, we have that


−1
 TX
V π (t, x) = E

r(xs , πs (xs )) + R(xT ) | xt = x; π
s=t
−1
 TX 
(a)
= r(x, πt (x)) + Ext+1 ,xt+2 ,...,xT −1 r(xs , πs (xs )) + R(xT ) | xt = x; π
s=t+1

X −1
 TX 
= r(x, πt (x)) + P[xt+1 = y|xt = x; at = π(xt )]Ext+2 ,...,xT −1 r(xs , πs (xs )) + R(xT ) | xt+1 = y; π
y∈X s=t+1
T −1
(b) X  X 
= r(x, πt (x)) + p(y|x, π(x))E r(xs , πs (xs )) + R(xT ) | xt+1 = x; π
y∈X s=t+1
(c) X
= r(x, πt (x)) + p(y|x, π(x))V π (t + 1, y),
y∈X

where

(a) The expectation is conditioned on xt = x.

(b) Application of the law of total expectation.

(c) Definition of the MDP dynamics p and of the value function.

Bellman’s Principle of Optimality [Bellman, 1957]:

“An optimal policy has the property that, whatever the initial state and the initial decision are,
the remaining decisions must constitute an optimal policy with regard to the state resulting from
the first decision.”

Proposition 2. The optimal value function V ∗ (t, x) (i.e., V ∗ = maxπ V π ) is the solution to the
optimal Bellman equation:
X
V ∗ (t, x) = max r(x, a) + p(y|x, a)V ∗ (t + 1, y) , with 0 ≤ t < T
 
(8)
a∈A
y∈X

V (T, x) = R(x),

and the policy


X
πt∗ (x) ∈ arg max r(x, a) + p(y|x, a)V ∗ (t + 1, y) , with 0 ≤ t < T.
 
a∈A
y∈X

is an optimal policy.
Markov Decision Processes and Dynamic Programming 9

Proof. By definition we have V ∗ (T, x) = R(x), then V ∗ is defined by backward propagation for any t < T .
Any policy π applied from time t < T on at state x can be written as π = (a, π 0 ) with a ∈ A is the action
taken at time t in x and π 0 = (πt+1 , . . . , πT −1 ). Then we can show that
−1
(a)  TX
V ∗ (t, x)

= max E r(xs , πs (xs )) + R(xT )|xt = x; π
π
s=t
(b)
h X 0
i
= max0 r(x, a) + p(y|x, a)V π (t + 1, y)
(a,π )
y∈X
(c)
h i
π0
X
= max r(x, a) + p(y|x, a) max
0
V (t + 1, y)
a∈A π
y∈X
(d) X
p(y|x, a)V ∗ (t + 1, y) .
 
= max r(x, a) +
a∈A
y∈X

where

(a) : definition of value function from equation 3,

(b) : decomposition of policy π = (a, π 0 ) and recursive definition of the value function,

(c) : follows from

– Trivial inequality
X 0 X 0
max
0
p(y|x, a)V π (t + 1, y) ≤ p(y|x, a) max
0
V π (t + 1, y)
π π
y y

– Let π̄ = (π̄t+1 , . . . ) a policy such that π̄t+1 (y) = arg maxb∈A max(πt+2 ,... ) V (b,πt+2 ,... ) (t + 1, y).
Then
X 0 X X 0
p(y|x, a) max 0
V π (t + 1, y) = p(y|x, a)V π̄ (t + 1, y) ≤ max
0
p(y|x, a)V π (t + 1, y).
π π
y y y

(d) : definition of optimal value function.

Finally the optimal policy simply takes the action for which the maximum is attained at each iteration.

Remark. The previous Bellman equations can be easily extended to the case of state-action value functions.

Parking example. Let V ∗ (p, A) (resp. V ∗ (p, T )) the optimal value function when at position p and the
place is available (resp. taken). The optimal value function can be constructed using the optimal Bellman
equations as:

V ∗ (P, A) = max{P, 0} = P, V ∗ (P, T ) = 0,


V ∗ (P − 1, A) = max{P − 1, ρ(P )V ∗ (P, A) + (1 − ρ(P ))V ∗ (P, T ))}
V ∗ (p, A) = max{t, ρ(p + 1)V ∗ (p + 1, A) + (1 − ρ(p + 1))V ∗ (p + 1, T )},

where the max corresponds to the two possible choices (park or continue) and the corresponding optimal
policy can be derived as the arg max of the previous maxima.
10 Markov Decision Processes and Dynamic Programming

1.4 Discounted Infinite Horizon Problems

1.4.1 Bellman Equations

Proposition 3. For any stationary policy π = (π, π, . . . ), the state value function at a state x ∈ X
satisfies the Bellman equation:
X
V π (x) = r(x, π(x)) + γ p(y|x, π(x))V π (y). (9)
y

Proof. For any policy π,


X t
V π (x) = E

γ r(xt , π(xt )) | x0 = x; π
t≥0
X t 
= r(x, π(x)) + E γ r(xt , π(xt )) | x0 = x; π
t≥1
X  X t−1 
= r(x, π(x)) + γ P(x1 = y | x0 = x; π(x0 ))E γ r(xt , π(xt )) | x1 = y; π
y t≥1
X
= r(x, π(x)) + γ p(y|x, π(x))V π (y).
y

Proposition 4. The optimal value function V ∗ (i.e., V ∗ = maxπ V π ) is the solution to the optimal
Bellman equation:
X
V ∗ (x) = max r(x, a) + γ p(y|x, a)V ∗ (y) .
 
(10)
a∈A
y

Proof. For any policy π = (a, π 0 ) (possibly non-stationary),


(a) X t
V ∗ (x)

= max E γ r(xt , π(xt )) | x0 = x; π
π
t≥0
(b)
h X 0
i
= max0 r(x, a) + γ p(y|x, a)V π (y)
(a,π )
y
(c)
h i
π0
X
= max r(x, a) + γ p(y|x, a) max
0
V (y) (11)
a π
y
(d)
h X i
= max r(x, a) + γ p(y|x, a)V ∗ (y) .
a
y

where

(a) : definition of value function from equation 4,


Markov Decision Processes and Dynamic Programming 11

(b) : decomposition of policy π = (a, π 0 ) and recursive definition of the value function,
(c) : follows from
– Trivial inequality X 0 X 0
max
0
p(y|x, a)V π (y) ≤ p(y|x, a) max
0
V π (y)
π π
y y
0
– Let π̄(y) = arg maxπ0 V π (y). Then
π0 0
X X X
π̄
p(y|x, a) max
0
V (y) = p(y|x, a)V (y) ≤ max
0
p(y|x, a)V π (y).
π π
y y y

(d) : definition of optimal value function.

The equivalent Bellman equations for state-action value functions are:


X
Qπ (x, a) = r(x, a) + γ p(y|x, a)Qπ (y, π(y))
y∈X
X

Q (x, a) = r(x, a) + γ p(y|x, a) max Q∗ (y, b).
b∈A
y∈X

1.4.2 Bellman Operators

Notation. W.l.o.g. from this moment on we consider a discrete state space |X| = N , so that for any policy
π, V π is a vector of size N (i.e., V π ∈ RN ).
Definition 7. For any W ∈ RN , the Bellman operator T π : RN → RN is defined as
X
T π W (x) = r(x, π(x)) + γ p(y|x, π(x))W (y), (12)
y

and the optimal Bellman operator (or dynamic programming operator) is defined as
 X 
T W (x) = max r(x, a) + γ p(y|x, a)W (y) . (13)
a∈A
y

Proposition 5. The Bellman operators enjoy a number of properties:

1. Monotonicity: for any W1 , W2 ∈ RN , if W1 ≤ W2 component-wise, then

T π W1 ≤ T π W2 ,
T W1 ≤ T W2 .

2. Offset: for any scalar c ∈ R,

T π (W + cIN ) = T π W + γcIN ,
T (W + cIN ) = T W + γcIN ,
12 Markov Decision Processes and Dynamic Programming

3. Contraction in L∞ -norm: for any W1 , W2 ∈ RN

||T π W1 − T π W2 ||∞ ≤ γ||W1 − W2 ||∞ ,


||T W1 − T W2 ||∞ ≤ γ||W1 − W2 ||∞ .

4. Fixed point: For any policy π

V π is the unique fixed point of T π ,


V ∗ is the unique fixed point of T .

Furthermore for any W ∈ RN and any stationary policy π

lim (T π )k W = V π,
k→∞
lim (T )k W = V ∗.
k→∞

Proof. Monotonicity (1) and offset (2) directly follow from the definitions.
The contraction property (3) holds since for any x ∈ X we have
X X 
0 0
  
|T W1 (x) − T W2 (x)| = max r(x, a) + γ p(y|x, a)W1 (y) − max r(x, a ) + γ p(y|x, a )W (y)

0
2
a a
y y
(a)  X   X 
≤ max r(x, a) + γ p(y|x, a)W1 (y) − r(x, a) + γ p(y|x, a)W2 (y)

a
y y
X
= γ max p(y|x, a)|W1 (y) − W2 (y)|
a
y
X
≤ γ||W1 − W2 ||∞ max p(y|x, a) = γ||W1 − W2 ||∞ ,
a
y

where in (a) we used maxa f (a) − maxa0 g(a0 ) ≤ maxa (f (a) − g(a)).
The fixed point property follows from the Bellman equations (eq. 9 and 10) and Banach fixed point theorem
(see Theorem 11). The property regarding the convergence of the limit of the Bellman operators is actually
used in the proof of the Banach theorem.

Remark 1 (optimal policy): Any stationary policy π ∗ (x) ∈ arg maxa∈A r(x, a) + γ y p(y|x, a)V ∗ (y) is
 P 

optimal. In fact, from the definition of π ∗ we have that T π V ∗ = T V ∗ = V ∗ where the first equality follows
from the fact that the optimal Bellman operator coincides with the action taken by π ∗ and the second from

the fixed point property of T . Furthermore, by the fixed point property of T π we have that V π is the fixed
∗ ∗
point of T π which is also unique. Then V π = V ∗ thus implying that π ∗ is an optimal policy.
Remark 2 (value/policy iteration): most of the dynamic programming algorithms studied in Section 2 will
heavily rely on property (4).
Bellman operators for Q-functions.
Markov Decision Processes and Dynamic Programming 13

Both the Bellman T π and the optimal Bellman operators T can be extended to Q-functions. Thus for any
function4 W : X × A → R, we have
X
T π W (x, a) = r(x, a) + γ p(y|x, a)W (y, π(y)),
y∈X
X
T W (x, a) = r(x, a) + γ p(y|x, a) max W (y, b).
b∈A
y∈X

This allows to extend also all the properties in Proposition 5, notably that Qπ (resp. Q∗ ) is the fixed point
of T π (resp. T ).

1.5 Undiscounted Infinite Horizon Problems

1.5.1 Bellman equations

Recall the definition of value function for infinite horizon problems with absorbing states in eq. 5:
∞ T
X X
V π (x) = E
 
r(xt , π(xt ))|x0 = x; π = E r(xt , π(xt ))|x0 = x; π ,
t=0 t=0

where T is the first (random) time when the agent achieves a absorbing state. The equivalence follows from
the fact that we assume that once the absorbing state is achieved the system stays in that state indefinitely
long with a constant reward of 0.
Definition 8. A stationary policy π is proper if there exists an integer n ∈ N such that from any initial
state x ∈ X the probability of achieving the terminal state (denoted by x̄) after n steps is strictly positive.
That is
ρπ = max P(xn 6= x̄ | x0 = x, π) < 1.
x

Proposition 6. For any proper policy π with parameter ρπ after n steps, the value function is
bounded as
X
||V π ||∞ ≤ rmax ρbt/nc
π .
t≥0

Proof. We have that by def. 8

P(x2n 6= x̄ | x0 = x, π) = P(x2n 6= x̄ | xn 6= x̄, π) × P(xn 6= x̄ | x0 = x, π) ≤ ρ2π .

Then for any t ∈ N


P(xt 6= x̄ | x0 = x, π) ≤ ρbt/nc
π ,
which implies that eventually the terminal state x̄ is achieved with probability 1. Then

X X X
||V π ||∞ = max E ρbt/nc

r(xt , π(xt ))|x0 = x; π ≤ rmax P(xt 6= x̄ | x0 = x, π) ≤ nrmax + rmax π .
x∈X
t=0 t>0 t≥n

4 It is simpler to state W as function rather than a vector as done for value functions.
14 Markov Decision Processes and Dynamic Programming

1.5.2 Bellman operators

Assumption. There exists at least one proper policy and for any non-proper policy π there exists at least
one state x where the corresponding value function is negatively unbounded, i.e., V π (x) = −∞, which
corresponds to the existence of a cycle with only negative rewards.

Proposition 7. [Bertsekas and Tsitsiklis, 1996] Under the previous assumption, the optimal value
function is bounded, i.e., ||V ∗ ||∞ < ∞ and it is the unique fixed point of the optimal Bellman
operator T such that for any vector W ∈ Rn
 X
T W (x) = max r(x, a) + p(y|x, a)W (y)].
a∈A
y

Furthermore, we have that V ∗ = limk→∞ (T )k W .

Proposition 8. Let all the policies π be proper, then there exist a vector µ ∈ RN with µ > 0 and
a scalar β < 1 such that, ∀x, y ∈ XN , ∀a ∈ A,
X
p(y|x, a)µ(y) ≤ βµ(x).
y

Then it follows that both operators T and T π are contraction in the weighted norm L∞,µ , that is

||T W1 − T W2 ||∞,µ ≤ β||W1 − W2 ||∞,µ .

Proof. Let µ be the maximum (over all policies) of the average time to the terminal goal. This can be easily
casted to a MDP where for any action and any state the rewards are 1 (i.e., for any x ∈ X and a ∈ A,
r(x, a) = 1). Under the assumption that all the policies are proper, then µ is finite and it is the solution to
the dynamic programming equation
X
µ(x) = 1 + max p(y|x, a)µ(y).
a
y

P
Then µ(x) ≥ 1 and for any a ∈ A, µ(x) ≥ 1 + y p(y|x, a)µ(y). Furthermore,
X
p(y|x, a)µ(y) ≤ µ(x) − 1 ≤ βµ(x),
y

for
µ(x) − 1
β = max < 1.
x µ(x)

From this definition of µ and β we obtain the contraction property of T (similar for T π ) in norm L∞,µ :
|T W1 (x) − T W2 (x)|
||T W1 − T W2 ||∞,µ = max
x µ(x)
P
y p(y|x, a)
≤ max |W1 (y) − W2 (y)|
x,a µ(x)
P
y p(y|x, a)µ(y)
≤ max kW1 − W2 kµ
x,a µ(x)
≤ βkW1 − W2 kµ
Markov Decision Processes and Dynamic Programming 15

2 Dynamic Programming

2.1 Value Iteration (VI)

The idea. We build a sequence of value functions. Let V0 be any vector in RN , then iterate the application
of the optimal Bellman operator so that given Vk at iteration k we compute

Vk+1 = T Vk .

From Proposition 5 we have that limk→∞ Vk = V ∗ , thus a repeated application of the Bellman operator
make the initial guess V0 closer and closer to the actual optimal value function. At any iteration K, the
algorithm returns the greedy policy
 X 
πK (x) ∈ arg max r(x, a) + γ p(y|x, a)VK (y) .
a∈A
y

Guarantees. Using the contraction property of the Bellman operator we obtain the convergence of V0 to
V ∗ . In fact,

||Vk+1 − V ∗ ||∞ = ||T Vk − T V ∗ ||∞ ≤ γ||Vk − V ∗ ||∞ ≤ γ k+1 ||V0 − V ∗ ||∞ → 0.

This also provides the convergence rate of VI. Let  > 0 be a desired level of accuracy in approximating V ∗
and ||r||∞ ≤ rmax , then after at most
log(rmax /)
K=
log(1/γ)
iterations VI returns a value function VK such that ||VK − V ∗ ||∞ < . In fact, after K iterations we have

||VK − V ∗ ||∞ < γ K+1 ||V0 − V ∗ ||∞ ≤ γ K+1 rmax ≤ .

Computational complexity. One application of the optimal Bellman operator takes O(N 2 A) operations.
Remark: Notice that the previous guarantee is for the value function returned by VI and not the corre-
sponding policy πK .
Q-iteration. Exactly the same algorithm can be applied to Q-functions using the corresponding optimal
Bellman operator.
Implementation. There exist several implementations depending on the order used to update the different
components of Vk (i.e., the states). For instance, in asynchronous VI, at each iteration k, one single state xk
is chosen and only the corresponding component of Vk is updated, i.e., Vk+1 (xk ) = T Vk (xk ), while all the
other states remain unchanged. If all the states are selected infinitely often, then Vk → V ∗ .

Pros: each iteration is very computationally efficient.

Cons: convergence is only asymptotic.


16 Markov Decision Processes and Dynamic Programming

2.2 Policy Iteration (PI)

Notation. Further extending the vector notation, for any policy π we introduce the reward vector rπ ∈ RN
as rπ (x) = r(x, π(x)) and the transition matrix P π ∈ RN ×N as P π (x, y) = p(y|x, π(x))
The idea. We build a sequence of policies. Let π0 be any stationary policy. At each iteration k we perform
the two following steps:

1. Policy evaluation given πk , compute V πk .

2. Policy improvement: we compute the greedy policy πk+1 from V πk as:


X
p(y|x, a)V πk (y) .
 
πk+1 (x) ∈ arg max r(x, a) + γ
a∈A
y

The iterations continue until V πk = V πk+1 .


Remark: Notice πk+1 is called greedy since it is maximizing the current estimate of the future rewards, which
corresponds to the application of the optimal Bellman operator, since T πk+1 V πk = T V πk .
Remark: The PI algorithm is often seen as an actor-critic algorithm.
Guarantees.

Proposition 9. The policy iteration algorithm generates a sequences of policies with non-decreasing
performance (i.e., V πk+1 ≥ V πk )) and it converges to π ∗ in a finite number of iterations.

Proof. From the definition of the operators T , T πk , T πk+1 and of the greedy policy πk+1 , we have that

V πk = T πk V πk ≤ T V πk = T πk+1 V πk , (14)

and from the monotonicity property of T πk+1 , it follows that

V πk ≤ T πk+1 V πk ,
T πk+1 V πk ≤ (T πk+1 )2 V πk ,
...
πk+1 n−1
(T ) V πk ≤ (T πk+1 )n V πk ,
...

Joining all the inequalities in the chain we obtain

V πk ≤ lim (T πk+1 )n V πk = V πk+1 .


n→∞

Then (V πk )k is a non-decreasing sequence. Furthermore, in a finite MDP we have a finite number of policies,
then the termination condition is always met for a specific k. Thus eq. 14 holds with an equality and we
obtain
V πk = T V πk
and V πk = V ∗ which implies that πk is an optimal policy.
Markov Decision Processes and Dynamic Programming 17

Remark: More recent and refined proofs of the convergence rate of PI are available.
Policy evaluation. At each iteration k the value function V πk corresponding to the current policy πk is
computed. There are a number of different approaches to compute V πk .

• Direct computation. From the vector notation introduced before and the Bellman equation in
Proposition 4 we have that for any policy π:

V π = rπ + γP π V π .

By rearranging the terms we write the previous equation as

(I − γP π )V π = rπ .

Since P π is a stochastic matrix, then all its eigenvalues are ≤ 1. Thus the eigenvalues of the matrix
(I − γP π ) are bounded by ≥ 1 − γ, which guarantees that it is invertible. Then we can compute V π as

V π = (I − γP π )−1 rπ .

The inversion can be done with different methods. For instance, the Gauss-Jordan elimination algo-
rithm has a complexity O(N 3 ), which can be lowered to O(N 2.807 ) using Strassen’s algorithm.

• Iterative policy evaluation. For any policy π from the properties of the policy Bellman operator
T π , for any V0 we have that limn→∞ T π V0 = V π . Thus an approximation of V π could be computed
by re-iterating the application T π for n steps. In particular, in order to achieve an -approximation of
log 1/
V π we need O(N 2 log 1/γ ) steps.

• Monte-Carlo simulation. In each state x, we simulate n trajectories ((xit )t≥0, )1≤i≤n following policy
π, where for any i = 1, . . . , n, xi0 = x and xit+1 ∼ p(·|xit , π(xit )). We compute
n
1 XX t i
V̂ π (x) ' γ r(xt , π(xit )).
n i=1
t≥0


The approximation error is of order O(1/ n).

• Temporal-difference (TD) see next lecture.

Policy improvement. The computation of πk+1 requires the computation of an expectation w.r.t. the
next state, which might be as expensive as O(N |A|). If we move to policy iteration of Q-functions, then the
policy improvement step simplifies to

πk+1 (x) ∈ arg max Q(x, a),


a∈A

which has a computational cost of O(|A|). Furthermore, this could allow to compute the greedy policy even
when the dynamics p is not known.

Pros: converge in a finite number of iterations (often small in practice).

Cons: each iteration requires a full policy evaluation and it might be expensive.
18 Markov Decision Processes and Dynamic Programming

2.3 Linear Programming (LP)

Geometric interpretation of PI. We first provide a different interpretation of PI as a Newton method


trying to find the zero of the Bellman residual operator. Let B = T − I the Bellman residual operator. From
the definition of V ∗ as the fixed point of T , it follows that BV ∗ = 0. Thus, computing V ∗ as the fixed point
of T is equivalent to compute the zero of B. We first prove the following proposition.

Proposition 10. Let B = T − I the Bellman residual operator and B 0 its derivative. Then sequence
of policies πk generated by PI satisfies

V πk+1 = V πk − (γP πk+1 − I)−1 [T πk+1 V πk − V πk ]


= V πk − [B 0 ]−1 BV πk ,

which coincides with the standard formulation of the Newton’s method.

Proof. By the definition of V πk and the Bellman operators we have

V πk+1 = (I − γP πk+1 )−1 rπk+1 − V πk + V πk


= V πk + (I − γP πk+1 )−1 [rπk+1 + (γP πk+1 − I)V πk ]

TV−V
Tπ 1V−V

π
T 2V−V

π
T 3V−V

π1 π3
V
π2
V V V = V*

Following the geometric interpretation, we have that V πk is the zero of the linear operator T πk − I. Since the
application V → T V − V = maxπ T π V − V is convex, then we have that the Newton’s method is guaranteed
to converge for any V0 such that T V 0 − V0 ≥ 0.

Linear Programming The previous intuition is at the basis of the observation that V ∗ is the smallest
vector V such that V ≥ T V . In fact,

V ≥TV =⇒ V ≥ lim (T )k V = V ∗ .
k→∞

Thus, we can compute V ∗ as the solution to the linear program:

P
• Min x V (x),
Markov Decision Processes and Dynamic Programming 19

• Subject to (finite number of linear inequalities which defines a polyhidron in RN ):


X
V (x) ≥ r(x, a) + γ p(y|x, a)V (y), ∀x ∈ X, ∀a ∈ A
y

which contains N variables and N × |A| constraints.


20 Markov Decision Processes and Dynamic Programming

A Probability Theory

Definition 9 (Conditional probability). Given two events A and B with P(B) > 0, the conditional probability
of A given B is
P(A ∪ B)
P(A|B) = .
P(B)
Similarly, if X and Y are non-degenerate and jointly continuous random variables with density fX,Y (x, y)
then if B has positive measure then the conditional probability is
R R
y∈B x∈A X,Y
f (x, y)dxdy
P(X ∈ A|Y ∈ B) = R R .
f
y∈B x X,Y
(x, y)dxdy

Definition 10 (Law of total expectation). Given a function f and two random variables X, Y we have that
  h  i
EX,Y f (X, Y ) = EX EY f (x, Y )|X = x .

B Norms, Contractions, and Banach’s Fixed-Point Theorem

Definition 11. Given a vector space V ⊆ Rd a function f : V → R+


0 is a norm if an only if

• If f (v) = 0 for some v ∈ V, then v = 0.

• For any λ ∈ R, v ∈ V, f (λv) = |λ|f (v).

• Triangle inequality: For any v, u ∈ V, f (v + u) ≤ f (v) + f (u) .

A list of useful norms.

• Lp -norm
d
X 1/p
||v||p = |vi |p .
i=1

• L∞ -norm
||v||∞ = max |vi |.
1≤i≤d

• Lµ,p -norm
d 1/p
|vi |p
X
||v||µ,p = .
i=1
µi

• Lµ,p -norm
|vi |
||v||µ,∞ = max .
1≤i≤d µi

• L2,P -matrix norm (P is a positive definite matrix)

||v||2P = v > P v.
Markov Decision Processes and Dynamic Programming 21

Definition 12. A sequence of vectors vn ∈ V (with n ∈ N) is said to converge in norm || · || to v ∈ V if

lim ||vn − v|| = 0.


n→∞

Definition 13. A sequence of vectors vn ∈ V (with n ∈ N) is a Cauchy sequence if

lim sup ||vn − vm || = 0.


n→∞ m≥n

Definition 14. A vector space V equipped with a norm || · || is complete if every Cauchy sequence in V is
convergent in the norm of the space.

Definition 15. An operator T : V → V is L-Lipschitz if for any v, u ∈ V

||T v − T u|| ≤ L||u − v||.

If L ≤ 1 then T is a non-expansion, while if L < 1 then T is a L-contraction.


If T is Lipschitz then it is also continuous, that is

if vn →||·|| v then T vn →||·|| T v.

Definition 16. A vector v ∈ V is a fixed point of the operator T : V → V if T v = v.

Proposition 11. Let V be a complete vector space equipped with the norm || · || and T : V → V be
a γ-contraction mapping. Then

1. T admits a unique fixed point v.


2. For any v0 ∈ V, if vn+1 = T vn then vn →||·|| v with a geometric convergence rate:

||vn − v|| ≤ γ n ||v0 − v||.

Proof. We first derive the fundamental contraction property. For any v, u ∈ V:

||v − u|| = ||v − T v + T v − T u + T u − u||


(a)
≤ ||v − T v|| + ||T v − T u|| + ||T u − u||
(b)
≤ ||v − T v|| + γ||v − u|| + ||T u − u||,

where (a) follows from the triangle inequality and (b) from the contraction property of T . Rearranging the
terms we obtain:
||v − T v|| + ||T u − u||
||v − u|| ≤ . (15)
1−γ

If v and u are both fixed points then ||T v − v|| = 0 and ||T u − u|| = 0, thus from the previous inequality
||v − u|| ≤ 0 which implies v = u. Then T admits at most one fixed point.
22 Markov Decision Processes and Dynamic Programming

Let v0 ∈ V, for any n, m ∈ N an iterative application of eq. 15 gives


||T n v0 − T T n v0 || + ||T T m v0 − T m v0 ||
||T n v0 − T m u0 || ≤
1−γ
||T n v0 − T n T v0 || + ||T m T v0 − T m v0 ||
=
1−γ
(a) γ n ||v − T v || + γ m ||T v − v ||
0 0 0 0

1−γ
γn + γm
= ||v0 − T v0 ||,
1−γ
where in (a) we used the fact that T n is a γ n contraction. Since γ < 1 we have that
γn + γm
lim sup ||T n v0 − T m u0 || ≤ lim sup ||v0 − T v0 || = 0,
n→∞ m≥n n→∞ m≥n 1 − γ

which implies that {T n v0 }n is a Cauchy sequence. Since V is complete by assumption then {T n v0 }n is


convergent to a vector v. Recalling that vn+1 = T vn and by taking the limit on both sides we obtain
(a)
lim vn+1 = lim T n+1 v0 = v,
n→∞ n→∞
(b)
lim T vn = T lim vn = T v,
n→∞ n→∞

where (a) follows from the definition of vn+1 and the fact that {T n v0 }n is a convergence Cauchy sequence,
while (b) follows from the fact that a contraction operator T is also continuous. Joining the two equalities
we obtain v = T v which is the definition of fixed point.

C Linear Algebra

Eigenvalues of a matrix (1). Given a square matrix A ∈ RN ×N , a vector v ∈ RN and a scalar λ are
eigenvector and eigenvalue of the matrix if
Av = λv.

Eigenvalues of a matrix (2). Given a square matrix A ∈ RN ×N with eigenvalues {λi }N


i=1 , then the matrix
(I − αA) has eigenvalues {µi = 1 − αλi }N
i=1 .
Stochastic matrix. A square matrix P ∈ RN ×N is a stochastic matrix if

1. all non-zero entries, ∀i, j, [P ]i,j ≥ 0


PN
2. all the rows sum to one, ∀i, j=1 [P ]i,j = 1.

All the eigenvalues of a stochastic matrix are bounded by 1, i.e., ∀i, λi ≤ 1.


Matrix inversion. A square matrix A ∈ RN ×N can be inverted if and only if ∀i, λi 6= 0.

References
[Bellman, 1957] Bellman, R. E. (1957). Dynamic Programming. Princeton University Press, Princeton, N.J.
Markov Decision Processes and Dynamic Programming 23

[Bertsekas and Tsitsiklis, 1996] Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming.
Athena Scientific, Belmont, MA.
[Fleming and Rishel, 1975] Fleming, W. and Rishel, R. (1975). Deterministic and stochastic optimal control.
Applications of Mathematics, 1, Springer-Verlag, Berlin New York.
[Howard, 1960] Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press, Cam-
bridge, MA.

[Puterman, 1994] Puterman, M. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Program-
ming. John Wiley & Sons, Inc., New York, Etats-Unis.

You might also like