Stochastic DP
Stochastic DP
1.1 Definitions
Definition 1 (Markov chain). Let the state space X be a bounded compact subset of the Euclidean space,
the discrete-time dynamic system (xt )t∈N ∈ X is a Markov chain if
so that all the information needed to predict (in probability) the future is contained in the current state
(Markov property). Given an initial state x0 ∈ X, a Markov chain is defined by the transition proba-
bility p such that
Remark : notice that in some cases we can turn a higher-order Markov process into a Markov process by
including the past as a new state variable. For instance, in the control of an inverted pendulum, the state
that can be observed is only the angular position θt . In this case the system is non-Markov since the next
position depends on the previous position but also on the angular speed, which is defined as dθt = θt − θt−1
(as a first approximation). Thus, if we define the state space as xt = (θt , dθt ) we obtain a Markov chain.
Definition 2 (Markov decision process [Bellman, 1957, Howard, 1960, Fleming and Rishel, 1975, Puterman, 1994,
Bertsekas and Tsitsiklis, 1996]). A Markov decision process is defined as a tuple M = (X, A, p, r) where
1
2 Markov Decision Processes and Dynamic Programming
• p(y|x, a) is the transition probability (i.e., environment dynamics) such that for any x ∈ X, y ∈ X,
and a ∈ A
p(y|x, a) = P(xt+1 = y|xt = x, at = a),
• r(x, a, y) is the reinforcement obtained when taking action a, a transition from a state x to a state y
is observed.2
Definition 3 (Policy). At time t ∈ N a decision rule πt is a mapping from states to actions, in particular
it can be
• Stochastic: πt : X → ∆(A), πt (a|x) denotes the probability of taking action a at state x at time t.
• Non-stationary: π = (π0 , π1 , π2 , . . . ),
Remark : an MDP M together with a deterministic (stochastic) stationary policy π forms a dynamic process
(xt )t∈N (obtained by taking the actions at = π(xt )) which corresponds to a Markov chain of state X and
transition probability p(y|x) = p(y|x, π(x)).
Time horizons
• Finite time horizon T : the problem is characterized by a deadline at time T (e.g., the end of the course)
and the agent only focuses on the sum of the rewards up to that time.
• Infinite time horizon with discount: the problem never terminates but rewards which are closer in time
receive a higher importance.
• Infinite time horizon with absorbing (terminal) state: the problem never terminates but the agent will
eventually reach a termination state.
• Infinite time horizon with average reward : the problem never terminates but the agent only focuses on
the average of the rewards.
Definition 4 (The state value function). Depending on the time horizon we consider we have
function.
Markov Decision Processes and Dynamic Programming 3
where 0 ≤ γ < 1 is a discount factor (i.e., small = focus on short-term rewards, big = focus on long
term rewards).3
• Infinite time horizon with absorbing (terminal) states
T
X
V π (x) = E
r(xt , π(xt ))|x0 = x; π , (5)
t=0
where T is the first (random) time when the agent achieves a absorbing state.
• Infinite time horizon with average reward
T −1
1 X
V π (x) = lim
E r(xt , π(xt )) | x0 = x; π . (6)
T →∞ T
t=0
Remark : the previous expectations refer to all the possible stochastic trajectories. More precisely, let
us consider an MDP M = (X, A, p, r), if an agent follows a non-stationary policy π from the state x0 ,
then it observes the random sequence of states (x0 , x1 , x2 , . . .) and the corresponding sequence of rewards
(r0 , r1 , r2 , . . .) where rt = r(xt , πt (xt )). The corresponding value function (in the infinite horizon with
discount setting) is then defined as
∞
X
π t
V (x) = E(x1 ,x2 ,...) γ r(xt , π(xt )) | x0 = x; π ,
t=0
where each xt ∼ p(·|xt−1 , at = π(xt )) is a random realization from the transition probability of the MDP.
Definition 5 (Optimal policy and optimal value function). The solution to an MDP is an optimal policy
π ∗ satisfying
π ∗ ∈ arg max V π
π∈Π
in all the states x ∈ X, where Π is some policy set of interest. The corresponding value function is the
∗
optimal value function V ∗ = V π .
Remark : π ∗ ∈ arg max(·) and not π ∗ = arg max(·) because an MDP may admit more than one optimal
policy.
Beside the state value function, we can also introduce an alternative formulation, the state-action value
function. For infinite horizon discounted problems we have the following definition (similar for other settings).
Definition 6 (The state-action value function). In discounted infinite horizon problems, for any policy π,
the state-action value function (or Q-function) Qπ : X × A 7→ R is defined as
hX i
Qπ (x, a) = E γ t r(xt , at )|x0 = x, a0 = a, at = π(xt ), ∀t ≥ 1 ,
t≥0
Finally, from an optimal Q-function we can deduce the optimal policy as π ∗ (x) ∈ arg maxa∈A Q∗ (x, a).
1.2 Examples
2 r=1
p=0.5
1 Rest
0.4 r=−10
Rest 0.5
r=0 Work 5
Work 0.3 0.4 0.6
0.5 0.7
0.5 r=100
Rest
0.6
r=−1
Rest
0.9 6
Work
3
0.1 r=−1000
0.5
1
0.5
Work
r=−10
4 7
• Model : all the transitions are Markov since the probability of transition only depend on the previous
state. The states x5 , x6 , x7 are terminal (absorbing) states.
• Objective: find the policy that maximizes the expected sum of rewards before achieving a terminal
state.
• Problem 1 : if a student knows the transition probabilities and the rewards, how does he/she compute
the optimal strategy?
Solution:
Markov Decision Processes and Dynamic Programming 5
V2 = 88.3
p=0.5 r=1
Rest r=−10
Rest 0.5 0.4
V1 = 88.3
V5 = −10
r=0 Work
Work 0.3 0.4 0.6
0.5 0.7
0.5 r=100
Rest
0.6
V6 = 100
r=−1 0.9
V3 = 86.9 Rest
Work
0.1 r=−1000
0.5
1
0.5 r=−10 Work
V7 = −1000
V4 = 88.9
Reward t
1 2 T
p(t) Restaurant
Reward 0
We know that
• The driver cannot see whether a place is available unless he/she is in front of it.
• There are P places.
• At each place i the driver can either move to the next place or park (if the place is available).
• The closer to the restaurant the parking, the higher the satisfaction.
• If the driver doesn’t park anywhere, then he/she leaves the restaurant and has to find another one.
• Objective: maximize the satisfaction.
• State space: x ∈ X = {(1, T ), (1, A), (2, T ), (2, A), . . . , (P, T ), (P, A), parked, left}. Each of the P places
can be (A)vailable or (T )aken. Whenever the driver parks, he reaches the state parked, while if he
never parks then the state becomes left. Both these states are terminal states.
• Action space: A(x) = {park, continue} if x = (·, A) or A(x) = {continue} if x = (·, T ).
• Dynamics: (see graph). We assume that the probability of a place i to be available is ρ(i).
• Reward : for any i ∈ [1, P ] we have r(x, park, parked) = i and 0 otherwise.
Markov Decision Processes and Dynamic Programming 7
PT
• Objective function: E t=1 rt with T the random time when a terminal state is reached.
L L L Rew 0
p(2) p(T)
p(1) Cont Cont Cont
D 1−p(1)
1−p(2) 1−p(T)
O O O Rew 0
1 2 T
Proposition 1. For any horizon T and a (non-stationary) policy π = (π0 , . . . , πT −1 ), the state value
function at a state x ∈ X at time t ∈ {0, . . . , T } satisfies the Bellman equation:
−1
TX
V π (t, x) = E
r(xs , πs (xs )) + R(xT ) | xt = x; π . (7)
s=t
8 Markov Decision Processes and Dynamic Programming
X −1
TX
= r(x, πt (x)) + P[xt+1 = y|xt = x; at = π(xt )]Ext+2 ,...,xT −1 r(xs , πs (xs )) + R(xT ) | xt+1 = y; π
y∈X s=t+1
T −1
(b) X X
= r(x, πt (x)) + p(y|x, π(x))E r(xs , πs (xs )) + R(xT ) | xt+1 = x; π
y∈X s=t+1
(c) X
= r(x, πt (x)) + p(y|x, π(x))V π (t + 1, y),
y∈X
where
“An optimal policy has the property that, whatever the initial state and the initial decision are,
the remaining decisions must constitute an optimal policy with regard to the state resulting from
the first decision.”
Proposition 2. The optimal value function V ∗ (t, x) (i.e., V ∗ = maxπ V π ) is the solution to the
optimal Bellman equation:
X
V ∗ (t, x) = max r(x, a) + p(y|x, a)V ∗ (t + 1, y) , with 0 ≤ t < T
(8)
a∈A
y∈X
∗
V (T, x) = R(x),
is an optimal policy.
Markov Decision Processes and Dynamic Programming 9
Proof. By definition we have V ∗ (T, x) = R(x), then V ∗ is defined by backward propagation for any t < T .
Any policy π applied from time t < T on at state x can be written as π = (a, π 0 ) with a ∈ A is the action
taken at time t in x and π 0 = (πt+1 , . . . , πT −1 ). Then we can show that
−1
(a) TX
V ∗ (t, x)
= max E r(xs , πs (xs )) + R(xT )|xt = x; π
π
s=t
(b)
h X 0
i
= max0 r(x, a) + p(y|x, a)V π (t + 1, y)
(a,π )
y∈X
(c)
h i
π0
X
= max r(x, a) + p(y|x, a) max
0
V (t + 1, y)
a∈A π
y∈X
(d) X
p(y|x, a)V ∗ (t + 1, y) .
= max r(x, a) +
a∈A
y∈X
where
(b) : decomposition of policy π = (a, π 0 ) and recursive definition of the value function,
– Trivial inequality
X 0 X 0
max
0
p(y|x, a)V π (t + 1, y) ≤ p(y|x, a) max
0
V π (t + 1, y)
π π
y y
– Let π̄ = (π̄t+1 , . . . ) a policy such that π̄t+1 (y) = arg maxb∈A max(πt+2 ,... ) V (b,πt+2 ,... ) (t + 1, y).
Then
X 0 X X 0
p(y|x, a) max 0
V π (t + 1, y) = p(y|x, a)V π̄ (t + 1, y) ≤ max
0
p(y|x, a)V π (t + 1, y).
π π
y y y
Finally the optimal policy simply takes the action for which the maximum is attained at each iteration.
Remark. The previous Bellman equations can be easily extended to the case of state-action value functions.
Parking example. Let V ∗ (p, A) (resp. V ∗ (p, T )) the optimal value function when at position p and the
place is available (resp. taken). The optimal value function can be constructed using the optimal Bellman
equations as:
where the max corresponds to the two possible choices (park or continue) and the corresponding optimal
policy can be derived as the arg max of the previous maxima.
10 Markov Decision Processes and Dynamic Programming
Proposition 3. For any stationary policy π = (π, π, . . . ), the state value function at a state x ∈ X
satisfies the Bellman equation:
X
V π (x) = r(x, π(x)) + γ p(y|x, π(x))V π (y). (9)
y
Proposition 4. The optimal value function V ∗ (i.e., V ∗ = maxπ V π ) is the solution to the optimal
Bellman equation:
X
V ∗ (x) = max r(x, a) + γ p(y|x, a)V ∗ (y) .
(10)
a∈A
y
where
(b) : decomposition of policy π = (a, π 0 ) and recursive definition of the value function,
(c) : follows from
– Trivial inequality X 0 X 0
max
0
p(y|x, a)V π (y) ≤ p(y|x, a) max
0
V π (y)
π π
y y
0
– Let π̄(y) = arg maxπ0 V π (y). Then
π0 0
X X X
π̄
p(y|x, a) max
0
V (y) = p(y|x, a)V (y) ≤ max
0
p(y|x, a)V π (y).
π π
y y y
Notation. W.l.o.g. from this moment on we consider a discrete state space |X| = N , so that for any policy
π, V π is a vector of size N (i.e., V π ∈ RN ).
Definition 7. For any W ∈ RN , the Bellman operator T π : RN → RN is defined as
X
T π W (x) = r(x, π(x)) + γ p(y|x, π(x))W (y), (12)
y
and the optimal Bellman operator (or dynamic programming operator) is defined as
X
T W (x) = max r(x, a) + γ p(y|x, a)W (y) . (13)
a∈A
y
T π W1 ≤ T π W2 ,
T W1 ≤ T W2 .
T π (W + cIN ) = T π W + γcIN ,
T (W + cIN ) = T W + γcIN ,
12 Markov Decision Processes and Dynamic Programming
lim (T π )k W = V π,
k→∞
lim (T )k W = V ∗.
k→∞
Proof. Monotonicity (1) and offset (2) directly follow from the definitions.
The contraction property (3) holds since for any x ∈ X we have
X X
0 0
|T W1 (x) − T W2 (x)| = max r(x, a) + γ p(y|x, a)W1 (y) − max r(x, a ) + γ p(y|x, a )W (y)
0
2
a a
y y
(a) X X
≤ max r(x, a) + γ p(y|x, a)W1 (y) − r(x, a) + γ p(y|x, a)W2 (y)
a
y y
X
= γ max p(y|x, a)|W1 (y) − W2 (y)|
a
y
X
≤ γ||W1 − W2 ||∞ max p(y|x, a) = γ||W1 − W2 ||∞ ,
a
y
where in (a) we used maxa f (a) − maxa0 g(a0 ) ≤ maxa (f (a) − g(a)).
The fixed point property follows from the Bellman equations (eq. 9 and 10) and Banach fixed point theorem
(see Theorem 11). The property regarding the convergence of the limit of the Bellman operators is actually
used in the proof of the Banach theorem.
Remark 1 (optimal policy): Any stationary policy π ∗ (x) ∈ arg maxa∈A r(x, a) + γ y p(y|x, a)V ∗ (y) is
P
∗
optimal. In fact, from the definition of π ∗ we have that T π V ∗ = T V ∗ = V ∗ where the first equality follows
from the fact that the optimal Bellman operator coincides with the action taken by π ∗ and the second from
∗
the fixed point property of T . Furthermore, by the fixed point property of T π we have that V π is the fixed
∗ ∗
point of T π which is also unique. Then V π = V ∗ thus implying that π ∗ is an optimal policy.
Remark 2 (value/policy iteration): most of the dynamic programming algorithms studied in Section 2 will
heavily rely on property (4).
Bellman operators for Q-functions.
Markov Decision Processes and Dynamic Programming 13
Both the Bellman T π and the optimal Bellman operators T can be extended to Q-functions. Thus for any
function4 W : X × A → R, we have
X
T π W (x, a) = r(x, a) + γ p(y|x, a)W (y, π(y)),
y∈X
X
T W (x, a) = r(x, a) + γ p(y|x, a) max W (y, b).
b∈A
y∈X
This allows to extend also all the properties in Proposition 5, notably that Qπ (resp. Q∗ ) is the fixed point
of T π (resp. T ).
Recall the definition of value function for infinite horizon problems with absorbing states in eq. 5:
∞ T
X X
V π (x) = E
r(xt , π(xt ))|x0 = x; π = E r(xt , π(xt ))|x0 = x; π ,
t=0 t=0
where T is the first (random) time when the agent achieves a absorbing state. The equivalence follows from
the fact that we assume that once the absorbing state is achieved the system stays in that state indefinitely
long with a constant reward of 0.
Definition 8. A stationary policy π is proper if there exists an integer n ∈ N such that from any initial
state x ∈ X the probability of achieving the terminal state (denoted by x̄) after n steps is strictly positive.
That is
ρπ = max P(xn 6= x̄ | x0 = x, π) < 1.
x
Proposition 6. For any proper policy π with parameter ρπ after n steps, the value function is
bounded as
X
||V π ||∞ ≤ rmax ρbt/nc
π .
t≥0
4 It is simpler to state W as function rather than a vector as done for value functions.
14 Markov Decision Processes and Dynamic Programming
Assumption. There exists at least one proper policy and for any non-proper policy π there exists at least
one state x where the corresponding value function is negatively unbounded, i.e., V π (x) = −∞, which
corresponds to the existence of a cycle with only negative rewards.
Proposition 7. [Bertsekas and Tsitsiklis, 1996] Under the previous assumption, the optimal value
function is bounded, i.e., ||V ∗ ||∞ < ∞ and it is the unique fixed point of the optimal Bellman
operator T such that for any vector W ∈ Rn
X
T W (x) = max r(x, a) + p(y|x, a)W (y)].
a∈A
y
Proposition 8. Let all the policies π be proper, then there exist a vector µ ∈ RN with µ > 0 and
a scalar β < 1 such that, ∀x, y ∈ XN , ∀a ∈ A,
X
p(y|x, a)µ(y) ≤ βµ(x).
y
Then it follows that both operators T and T π are contraction in the weighted norm L∞,µ , that is
Proof. Let µ be the maximum (over all policies) of the average time to the terminal goal. This can be easily
casted to a MDP where for any action and any state the rewards are 1 (i.e., for any x ∈ X and a ∈ A,
r(x, a) = 1). Under the assumption that all the policies are proper, then µ is finite and it is the solution to
the dynamic programming equation
X
µ(x) = 1 + max p(y|x, a)µ(y).
a
y
P
Then µ(x) ≥ 1 and for any a ∈ A, µ(x) ≥ 1 + y p(y|x, a)µ(y). Furthermore,
X
p(y|x, a)µ(y) ≤ µ(x) − 1 ≤ βµ(x),
y
for
µ(x) − 1
β = max < 1.
x µ(x)
From this definition of µ and β we obtain the contraction property of T (similar for T π ) in norm L∞,µ :
|T W1 (x) − T W2 (x)|
||T W1 − T W2 ||∞,µ = max
x µ(x)
P
y p(y|x, a)
≤ max |W1 (y) − W2 (y)|
x,a µ(x)
P
y p(y|x, a)µ(y)
≤ max kW1 − W2 kµ
x,a µ(x)
≤ βkW1 − W2 kµ
Markov Decision Processes and Dynamic Programming 15
2 Dynamic Programming
The idea. We build a sequence of value functions. Let V0 be any vector in RN , then iterate the application
of the optimal Bellman operator so that given Vk at iteration k we compute
Vk+1 = T Vk .
From Proposition 5 we have that limk→∞ Vk = V ∗ , thus a repeated application of the Bellman operator
make the initial guess V0 closer and closer to the actual optimal value function. At any iteration K, the
algorithm returns the greedy policy
X
πK (x) ∈ arg max r(x, a) + γ p(y|x, a)VK (y) .
a∈A
y
Guarantees. Using the contraction property of the Bellman operator we obtain the convergence of V0 to
V ∗ . In fact,
This also provides the convergence rate of VI. Let > 0 be a desired level of accuracy in approximating V ∗
and ||r||∞ ≤ rmax , then after at most
log(rmax /)
K=
log(1/γ)
iterations VI returns a value function VK such that ||VK − V ∗ ||∞ < . In fact, after K iterations we have
Computational complexity. One application of the optimal Bellman operator takes O(N 2 A) operations.
Remark: Notice that the previous guarantee is for the value function returned by VI and not the corre-
sponding policy πK .
Q-iteration. Exactly the same algorithm can be applied to Q-functions using the corresponding optimal
Bellman operator.
Implementation. There exist several implementations depending on the order used to update the different
components of Vk (i.e., the states). For instance, in asynchronous VI, at each iteration k, one single state xk
is chosen and only the corresponding component of Vk is updated, i.e., Vk+1 (xk ) = T Vk (xk ), while all the
other states remain unchanged. If all the states are selected infinitely often, then Vk → V ∗ .
Notation. Further extending the vector notation, for any policy π we introduce the reward vector rπ ∈ RN
as rπ (x) = r(x, π(x)) and the transition matrix P π ∈ RN ×N as P π (x, y) = p(y|x, π(x))
The idea. We build a sequence of policies. Let π0 be any stationary policy. At each iteration k we perform
the two following steps:
Proposition 9. The policy iteration algorithm generates a sequences of policies with non-decreasing
performance (i.e., V πk+1 ≥ V πk )) and it converges to π ∗ in a finite number of iterations.
Proof. From the definition of the operators T , T πk , T πk+1 and of the greedy policy πk+1 , we have that
V πk = T πk V πk ≤ T V πk = T πk+1 V πk , (14)
V πk ≤ T πk+1 V πk ,
T πk+1 V πk ≤ (T πk+1 )2 V πk ,
...
πk+1 n−1
(T ) V πk ≤ (T πk+1 )n V πk ,
...
Then (V πk )k is a non-decreasing sequence. Furthermore, in a finite MDP we have a finite number of policies,
then the termination condition is always met for a specific k. Thus eq. 14 holds with an equality and we
obtain
V πk = T V πk
and V πk = V ∗ which implies that πk is an optimal policy.
Markov Decision Processes and Dynamic Programming 17
Remark: More recent and refined proofs of the convergence rate of PI are available.
Policy evaluation. At each iteration k the value function V πk corresponding to the current policy πk is
computed. There are a number of different approaches to compute V πk .
• Direct computation. From the vector notation introduced before and the Bellman equation in
Proposition 4 we have that for any policy π:
V π = rπ + γP π V π .
(I − γP π )V π = rπ .
Since P π is a stochastic matrix, then all its eigenvalues are ≤ 1. Thus the eigenvalues of the matrix
(I − γP π ) are bounded by ≥ 1 − γ, which guarantees that it is invertible. Then we can compute V π as
V π = (I − γP π )−1 rπ .
The inversion can be done with different methods. For instance, the Gauss-Jordan elimination algo-
rithm has a complexity O(N 3 ), which can be lowered to O(N 2.807 ) using Strassen’s algorithm.
• Iterative policy evaluation. For any policy π from the properties of the policy Bellman operator
T π , for any V0 we have that limn→∞ T π V0 = V π . Thus an approximation of V π could be computed
by re-iterating the application T π for n steps. In particular, in order to achieve an -approximation of
log 1/
V π we need O(N 2 log 1/γ ) steps.
• Monte-Carlo simulation. In each state x, we simulate n trajectories ((xit )t≥0, )1≤i≤n following policy
π, where for any i = 1, . . . , n, xi0 = x and xit+1 ∼ p(·|xit , π(xit )). We compute
n
1 XX t i
V̂ π (x) ' γ r(xt , π(xit )).
n i=1
t≥0
√
The approximation error is of order O(1/ n).
Policy improvement. The computation of πk+1 requires the computation of an expectation w.r.t. the
next state, which might be as expensive as O(N |A|). If we move to policy iteration of Q-functions, then the
policy improvement step simplifies to
which has a computational cost of O(|A|). Furthermore, this could allow to compute the greedy policy even
when the dynamics p is not known.
Cons: each iteration requires a full policy evaluation and it might be expensive.
18 Markov Decision Processes and Dynamic Programming
Proposition 10. Let B = T − I the Bellman residual operator and B 0 its derivative. Then sequence
of policies πk generated by PI satisfies
TV−V
Tπ 1V−V
π
T 2V−V
π
T 3V−V
π1 π3
V
π2
V V V = V*
Following the geometric interpretation, we have that V πk is the zero of the linear operator T πk − I. Since the
application V → T V − V = maxπ T π V − V is convex, then we have that the Newton’s method is guaranteed
to converge for any V0 such that T V 0 − V0 ≥ 0.
Linear Programming The previous intuition is at the basis of the observation that V ∗ is the smallest
vector V such that V ≥ T V . In fact,
V ≥TV =⇒ V ≥ lim (T )k V = V ∗ .
k→∞
P
• Min x V (x),
Markov Decision Processes and Dynamic Programming 19
A Probability Theory
Definition 9 (Conditional probability). Given two events A and B with P(B) > 0, the conditional probability
of A given B is
P(A ∪ B)
P(A|B) = .
P(B)
Similarly, if X and Y are non-degenerate and jointly continuous random variables with density fX,Y (x, y)
then if B has positive measure then the conditional probability is
R R
y∈B x∈A X,Y
f (x, y)dxdy
P(X ∈ A|Y ∈ B) = R R .
f
y∈B x X,Y
(x, y)dxdy
Definition 10 (Law of total expectation). Given a function f and two random variables X, Y we have that
h i
EX,Y f (X, Y ) = EX EY f (x, Y )|X = x .
• Lp -norm
d
X 1/p
||v||p = |vi |p .
i=1
• L∞ -norm
||v||∞ = max |vi |.
1≤i≤d
• Lµ,p -norm
d 1/p
|vi |p
X
||v||µ,p = .
i=1
µi
• Lµ,p -norm
|vi |
||v||µ,∞ = max .
1≤i≤d µi
||v||2P = v > P v.
Markov Decision Processes and Dynamic Programming 21
Definition 14. A vector space V equipped with a norm || · || is complete if every Cauchy sequence in V is
convergent in the norm of the space.
Proposition 11. Let V be a complete vector space equipped with the norm || · || and T : V → V be
a γ-contraction mapping. Then
where (a) follows from the triangle inequality and (b) from the contraction property of T . Rearranging the
terms we obtain:
||v − T v|| + ||T u − u||
||v − u|| ≤ . (15)
1−γ
If v and u are both fixed points then ||T v − v|| = 0 and ||T u − u|| = 0, thus from the previous inequality
||v − u|| ≤ 0 which implies v = u. Then T admits at most one fixed point.
22 Markov Decision Processes and Dynamic Programming
where (a) follows from the definition of vn+1 and the fact that {T n v0 }n is a convergence Cauchy sequence,
while (b) follows from the fact that a contraction operator T is also continuous. Joining the two equalities
we obtain v = T v which is the definition of fixed point.
C Linear Algebra
Eigenvalues of a matrix (1). Given a square matrix A ∈ RN ×N , a vector v ∈ RN and a scalar λ are
eigenvector and eigenvalue of the matrix if
Av = λv.
References
[Bellman, 1957] Bellman, R. E. (1957). Dynamic Programming. Princeton University Press, Princeton, N.J.
Markov Decision Processes and Dynamic Programming 23
[Bertsekas and Tsitsiklis, 1996] Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-Dynamic Programming.
Athena Scientific, Belmont, MA.
[Fleming and Rishel, 1975] Fleming, W. and Rishel, R. (1975). Deterministic and stochastic optimal control.
Applications of Mathematics, 1, Springer-Verlag, Berlin New York.
[Howard, 1960] Howard, R. A. (1960). Dynamic Programming and Markov Processes. MIT Press, Cam-
bridge, MA.
[Puterman, 1994] Puterman, M. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Program-
ming. John Wiley & Sons, Inc., New York, Etats-Unis.