Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
January 2014
1 MDP framework
(From Wikipedia, the free encyclopedia with minor changes)
The probability that the process moves into its new state s’ is influenced
by the chosen action. Specifically, it is given by the state transition function
P (s, s0 , a). Thus, the next state s’ depends on the current state s and the
decision maker’s action a. But given s and a, it is conditionally independent
of all previous states and actions; in other words, the state transitions of an
MDP possess the Markov property.
Definition
In its typical definition, a Markov decision process is a 4-tuple < S, A, P, R >,
∗
CSIRO Ecosystem Sciences, GPO Box 2583, Brisbane QLD 4001, Australia
†
Grimsö Wildlife Research Station, Swedish University of Agricultural Sciences, 73091
Riddarhyttan, Sweden
‡
INRA, UR 875 Applied Mathematics and Computer Science laboratory, F-31326 Cas-
tanet Tolosan, France.
1
where:
The core problem of MDPs is to find a "policy" for the decision maker: a
function π that specifies the action π(s) that the decision maker will choose
when in state s. The goal is to choose a policy π that will maximize some cu-
mulative function of the random rewards, typically the expected discounted
sum over a potentially infinite horizon:
P∞ t R(s
t=0 γ t , st+1 , π(st ))
Algorithms
MDPs can be solved by linear programming or dynamic programming. In
what follows we present the latter approach. The standard family of algo-
rithms to calculate this optimal policy requires storage for two arrays indexed
by state: value V, which contains real values, and policy π which contains
actions. At the end of the algorithm, π will contain the solution and V(s)
will contain the discounted sum of the rewards to be earned (on average) by
following that solution from state s.
The algorithm has the following two kinds of steps, which are repeated in
some order for all the states until no further changes take place. They are
defined recursively as follows:
Their order depends on the variant of the algorithm; one can also do them
for all states at once or state by state, and more often to some states than
2
others. As long as no state is permanently excluded from either of the steps,
the algorithm will eventually arrive at the correct solution.
Notable variants: Value iteration, Policy iteration, Modified policy iteration.
2 MDPtoolbox
The MDPtoolbox (http://www7.inra.fr/mia/T/MDPtoolbox) proposes func-
tions related to the resolution of discrete-time Markov Decision Processes:
value iteration, policy iteration, linear programming algorithms with some
variants.
It is currently available on several environment: MATLAB, GNU Octave,
Scilab and R.
To use the toolbox, just call Matlab and add the MDPtoolbox directory
to search path.
For example, go to the MDPtoolbox directory, call Matlab and execute:
In this Matlab session, it is then possible to use all the MDPtoolbox func-
tions. To acces the HTML documentation, open with a browser the local
file: MDPtoolbox/documentation/DOCUMENTATION.html .
3
the youngest age-class (state 1). Let p = 0.1 be the probability of wildfire
occurence during a time period.
The problem is howto manage this stand in a long term vision to maximize
the γ-discounted reward with γ = 0.95.
>> mdp_check(P, R)
ans =
’’
4
When the output is empty, no error was detected.
The optimal policy found is to choose action Wait(1) in the 3 defined states,
that is in a more understandable way ’never cut’.
Note that mdp_LP function provides the exact expected value function. For
5
instance, V(2) = 61.9020 is the expected reward value when in state 2 (age-
class 21-40 years for trees). For the other algorithms, to better apprehend
V, some functions are available in the toolbox. Let call them. Note that
mdp_eval_policy_matrix provides also the exact expected value function.
function mu = get_stationary_distribution( p )
% Computes the stationary distribution mu of a Markov chain
% described by p (stochastic matrix, ie sum(p,2)=1).
% Input
% p : transition matrix associated with a policy, p(s,s’)
% Output
% mu : stationary distribution for each state s ( p*mu’=mu’ )
% is_OK_mu : false if p is not a stochatic matrix, else true
s=size(p,1);
mu=zeros(1,s);
if any(abs(sum(p,2)-1)>10^-4) || (size(p,2)~=s)
disp ’ERROR in get_stationary_distribution: argument p must be a stochastic matrix’
else
% mu satisfies p*mu’=mu’ and mu sums to one
A=transpose(p)-eye(s);
A(s,:)=ones(1,s);
b=zeros(s,1);
b(s)=1;
mu=transpose(A\b);
is_OK_mu = ~isempty(mu) && all(mu>-10^-4) && (sum(mu)-1<10^-4);
if ~is_OK_mu; mu=[]; end
end
Then call this function and plot (Figure 1) the stationary distribution in
age-classes.
6
>> mu = get_stationary_distribution( mdp_computePpolicyPRpolicy(P, R, policy) )
mu =
0.1000 0.0900 0.8100
>> bar(mu,0.4); ylim([0 1]);
>> xlabel(’age-class’); ylabel(’percentage of time in age-class’);
Beware that in the Matlab code the caracter ’ must be replace by a '.
First, what about a lower incitation of conserving the oldest age-class (0.4
instead of 4).
7
The policy found ask now to cut the oldest-age class.
Lets compute the state stationary distribution and plot it (Figure 2).
Beware that in the Matlab code the caracter ’ must be replace by a '.
As expected, the policy found changes and requires to cut at the 2nd age-
class with lower expected values.
Lets compute the state stationary distribution and plot it (Figure 3).
8
0.8333 0.1667 0
>> bar(mu,0.4); ylim([0 1]);
>> xlabel(’age-class’); ylabel(’percentage of time in age-class’);
Beware that in the Matlab code the caracter ’ must be replace by a '.