5.4-Reinforcement Learning-Part1-Introduction
5.4-Reinforcement Learning-Part1-Introduction
3
Reinforcement Learning terminology
Environment The ´microworld´defined for the particular RL problem including the agent. Designated as E.
Agent An agent Designated as A.
State A particular configuration of the agent within the environment. Designated as s.
Terminal states Defined end states for the particular RL problem. Designated as TS.
Action An agent select an action based upon the current state S and the policy π
Policy A policy π is a mapping from states of the environment to the potential actions of an agent in those states. π(s) can be
deterministic, only depending on s or stochastic π(s,a) depending also on a.
Transition Probability function s´ = T( s , a) specifies the probability that the environment will transition to state s′ if the agent takes action a in state s.
Episode Sometimes called an epoque. A sequence of states, actions and rewards, which ends in a terminal state.
Reward or Reward signal The reward R ( s´|s,a) gives feedback from the environment on the effect of a single action a in state s leading to s´
Discounted Reward When calculating the Return, the expected rewards for future steps can be weighted with a discount factor γ in the interval [0,1]
Return The accumulated rewards for an episode. Designated as G.
Value function The Value function V = V(s) is the estimation of the Value or Utility of s with respect to its average Return considering all
possible episodes possible within the current policy. It must continually be re-estimated for each action taken.
The state-value function v(s) for the policy π is given below. Note that the value of the terminal state (if any) is always zero.
Model of the environment A set of tuples T(s,a)=s´, and R(s´|s,a) representing the possible state transitions and the rewards given.
Example 1 - the 4x3 world
Reward: There are two terminal positions (4,3) and (4,2). Reaching (4,3) gives a reward =+1.
2 -1
Reaching (4,2) gives a reward = -1. Reaching all other postions give reward=0.
1 start Actions: up, down, left, right but restricted by the board configuration. The policy
is deterministic in the sense that an action in a specific state can only lead to one other
1 2 3 4 State.
Episode: e.g a sequence of actions leading from 1,1 to 4,3 via 1.2 1,3 2,3 and 3,3
(1,1)🡪(1,2)🡪(1,3)🡪(1,2)🡪(1,3)🡪(2,3)🡪(3,3) 🡪(3,4) +1
(1,1)🡪(1,2)🡪(1,3)🡪(2,3)🡪(3,3)🡪(3,2)🡪(3,3)🡪(3,4) +1
(1,1)🡪(2,1)🡪(3,1)🡪(3,2)🡪(4,2) -1
Reinforcement Learning modelled as Markov Decision Process (MDP)
A suitable model for the above intuitive scenario is a basic Markov .5 .5 .33
Decision Process (MDP). 1.0
+1
An MDP is typically defined by a 4-tuple (S,A,R,T) where: .5 .33
.5 .5 .33
• S is the set of states s is the state of the environment .33 .33 1.0
(including the agent)
• A is the set of actions for each s, that the agent can choose
-1
between defined by a policy π .33 .33
• R(s’ |s , a) is a function that returns the reward received for taking .5 .5.5 .5 .33 .5
action a in state s leading to s´
In one MDP scenario, there is a complete and exact model of the MDP in the
sense that T(S,A) and R(S,A) are fully defined.
However in many cases T(S,A) and R(S,A) are not completely known, meaning
that the model of the MDP is not complete.
Like divide and conquer, DP is simplifying a complicated problem by breaking it down into simpler
sub-problems in a recursive manner and then combining the solutions to the sub-problems to form the total
solution.
If a problem can be optimally solved by breaking it recursively into sub-problems and then form the
solution from optimal solutions to the sub-problems, then it is said to have optimal substructure.
Unlike divide and conquer, sub-problems are not independent, sub-problems may share sub-sub-problems,
Typically the utility value of a solution in a certain state is defined by a value function calculated recursively
based on the value functions for the remaining steps of an episode moderated by the relevant probabilities
and a discount parameter giving higher weight to closer reward contributions. Typically the policy function
can be calculated in a similar fashion.
The defining equation for finding an optimal value function is called the Bellman equation.
Dynamic Programming (DP) – motives behind name
Bellman´s principle of optimality and the Bellman equation
Dynamic programming algorithms has the ambition to optimize the following two functions:
An optimal policy has the property that whatever the initial state and initial decision are, the remaining
decisions must constitute an optimal policy with regard to the state resulting from the first decision.
Value iteration
Initialise V arbitrarily
repeat
Improve Vk+1 using the estimate of Vk
until convergence
Policy iteration
Initialise V and π arbitrarily
repeat
Evaluate V using π
Improve π using V
until convergence.
Linear programming
Prioritized sweeping - use priority queue to update states with large potential for change
Simple example to illustrate Dynamic Programming: The shortest path in multi-stage graphs
The shortest path is: (S, C, F, T)=> 5+2+2 = 9. A greedy method gives (S, A, D, T) => 1+4+18 =
23.
Example with Dynamic programming approach
Forward approach
• d(S, T) = min{1+d(A, T), 2+d(B, T), 5+d(C, T)} = min{1+22, 2+18, 5+4} = 9.
Markov Decision Process (MDP) can be generalized to
Partially Observable Markov Decision Process (POMDP)
A POMDP models an agent decision process in which it is assumed that
the system dynamics are determined by an MDP, but the agent cannot
directly observe the underlying state. The POMDPs enlarge the
applicability of model-based MDPs.
Because the agent does not directly observe the environment's state, the
agent must make decisions under uncertainty of the true environment
state.