0% found this document useful (0 votes)
20 views15 pages

5.4-Reinforcement Learning-Part1-Introduction

The document discusses reinforcement learning and dynamic programming. It defines key reinforcement learning concepts like the environment, agent, states, actions, rewards, policy, and return. It explains how reinforcement learning problems can be modeled as Markov decision processes and discusses the differences between dynamic programming and reinforcement learning approaches when the MDP model is fully or partially known. Dynamic programming algorithms aim to optimize the policy and value functions using Bellman's principle of optimality and equation. A shortest path example illustrates dynamic programming.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views15 pages

5.4-Reinforcement Learning-Part1-Introduction

The document discusses reinforcement learning and dynamic programming. It defines key reinforcement learning concepts like the environment, agent, states, actions, rewards, policy, and return. It explains how reinforcement learning problems can be modeled as Markov decision processes and discusses the differences between dynamic programming and reinforcement learning approaches when the MDP model is fully or partially known. Dynamic programming algorithms aim to optimize the policy and value functions using Bellman's principle of optimality and equation. A shortest path example illustrates dynamic programming.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 5 Machine Learning enabled


by prior Theories

Video 5.4 Reinforcement Learning – Part 1 - Introduction


Reinforcement Learning (RL)

The intuitive scenario for Reinforcement Learning is an Agent that


learns from interaction with an Environment to achieve some
long-term goals related to the State of the environment by performing
a sequence of actions and by receiving feedback.

The mapping from states to possible actions is called a


Policy.

The achievement of goals is defined by Rewards or reward signals


being the feedback on actions from the environment.

The Return is defined as the cumulative (possibly discounted)


rewards over an Episode = action sequence, leading to a Terminal
state.

The goal of any RL algorithm is to establish a policy that maximizes


the Returns.
Micromouse

3
Reinforcement Learning terminology
Environment The ´microworld´defined for the particular RL problem including the agent. Designated as E.
Agent An agent Designated as A.
State A particular configuration of the agent within the environment. Designated as s.
Terminal states Defined end states for the particular RL problem. Designated as TS.
Action An agent select an action based upon the current state S and the policy π
Policy A policy π is a mapping from states of the environment to the potential actions of an agent in those states. π(s) can be
deterministic, only depending on s or stochastic π(s,a) depending also on a.
Transition Probability function s´ = T( s , a) specifies the probability that the environment will transition to state s′ if the agent takes action a in state s.
Episode Sometimes called an epoque. A sequence of states, actions and rewards, which ends in a terminal state.
Reward or Reward signal The reward R ( s´|s,a) gives feedback from the environment on the effect of a single action a in state s leading to s´
Discounted Reward When calculating the Return, the expected rewards for future steps can be weighted with a discount factor γ in the interval [0,1]
Return The accumulated rewards for an episode. Designated as G.
Value function The Value function V = V(s) is the estimation of the Value or Utility of s with respect to its average Return considering all
possible episodes possible within the current policy. It must continually be re-estimated for each action taken.
The state-value function v(s) for the policy π is given below. Note that the value of the terminal state (if any) is always zero.
Model of the environment A set of tuples T(s,a)=s´, and R(s´|s,a) representing the possible state transitions and the rewards given.
Example 1 - the 4x3 world

Environment and agent: The 4x3 world is a board of 4 * 3 positions, indexed by


3 +1 two coordinates. The agent has a start position of (1,1). The position (2,2) is excluded.

Reward: There are two terminal positions (4,3) and (4,2). Reaching (4,3) gives a reward =+1.
2 -1
Reaching (4,2) gives a reward = -1. Reaching all other postions give reward=0.

1 start Actions: up, down, left, right but restricted by the board configuration. The policy
is deterministic in the sense that an action in a specific state can only lead to one other
1 2 3 4 State.

Episode: e.g a sequence of actions leading from 1,1 to 4,3 via 1.2 1,3 2,3 and 3,3

Examples of episodes with returns.

(1,1)🡪(1,2)🡪(1,3)🡪(1,2)🡪(1,3)🡪(2,3)🡪(3,3) 🡪(3,4) +1
(1,1)🡪(1,2)🡪(1,3)🡪(2,3)🡪(3,3)🡪(3,2)🡪(3,3)🡪(3,4) +1
(1,1)🡪(2,1)🡪(3,1)🡪(3,2)🡪(4,2) -1
Reinforcement Learning modelled as Markov Decision Process (MDP)

A suitable model for the above intuitive scenario is a basic Markov .5 .5 .33
Decision Process (MDP). 1.0
+1
An MDP is typically defined by a 4-tuple (S,A,R,T) where: .5 .33
.5 .5 .33
• S is the set of states s is the state of the environment .33 .33 1.0
(including the agent)
• A is the set of actions for each s, that the agent can choose
-1
between defined by a policy π .33 .33
• R(s’ |s , a) is a function that returns the reward received for taking .5 .5.5 .5 .33 .5
action a in state s leading to s´

• s´= T(s , a) is a transition probability function, specifying the .5 .33 .5


probability that the environment will transition to state s′ if the State transitions: The state transitions from a state
agent takes action a in state s. The Markov property = Transition to its neighboring state that in this case have equal
probabilities depend on state only, not on the path to the state. probability, summed up to 1.
The goal is to find a policy π that maximizes the return = expected
future accumulated (discounted) reward for episodes from a Start
state s to a Terminal state.
Dynamic Programming versus Reinforcement Learning

In one MDP scenario, there is a complete and exact model of the MDP in the
sense that T(S,A) and R(S,A) are fully defined.

In this case a MDP problem is a Planning Problem that can be exactly


solved by use of e.g. Dynamic Programming.

However in many cases T(S,A) and R(S,A) are not completely known, meaning
that the model of the MDP is not complete.

In such cases we have a true Reinforcement Learning problem.


Dynamic Programming (DP)
Dynamic Programming is an algorithm design technique for optimization problems: often for minimizing or
maximizing. The method was developed by Richard Bellman in the 1950s.

Like divide and conquer, DP is simplifying a complicated problem by breaking it down into simpler
sub-problems in a recursive manner and then combining the solutions to the sub-problems to form the total
solution.

If a problem can be optimally solved by breaking it recursively into sub-problems and then form the
solution from optimal solutions to the sub-problems, then it is said to have optimal substructure.

Unlike divide and conquer, sub-problems are not independent, sub-problems may share sub-sub-problems,

Typically the utility value of a solution in a certain state is defined by a value function calculated recursively
based on the value functions for the remaining steps of an episode moderated by the relevant probabilities
and a discount parameter giving higher weight to closer reward contributions. Typically the policy function
can be calculated in a similar fashion.

The defining equation for finding an optimal value function is called the Bellman equation.
Dynamic Programming (DP) – motives behind name
Bellman´s principle of optimality and the Bellman equation

Dynamic programming algorithms has the ambition to optimize the following two functions:

Policy function: π(s):= argmax ∑ T(s,a) * (R(s´|s,a)+ * V(s') )


a s´

Value function: V(s):= ∑ T (s,a) * (R(s´|s,a)+ * V(s') )


Richard Bellman's Principle of Optimality

An optimal policy has the property that whatever the initial state and initial decision are, the remaining
decisions must constitute an optimal policy with regard to the state resulting from the first decision.

An optimal value function is a solution to the so called Bellman equation:

V(s) = Max ( R(s´|s,ai) + * V(s´ =T(s,ai) )


i=1..n
An optimal value function can be the basis for an optimal policy function.
Three standard approaches to solve the Bellman Equation

Value iteration
Initialise V arbitrarily
repeat
Improve Vk+1 using the estimate of Vk
until convergence
Policy iteration
Initialise V and π arbitrarily
repeat
Evaluate V using π
Improve π using V
until convergence.

Linear programming
Prioritized sweeping - use priority queue to update states with large potential for change
Simple example to illustrate Dynamic Programming: The shortest path in multi-stage graphs

The shortest path is: (S, C, F, T)=> 5+2+2 = 9. A greedy method gives (S, A, D, T) => 1+4+18 =
23.
Example with Dynamic programming approach
Forward approach

• d(S, T) = min{1+d(A, T), 2+d(B, T), 5+d(C, T)}

• d(A,T) = min{4+d(D,T), 11+d(E,T)} = min{4+18, 11+13} = 22.


• d(B, T) = min{9+d(D, T), 5+d(E, T), 16+d(F, T)} = min{9+18, 5+13, 16+2} = 18.
• d(C, T) = min{ 2+d(F, T) } = 2+2 = 4

• d(S, T) = min{1+d(A, T), 2+d(B, T), 5+d(C, T)} = min{1+22, 2+18, 5+4} = 9.
Markov Decision Process (MDP) can be generalized to
Partially Observable Markov Decision Process (POMDP)
A POMDP models an agent decision process in which it is assumed that
the system dynamics are determined by an MDP, but the agent cannot
directly observe the underlying state. The POMDPs enlarge the
applicability of model-based MDPs.

Instead, it must maintain a probability distribution over the set of


possible states, based on a set of observations and observation
probabilities, and the underlying MDP.

A POMDP have two additional elements on top of the MDP elements.


• A set of observations
• A set of observational probabilities.

Because the agent does not directly observe the environment's state, the
agent must make decisions under uncertainty of the true environment
state.

The details of POMDP are outside the scope of this lecture.


To be continued in Part 2

You might also like