0% found this document useful (0 votes)

20 views15 pages

5.4-Reinforcement Learning-Part1-Introduction

The document discusses reinforcement learning and dynamic programming. It defines key reinforcement learning concepts like the environment, agent, states, actions, rewards, policy, and return. It explains how reinforcement learning problems can be modeled as Markov decision processes and discusses the differences between dynamic programming and reinforcement learning approaches when the MDP model is fully or partially known. Dynamic programming algorithms aim to optimize the policy and value functions using Bellman's principle of optimality and equation. A shortest path example illustrates dynamic programming.

Uploaded by

polinati.vinesh2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views15 pages

5.4-Reinforcement Learning-Part1-Introduction

Uploaded by

polinati.vinesh2023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 5 Machine Learning enabled

by prior Theories

Video 5.4 Reinforcement Learning – Part 1 - Introduction

Reinforcement Learning (RL)

The intuitive scenario for Reinforcement Learning is an Agent that

learns from interaction with an Environment to achieve some
long-term goals related to the State of the environment by performing
a sequence of actions and by receiving feedback.

The mapping from states to possible actions is called a

Policy.

The achievement of goals is defined by Rewards or reward signals

being the feedback on actions from the environment.

The Return is defined as the cumulative (possibly discounted)

rewards over an Episode = action sequence, leading to a Terminal
state.

The goal of any RL algorithm is to establish a policy that maximizes

the Returns.
Micromouse

3
Reinforcement Learning terminology
Environment The ´microworld´defined for the particular RL problem including the agent. Designated as E.
Agent An agent Designated as A.
State A particular configuration of the agent within the environment. Designated as s.
Terminal states Defined end states for the particular RL problem. Designated as TS.
Action An agent select an action based upon the current state S and the policy π
Policy A policy π is a mapping from states of the environment to the potential actions of an agent in those states. π(s) can be
deterministic, only depending on s or stochastic π(s,a) depending also on a.
Transition Probability function s´ = T( s , a) specifies the probability that the environment will transition to state s′ if the agent takes action a in state s.
Episode Sometimes called an epoque. A sequence of states, actions and rewards, which ends in a terminal state.
Reward or Reward signal The reward R ( s´|s,a) gives feedback from the environment on the effect of a single action a in state s leading to s´
Discounted Reward When calculating the Return, the expected rewards for future steps can be weighted with a discount factor γ in the interval [0,1]
Return The accumulated rewards for an episode. Designated as G.
Value function The Value function V = V(s) is the estimation of the Value or Utility of s with respect to its average Return considering all
possible episodes possible within the current policy. It must continually be re-estimated for each action taken.
The state-value function v(s) for the policy π is given below. Note that the value of the terminal state (if any) is always zero.
Model of the environment A set of tuples T(s,a)=s´, and R(s´|s,a) representing the possible state transitions and the rewards given.
Example 1 - the 4x3 world

Environment and agent: The 4x3 world is a board of 4 * 3 positions, indexed by

3 +1 two coordinates. The agent has a start position of (1,1). The position (2,2) is excluded.

Reward: There are two terminal positions (4,3) and (4,2). Reaching (4,3) gives a reward =+1.
2 -1
Reaching (4,2) gives a reward = -1. Reaching all other postions give reward=0.

1 start Actions: up, down, left, right but restricted by the board configuration. The policy
is deterministic in the sense that an action in a specific state can only lead to one other
1 2 3 4 State.

Episode: e.g a sequence of actions leading from 1,1 to 4,3 via 1.2 1,3 2,3 and 3,3

Examples of episodes with returns.

(1,1)🡪(1,2)🡪(1,3)🡪(1,2)🡪(1,3)🡪(2,3)🡪(3,3) 🡪(3,4) +1
(1,1)🡪(1,2)🡪(1,3)🡪(2,3)🡪(3,3)🡪(3,2)🡪(3,3)🡪(3,4) +1
(1,1)🡪(2,1)🡪(3,1)🡪(3,2)🡪(4,2) -1
Reinforcement Learning modelled as Markov Decision Process (MDP)

A suitable model for the above intuitive scenario is a basic Markov .5 .5 .33
Decision Process (MDP). 1.0
+1
An MDP is typically defined by a 4-tuple (S,A,R,T) where: .5 .33
.5 .5 .33
• S is the set of states s is the state of the environment .33 .33 1.0
(including the agent)
• A is the set of actions for each s, that the agent can choose
-1
between defined by a policy π .33 .33
• R(s’ |s , a) is a function that returns the reward received for taking .5 .5.5 .5 .33 .5
action a in state s leading to s´

• s´= T(s , a) is a transition probability function, specifying the .5 .33 .5

probability that the environment will transition to state s′ if the State transitions: The state transitions from a state
agent takes action a in state s. The Markov property = Transition to its neighboring state that in this case have equal
probabilities depend on state only, not on the path to the state. probability, summed up to 1.
The goal is to find a policy π that maximizes the return = expected
future accumulated (discounted) reward for episodes from a Start
state s to a Terminal state.
Dynamic Programming versus Reinforcement Learning

In one MDP scenario, there is a complete and exact model of the MDP in the
sense that T(S,A) and R(S,A) are fully defined.

In this case a MDP problem is a Planning Problem that can be exactly

solved by use of e.g. Dynamic Programming.

However in many cases T(S,A) and R(S,A) are not completely known, meaning
that the model of the MDP is not complete.

In such cases we have a true Reinforcement Learning problem.

Dynamic Programming (DP)
Dynamic Programming is an algorithm design technique for optimization problems: often for minimizing or
maximizing. The method was developed by Richard Bellman in the 1950s.

Like divide and conquer, DP is simplifying a complicated problem by breaking it down into simpler
sub-problems in a recursive manner and then combining the solutions to the sub-problems to form the total
solution.

If a problem can be optimally solved by breaking it recursively into sub-problems and then form the
solution from optimal solutions to the sub-problems, then it is said to have optimal substructure.

Unlike divide and conquer, sub-problems are not independent, sub-problems may share sub-sub-problems,

Typically the utility value of a solution in a certain state is defined by a value function calculated recursively
based on the value functions for the remaining steps of an episode moderated by the relevant probabilities
and a discount parameter giving higher weight to closer reward contributions. Typically the policy function
can be calculated in a similar fashion.

The defining equation for finding an optimal value function is called the Bellman equation.
Dynamic Programming (DP) – motives behind name
Bellman´s principle of optimality and the Bellman equation

Dynamic programming algorithms has the ambition to optimize the following two functions:

Policy function: π(s):= argmax ∑ T(s,a) * (R(s´|s,a)+ * V(s') )

a s´

Value function: V(s):= ∑ T (s,a) * (R(s´|s,a)+ * V(s') )

s´

Richard Bellman's Principle of Optimality

An optimal policy has the property that whatever the initial state and initial decision are, the remaining
decisions must constitute an optimal policy with regard to the state resulting from the first decision.

An optimal value function is a solution to the so called Bellman equation:

V(s) = Max ( R(s´|s,ai) + * V(s´ =T(s,ai) )

i=1..n
An optimal value function can be the basis for an optimal policy function.
Three standard approaches to solve the Bellman Equation

Value iteration
Initialise V arbitrarily
repeat
Improve Vk+1 using the estimate of Vk
until convergence
Policy iteration
Initialise V and π arbitrarily
repeat
Evaluate V using π
Improve π using V
until convergence.

Linear programming
Prioritized sweeping - use priority queue to update states with large potential for change
Simple example to illustrate Dynamic Programming: The shortest path in multi-stage graphs

The shortest path is: (S, C, F, T)=> 5+2+2 = 9. A greedy method gives (S, A, D, T) => 1+4+18 =
23.
Example with Dynamic programming approach
Forward approach

• d(S, T) = min{1+d(A, T), 2+d(B, T), 5+d(C, T)}

• d(A,T) = min{4+d(D,T), 11+d(E,T)} = min{4+18, 11+13} = 22.

• d(B, T) = min{9+d(D, T), 5+d(E, T), 16+d(F, T)} = min{9+18, 5+13, 16+2} = 18.
• d(C, T) = min{ 2+d(F, T) } = 2+2 = 4

• d(S, T) = min{1+d(A, T), 2+d(B, T), 5+d(C, T)} = min{1+22, 2+18, 5+4} = 9.
Markov Decision Process (MDP) can be generalized to
Partially Observable Markov Decision Process (POMDP)
A POMDP models an agent decision process in which it is assumed that
the system dynamics are determined by an MDP, but the agent cannot
directly observe the underlying state. The POMDPs enlarge the
applicability of model-based MDPs.

Instead, it must maintain a probability distribution over the set of

possible states, based on a set of observations and observation
probabilities, and the underlying MDP.

A POMDP have two additional elements on top of the MDP elements.

• A set of observations
• A set of observational probabilities.

Because the agent does not directly observe the environment's state, the
agent must make decisions under uncertainty of the true environment
state.

The details of POMDP are outside the scope of this lecture.

To be continued in Part 2

AS02
No ratings yet
AS02
16 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
RL Ese
No ratings yet
RL Ese
7 pages
DLMAIRIL01_Q4-2024_Session2
No ratings yet
DLMAIRIL01_Q4-2024_Session2
68 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
RL UNIT - II
No ratings yet
RL UNIT - II
20 pages
lec12
No ratings yet
lec12
60 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
RL Frra
No ratings yet
RL Frra
10 pages
Markov decision
No ratings yet
Markov decision
4 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
notes
No ratings yet
notes
6 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Unit 4
No ratings yet
Unit 4
49 pages
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
No ratings yet
Reinforcement Learning, Crawling Robot: Faculty of Sciences and Techniques Béni-Mellal
5 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
UNIT 4 (2)
No ratings yet
UNIT 4 (2)
6 pages
CS229
No ratings yet
CS229
17 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
14 pages
RL
No ratings yet
RL
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
06 MDP
No ratings yet
06 MDP
89 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
UNIT-5.docx
No ratings yet
UNIT-5.docx
39 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Assignments For Week 5 2024
No ratings yet
Assignments For Week 5 2024
10 pages
Assignments For Week 6 2024
No ratings yet
Assignments For Week 6 2024
13 pages
6 9-DeepLearning
No ratings yet
6 9-DeepLearning
8 pages
7.2 Interdisciplinary Inspiration
No ratings yet
7.2 Interdisciplinary Inspiration
17 pages
4.1-Inductive Learning Based On Symbolic Representations and Week Theories
No ratings yet
4.1-Inductive Learning Based On Symbolic Representations and Week Theories
9 pages
4.2-GeneralizationAsSearch Part 1
No ratings yet
4.2-GeneralizationAsSearch Part 1
17 pages
Convolutional Neural Networks-Part2
No ratings yet
Convolutional Neural Networks-Part2
21 pages
Hopfield Networks and Boltzman Machines-Part 1
100% (1)
Hopfield Networks and Boltzman Machines-Part 1
13 pages
Hopfield Networks and Boltzman Machines-Part 2
No ratings yet
Hopfield Networks and Boltzman Machines-Part 2
13 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
18 pages
Hebbian Learning and Associative Memory
No ratings yet
Hebbian Learning and Associative Memory
13 pages
5 2-ExplanationBasedLearning
No ratings yet
5 2-ExplanationBasedLearning
19 pages
Perceptrons
No ratings yet
Perceptrons
11 pages
Model of Neuron in An ANN
No ratings yet
Model of Neuron in An ANN
12 pages
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
No ratings yet
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
18 pages
Hollier Method
100% (1)
Hollier Method
33 pages
Solving Equal Piles Problem
No ratings yet
Solving Equal Piles Problem
35 pages
Lecture 21 (DS) - Decision Tree
No ratings yet
Lecture 21 (DS) - Decision Tree
20 pages
ADA Manual-1
No ratings yet
ADA Manual-1
20 pages
DSC Programs
No ratings yet
DSC Programs
59 pages
Unit 3 Trees Notes
No ratings yet
Unit 3 Trees Notes
57 pages
MIT 102 Set 1
No ratings yet
MIT 102 Set 1
5 pages
Experiment No 4 Dsa
No ratings yet
Experiment No 4 Dsa
4 pages
AI Unit-2 Notes
No ratings yet
AI Unit-2 Notes
22 pages
Huffman Coding
No ratings yet
Huffman Coding
14 pages
Unit 2 Disjoint Sets
No ratings yet
Unit 2 Disjoint Sets
28 pages
Fsolve - Optimization Toolbox
No ratings yet
Fsolve - Optimization Toolbox
6 pages
Unit I J Line Drawing Algorithm - DDA Algorithm, Unit I K Bresenham's Algorithm
No ratings yet
Unit I J Line Drawing Algorithm - DDA Algorithm, Unit I K Bresenham's Algorithm
22 pages
Big O Notation PDF
No ratings yet
Big O Notation PDF
9 pages
Ex No: 7 Date: Write A Program For Implementing The FCFS Scheduling Algorithm
No ratings yet
Ex No: 7 Date: Write A Program For Implementing The FCFS Scheduling Algorithm
4 pages
Adaboost Solutions
No ratings yet
Adaboost Solutions
6 pages
Evan Malveda Algo
No ratings yet
Evan Malveda Algo
4 pages
Application of ACO
No ratings yet
Application of ACO
7 pages
DAA Backracking U 5
No ratings yet
DAA Backracking U 5
32 pages
Isas 2 Semester 2 Full DLC
No ratings yet
Isas 2 Semester 2 Full DLC
22 pages
Lecture 0
No ratings yet
Lecture 0
23 pages
Lecture - (Tree and Its Types)
No ratings yet
Lecture - (Tree and Its Types)
43 pages
Aplikasi Simulasi Algoritma Penjadwalan Sistem Operasi: Jurnal Teknologi Informasi July 2021
No ratings yet
Aplikasi Simulasi Algoritma Penjadwalan Sistem Operasi: Jurnal Teknologi Informasi July 2021
6 pages
4.2 Backpropagation 1
No ratings yet
4.2 Backpropagation 1
78 pages
DS Notes
No ratings yet
DS Notes
20 pages
A5-BinTree
No ratings yet
A5-BinTree
4 pages
Greedy Fractional Knapsack
No ratings yet
Greedy Fractional Knapsack
15 pages
Important Problem Types
50% (2)
Important Problem Types
2 pages
Python model q
No ratings yet
Python model q
2 pages
Machine Learning KNN Presentation
No ratings yet
Machine Learning KNN Presentation
28 pages