Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)

Reinforcement
learning
Amulya Viswambaran (202090007)
Kehkashan Fatima (202090202)
Sruthi Krishnan (202090333)
1
Supervised learning
Machine
Learning
Unsupervised learning
techniques
Reinforcement learning
2
3
Scenario –
A man
dropped off in
an Isolated
Island
4
Characteristics of RL
• No Supervisor.
• Decision making is sequential.
• Feedback is not instantaneous, delayed at least until the
completion of 1 target.
• Agent’s current action determine the subsequent data it
receives.
• Time plays a crucial role; A decision is to be taken within a
specified time.
5
Depending on the Decision made,
POSITIVE NEGATIVE
6
Key Features of RL
The following parameters are used to
attain a solution.
• Set of Actions, A
• Set of states, S
• Reward, R
• Policy,
• Value, V
7
Terms to be understood in RL
••Agent:
Responsible for taking actions in the environment.
•Environment: The place through which the agent moves.
•Action: All possible actions an agent can take is called action.
•State: The current condition returned by the environment.
•Reward(R) : An instant return from the environment to appraise the last action is
a reward.
• Policy ( The approach that the agent uses to determine the next action based on
the current state.
• Value(V) : The expected long-term return with discount as opposed to the short-
term reward (R).
• Action Value(Q): This is similar to value, except for it takes an extra parameter,
the current action (A).
8
Reward Maximization
(Basic Aim of RL)
•Discounting
•Exploration
•Exploitation
9
Markov Decision Processes
• Mathematical approach for mapping a solution in
Reinforcement Learning
• Markov decision processes formally describe an
environment for RL
• Where the environment is fully observable
• i.e. The current state completely characterizes
the process
• With the Markov Decision Process, an agent can
arrive at an optimal policy for maximum rewards
over time.
10
Markov Property
“The future is independent of the past given the present”
• A state St is Markov if and only if
P [St+1 | St ] = P [St+1 | S1, ..., St ]
• The state captures all relevant information from the history
• Once the state is known, the history may be thrown away
• i.e. The state is a sufficient statistic of the future

11
State Transition Matrix
•For a Markov state s and successor state s’ , the state transition probability is
defined by
•Pss’ = P [St+1 = s’ | St = s]
•State transition matrix P defines transition probabilities from all states s to all
successor states s’ ,
•where each row of the matrix sums to 1.
12
Markov Process or Markov Chains
• A Markov process is a memoryless random process, i.e. a sequence of

random states S1, S2, ... with the Markov property
• A Markov Process (or Markov Chain) is a tuple (S,P)
S is a (finite) set of states
P is a state transition probability matrix,
The dynamics of the environment can be fully defined using the

States(S) and Transition Probability matrix(P).
13
Example: Student Markov Chain
• Sample episodes for Student Markov Chain starting from S1

= C1
S1, S2, ..., ST
• C1 C2 C3 Pass Sleep
• C1 FB FB C1 C2 Sleep
• C1 C2 C3 Pub C2 C3 Pass Sleep
• C1 FB FB C1 C2 C3 Pub C1 FB FB FB C1 C2 C3 Pub C2 Sleep
Student MRP from David Silver class.

14
Markov Reward Process
• A Markov reward process is a Markov chain with values
• A Markov Reward Process is a tuple (S,P, R, γ)

 S is a finite set of states
 P is a state transition probability matrix,
Pss’ = P [St+1 = s 0 | St = s]
 R is a reward function, Rs = E [Rt+1 | St = s]
 γ is a discount factor, γ ∈ [0, 1]
15
Return
• The return Gt is the total discounted reward from time-step t.
• The discount γ ∈ [0, 1] is the present value of future rewards

• The value of receiving reward R after k + 1 time-steps is γᴷR.
• This values immediate reward above delayed reward
 γ close to 0 leads to ”myopic” evaluation
 γ close to 1 leads to ”far-sighted” evaluation
16
Example: Student MRP
17
Example: Student MRP Returns
• Sample returns for Student MRP:
Starting from S1 = C1 with γ = 1 /2
G1 = R2 + γR3 + ... + γᵀ⁻²RT
C1 C2 C3 Pass Sleep v1 = −2 − 2 ∗ 1/ 2 − 2 ∗ 1/ 4 + 10 ∗ 1/ 8 = −2.25

C1 FB FB C1 C2 Sleep v1 = −2 − 1 ∗ 1/ 2 − 1 ∗ 1 /4 − 2 ∗ 1/ 8 − 2 ∗ 1 /16 = −3.125
C1 C2 C3 Pub C2 C3 Pass Sleep v1 = −2 − 2 ∗ 1/ 2 − 2 ∗ 1/ 4 + 1 ∗ 1 /8 − 2 ∗ 1 /16 ... = −3.41
C1 FB FB C1 C2 C3 Pub C1 ... v1 = −2 − 1 ∗ 1 /2 − 1 ∗ 1 /4 − 2 ∗ 1/ 8 − 2 ∗ 1 /16 ... = −3.20
FB FB FB C1 C2 C3 Pub C2 Sleep
18
Value Function and Policy function
• Value Function determines how good it is for the agent to be in a particular state
• The value function v(s) gives the long-term value of state s
• The state value function v(s) of an MRP is the expected return starting from state s
v(s) = E [Gt | St = s]
• A policy defines what actions to perform in a particular state s.
• A policy is a simple function, that defines a probability distribution over Actions (a∈ A)

for each state (s ∈ S)
19
20
Bellman Equation for MRPs
• The agent tries to get the most expected sum of rewards from every state it lands in.
• Try to get the optimal value function, i.e. the maximum sum of cumulative rewards.
• The value function can be decomposed into two par

immediate reward Rt+1
 discounted value of successor state γᵛ(St+1)

v(s) = E [Gt | St = s]
= E [Rt+1 + γRt+2 + γ ²Rt+3 + ... | St = s ]
= E [Rt+1 + γ (Rt+2 + γRt+3 + ...) | St = s]
= E [Rt+1 + γGt+1 | St = s]
= E [Rt+1 + γᵛ(St+1) | St = s]
21
• That gives us the Bellman equation for MRPs,
So, for each state in the state space, the Bellman equation gives us the value of that
state,
22
Example: Bellman Equation for Student MRP
v(s) for γ =1
23
Solving the Bellman Equation
• The Bellman equation is a linear equation
• It can be solved directly:
• Computational complexity is O(n 3 ) for n states

• Direct solution only possible for small MRPs
• There are many iterative methods for large MRPs,
e.g. Dynamic programming
Monte-Carlo evaluation
24
Temporal-Difference learning
• Objectives of the next slides:
• Overview of a collection of classical

Dynamic solution methods for MDPs known as
Programmin dynamic programming (DP).
g • Show how DP can be used to compute

value functions, and hence, optimal
policies.
• Discuss efficiency and utility of DP.
25
Value Functions
• The
value of a state is the expected return starting from that state. It depends on
the agent’s policy:
• The value of taking an action in a state under policy is the expected return
starting from that state, taking that action, and thereafter following :
26
Policy Evaluation
• Policy Evaluation: for a given policy , compute the state-value function
• Recall: Bellman equation for V* :
• A system of |S| simultaneous linear equations
27
Iterative Methods
• A sweep consists of applying a backup operation to each state.

• A full policy evaluation backup:
28
START
iInput
Initialise
Compute for all
NO
𝜃> ∆ ←max ⁡¿

YES
Output
STOP
29
Flow chart for Iterative Policy Evaluation
Policy Improvement
• Suppose
we have computed V* for a deterministic policy . For a given state s,
would it be better to do an action a?
• The value of doing a in state s is:
• It is better to switch to action a for state s if and only if
30
The Policy Improvement Theorem
• Let be two policies such that for all
• Then is a better policy than , i.e. for all :
31
Policy Improvement Cont.
• Do
this for all states to get a new policy that is greedy with respect to :
•
• Then,
•
• What if ?
• But this is the Bellman Optimality Equation.

• So and both and are optimal policies.
32
Policy Iteration
33
START
1. Initialization
2. Policy Evaluation
I3. Policy Improvement
Policy
NO/FALSE
Stable
𝑏 ≠ 𝜋 ( 𝑠)
YES/TRUE
STOP
34
Flowchart for Policy Iteration
Value Iteration
• Drawback
to policy iteration is that each iteration involves a policy evaluation,
which itself may require multiple sweeps.
• Convergence of occurs only in the limit so that we in principle have to wait until
convergence.
• As we have seen, the optimal policy is often obtained long before has converged.
• Fortunately, the policy evaluation step can be truncated in several ways without
losing the convergence guarantees of policy iteration.
• Value iteration is to stop policy evaluation after just one sweep.
35
Value Iteration
• Recall the full policy evaluation backup:
• Here is the full value iteration backup:
• Combination of policy improvement and truncated policy evaluation.
36
START
iInput
Initialise
Compute for all
NO
𝜃> ∆ ←max ⁡¿
YES
Output
STOP
37
Flow chart for Value Iteration
Generalized Policy Iteration
• Generalized Policy Iteration (GPI) any interaction of policy evaluation
and policy improvement, independent of their granularity.
38
Efficiency of DP
• To find an optimal policy is polynomial in the number of states.
• BUT, the number of states is often astronomical.
• In practice, classical DP can be applied to problems with a few millions of states.
• It is surprisingly easy to come up with MDPs for which DP methods are not
practical.
39
Summary
• Policy evaluation: backups without a max

• Policy improvement: form a greedy policy, if only locally
• Policy iteration: alternate the above two processes
• Value iteration: backups with a max
• Full backups (to be contrasted later with sample backups)
• Generalized Policy Iteration (GPI)
• Bootstrapping: updating estimates based on other estimates
40

Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)

Uploaded by

Copyright:

Available Formats

Reinforcement

“The future is independent of the past given the present”

• A state St is Markov if and only if

P [St+1 | St ] = P [St+1 | S1, ..., St ]

• The state captures all relevant information from the history

• Once the state is known, the history may be thrown away

• i.e. The state is a sufficient statistic of the future

•where each row of the matrix sums to 1.

• A Markov process is a memoryless random process, i.e. a sequence of

• A Markov Process (or Markov Chain) is a tuple (S,P)

S is a (finite) set of states

P is a state transition probability matrix,

The dynamics of the environment can be fully defined using the

• Sample episodes for Student Markov Chain starting from S1

Student MRP from David Silver class.

• A Markov reward process is a Markov chain with values

• A Markov Reward Process is a tuple (S,P, R, γ)

 P is a state transition probability matrix,

• The discount γ ∈ [0, 1] is the present value of future rewards

C1 C2 C3 Pass Sleep v1 = −2 − 2 ∗ 1/ 2 − 2 ∗ 1/ 4 + 10 ∗ 1/ 8 = −2.25

• The value function v(s) gives the long-term value of state s

• A policy defines what actions to perform in a particular state s.

• A policy is a simple function, that defines a probability distribution over Actions (a∈ A)

• The value function can be decomposed into two par

 discounted value of successor state γᵛ(St+1)

• Computational complexity is O(n 3 ) for n states

• Overview of a collection of classical

g • Show how DP can be used to compute

• Discuss efficiency and utility of DP.

• Recall: Bellman equation for V* :

• A system of |S| simultaneous linear equations

• A sweep consists of applying a backup operation to each state.

Compute for all

• It is better to switch to action a for state s if and only if

• Then is a better policy than , i.e. for all :

• But this is the Bellman Optimality Equation.

I3. Policy Improvement

• Here is the full value iteration backup:

• Combination of policy improvement and truncated policy evaluation.

Compute for all

• BUT, the number of states is often astronomical.

• In practice, classical DP can be applied to problems with a few millions of states.

• Policy evaluation: backups without a max

You might also like