0% found this document useful (0 votes)
37 views40 pages

Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 40

Reinforcement

learning
Amulya Viswambaran (202090007)
Kehkashan Fatima (202090202)
Sruthi Krishnan (202090333)

1
Supervised learning

Machine
Learning
Unsupervised learning
techniques
Reinforcement learning

2
3
Scenario –
A man
dropped off in
an Isolated
Island

4
Characteristics of RL

• No Supervisor.
• Decision making is sequential.
• Feedback is not instantaneous, delayed at least until the
completion of 1 target.
• Agent’s current action determine the subsequent data it
receives.
• Time plays a crucial role; A decision is to be taken within a
specified time.

5
Depending on the Decision made,

POSITIVE NEGATIVE
6
Key Features of RL
 The following parameters are used to
attain a solution.

• Set of Actions, A
• Set of states, S
• Reward, R
• Policy,
• Value, V

7
Terms to be understood in RL
••Agent:
  Responsible for taking actions in the environment.
•Environment: The place through which the agent moves.
•Action: All possible actions an agent can take is called action.
•State: The current condition returned by the environment.
•Reward(R) : An instant return from the environment to appraise the last action is
a reward.
• Policy ( The approach that the agent uses to determine the next action based on
the current state.
• Value(V) : The expected long-term return with discount as opposed to the short-
term reward (R).
• Action Value(Q): This is similar to value, except for it takes an extra parameter,
the current action (A).
8
Reward Maximization
(Basic Aim of RL)

•Discounting
•Exploration
•Exploitation

9
Markov Decision Processes
• Mathematical approach for mapping a solution in
Reinforcement Learning
• Markov decision processes formally describe an
environment for RL
• Where the environment is fully observable
• i.e. The current state completely characterizes
the process
• With the Markov Decision Process, an agent can
arrive at an optimal policy for maximum rewards
over time.
10
Markov Property

“The future is independent of the past given the present”

• A state St is Markov if and only if

P [St+1 | St ] = P [St+1 | S1, ..., St ]

• The state captures all relevant information from the history

• Once the state is known, the history may be thrown away

• i.e. The state is a sufficient statistic of the future


11
State Transition Matrix
•For a Markov state s and successor state s’ , the state transition probability is
defined by
•Pss’ = P [St+1 = s’ | St = s]
•State transition matrix P defines transition probabilities from all states s to all
successor states s’ ,

•where each row of the matrix sums to 1.

12
Markov Process or Markov Chains

• A Markov process is a memoryless random process, i.e. a sequence of


random states S1, S2, ... with the Markov property

• A Markov Process (or Markov Chain) is a tuple (S,P)

S is a (finite) set of states

P is a state transition probability matrix,

The dynamics of the environment can be fully defined using the


States(S) and Transition Probability matrix(P).
13
Example: Student Markov Chain

• Sample episodes for Student Markov Chain starting from S1


= C1
S1, S2, ..., ST
• C1 C2 C3 Pass Sleep
• C1 FB FB C1 C2 Sleep
• C1 C2 C3 Pub C2 C3 Pass Sleep
• C1 FB FB C1 C2 C3 Pub C1 FB FB FB C1 C2 C3 Pub C2 Sleep

Student MRP from David Silver class.


14
Markov Reward Process

• A Markov reward process is a Markov chain with values

• A Markov Reward Process is a tuple (S,P, R, γ)


 S is a finite set of states

 P is a state transition probability matrix,

Pss’ = P [St+1 = s 0 | St = s]
 R is a reward function, Rs = E [Rt+1 | St = s]
 γ is a discount factor, γ ∈ [0, 1]

15
Return
• The return Gt is the total discounted reward from time-step t.

• The discount γ ∈ [0, 1] is the present value of future rewards


• The value of receiving reward R after k + 1 time-steps is γᴷR.
• This values immediate reward above delayed reward
 γ close to 0 leads to ”myopic” evaluation
 γ close to 1 leads to ”far-sighted” evaluation

16
Example: Student MRP

17
Example: Student MRP Returns
• Sample returns for Student MRP:
Starting from S1 = C1 with γ = 1 /2
G1 = R2 + γR3 + ... + γᵀ⁻²RT

C1 C2 C3 Pass Sleep v1 = −2 − 2 ∗ 1/ 2 − 2 ∗ 1/ 4 + 10 ∗ 1/ 8 = −2.25


C1 FB FB C1 C2 Sleep v1 = −2 − 1 ∗ 1/ 2 − 1 ∗ 1 /4 − 2 ∗ 1/ 8 − 2 ∗ 1 /16 = −3.125
C1 C2 C3 Pub C2 C3 Pass Sleep v1 = −2 − 2 ∗ 1/ 2 − 2 ∗ 1/ 4 + 1 ∗ 1 /8 − 2 ∗ 1 /16 ... = −3.41
C1 FB FB C1 C2 C3 Pub C1 ... v1 = −2 − 1 ∗ 1 /2 − 1 ∗ 1 /4 − 2 ∗ 1/ 8 − 2 ∗ 1 /16 ... = −3.20
FB FB FB C1 C2 C3 Pub C2 Sleep

18
Value Function and Policy function

• Value Function determines how good it is for the agent to be in a particular state

• The value function v(s) gives the long-term value of state s

• The state value function v(s) of an MRP is the expected return starting from state s

v(s) = E [Gt | St = s]

• A policy defines what actions to perform in a particular state s.

• A policy is a simple function, that defines a probability distribution over Actions (a∈ A)


for each state (s ∈ S)

19
20
Bellman Equation for MRPs

• The agent tries to get the most expected sum of rewards from every state it lands in.

• Try to get the optimal value function, i.e. the maximum sum of cumulative rewards.

• The value function can be decomposed into two par


immediate reward Rt+1

 discounted value of successor state γᵛ(St+1)


v(s) = E [Gt | St = s]
= E [Rt+1 + γRt+2 + γ ²Rt+3 + ... | St = s ]
= E [Rt+1 + γ (Rt+2 + γRt+3 + ...) | St = s]
= E [Rt+1 + γGt+1 | St = s]
= E [Rt+1 + γᵛ(St+1) | St = s]
21
• That gives us the Bellman equation for MRPs,

So, for each state in the state space, the Bellman equation gives us the value of that
state,

22
Example: Bellman Equation for Student MRP
v(s) for γ =1

23
Solving the Bellman Equation
• The Bellman equation is a linear equation
• It can be solved directly:

• Computational complexity is O(n 3 ) for n states


• Direct solution only possible for small MRPs
• There are many iterative methods for large MRPs,
e.g. Dynamic programming
Monte-Carlo evaluation
24
Temporal-Difference learning
• Objectives of the next slides:

• Overview of a collection of classical


Dynamic solution methods for MDPs known as
Programmin dynamic programming (DP).

g • Show how DP can be used to compute


value functions, and hence, optimal
policies.

• Discuss efficiency and utility of DP.

25
Value Functions
• The
  value of a state is the expected return starting from that state. It depends on
the agent’s policy:

• The value of taking an action in a state under policy is the expected return
starting from that state, taking that action, and thereafter following :

26
Policy Evaluation
•  Policy Evaluation: for a given policy , compute the state-value function

• Recall: Bellman equation for V* :

• A system of |S| simultaneous linear equations

27
Iterative Methods

• A sweep consists of applying a backup operation to each state.


• A full policy evaluation backup:

28
START

  iInput

  Initialise

  Compute for all

NO
𝜃> ∆ ←max ⁡¿
 

YES
  Output

STOP
29
Flow chart for Iterative Policy Evaluation
Policy Improvement
• Suppose
  we have computed V* for a deterministic policy . For a given state s,
would it be better to do an action a?
• The value of doing a in state s is:

• It is better to switch to action a for state s if and only if

30
The Policy Improvement Theorem
•  Let be two policies such that for all

• Then is a better policy than , i.e. for all :

31
Policy Improvement Cont.
• Do
  this for all states to get a new policy that is greedy with respect to :

• Then,

• What if ?

• But this is the Bellman Optimality Equation.


• So and both and are optimal policies.

32
Policy Iteration

33
START

  1. Initialization

2. Policy Evaluation

I3. Policy Improvement

  Policy
NO/FALSE
Stable
  𝑏 ≠ 𝜋 ( 𝑠)

YES/TRUE

STOP

34
Flowchart for Policy Iteration
Value Iteration
• Drawback
  to policy iteration is that each iteration involves a policy evaluation,
which itself may require multiple sweeps.
• Convergence of occurs only in the limit so that we in principle have to wait until
convergence.
• As we have seen, the optimal policy is often obtained long before has converged.
• Fortunately, the policy evaluation step can be truncated in several ways without
losing the convergence guarantees of policy iteration.
• Value iteration is to stop policy evaluation after just one sweep.

35
Value Iteration
• Recall the full policy evaluation backup:

• Here is the full value iteration backup:

• Combination of policy improvement and truncated policy evaluation.

36
START

  iInput

  Initialise

  Compute for all

NO
 𝜃> ∆ ←max ⁡¿

YES

  Output

STOP
37
Flow chart for Value Iteration
Generalized Policy Iteration
• Generalized Policy Iteration (GPI) any interaction of policy evaluation
and policy improvement, independent of their granularity.

38
Efficiency of DP
• To find an optimal policy is polynomial in the number of states.

• BUT, the number of states is often astronomical.

• In practice, classical DP can be applied to problems with a few millions of states.

• It is surprisingly easy to come up with MDPs for which DP methods are not
practical.

39
Summary

• Policy evaluation: backups without a max


• Policy improvement: form a greedy policy, if only locally
• Policy iteration: alternate the above two processes
• Value iteration: backups with a max
• Full backups (to be contrasted later with sample backups)
• Generalized Policy Iteration (GPI)
• Bootstrapping: updating estimates based on other estimates

40

You might also like