Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
learning
Amulya Viswambaran (202090007)
Kehkashan Fatima (202090202)
Sruthi Krishnan (202090333)
1
Supervised learning
Machine
Learning
Unsupervised learning
techniques
Reinforcement learning
2
3
Scenario –
A man
dropped off in
an Isolated
Island
4
Characteristics of RL
• No Supervisor.
• Decision making is sequential.
• Feedback is not instantaneous, delayed at least until the
completion of 1 target.
• Agent’s current action determine the subsequent data it
receives.
• Time plays a crucial role; A decision is to be taken within a
specified time.
5
Depending on the Decision made,
POSITIVE NEGATIVE
6
Key Features of RL
The following parameters are used to
attain a solution.
• Set of Actions, A
• Set of states, S
• Reward, R
• Policy,
• Value, V
7
Terms to be understood in RL
••Agent:
Responsible for taking actions in the environment.
•Environment: The place through which the agent moves.
•Action: All possible actions an agent can take is called action.
•State: The current condition returned by the environment.
•Reward(R) : An instant return from the environment to appraise the last action is
a reward.
• Policy ( The approach that the agent uses to determine the next action based on
the current state.
• Value(V) : The expected long-term return with discount as opposed to the short-
term reward (R).
• Action Value(Q): This is similar to value, except for it takes an extra parameter,
the current action (A).
8
Reward Maximization
(Basic Aim of RL)
•Discounting
•Exploration
•Exploitation
9
Markov Decision Processes
• Mathematical approach for mapping a solution in
Reinforcement Learning
• Markov decision processes formally describe an
environment for RL
• Where the environment is fully observable
• i.e. The current state completely characterizes
the process
• With the Markov Decision Process, an agent can
arrive at an optimal policy for maximum rewards
over time.
10
Markov Property
12
Markov Process or Markov Chains
Pss’ = P [St+1 = s 0 | St = s]
R is a reward function, Rs = E [Rt+1 | St = s]
γ is a discount factor, γ ∈ [0, 1]
15
Return
• The return Gt is the total discounted reward from time-step t.
16
Example: Student MRP
17
Example: Student MRP Returns
• Sample returns for Student MRP:
Starting from S1 = C1 with γ = 1 /2
G1 = R2 + γR3 + ... + γᵀ⁻²RT
18
Value Function and Policy function
• Value Function determines how good it is for the agent to be in a particular state
• The state value function v(s) of an MRP is the expected return starting from state s
v(s) = E [Gt | St = s]
19
20
Bellman Equation for MRPs
• The agent tries to get the most expected sum of rewards from every state it lands in.
• Try to get the optimal value function, i.e. the maximum sum of cumulative rewards.
So, for each state in the state space, the Bellman equation gives us the value of that
state,
22
Example: Bellman Equation for Student MRP
v(s) for γ =1
23
Solving the Bellman Equation
• The Bellman equation is a linear equation
• It can be solved directly:
25
Value Functions
• The
value of a state is the expected return starting from that state. It depends on
the agent’s policy:
• The value of taking an action in a state under policy is the expected return
starting from that state, taking that action, and thereafter following :
26
Policy Evaluation
• Policy Evaluation: for a given policy , compute the state-value function
27
Iterative Methods
28
START
iInput
Initialise
NO
𝜃> ∆ ←max ¿
YES
Output
STOP
29
Flow chart for Iterative Policy Evaluation
Policy Improvement
• Suppose
we have computed V* for a deterministic policy . For a given state s,
would it be better to do an action a?
• The value of doing a in state s is:
30
The Policy Improvement Theorem
• Let be two policies such that for all
31
Policy Improvement Cont.
• Do
this for all states to get a new policy that is greedy with respect to :
•
• Then,
•
• What if ?
32
Policy Iteration
33
START
1. Initialization
2. Policy Evaluation
Policy
NO/FALSE
Stable
𝑏 ≠ 𝜋 ( 𝑠)
YES/TRUE
STOP
34
Flowchart for Policy Iteration
Value Iteration
• Drawback
to policy iteration is that each iteration involves a policy evaluation,
which itself may require multiple sweeps.
• Convergence of occurs only in the limit so that we in principle have to wait until
convergence.
• As we have seen, the optimal policy is often obtained long before has converged.
• Fortunately, the policy evaluation step can be truncated in several ways without
losing the convergence guarantees of policy iteration.
• Value iteration is to stop policy evaluation after just one sweep.
35
Value Iteration
• Recall the full policy evaluation backup:
36
START
iInput
Initialise
NO
𝜃> ∆ ←max ¿
YES
Output
STOP
37
Flow chart for Value Iteration
Generalized Policy Iteration
• Generalized Policy Iteration (GPI) any interaction of policy evaluation
and policy improvement, independent of their granularity.
38
Efficiency of DP
• To find an optimal policy is polynomial in the number of states.
• It is surprisingly easy to come up with MDPs for which DP methods are not
practical.
39
Summary
40