Non-Graded: Assignment 1: (Https://swayam - Gov.in)

4/21/24, 3:26 AM Reinforcement Learning - - Unit 4 - Week 1
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com 
NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)
If already
registered, click
to check your
Non-Graded : Assignment 1
payment status Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Final score
Course
outline 1) Which of the following best suits the notion of ’regret’ in a standard multi-arm bandit 1 point
problem? Assume bandits are stationary.
About
“The difference between the total reward that could have been achieved by choosing the
NPTEL ()
optimal action from the beginning and the total reward achieved by choosing the worst action
from the beginning.”
How does an
NPTEL “The number of time steps required for the solution method to find the optimal action.”
online “The difference between the total reward that could have been achieved by choosing the
course optimal action from the beginning and the total reward accumulated by our solution method”
work? ()
“The difference between the best possible reward that can be sampled by selecting the
optimal action and the worst possible reward that can be sampled by selecting the optimal
Week 1 ()
action.”
Introduction to No, the answer is incorrect.
RL (unit? Score: 0
unit=17&lesso Accepted Answers:
n=18) “The difference between the total reward that could have been achieved by choosing the
optimal action from the beginning and the total reward accumulated by our solution method”
RL Framework
and
Applications 2) Credit assignment problem is the issue of assigning a correct mapping of rewards 1 point
(unit? accumulated to the action(s) that led to them. Which of the following is the reason for credit
unit=17&lesso assignment problem in RL?
n=19)
Rewards are restricted to be a scalar value
Introduction to
Rewards are delayed in RL setting
Immediate RL
(unit? Agent cannot observe the reward
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=17&assessment=188 1/3
unit=17&lesso RL agents do not face credit assignment problem

n=20)
No, the answer is incorrect.
Bandit Score: 0
Optimalities Accepted Answers:
(unit? Rewards are delayed in RL setting
unit=17&lesso
n=21) 3) Consider a standard multi-arm bandit problem. The probability of picking an action, 1 point
using the softmax policy is given by:
Value Function Qt (a)/β
e
Based P r (at = a) =
∑ eQt (b)/β
Methods (unit? b
unit=17&lesso Now, assuming the following action-value estimates:

n=22) Qt (a0 ) = 1, Qt (a1 ) = 0.2, Qt (a2 ) = 0.5 and Qt (a3 ) = −1
Week 1: 0
Solutions
(unit?
0.1
unit=17&lesso 0.23
n=24)
0.31
Week 1 Yes, the answer is correct.
Feedback Score: 1
Form : Accepted Answers:
Reinforcement 0.23
Learning (unit?
unit=17&lesso
4) Consider the following statements 1 point
n=23)
i The rewards are obtained at a fixed time after taking an action.
Practice: ii Reinforcement Learning is neither supervised nor unsupervised learning.
Non-Graded : iii Two reinforcement learning agents can learn by playing against each other.
Assignment 1 iv Always selecting the action with maximum reward will automatically maximize the winning
(assessment? probability in a game.
name=188)
Which of the above statements is/are correct?
Quiz: Week 1 :
Assignment 1 i, ii, iii
(assessment? ii
name=200)
ii, iii
Week 2 () iii, iv

Week 3 () Score: 0
Accepted Answers:
Week 4 () ii, iii
Week 5 ()
Check Answers and Submit
Week 6 ()
Your score is: 1/4
Week 7 ()
Week 8 ()
Week 9 ()
Week 10 ()
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
Answer Submitted.
X
If already
registered, click
to check your
Non-Graded: Assignment 2
payment status Assignment not submitted
Final score
Course
outline 1) Which of the following is true of the UCB algorithm? 1 point
About The action with the highest Q value is chosen at every iteration.
NPTEL ()
After a very large number of iterations, the confidence intervals of unselected actions will
not change much.
How does an
NPTEL The true expected-value of an action always lies within it’s estimated confidence interval.
online
course With a small probability ϵ , we select a random action to ensure adequate exploration of the
work? () action space.
Week 1 () Score: 0
Accepted Answers:
Week 2 () After a very large number of iterations, the confidence intervals of unselected actions will not
change much.
UCB 1 (unit?
unit=25&lesso 2) In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm, the 1 point
n=26)
estimates of Q values are
Concentration Q100 (1) = 1.73, Q100 (2) = 1.83, Q100 (3) = 1.89, Q100 (4) = 1.55 and the number of times
Bounds (unit? each of them are sampled are n1 = 25, n2 = 20, n3 = 30, n4 = 15 . Which arm will be
unit=25&lesso sampled in the next trial?

n=27)
Arm 1
UCB 1
Theorem
Arm 2
(unit? Arm 3
Arm 4
Y h i
unit=25&lesso Yes, the answer is correct.

n=28) Score: 1
Accepted Answers:
PAC Bounds Arm 2
(unit?
unit=25&lesso 3) Assertion: The confidence bound of each arm in the UCB algorithm cannot 1 point
n=29) increase with iterations.
Reason: The nj term in the denominator ensures that the confidence bound remains the
Median
same for unselected arms and decreases for the selected arm.
Elimination
(unit?
Assertion and Reason are both true and Reason is a correct explanation of Assertion
unit=25&lesso
n=30) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
Assertion is true and Reason is false
Thompson
Sampling Both Assertion and Reason are false
(unit?
unit=25&lesso Score: 0
n=31) Accepted Answers:
Week 2:
Both Assertion and Reason are false
Solutions
(unit? 4) Consider the following equalities/inequalities for the UCB algorithm, following the 1 point
unit=25&lesso notation used in the lectures. (Ti (n): the number of times that action i has been played in the
n=33) previous n trials, Ck,T represents the confidence bound for arm i after k trials)
i
(k)
n
Week 2 i. Ti (n) = 1 + ∑
m=k+1
{I m = i}
n
Feedback ii. Ti (n) = ∑

m=1
{I m = i}
n
Form : iii. Ti (n) ≤ 1 + ∑
m=k+1
∗
{Qm−1 (a ) + C m−1,T ∗
a (m−1)
≤ Qm−1 (i) + C m−1,T
i
(m−1)
}
Reinforcement n
iv. Ti (n) ≤ 1 + ∑m=k+1 {Qm−1 (a ) ≤ Qm−1 (i) + Cm−1,T∗
(m−1)
}
i
Learning (unit? Which of these equalities/inequalities are correct?

unit=25&lesso
n=32) i and iii
Practice: ii and iv
Non-Graded: i, ii, iii
Assignment 2
i, ii, iii, iv
(assessment?
name=189) No, the answer is incorrect.
Score: 0
Quiz: Week 2 :
Accepted Answers:
Assignment 2
i, ii, iii, iv
(assessment?
name=201)

Week 3 ()
Your score is: 1/4

Week 4 ()
Week 5 ()
Week 6 ()
Week 7 ()
Week 8 ()
Week 9 ()
Week 10 ()
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
Answer Submitted.
X
If already
registered, click
to check your
Week 3: Assignment 3(Non
payment status
Graded)
Assignment not submitted
Course
Final score
outline
About 1) In full RL problem, which of the following is determined by the agent? 1 point
NPTEL ()
State
How does an Reward
NPTEL Action
online
None of the above
course
work? () No, the answer is incorrect.
Score: 0
Accepted Answers:
Week 1 ()
Action
Week 2 ()
2) Which of the following statements is true about the RL problem? 1 point
Week 3 () Our main aim is to maximize the current reward.
Policy Search The agent performs the actions in a deterministic fashion.

(unit? We assume that the agent determines the reward based on the current state and action
unit=34&lesso
It is possible to have zero rewards.
n=35)
REINFORCE Score: 0
(unit? Accepted Answers:
unit=34&lesso It is possible to have zero rewards.
n=36)
3) 1 point
Contextual Let us say we are taking actions according to a Gaussian distribution with parameters μ and σ .
Bandits (unit? We update the parameters according to REINFORCE and at denote the action taken at step t.
unit=34&lesso
n=37) a t −μ
(i) μ
t+1
= μ
t
+ αr t
2
t
σ
t
Full RL μ −a t
(ii) μ
t+1
= μ
t
+ αr t
t
Introduction
2
σ
t
(a t−μ )2
(unit? (iii) σ t+1 = σ t + αrt
t
3
σ
unit=34&lesso t
n=38) (iv) σ t+1 = σ t + αrt (

(a t − μ )
t
−
1
)
3 σt
σ
t
Returns, Value
Functions and
Which of the above updates are correct?
MDPs (unit?
unit=34&lesso
(i), (iii)
n=39)
(i), (iv)
Week 3:
(ii), (iv)
Solutions
(unit? (ii), (iii)
unit=34&lesso No, the answer is incorrect.
n=41) Score: 0
Accepted Answers:
Week 3
(i), (iv)
Feedback
Form : 4) Assertion: Contextual bandits can be modeled as a full reinforcement learning 1 point
Reinforcement
problem.
Learning (unit?
Reason: We can define an MDP with n states where n is the number of bandits. The number of
unit=34&lesso
actions from each state corresponds to the arms in each bandit, with every action leading to
n=40)
termination of the episode, and giving a reward according to the corresponding bandit and arm.
Practice:
Week 3: Assertion and Reason are both true and Reason is a correct explanation of Assertion
Assignment Assertion and Reason are both true and Reason is not a correct explanation of Assertion
3(Non
Graded)
(assessment? Both Assertion and Reason are false
name=190)
Quiz: Week 3 :
Score: 0
Assignment 3 Accepted Answers:
(assessment? Assertion and Reason are both true and Reason is a correct explanation of Assertion
name=204)
5) Remember for discounted returns, 1 point
Week 4 ()
2
Gt = rt + γrt+1 + γ rt+2 +. . .
Week 5 () Where γ is a discount factor. Which of the following best explains what happens when γ > 1 ,
(say γ = 5 )?
Week 6 ()
Week 7 () Nothing, γ > 1 is common for many RL problems

Theoretically nothing can go wrong, but this case does not represent any real world
Week 8 () problems
The agent will learn that delayed rewards will always be beneficial and so will not learn
Week 9 () properly.
None of the above is true.
Week 10 ()
Y h i
Yes, the answer is correct.

Week 11 () Score: 1
Accepted Answers:
Week 12 () The agent will learn that delayed rewards will always be beneficial and so will not learn
properly.
DOWNLOAD
VIDEOS () Check Answers and Submit
Text Your score is: 1/5

Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
Answer Submitted.
X
If already
registered, click
to check your
Week 4 : Assignment 4(Non
payment status
Graded)
Course
Final score
outline
About 1) State True/ False 1 point

NPTEL () The Bellman optimality equation can be solved as a linear system of equations.
How does an True

NPTEL False
online
course Score: 1
work? () Accepted Answers:
False
Week 1 ()
2) Consider the following statements for a finite MDP (Pπ is a stochastic matrix): 1 point
Week 2 () (i) The Bellman equation for the value function of a finite MDP defines a contraction
operator(using the max norm).
Week 3 () (ii) If 0 ≤ γ < 1, then the eigenvalues of γPπ are less than 1.
(iii) If 0 ≤ γ < 1, the sequence defined by vn = rπ + γP π vn−1 is a Cauchy sequence (using
Week 4 () the max norm).
Which of the above statements are true?
MDP
Modelling Only (ii), (iii)
(unit?
Only (i), (ii)
unit=42&lesso
n=43) Only (i), (iii)
Bellman
(i), (ii), (iii)
Equation (unit?
N h i i

n=44) Score: 0
Accepted Answers:
Bellman (i), (ii), (iii)
Optimality
Equation (unit? 3) Select the correct Bellman optimality equation: 1 point
unit=42&lesso
n=45) ∗ ∗
v (s) = maxa ∑ p(s'|s, a)[E[r|s, a, s'] + γv (s')]
s'
Cauchy
Sequence and ∗
v (s) = maxa ∑
s'
∗
p(s'|s, a)v (s')
Green's
Equation (unit? ∗
v (s) = maxa ∑
s'
p(s'|s, a)[γE[r|s, a, s'] + v (s')]
∗
unit=42&lesso
n=46) ∗
v (s) = maxa ∑ p(s'|s, a)γ[E[r|s, a, s'] + v (s')]
∗
s'
Banach Fixed No, the answer is incorrect.

Point Theorem Score: 0
∗ ∗
unit=42&lesso v (s) = maxa ∑
s'
p(s'|s, a)[E[r|s, a, s'] + γv (s')]
n=47)
4) State True/False 1 point
Convergence
Proof (unit?
unit=42&lesso
In MDPs, there is a unique resultant state for any given state-action pair.
n=48)
True
Week 4: False
Solutions
(unit? No, the answer is incorrect.
Score: 0
unit=42&lesso
False
Practice:
Week 4 :
5) For an operator L, let x be one of its fixed points. Consider the following 1 point
Assignment
statements:
4(Non
(i) Lm x = L
n
x, where m, n ∈ N
Graded)
(assessment? (ii)L2
x = x
name=191) Which of the above statements are true?
Week 4 Only (i)

Feedback
Only (ii)
Form :
Reinforcement (i), (ii)
Learning (unit? None of the above
unit=42&lesso
n=49) No, the answer is incorrect.
Score: 0
Quiz: Week 4 : Accepted Answers:
Assignment 4 (i), (ii)
(assessment?
name=207)
Week 5 ()
Your score is: 1/5
Week 6 ()
Week 7 ()
Week 8 ()
Week 9 ()
Week 10 ()
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
Answer Submitted.
X
If already
registered, click
to check your
payment status
Graded)
Course
Final score
outline
About 1) Let the policy and value function obtained by applying policy iteration on a finite 1 point
NPTEL () MDP be π n and vπ , after the stopping criterion is met. Then:
n
How does an
πn will be the optimal policy but vπ might not be the optimal value function.
n
NPTEL
online
πn might not be the optimal policy but vπ will be the optimal value function.
n
course
work? ()
Nothing can be said about π n and vπ . n
Week 1 () πn will be the optimal policy and vπ will be the optimal value function.
n

Week 2 ()
Score: 0
Accepted Answers:
Week 3 () π n will be the optimal policy and v
π
will be the optimal value function.
n
Week 4 () 2) Assertion: Monte Carlo policy evaluation must use exploring starts in the case of 1 point
non-deterministic policies.
Week 5 ()
Reason: They have to rely upon exploring starts in the case of non-deterministic policies to
Lpi
ensure adequate sampling of all states.
Convergence
(unit?
unit=51&lesso (Assume all states of the MDP are reachable from all other states).
n=52)
Value Iteration Assertion and Reason are both true and Reason is a correct explanation of Assertion
(unit?
Assertion and Reason are both true and Reason is not a correct explanation of Assertion
unit=51&lesso
n=53) Assertion is true and Reason is false
Policy Iteration
(unit? No, the answer is incorrect.
unit=51&lesso Score: 0
Dynamic
Programming 3) Which of the following is/are guaranteed to be true at the stopping criterion for value 1 point
(unit? iteration?
unit=51&lesso
n=55) ϵ(1−γ)
Note: We stop the algorithm when ||vn+1 − v
n
|| <
2γ
Monte Carlo
(unit?
π ∗
||v − v || ≤ ϵ
unit=51&lesso
n=56)
n+1 ∗
||v − v || ≤ ϵ
Control in
Monte Carlo ||v
π ∗
− v || ≤ ϵ/2
(unit?
unit=51&lesso ||v
n+1 ∗
− v || ≤ ϵ/2
n=57)
Partially Correct.
Week 5 Score: 0.33
Feedback Accepted Answers:
π ∗
Form : ||v − v || ≤ ϵ
Reinforcement ||v
n+1 ∗
− v || ≤ ϵ
Learning (unit? ||v

n+1 ∗
− v || ≤ ϵ/2
unit=51&lesso
n=59) 4) Consider an MDP, where there are n actions (a ∈ A, with |A| = n), each of which1 point
Week 5: is applicable in each state s ∈ S. If π is an ϵ - soft policy for some ϵ > 0, and let qπ be the
Solutions action-value function of the policy π , then consider the following statements and choose the
(unit? appropriate option.
unit=51&lesso
n=58) Assertion: Any ϵ - greedy policy with respect to qπ is strictly better than π .
Practice:
Week 5 : Reason: In an ϵ - greedy policy, the action with the maximal estimated action value is chosen
Assignment with probability 1 − ϵ + ϵ/n and other actions are chosen at random with probability ϵ/n .
5(Non
Graded) Both assertion and reason are true, and reason correct explanation for assertion.
(assessment? Both assertion and reason are true, but reason is not correct explanation for assertion.
name=192)
Assertion is false but reason is true.
Quiz: Week 5 : Assertion is true but reason is false.
Assignment 5
(assessment? No, the answer is incorrect.
Score: 0
name=208)
Accepted Answers:
Week 6 () Assertion is false but reason is true.
Week 7 () 5) If v satisfy Lv = Lπ v = v (where L is the Bellman optimality operator and Lπ is 1 point

the Bellman operator for value function of π ), then
π is ϵ optimal and v is the fixed point of π

Week 8 ()
π is not optimal and v is not the fixed point of π

Week 9 ()
π is an optimal policy and v is the fixed point of π
Week 10 ()
π is not optimal and v is the fixed point of the optimal policy
Week 11 ()
Score: 0
Week 12 () Accepted Answers:
π is an optimal policy and v is the fixed point of π
DOWNLOAD
VIDEOS ()
Text Your score is: 0.33/5

Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
Answer Submitted.
X
If already
registered, click
to check your
payment status
Graded)
Course
Final score
outline
About 1) Consider the following equation for TD-control. 1 point

NPTEL ()
Qnew (st , at ) = Qold (st , at ) + α[Rt+1 + γQold (st+1 , at+1 ) − Qold (st , at )]
How does an
NPTEL Which part of the equation is said to be as TD-error (δ)
online
course
δ = Rt+1 + γ (st+1 , at+1 )
work? () old
δ = α[Rt+1 + γ (st+1 , at+1 ) − Qold (st , at )]

old
Week 1 ()
δ = Rt+1 + γ (st+1 , at+1 ) − Qold (st , at )
old
Week 2 ()
δ = Qold(st+1 , at+1 ) − Qold (st , at )
Week 3 () No, the answer is incorrect.

Score: 0
δ = Rt+1 + γ (st+1 , at+1 ) − Qold (st , at )
old
Week 5 ()
2) Which of the following are True for TD(0) ? (Assume that the environment is truly 1 point
Week 6 () Markov)
Off Policy MC
It uses the full return to update the value of states.
(unit? Both TD(0) and Monte-Carlo policy evaluation converge to the same value function, given
a finite number of samples.
unit=60&lesso Both TD(0) and Monte-Carlo policy evaluation converge to the same value function, given
n=61)
an infinite number of samples.
UCT (unit?
unit=60&lesso TD error is given by “δ = vnew (st , at ) − vold (st , at )”.
n=62)
Score: 0
TD(0) (unit?
unit=60&lesso
Accepted Answers:
n=63)
Both TD(0) and Monte-Carlo policy evaluation converge to the same value function, given an
infinite number of samples.
TD(0) Control
(unit? 3) Consider the following equation for weighted Importance Sampling 1 point
unit=60&lesso
n=64) N p(xi )
∑ f (x i )
i=1
q(xi )
Q-Learning Weighted Importance Sampling =

N p(xi )
∑
(unit? i=1
q(xi )
unit=60&lesso
n=65) where x i is trajectory, f (x i) is return on trajectory, p(x i) is probability of trajectory x i when
Afterstate following policy π and q(x i) is probability of trajectory x i when following policy µ, then
(unit?
unit=60&lesso
π is estimation policy and µ is behaviour policy
n=66)
Week 6: π is behaviour policy and µ is estimation policy

Solutions
(unit? Score: 0
unit=60&lesso
Accepted Answers:
n=67)
π is estimation policy and µ is behaviour policy
Week 6
Feedback 4) Assertion: SARSA is an on-policy method. 1 point
Form : Reason: In SARSA, we do not update the action that was actually used, so it is on-policy
Reinforcement method.
Learning (unit?
unit=60&lesso
n=68)
Practice: Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
Week 6 : Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
Assignment
Assertion is true, Reason is false
6(Non
Graded) Both Assertion and Reason are false
(assessment? No, the answer is incorrect.
name=193) Score: 0
Accepted Answers:
Quiz: Week 6 :
Assignment 6
(assessment?
name=209) 5) Which of the following statements are true? (multi-correct) 1 point
Week 7 () Learning requires at least a simulation model for the MDP.

Planning requires at least a simulation model for the MDP.
Week 8 () Learning can be done just using the real-world samples.
Planning can be done just using the real-world samples.
Week 9 ()
Partially Correct.
S 0
Score: 0.5
Planning requires at least a simulation model for the MDP.
Week 11 () Learning can be done just using the real-world samples.
Week 12 ()
DOWNLOAD Your score is: 0.5/5

VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
Answer Submitted.
X
If already
registered, click
to check your
payment status
Graded)
Course
Final score
outline
About 1) Which of the following is equal to G(3) 1 point

t
NPTEL ()
2
Rt+1 + γRt+2 + γ V (st+2 )
How does an
NPTEL
2 3
Rt+1 + γRt+2 + γ Rt+3 + γ V (st+3 )
online
course
2 3
work? () Rt+1 + γRt+2 + γ Rt+3 + γ V (st+4 )
None of the above.

Week 1 ()
Score: 1
2 3
Rt+1 + γRt+2 + γ Rt+3 + γ V (st+3 )
Week 3 ()
2) State True or False: The idea in Sarsa(λ ) is to apply the TD( λ ) prediction method 1 point
Week 4 () to just the states rather than to state-action pairs.
Week 5 () True
False
Week 6 ()
Score: 0
False
Eligibility 3) Is the TD(λ ) algorithm, with λ = 1 essentially Monte Carlo? 1 point

Traces (unit?
unit=69&lesso yes
n=70) no
Backward No, the answer is incorrect.

View of Score: 0
Eligibility Accepted Answers:
Traces (unit? yes
unit=69&lesso
4) In solving the control problem, suppose that the first action that is taken is not an 1 point
n=71)
optimal action according to the current policy at the start of an episode. Would an update be
Eligibility Trace made corresponding to this action and the subsequent reward received in Watkin’s Q(λ )
Control (unit? algorithm?
unit=69&lesso
n=72) Yes
Thompson No
Sampling
Recap (unit? Score: 1
unit=69&lesso Accepted Answers:
n=73) Yes
Week 7:
Solutions
(unit? Check Answers and Submit
unit=69&lesso
n=74) Your score is: 2/4
Week 7
Feedback
Form :
Reinforcement
Learning (unit?
unit=69&lesso
n=75)
Practice:
Week 7 :
Assignment
7(Non
Graded)
(assessment?
name=194)
Quiz: Week 7 :
Assignment 7
(assessment?
name=210)
Week 8 ()
Week 9 ()
Week 10 ()
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
Answer Submitted.
X
If already
registered, click
to check your
payment status
Graded)
Course
Final score
outline
About 1) Assuming that we use tile-coding state aggregation, which of the following is the 1 point
NPTEL () correct update equation for replacing traces, for a tile “h ”, when its indicator variable is 1?
How does an
⃗ (h) = γλe t−1
et ⃗ (h) + 1
NPTEL
online ⃗ ⃗
e (h) = γλe t−1 (h)
course
work? () ⃗ (h) = 1
et
Replacing traces are not well defined for tile-coding representations of states.
Week 1 ()
Score: 0
Week 2 ()
Accepted Answers:
⃗ (h) = 1
et
Week 3 ()
2) Assertion: LSPI can be used for continuous action spaces. 1 point

Week 4 ()
Reason: We can use samples of (state,action) pairs to train a regressor to approximate policy.
Week 5 () Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
Week 6 ()
Week 7 () Both Assertion and Reason are false
S 0
Score: 0
Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
Function
Approximation 3) State true or false for the following statement. 1 point
(unit? Statement:For the LSTD and LSTDQ methods, as we gather more samples, we can
unit=76&lesso ~ ~
incrementally append data to the A and b matrices. This allows us to stop and solve for the
n=77)
value of θπ at any point in a sampled trajectory.
Linear
Parameterizati True
on (unit? False
unit=76&lesso
n=78) Yes, the answer is correct.
Score: 1
State Accepted Answers:
Aggregation True
Methods (unit?
unit=76&lesso 4) Which of the following is the correct update equation for eligibility traces with linear 1 point
n=79)
function approximator, V (st ) = w
⊤
ϕ(st ) ?
Function
Approximation
⃗ = γλe t−1
et ⃗ + w
and Eligibility
Traces (unit?
⃗ = γλe t−1
et ⃗ + ϕ(st )
unit=76&lesso
n=80)
⃗ = γλe t−1
et ⃗ + ϕ(st−1 )
LSTD and None of the above

LSTDQ (unit?
unit=76&lesso Yes, the answer is correct.
Score: 1
n=81)
Accepted Answers:
LSPI and ⃗ = γλe t−1
et ⃗ + ϕ(st )
Fitted Q (unit?
unit=76&lesso
n=82) Check Answers and Submit
Week 8
Your score is: 2/4
Feedback
Form :
Reinforcement
Learning (unit?
unit=76&lesso
n=85)
Week 8:
Solutions
(unit?
unit=76&lesso
n=84)
Practice:
Week 8 :
Assignment
8(Non
Graded)
(assessment?
name=195)
Quiz: Week 8 :
Assignment 8
(assessment?
name=211)
Week 9 ()
Week 10 ()
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
Answer Submitted.
X
If already
registered, click
to check your
payment status
Graded)
Course
Final score
outline
About 1) State True or False 1 point

NPTEL () DQN is guaranteed to converge to an optimal policy.
How does an True

NPTEL False
online
course Score: 1
False
Week 1 ()
2) Which of the following is true about DQN? 1 point
Week 2 ()
It can be efficiently used for very large state spaces
Week 3 () It can be efficiently used for continuous action spaces
Week 4 () Score: 0
Accepted Answers:
Week 5 () It can be efficiently used for very large state spaces
Week 6 () 3) Which of the following is the correct definition of average reward formulation? 1 point
Week 7 ()
ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ]
ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ][π]
Week 8 () 1
ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ]
N
Week 9 () 1
ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ][π]
N
DQN and No, the answer is incorrect.

Fitted Q- Score: 0
Iteration (unit? Accepted Answers:
unit=86&lesso ρ(π) = limN →∞
1
E[r1 + r2 +. . . +rN ][π]
N
n=87)
4) Policy Gradient Theorem does not hold for average reward formulation. 1 point
Policy
Gradient
True
Approach
(unit?
False
n=88) Score: 0
Accepted Answers:
Actor Critic
False
and
REINFORCE
(unit? 5) How many outputs will we get from the final layer of a DQN Network (|S| and |A| 1 point
unit=86&lesso represent the total number of states and actions in the environment respectively)?
n=89)
REINFORCE |S| × |A|
(cont'd) (unit?
unit=86&lesso |S|
n=90)
|A|
Policy
None of these
Gradient with
Function No, the answer is incorrect.
Approximation Score: 0
unit=86&lesso |A|
n=91)
Week 9:
Solutions
(unit?
Your score is: 1/5
unit=86&lesso
n=92)
Week 9
Feedback
Form :
Reinforcement
Learning (unit?
unit=86&lesso
n=93)
Practice:
Week 9 :
Assignment
9(Non
Graded)
(assessment?
name=196)
Quiz: Week 9 :
Assignment 9
(assessment?
name=212)
Week 10 ()
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
Answer Submitted.
X
If already
registered, click
to check your
payment status
Graded)
Course
Final score
outline
Instructions: In the following questions, one or more choices may be correct. Select all that apply.
About
NPTEL ()
1) Given a problem with a well defined hierarchy, what ordering would you expect on 1 point
How does an the total expected reward for a hierarchically optimal policy (H), a recursively optimal policy (R)
NPTEL and a flat optimal policy (F)?
online
course
R ≤ F ≤ H
work? ()
F ≤ R ≤ H
Week 1 ()
R ≤ H ≤ F
Week 2 ()
F ≤ H ≤ R

Week 3 () Score: 0
Accepted Answers:
Week 4 () R ≤ H ≤ F
Week 5 () 2) Do flat optimal solutions give the solution with highest expected returns? 1 point
Week 6 () Yes
No
Week 7 () Yes, the answer is correct.
S 1
Score: 1
Yes
Week 9 ()
3) Which of the following are true for HAMs? 1 point
Week 10 ()
The finite state machines in a HAM must be connected in a directed acyclic graph.
Hierarchical A choice state can make a transition to any state in any machine in the HAM
Reinforcement Stop states terminate the episode.
Learning (unit?
Action states emit primitive actions.
unit=94&lesso
Score: 0
Types of
Accepted Answers:
Optimality
The finite state machines in a HAM must be connected in a directed acyclic graph.
(unit?
Action states emit primitive actions.
unit=94&lesso
n=96)
4) In Hierarchy of Abstract Machine the core MDP state changes only when we visit a 1 point
Semi Markov choice state.
Decision
Processes True
(unit?
False
unit=94&lesso
Score: 0
Options (unit?
Accepted Answers:
unit=94&lesso False
n=98)
Learning with 5) State True or False: 1 point

Options (unit? Flat optimal policies are always easier to learn than recursively optimal policies.
unit=94&lesso
n=99) True
Hierarchical False
Abstract No, the answer is incorrect.
Machines Score: 0
unit=94&lesso False
n=100)
Week 10:
Solutions
(unit?
unit=94&lesso
Your score is: 1/5
n=101)
Week 10
Feedback
Form :
Reinforcement
Learning (unit?
unit=94&lesso
n=102)
Practice:
Week 10 :
Assignment
10(Non
Graded)
(assessment?
name=197)
Quiz: Week 10
: Assignment
10
(assessment?
name=213)
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
Answer Submitted.
X
If already
registered, click
to check your
payment status
Graded)
Course
Final score
outline
About 1) State True or False: 1 point

NPTEL () Each sub task, Mi in a MAXQ Framework is not an SMDP.
How does an True

NPTEL False
online
course Score: 1
False
Week 1 ()
2) Recall that in MAXQ Value Function Decomposition, we draw a “call graph” where 1 point
Week 2 () nodes are ‘tasks’ and edges show the dependency of the tasks. Which of the following is true
about the graph?
Week 3 ()
The graph must be a tree
Week 4 () The graph must be a DAG
The graph can be any regular graph without self loops
Week 5 ()
Any directed graph can be a call graph

Score: 1
Accepted Answers:
Week 7 ()
The graph must be a DAG
3) State True or False: 1 point

Week 8 ()
In the MAXQ framework, rewards of the core MDP are not available while learning the policies
of the sub-tasks i.e, the agent is restricted to the corresponding sub-task’s pseudo rewards.
Week 9 ()
True
Week 10 ()
False
Week 11 ()
4) State True or False: 1 point
MAXQ (unit? In the MAXQ Framework, the expected reward function R ¯
(s, a) of the SMDP corresponding to
unit=103&less sub-task Mi is equivalent to the projected value function V π (a, s).i
on=104)
True
MAXQ Value
False
Function
Decomposition No, the answer is incorrect.
(unit? Score: 0
unit=103&less Accepted Answers:
on=105) False
Option
Discovery
(unit?
unit=103&less
Your score is: 2/4
on=106)
Week 11
Feedback
Form :
Reinforcement
Learning (unit?
unit=103&less
on=108)
Week 11:
Solutions
(unit?
unit=103&less
on=107)
Practice:
Week 11 :
Assignment
11(Non
Graded)
(assessment?
name=198)
Quiz: Week 11
: Assignment
11
(assessment?
name=214)
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
Answer Submitted.
X
If already
registered, click
to check your
payment status
Graded)
Course
Final score
outline
Instructions: In the following questions, one or more choices may be correct. Select all that
About
apply
NPTEL ()
How does an 1) Suppose that we solve a POMDP using a Q-MDP like solution discussed in the 1 point
NPTEL lectures - where we assume that the MDP is known and solve it to learn Q values for the true
online (state, action) pairs. Which of the following are true?
course
work? ()
We can recover a policy for execution in the partially observable environment by weighting
Q values by the belief distribution bel so that π(s) = argmax a ∑ bel(s)Q(s, a).
Week 1 () s
We can recover an optimal policy for the POMDP from the Q values that have been learnt
Week 2 ()
for the true (state, action) pairs.
Week 3 ()
Policies recovered from Q-MDP like solution methods are always better than policies
learnt by history based methods.
Week 4 ()
None of the above

Score: 1
Accepted Answers:
Week 6 ()
We can recover a policy for execution in the partially observable environment by weighting Q
values by the belief distribution bel so that π(s) = argmax a ∑ bel(s)Q(s, a).
Week 7 () s
2) Consider the below grid-world: 1 point

Week 8 ()
Week 9 ()
Week 10 ()
Week 11 ()
Week 12 ()
POMDP
Introduction
(unit?
unit=109&less
on=110)
Solving
POMDP (unit?
unit=109&less In the figure above, black squares are blocked. Assume the agent can see one step in the 4
on=111) cardinal directions. Assume that the agent’s observations are always correct and that there is no
prior information given regarding the states.
Week 12:
Solutions
Assertion: If the observation is that there are no obstruction to the East or West, but are present
(unit?
to the North and South, the belief that the agent is in the green shaded square is 0.5.
unit=109&less
on=113) Reason: Only the green and blue shaded squares have obstructions to the North and South, but
not to the East or West.
Practice:
Week 12 : Assertion and Reason are both true and Reason is a correct explanation of Assertion.
Assignment
Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
12(Non
Graded) Assertion is true but Reason is false.
(assessment? Assertion and Reason are both false.
name=199)
Week 12 Score: 0
Feedback Accepted Answers:
Form : Assertion and Reason are both false.
Reinforcement
Learning (unit? 3) Consider the grid world shown below. Walls and obstacles are colored gray. An 1 point
unit=109&less
agent is dropped into one of the unoccupied cells of the environment uniformly at random. The
on=112)
agent is equipped with a sensor that can detect the presence of walls or obstacles immediately to
Quiz: Week 12 its North, South, East or West. However the sensor is noisy, and an observation made in each
: Assignment direction may be wrong with a probability of 0.1. Given that the agent senses no obstacles in any
12 direction, what is the probability that it was dropped into the cell marked ‘x’?
(assessment?
name=215)
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
1/5
82/91
164/173
None of the above.
Score: 0
Accepted Answers:
None of the above.
4) In the same environment as Question 3, what is the probability that the agent was 1 point
not dropped onto the cell marked ‘x’, if the observation made is that there are obstacles present
only to the North and to the South?
4/5
82/91
164/173
None of the above.

Score: 0
Accepted Answers:
164/173
5) Asserion: In partially observable systems, histories that include both the sequence 1 point
of observations and the sequence of actions are typically able to disambiguate the true state of
an agent better than histories that include only the sequence of observations.
Reason: Different sequences of actions can lead to different interpretations of the sequence of
sensor observations.
Both Assertion and Reason are true, and Reason is a correct explanation of the
Assertion.
Both Assertion and Reason are true, but Reason is not a correct explanation of the
Assertion.
S 0
Score: 0
Accepted Answers:
Both Assertion and Reason are true, and Reason is a correct explanation of the Assertion.
Your score is: 1/5

Non-Graded: Assignment 1: (Https://swayam - Gov.in)

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Non-Graded: Assignment 1: (Https://swayam - Gov.in)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Non-Graded: Assignment 1: (Https://swayam - Gov.in)

Uploaded by

Copyright:

Available Formats

4/21/24, 3:26 AM Reinforcement Learning - - Unit 4 - Week 1

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

unit=17&lesso RL agents do not face credit assignment problem

unit=17&lesso Now, assuming the following action-value estimates:

No, the answer is incorrect.

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

unit=25&lesso sampled in the next trial?

unit=25&lesso Yes, the answer is correct.

Feedback ii. Ti (n) = ∑

Learning (unit? Which of these equalities/inequalities are correct?

Check Answers and Submit

Your score is: 1/4

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

Week 3 () Our main aim is to maximize the current reward.

Policy Search The agent performs the actions in a deterministic fashion.

n=38) (iv) σ t+1 = σ t + αrt (

Week 7 () Nothing, γ > 1 is common for many RL problems

Yes, the answer is correct.

Text Your score is: 1/5

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

About 1) State True/ False 1 point

How does an True

unit=42&lesso No, the answer is incorrect.

Banach Fixed No, the answer is incorrect.

name=191) Which of the above statements are true?

Week 4 Only (i)

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

No, the answer is incorrect.

Learning (unit? ||v

Week 7 () 5) If v satisfy Lv = Lπ v = v (where L is the Bellman optimality operator and Lπ is 1 point

π is ϵ optimal and v is the fixed point of π

π is not optimal and v is not the fixed point of π

Text Your score is: 0.33/5

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

About 1) Consider the following equation for TD-control. 1 point

δ = α[Rt+1 + γ (st+1 , at+1 ) − Qold (st , at )]

Week 3 () No, the answer is incorrect.

Q-Learning Weighted Importance Sampling =

Week 6: π is behaviour policy and µ is estimation policy

Week 7 () Learning requires at least a simulation model for the MDP.

DOWNLOAD Your score is: 0.5/5

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

About 1) Which of the following is equal to G(3) 1 point

None of the above.

Eligibility 3) Is the TD(λ ) algorithm, with λ = 1 essentially Monte Carlo? 1 point

Backward No, the answer is incorrect.

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

2) Assertion: LSPI can be used for continuous action spaces. 1 point

Week 7 () Both Assertion and Reason are false

No, the answer is incorrect.

LSTD and None of the above

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

About 1) State True or False 1 point

How does an True

ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ][π]

DQN and No, the answer is incorrect.

REINFORCE |S| × |A|

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)