Non-Graded: Assignment 1: (Https://swayam - Gov.in)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

4/21/24, 3:26 AM Reinforcement Learning - - Unit 4 - Week 1

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Non-Graded : Assignment 1
payment status Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Final score

Course
outline 1) Which of the following best suits the notion of ’regret’ in a standard multi-arm bandit 1 point
problem? Assume bandits are stationary.
About
“The difference between the total reward that could have been achieved by choosing the
NPTEL ()
optimal action from the beginning and the total reward achieved by choosing the worst action
from the beginning.”
How does an
NPTEL “The number of time steps required for the solution method to find the optimal action.”
online “The difference between the total reward that could have been achieved by choosing the
course optimal action from the beginning and the total reward accumulated by our solution method”
work? ()
“The difference between the best possible reward that can be sampled by selecting the
optimal action and the worst possible reward that can be sampled by selecting the optimal
Week 1 ()
action.”
Introduction to No, the answer is incorrect.
RL (unit? Score: 0
unit=17&lesso Accepted Answers:
n=18) “The difference between the total reward that could have been achieved by choosing the
optimal action from the beginning and the total reward accumulated by our solution method”
RL Framework
and
Applications 2) Credit assignment problem is the issue of assigning a correct mapping of rewards 1 point
(unit? accumulated to the action(s) that led to them. Which of the following is the reason for credit
unit=17&lesso assignment problem in RL?
n=19)
Rewards are restricted to be a scalar value
Introduction to
Rewards are delayed in RL setting
Immediate RL
(unit? Agent cannot observe the reward

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=17&assessment=188 1/3
4/21/24, 3:26 AM Reinforcement Learning - - Unit 4 - Week 1

unit=17&lesso RL agents do not face credit assignment problem


n=20)
No, the answer is incorrect.
Bandit Score: 0
Optimalities Accepted Answers:
(unit? Rewards are delayed in RL setting
unit=17&lesso
n=21) 3) Consider a standard multi-arm bandit problem. The probability of picking an action, 1 point
using the softmax policy is given by:
Value Function Qt (a)/β
e
Based P r (at = a) =
∑ eQt (b)/β
Methods (unit? b

unit=17&lesso Now, assuming the following action-value estimates:


n=22) Qt (a0 ) = 1, Qt (a1 ) = 0.2, Qt (a2 ) = 0.5 and Qt (a3 ) = −1

Week 1: 0
Solutions
(unit?
0.1
unit=17&lesso 0.23
n=24)
0.31
Week 1 Yes, the answer is correct.
Feedback Score: 1
Form : Accepted Answers:
Reinforcement 0.23
Learning (unit?
unit=17&lesso
4) Consider the following statements 1 point
n=23)
i The rewards are obtained at a fixed time after taking an action.
Practice: ii Reinforcement Learning is neither supervised nor unsupervised learning.
Non-Graded : iii Two reinforcement learning agents can learn by playing against each other.
Assignment 1 iv Always selecting the action with maximum reward will automatically maximize the winning
(assessment? probability in a game.
name=188)
Which of the above statements is/are correct?
Quiz: Week 1 :
Assignment 1 i, ii, iii
(assessment? ii
name=200)
ii, iii

Week 2 () iii, iv

No, the answer is incorrect.


Week 3 () Score: 0
Accepted Answers:
Week 4 () ii, iii

Week 5 ()
Check Answers and Submit
Week 6 ()
Your score is: 1/4

Week 7 ()

Week 8 ()

Week 9 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=17&assessment=188 2/3
4/21/24, 3:26 AM Reinforcement Learning - - Unit 4 - Week 1

Week 10 ()

Week 11 ()

Week 12 ()

DOWNLOAD
VIDEOS ()

Text
Transcripts ()

Problem
Solving
Session -
Jan 2024 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=17&assessment=188 3/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 5 - Week 2

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Non-Graded: Assignment 2
payment status Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Final score

Course
outline 1) Which of the following is true of the UCB algorithm? 1 point

About The action with the highest Q value is chosen at every iteration.
NPTEL ()
After a very large number of iterations, the confidence intervals of unselected actions will
not change much.
How does an
NPTEL The true expected-value of an action always lies within it’s estimated confidence interval.
online
course With a small probability ϵ , we select a random action to ensure adequate exploration of the
work? () action space.
No, the answer is incorrect.
Week 1 () Score: 0
Accepted Answers:
Week 2 () After a very large number of iterations, the confidence intervals of unselected actions will not
change much.
UCB 1 (unit?
unit=25&lesso 2) In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm, the 1 point
n=26)
estimates of Q values are
Concentration Q100 (1) = 1.73, Q100 (2) = 1.83, Q100 (3) = 1.89, Q100 (4) = 1.55 and the number of times
Bounds (unit? each of them are sampled are n1 = 25, n2 = 20, n3 = 30, n4 = 15 . Which arm will be

unit=25&lesso sampled in the next trial?


n=27)
Arm 1
UCB 1
Theorem
Arm 2
(unit? Arm 3
Arm 4
Y h i

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=25&assessment=189 1/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 5 - Week 2

unit=25&lesso Yes, the answer is correct.


n=28) Score: 1
Accepted Answers:
PAC Bounds Arm 2
(unit?
unit=25&lesso 3) Assertion: The confidence bound of each arm in the UCB algorithm cannot 1 point
n=29) increase with iterations.
Reason: The nj term in the denominator ensures that the confidence bound remains the
Median
same for unselected arms and decreases for the selected arm.
Elimination
(unit?
Assertion and Reason are both true and Reason is a correct explanation of Assertion
unit=25&lesso
n=30) Assertion and Reason are both true and Reason is not a correct explanation of Assertion
Assertion is true and Reason is false
Thompson
Sampling Both Assertion and Reason are false
(unit?
No, the answer is incorrect.
unit=25&lesso Score: 0
n=31) Accepted Answers:
Week 2:
Both Assertion and Reason are false
Solutions
(unit? 4) Consider the following equalities/inequalities for the UCB algorithm, following the 1 point
unit=25&lesso notation used in the lectures. (Ti (n): the number of times that action i has been played in the
n=33) previous n trials, Ck,T represents the confidence bound for arm i after k trials)
i
(k)
n

Week 2 i. Ti (n) = 1 + ∑
m=k+1
{I m = i}
n

Feedback ii. Ti (n) = ∑


m=1
{I m = i}
n
Form : iii. Ti (n) ≤ 1 + ∑
m=k+1

{Qm−1 (a ) + C m−1,T ∗
a (m−1)
≤ Qm−1 (i) + C m−1,T
i
(m−1)
}

Reinforcement n
iv. Ti (n) ≤ 1 + ∑m=k+1 {Qm−1 (a ) ≤ Qm−1 (i) + Cm−1,T∗
(m−1)
}
i

Learning (unit? Which of these equalities/inequalities are correct?


unit=25&lesso
n=32) i and iii
Practice: ii and iv
Non-Graded: i, ii, iii
Assignment 2
i, ii, iii, iv
(assessment?
name=189) No, the answer is incorrect.
Score: 0
Quiz: Week 2 :
Accepted Answers:
Assignment 2
i, ii, iii, iv
(assessment?
name=201)

Check Answers and Submit


Week 3 ()

Your score is: 1/4


Week 4 ()

Week 5 ()

Week 6 ()

Week 7 ()

Week 8 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=25&assessment=189 2/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 5 - Week 2

Week 9 ()

Week 10 ()

Week 11 ()

Week 12 ()

DOWNLOAD
VIDEOS ()

Text
Transcripts ()

Problem
Solving
Session -
Jan 2024 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=25&assessment=189 3/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 6 - Week 3

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Week 3: Assignment 3(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline

About 1) In full RL problem, which of the following is determined by the agent? 1 point
NPTEL ()
State
How does an Reward
NPTEL Action
online
None of the above
course
work? () No, the answer is incorrect.
Score: 0
Accepted Answers:
Week 1 ()
Action

Week 2 ()
2) Which of the following statements is true about the RL problem? 1 point

Week 3 () Our main aim is to maximize the current reward.

Policy Search The agent performs the actions in a deterministic fashion.


(unit? We assume that the agent determines the reward based on the current state and action
unit=34&lesso
It is possible to have zero rewards.
n=35)
No, the answer is incorrect.
REINFORCE Score: 0
(unit? Accepted Answers:
unit=34&lesso It is possible to have zero rewards.
n=36)

3) 1 point

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=34&assessment=190 1/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 6 - Week 3

Contextual Let us say we are taking actions according to a Gaussian distribution with parameters μ and σ .
Bandits (unit? We update the parameters according to REINFORCE and at denote the action taken at step t.
unit=34&lesso
n=37) a t −μ
(i) μ
t+1
= μ
t
+ αr t
2
t

σ
t

Full RL μ −a t
(ii) μ
t+1
= μ
t
+ αr t
t

Introduction
2
σ
t

(a t−μ )2
(unit? (iii) σ t+1 = σ t + αrt
t

3
σ
unit=34&lesso t

n=38) (iv) σ t+1 = σ t + αrt (


(a t − μ )
t

1
)
3 σt
σ
t

Returns, Value
Functions and
Which of the above updates are correct?
MDPs (unit?
unit=34&lesso
(i), (iii)
n=39)
(i), (iv)
Week 3:
(ii), (iv)
Solutions
(unit? (ii), (iii)
unit=34&lesso No, the answer is incorrect.
n=41) Score: 0
Accepted Answers:
Week 3
(i), (iv)
Feedback
Form : 4) Assertion: Contextual bandits can be modeled as a full reinforcement learning 1 point
Reinforcement
problem.
Learning (unit?
Reason: We can define an MDP with n states where n is the number of bandits. The number of
unit=34&lesso
actions from each state corresponds to the arms in each bandit, with every action leading to
n=40)
termination of the episode, and giving a reward according to the corresponding bandit and arm.
Practice:
Week 3: Assertion and Reason are both true and Reason is a correct explanation of Assertion
Assignment Assertion and Reason are both true and Reason is not a correct explanation of Assertion
3(Non
Graded)
Assertion is true and Reason is false
(assessment? Both Assertion and Reason are false
name=190)
No, the answer is incorrect.
Quiz: Week 3 :
Score: 0
Assignment 3 Accepted Answers:
(assessment? Assertion and Reason are both true and Reason is a correct explanation of Assertion
name=204)
5) Remember for discounted returns, 1 point
Week 4 ()
2
Gt = rt + γrt+1 + γ rt+2 +. . .
Week 5 () Where γ is a discount factor. Which of the following best explains what happens when γ > 1 ,
(say γ = 5 )?
Week 6 ()

Week 7 () Nothing, γ > 1 is common for many RL problems


Theoretically nothing can go wrong, but this case does not represent any real world
Week 8 () problems
The agent will learn that delayed rewards will always be beneficial and so will not learn
Week 9 () properly.
None of the above is true.
Week 10 ()

Y h i

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=34&assessment=190 2/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 6 - Week 3

Yes, the answer is correct.


Week 11 () Score: 1
Accepted Answers:
Week 12 () The agent will learn that delayed rewards will always be beneficial and so will not learn
properly.
DOWNLOAD
VIDEOS () Check Answers and Submit

Text Your score is: 1/5


Transcripts ()

Problem
Solving
Session -
Jan 2024 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=34&assessment=190 3/3
4/21/24, 3:28 AM Reinforcement Learning - - Unit 7 - Week 4

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Week 4 : Assignment 4(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline

About 1) State True/ False 1 point


NPTEL () The Bellman optimality equation can be solved as a linear system of equations.

How does an True


NPTEL False
online
Yes, the answer is correct.
course Score: 1
work? () Accepted Answers:
False
Week 1 ()
2) Consider the following statements for a finite MDP (Pπ is a stochastic matrix): 1 point
Week 2 () (i) The Bellman equation for the value function of a finite MDP defines a contraction
operator(using the max norm).
Week 3 () (ii) If 0 ≤ γ < 1, then the eigenvalues of γPπ are less than 1.
(iii) If 0 ≤ γ < 1, the sequence defined by vn = rπ + γP π vn−1 is a Cauchy sequence (using
Week 4 () the max norm).
Which of the above statements are true?
MDP
Modelling Only (ii), (iii)
(unit?
Only (i), (ii)
unit=42&lesso
n=43) Only (i), (iii)

Bellman
(i), (ii), (iii)
Equation (unit?

N h i i

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=42&assessment=191 1/3
4/21/24, 3:28 AM Reinforcement Learning - - Unit 7 - Week 4

unit=42&lesso No, the answer is incorrect.


n=44) Score: 0
Accepted Answers:
Bellman (i), (ii), (iii)
Optimality
Equation (unit? 3) Select the correct Bellman optimality equation: 1 point
unit=42&lesso
n=45) ∗ ∗
v (s) = maxa ∑ p(s'|s, a)[E[r|s, a, s'] + γv (s')]
s'

Cauchy
Sequence and ∗
v (s) = maxa ∑
s'

p(s'|s, a)v (s')

Green's
Equation (unit? ∗
v (s) = maxa ∑
s'
p(s'|s, a)[γE[r|s, a, s'] + v (s')]

unit=42&lesso
n=46) ∗
v (s) = maxa ∑ p(s'|s, a)γ[E[r|s, a, s'] + v (s')]

s'

Banach Fixed No, the answer is incorrect.


Point Theorem Score: 0
(unit? Accepted Answers:
∗ ∗
unit=42&lesso v (s) = maxa ∑
s'
p(s'|s, a)[E[r|s, a, s'] + γv (s')]

n=47)
4) State True/False 1 point
Convergence
Proof (unit?
unit=42&lesso
In MDPs, there is a unique resultant state for any given state-action pair.
n=48)
True
Week 4: False
Solutions
(unit? No, the answer is incorrect.
Score: 0
unit=42&lesso
n=50) Accepted Answers:
False
Practice:
Week 4 :
5) For an operator L, let x be one of its fixed points. Consider the following 1 point
Assignment
statements:
4(Non
(i) Lm x = L
n
x, where m, n ∈ N
Graded)
(assessment? (ii)L2
x = x

name=191) Which of the above statements are true?

Week 4 Only (i)


Feedback
Only (ii)
Form :
Reinforcement (i), (ii)
Learning (unit? None of the above
unit=42&lesso
n=49) No, the answer is incorrect.
Score: 0
Quiz: Week 4 : Accepted Answers:
Assignment 4 (i), (ii)
(assessment?
name=207)
Check Answers and Submit
Week 5 ()
Your score is: 1/5
Week 6 ()

Week 7 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=42&assessment=191 2/3
4/21/24, 3:28 AM Reinforcement Learning - - Unit 7 - Week 4

Week 8 ()

Week 9 ()

Week 10 ()

Week 11 ()

Week 12 ()

DOWNLOAD
VIDEOS ()

Text
Transcripts ()

Problem
Solving
Session -
Jan 2024 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=42&assessment=191 3/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 8 - Week 5

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Week 5 : Assignment 5(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline

About 1) Let the policy and value function obtained by applying policy iteration on a finite 1 point
NPTEL () MDP be π n and vπ , after the stopping criterion is met. Then:
n

How does an
πn will be the optimal policy but vπ might not be the optimal value function.
n

NPTEL
online
πn might not be the optimal policy but vπ will be the optimal value function.
n

course
work? ()
Nothing can be said about π n and vπ . n

Week 1 () πn will be the optimal policy and vπ will be the optimal value function.
n

No, the answer is incorrect.


Week 2 ()
Score: 0
Accepted Answers:
Week 3 () π n will be the optimal policy and v
π
will be the optimal value function.
n

Week 4 () 2) Assertion: Monte Carlo policy evaluation must use exploring starts in the case of 1 point
non-deterministic policies.
Week 5 ()
Reason: They have to rely upon exploring starts in the case of non-deterministic policies to
Lpi
ensure adequate sampling of all states.
Convergence
(unit?
unit=51&lesso (Assume all states of the MDP are reachable from all other states).
n=52)

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=51&assessment=192 1/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 8 - Week 5

Value Iteration Assertion and Reason are both true and Reason is a correct explanation of Assertion
(unit?
Assertion and Reason are both true and Reason is not a correct explanation of Assertion
unit=51&lesso
n=53) Assertion is true and Reason is false
Both Assertion and Reason are false
Policy Iteration
(unit? No, the answer is incorrect.
unit=51&lesso Score: 0
n=54) Accepted Answers:
Both Assertion and Reason are false
Dynamic
Programming 3) Which of the following is/are guaranteed to be true at the stopping criterion for value 1 point
(unit? iteration?
unit=51&lesso
n=55) ϵ(1−γ)
Note: We stop the algorithm when ||vn+1 − v
n
|| <

Monte Carlo
(unit?
π ∗
||v − v || ≤ ϵ
unit=51&lesso
n=56)
n+1 ∗
||v − v || ≤ ϵ
Control in
Monte Carlo ||v
π ∗
− v || ≤ ϵ/2
(unit?
unit=51&lesso ||v
n+1 ∗
− v || ≤ ϵ/2
n=57)
Partially Correct.
Week 5 Score: 0.33
Feedback Accepted Answers:
π ∗
Form : ||v − v || ≤ ϵ

Reinforcement ||v
n+1 ∗
− v || ≤ ϵ

Learning (unit? ||v


n+1 ∗
− v || ≤ ϵ/2

unit=51&lesso
n=59) 4) Consider an MDP, where there are n actions (a ∈ A, with |A| = n), each of which1 point
Week 5: is applicable in each state s ∈ S. If π is an ϵ - soft policy for some ϵ > 0, and let qπ be the
Solutions action-value function of the policy π , then consider the following statements and choose the
(unit? appropriate option.
unit=51&lesso
n=58) Assertion: Any ϵ - greedy policy with respect to qπ is strictly better than π .
Practice:
Week 5 : Reason: In an ϵ - greedy policy, the action with the maximal estimated action value is chosen
Assignment with probability 1 − ϵ + ϵ/n and other actions are chosen at random with probability ϵ/n .
5(Non
Graded) Both assertion and reason are true, and reason correct explanation for assertion.
(assessment? Both assertion and reason are true, but reason is not correct explanation for assertion.
name=192)
Assertion is false but reason is true.
Quiz: Week 5 : Assertion is true but reason is false.
Assignment 5
(assessment? No, the answer is incorrect.
Score: 0
name=208)
Accepted Answers:
Week 6 () Assertion is false but reason is true.

Week 7 () 5) If v satisfy Lv = Lπ v = v (where L is the Bellman optimality operator and Lπ is 1 point


the Bellman operator for value function of π ), then

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=51&assessment=192 2/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 8 - Week 5

π is ϵ optimal and v is the fixed point of π


Week 8 ()

π is not optimal and v is not the fixed point of π


Week 9 ()
π is an optimal policy and v is the fixed point of π
Week 10 ()
π is not optimal and v is the fixed point of the optimal policy
Week 11 ()
No, the answer is incorrect.
Score: 0
Week 12 () Accepted Answers:
π is an optimal policy and v is the fixed point of π

DOWNLOAD
VIDEOS ()
Check Answers and Submit

Text Your score is: 0.33/5


Transcripts ()

Problem
Solving
Session -
Jan 2024 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=51&assessment=192 3/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 9 - Week 6

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Week 6 : Assignment 6(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline

About 1) Consider the following equation for TD-control. 1 point


NPTEL ()
Qnew (st , at ) = Qold (st , at ) + α[Rt+1 + γQold (st+1 , at+1 ) − Qold (st , at )]

How does an
NPTEL Which part of the equation is said to be as TD-error (δ)
online
course
δ = Rt+1 + γ (st+1 , at+1 )
work? () old

δ = α[Rt+1 + γ (st+1 , at+1 ) − Qold (st , at )]


old
Week 1 ()
δ = Rt+1 + γ (st+1 , at+1 ) − Qold (st , at )
old

Week 2 ()
δ = Qold(st+1 , at+1 ) − Qold (st , at )

Week 3 () No, the answer is incorrect.


Score: 0
Week 4 () Accepted Answers:
δ = Rt+1 + γ (st+1 , at+1 ) − Qold (st , at )
old

Week 5 ()
2) Which of the following are True for TD(0) ? (Assume that the environment is truly 1 point
Week 6 () Markov)

Off Policy MC
It uses the full return to update the value of states.
(unit? Both TD(0) and Monte-Carlo policy evaluation converge to the same value function, given
a finite number of samples.

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=60&assessment=193 1/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 9 - Week 6

unit=60&lesso Both TD(0) and Monte-Carlo policy evaluation converge to the same value function, given
n=61)
an infinite number of samples.
UCT (unit?
unit=60&lesso TD error is given by “δ = vnew (st , at ) − vold (st , at )”.

n=62)
No, the answer is incorrect.
Score: 0
TD(0) (unit?
unit=60&lesso
Accepted Answers:
n=63)
Both TD(0) and Monte-Carlo policy evaluation converge to the same value function, given an
infinite number of samples.
TD(0) Control
(unit? 3) Consider the following equation for weighted Importance Sampling 1 point
unit=60&lesso
n=64) N p(xi )
∑ f (x i )
i=1
q(xi )

Q-Learning Weighted Importance Sampling =


N p(xi )

(unit? i=1
q(xi )

unit=60&lesso
n=65) where x i is trajectory, f (x i) is return on trajectory, p(x i) is probability of trajectory x i when
Afterstate following policy π and q(x i) is probability of trajectory x i when following policy µ, then
(unit?
unit=60&lesso
π is estimation policy and µ is behaviour policy
n=66)

Week 6: π is behaviour policy and µ is estimation policy


Solutions
No, the answer is incorrect.
(unit? Score: 0
unit=60&lesso
Accepted Answers:
n=67)
π is estimation policy and µ is behaviour policy

Week 6
Feedback 4) Assertion: SARSA is an on-policy method. 1 point
Form : Reason: In SARSA, we do not update the action that was actually used, so it is on-policy
Reinforcement method.
Learning (unit?
unit=60&lesso
n=68)

Practice: Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
Week 6 : Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
Assignment
Assertion is true, Reason is false
6(Non
Graded) Both Assertion and Reason are false
(assessment? No, the answer is incorrect.
name=193) Score: 0
Accepted Answers:
Quiz: Week 6 :
Assertion is true, Reason is false
Assignment 6
(assessment?
name=209) 5) Which of the following statements are true? (multi-correct) 1 point

Week 7 () Learning requires at least a simulation model for the MDP.


Planning requires at least a simulation model for the MDP.
Week 8 () Learning can be done just using the real-world samples.
Planning can be done just using the real-world samples.
Week 9 ()
Partially Correct.

S 0

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=60&assessment=193 2/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 9 - Week 6

Score: 0.5
Week 10 () Accepted Answers:
Planning requires at least a simulation model for the MDP.
Week 11 () Learning can be done just using the real-world samples.

Week 12 ()
Check Answers and Submit

DOWNLOAD Your score is: 0.5/5


VIDEOS ()

Text
Transcripts ()

Problem
Solving
Session -
Jan 2024 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=60&assessment=193 3/3
4/21/24, 3:30 AM Reinforcement Learning - - Unit 10 - Week 7

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Week 7 : Assignment 7(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline

About 1) Which of the following is equal to G(3) 1 point


t

NPTEL ()

2
Rt+1 + γRt+2 + γ V (st+2 )
How does an
NPTEL
2 3
Rt+1 + γRt+2 + γ Rt+3 + γ V (st+3 )
online
course
2 3
work? () Rt+1 + γRt+2 + γ Rt+3 + γ V (st+4 )

None of the above.


Week 1 ()
Yes, the answer is correct.
Score: 1
Week 2 () Accepted Answers:
2 3
Rt+1 + γRt+2 + γ Rt+3 + γ V (st+3 )

Week 3 ()
2) State True or False: The idea in Sarsa(λ ) is to apply the TD( λ ) prediction method 1 point
Week 4 () to just the states rather than to state-action pairs.

Week 5 () True
False
Week 6 ()
No, the answer is incorrect.
Score: 0
Week 7 () Accepted Answers:
False

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=69&assessment=194 1/3
4/21/24, 3:30 AM Reinforcement Learning - - Unit 10 - Week 7

Eligibility 3) Is the TD(λ ) algorithm, with λ = 1 essentially Monte Carlo? 1 point


Traces (unit?
unit=69&lesso yes
n=70) no

Backward No, the answer is incorrect.


View of Score: 0
Eligibility Accepted Answers:
Traces (unit? yes
unit=69&lesso
4) In solving the control problem, suppose that the first action that is taken is not an 1 point
n=71)
optimal action according to the current policy at the start of an episode. Would an update be
Eligibility Trace made corresponding to this action and the subsequent reward received in Watkin’s Q(λ )
Control (unit? algorithm?
unit=69&lesso
n=72) Yes
Thompson No
Sampling
Yes, the answer is correct.
Recap (unit? Score: 1
unit=69&lesso Accepted Answers:
n=73) Yes
Week 7:
Solutions
(unit? Check Answers and Submit
unit=69&lesso
n=74) Your score is: 2/4

Week 7
Feedback
Form :
Reinforcement
Learning (unit?
unit=69&lesso
n=75)

Practice:
Week 7 :
Assignment
7(Non
Graded)
(assessment?
name=194)

Quiz: Week 7 :
Assignment 7
(assessment?
name=210)

Week 8 ()

Week 9 ()

Week 10 ()

Week 11 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=69&assessment=194 2/3
4/21/24, 3:30 AM Reinforcement Learning - - Unit 10 - Week 7

Week 12 ()

DOWNLOAD
VIDEOS ()

Text
Transcripts ()

Problem
Solving
Session -
Jan 2024 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=69&assessment=194 3/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 11 - Week 8

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Week 8 : Assignment 8(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline

About 1) Assuming that we use tile-coding state aggregation, which of the following is the 1 point
NPTEL () correct update equation for replacing traces, for a tile “h ”, when its indicator variable is 1?

How does an
⃗ (h) = γλe t−1
et ⃗ (h) + 1
NPTEL
online ⃗ ⃗
e (h) = γλe t−1 (h)
course
work? () ⃗ (h) = 1
et

Replacing traces are not well defined for tile-coding representations of states.
Week 1 ()
No, the answer is incorrect.
Score: 0
Week 2 ()
Accepted Answers:
⃗ (h) = 1
et
Week 3 ()

2) Assertion: LSPI can be used for continuous action spaces. 1 point


Week 4 ()
Reason: We can use samples of (state,action) pairs to train a regressor to approximate policy.

Week 5 () Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
Week 6 ()
Assertion is true and Reason is false

Week 7 () Both Assertion and Reason are false

No, the answer is incorrect.

S 0

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=76&assessment=195 1/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 11 - Week 8

Score: 0
Week 8 () Accepted Answers:
Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
Function
Approximation 3) State true or false for the following statement. 1 point
(unit? Statement:For the LSTD and LSTDQ methods, as we gather more samples, we can
unit=76&lesso ~ ~
incrementally append data to the A and b matrices. This allows us to stop and solve for the
n=77)
value of θπ at any point in a sampled trajectory.
Linear
Parameterizati True
on (unit? False
unit=76&lesso
n=78) Yes, the answer is correct.
Score: 1
State Accepted Answers:
Aggregation True
Methods (unit?
unit=76&lesso 4) Which of the following is the correct update equation for eligibility traces with linear 1 point
n=79)
function approximator, V (st ) = w

ϕ(st ) ?
Function
Approximation
⃗ = γλe t−1
et ⃗ + w
and Eligibility
Traces (unit?
⃗ = γλe t−1
et ⃗ + ϕ(st )
unit=76&lesso
n=80)
⃗ = γλe t−1
et ⃗ + ϕ(st−1 )

LSTD and None of the above


LSTDQ (unit?
unit=76&lesso Yes, the answer is correct.
Score: 1
n=81)
Accepted Answers:
LSPI and ⃗ = γλe t−1
et ⃗ + ϕ(st )

Fitted Q (unit?
unit=76&lesso
n=82) Check Answers and Submit
Week 8
Your score is: 2/4
Feedback
Form :
Reinforcement
Learning (unit?
unit=76&lesso
n=85)

Week 8:
Solutions
(unit?
unit=76&lesso
n=84)

Practice:
Week 8 :
Assignment
8(Non
Graded)
(assessment?
name=195)

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=76&assessment=195 2/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 11 - Week 8

Quiz: Week 8 :
Assignment 8
(assessment?
name=211)

Week 9 ()

Week 10 ()

Week 11 ()

Week 12 ()

DOWNLOAD
VIDEOS ()

Text
Transcripts ()

Problem
Solving
Session -
Jan 2024 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=76&assessment=195 3/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 12 - Week 9

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Week 9 : Assignment 9(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline

About 1) State True or False 1 point


NPTEL () DQN is guaranteed to converge to an optimal policy.

How does an True


NPTEL False
online
Yes, the answer is correct.
course Score: 1
work? () Accepted Answers:
False
Week 1 ()
2) Which of the following is true about DQN? 1 point
Week 2 ()
It can be efficiently used for very large state spaces
Week 3 () It can be efficiently used for continuous action spaces
No, the answer is incorrect.
Week 4 () Score: 0
Accepted Answers:
Week 5 () It can be efficiently used for very large state spaces

Week 6 () 3) Which of the following is the correct definition of average reward formulation? 1 point

Week 7 ()
ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ]

ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ][π]

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=86&assessment=196 1/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 12 - Week 9

Week 8 () 1
ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ]
N

Week 9 () 1
ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ][π]
N

DQN and No, the answer is incorrect.


Fitted Q- Score: 0
Iteration (unit? Accepted Answers:
unit=86&lesso ρ(π) = limN →∞
1
E[r1 + r2 +. . . +rN ][π]
N
n=87)
4) Policy Gradient Theorem does not hold for average reward formulation. 1 point
Policy
Gradient
True
Approach
(unit?
False
unit=86&lesso No, the answer is incorrect.
n=88) Score: 0
Accepted Answers:
Actor Critic
False
and
REINFORCE
(unit? 5) How many outputs will we get from the final layer of a DQN Network (|S| and |A| 1 point
unit=86&lesso represent the total number of states and actions in the environment respectively)?
n=89)

REINFORCE |S| × |A|

(cont'd) (unit?
unit=86&lesso |S|

n=90)
|A|
Policy
None of these
Gradient with
Function No, the answer is incorrect.
Approximation Score: 0
(unit? Accepted Answers:
unit=86&lesso |A|

n=91)

Week 9:
Check Answers and Submit
Solutions
(unit?
Your score is: 1/5
unit=86&lesso
n=92)

Week 9
Feedback
Form :
Reinforcement
Learning (unit?
unit=86&lesso
n=93)

Practice:
Week 9 :
Assignment
9(Non
Graded)
(assessment?
name=196)

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=86&assessment=196 2/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 12 - Week 9

Quiz: Week 9 :
Assignment 9
(assessment?
name=212)

Week 10 ()

Week 11 ()

Week 12 ()

DOWNLOAD
VIDEOS ()

Text
Transcripts ()

Problem
Solving
Session -
Jan 2024 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=86&assessment=196 3/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 13 - Week 10

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Week 10 : Assignment 10(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
Instructions: In the following questions, one or more choices may be correct. Select all that apply.
About
NPTEL ()
1) Given a problem with a well defined hierarchy, what ordering would you expect on 1 point
How does an the total expected reward for a hierarchically optimal policy (H), a recursively optimal policy (R)
NPTEL and a flat optimal policy (F)?
online
course
R ≤ F ≤ H
work? ()
F ≤ R ≤ H

Week 1 ()
R ≤ H ≤ F

Week 2 ()
F ≤ H ≤ R

No, the answer is incorrect.


Week 3 () Score: 0
Accepted Answers:
Week 4 () R ≤ H ≤ F

Week 5 () 2) Do flat optimal solutions give the solution with highest expected returns? 1 point

Week 6 () Yes
No
Week 7 () Yes, the answer is correct.

S 1

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=94&assessment=197 1/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 13 - Week 10

Score: 1
Week 8 () Accepted Answers:
Yes
Week 9 ()
3) Which of the following are true for HAMs? 1 point
Week 10 ()
The finite state machines in a HAM must be connected in a directed acyclic graph.
Hierarchical A choice state can make a transition to any state in any machine in the HAM
Reinforcement Stop states terminate the episode.
Learning (unit?
Action states emit primitive actions.
unit=94&lesso
n=95) No, the answer is incorrect.
Score: 0
Types of
Accepted Answers:
Optimality
The finite state machines in a HAM must be connected in a directed acyclic graph.
(unit?
Action states emit primitive actions.
unit=94&lesso
n=96)
4) In Hierarchy of Abstract Machine the core MDP state changes only when we visit a 1 point
Semi Markov choice state.
Decision
Processes True
(unit?
False
unit=94&lesso
n=97) No, the answer is incorrect.
Score: 0
Options (unit?
Accepted Answers:
unit=94&lesso False
n=98)

Learning with 5) State True or False: 1 point


Options (unit? Flat optimal policies are always easier to learn than recursively optimal policies.
unit=94&lesso
n=99) True

Hierarchical False
Abstract No, the answer is incorrect.
Machines Score: 0
(unit? Accepted Answers:
unit=94&lesso False
n=100)

Week 10:
Solutions
Check Answers and Submit
(unit?
unit=94&lesso
Your score is: 1/5
n=101)

Week 10
Feedback
Form :
Reinforcement
Learning (unit?
unit=94&lesso
n=102)

Practice:
Week 10 :
Assignment
10(Non

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=94&assessment=197 2/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 13 - Week 10

Graded)
(assessment?
name=197)

Quiz: Week 10
: Assignment
10
(assessment?
name=213)

Week 11 ()

Week 12 ()

DOWNLOAD
VIDEOS ()

Text
Transcripts ()

Problem
Solving
Session -
Jan 2024 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=94&assessment=197 3/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 14 - Week 11

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Week 11 : Assignment 11(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline

About 1) State True or False: 1 point


NPTEL () Each sub task, Mi in a MAXQ Framework is not an SMDP.

How does an True


NPTEL False
online
Yes, the answer is correct.
course Score: 1
work? () Accepted Answers:
False
Week 1 ()
2) Recall that in MAXQ Value Function Decomposition, we draw a “call graph” where 1 point
Week 2 () nodes are ‘tasks’ and edges show the dependency of the tasks. Which of the following is true
about the graph?
Week 3 ()
The graph must be a tree
Week 4 () The graph must be a DAG
The graph can be any regular graph without self loops
Week 5 ()
Any directed graph can be a call graph

Week 6 () Yes, the answer is correct.


Score: 1
Accepted Answers:
Week 7 ()
The graph must be a DAG

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=103&assessment=198 1/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 14 - Week 11

3) State True or False: 1 point


Week 8 ()
In the MAXQ framework, rewards of the core MDP are not available while learning the policies
of the sub-tasks i.e, the agent is restricted to the corresponding sub-task’s pseudo rewards.
Week 9 ()
True
Week 10 ()
False

Week 11 ()
4) State True or False: 1 point
MAXQ (unit? In the MAXQ Framework, the expected reward function R ¯
(s, a) of the SMDP corresponding to

unit=103&less sub-task Mi is equivalent to the projected value function V π (a, s).i

on=104)
True
MAXQ Value
False
Function
Decomposition No, the answer is incorrect.
(unit? Score: 0
unit=103&less Accepted Answers:
on=105) False

Option
Discovery
Check Answers and Submit
(unit?
unit=103&less
Your score is: 2/4
on=106)

Week 11
Feedback
Form :
Reinforcement
Learning (unit?
unit=103&less
on=108)

Week 11:
Solutions
(unit?
unit=103&less
on=107)

Practice:
Week 11 :
Assignment
11(Non
Graded)
(assessment?
name=198)

Quiz: Week 11
: Assignment
11
(assessment?
name=214)

Week 12 ()

DOWNLOAD
VIDEOS ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=103&assessment=198 2/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 14 - Week 11

Text
Transcripts ()

Problem
Solving
Session -
Jan 2024 ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=103&assessment=198 3/3
4/21/24, 3:33 AM Reinforcement Learning - - Unit 15 - Week 12

Answer Submitted.
X

(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)

amitbatra2011@gmail.com 

NPTEL (https://swayam.gov.in/explorer?ncCode=NPTEL) » Reinforcement Learning (course)

If already
registered, click
to check your
Week 12 : Assignment 12(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
Instructions: In the following questions, one or more choices may be correct. Select all that
About
apply
NPTEL ()

How does an 1) Suppose that we solve a POMDP using a Q-MDP like solution discussed in the 1 point
NPTEL lectures - where we assume that the MDP is known and solve it to learn Q values for the true
online (state, action) pairs. Which of the following are true?
course
work? ()
We can recover a policy for execution in the partially observable environment by weighting
Q values by the belief distribution bel so that π(s) = argmax a ∑ bel(s)Q(s, a).
Week 1 () s

We can recover an optimal policy for the POMDP from the Q values that have been learnt
Week 2 ()
for the true (state, action) pairs.

Week 3 ()
Policies recovered from Q-MDP like solution methods are always better than policies
learnt by history based methods.
Week 4 ()
None of the above

Week 5 () Yes, the answer is correct.


Score: 1
Accepted Answers:
Week 6 ()
We can recover a policy for execution in the partially observable environment by weighting Q
values by the belief distribution bel so that π(s) = argmax a ∑ bel(s)Q(s, a).
Week 7 () s

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=109&assessment=199 1/4
4/21/24, 3:33 AM Reinforcement Learning - - Unit 15 - Week 12

2) Consider the below grid-world: 1 point


Week 8 ()

Week 9 ()

Week 10 ()

Week 11 ()

Week 12 ()

POMDP
Introduction
(unit?
unit=109&less
on=110)

Solving
POMDP (unit?
unit=109&less In the figure above, black squares are blocked. Assume the agent can see one step in the 4
on=111) cardinal directions. Assume that the agent’s observations are always correct and that there is no
prior information given regarding the states.
Week 12:
Solutions
Assertion: If the observation is that there are no obstruction to the East or West, but are present
(unit?
to the North and South, the belief that the agent is in the green shaded square is 0.5.
unit=109&less
on=113) Reason: Only the green and blue shaded squares have obstructions to the North and South, but
not to the East or West.
Practice:
Week 12 : Assertion and Reason are both true and Reason is a correct explanation of Assertion.
Assignment
Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
12(Non
Graded) Assertion is true but Reason is false.
(assessment? Assertion and Reason are both false.
name=199)
No, the answer is incorrect.
Week 12 Score: 0
Feedback Accepted Answers:
Form : Assertion and Reason are both false.
Reinforcement
Learning (unit? 3) Consider the grid world shown below. Walls and obstacles are colored gray. An 1 point
unit=109&less
agent is dropped into one of the unoccupied cells of the environment uniformly at random. The
on=112)
agent is equipped with a sensor that can detect the presence of walls or obstacles immediately to
Quiz: Week 12 its North, South, East or West. However the sensor is noisy, and an observation made in each
: Assignment direction may be wrong with a probability of 0.1. Given that the agent senses no obstacles in any
12 direction, what is the probability that it was dropped into the cell marked ‘x’?
(assessment?
name=215)

DOWNLOAD
VIDEOS ()

Text
Transcripts ()

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=109&assessment=199 2/4
4/21/24, 3:33 AM Reinforcement Learning - - Unit 15 - Week 12

Problem
Solving
Session -
Jan 2024 ()

1/5
82/91
164/173
None of the above.
No, the answer is incorrect.
Score: 0
Accepted Answers:
None of the above.

4) In the same environment as Question 3, what is the probability that the agent was 1 point
not dropped onto the cell marked ‘x’, if the observation made is that there are obstacles present
only to the North and to the South?

4/5
82/91
164/173
None of the above.

No, the answer is incorrect.


Score: 0
Accepted Answers:
164/173

5) Asserion: In partially observable systems, histories that include both the sequence 1 point
of observations and the sequence of actions are typically able to disambiguate the true state of
an agent better than histories that include only the sequence of observations.
Reason: Different sequences of actions can lead to different interpretations of the sequence of
sensor observations.

Both Assertion and Reason are true, and Reason is a correct explanation of the
Assertion.
Both Assertion and Reason are true, but Reason is not a correct explanation of the
Assertion.
Assertion is true, Reason is false
Both Assertion and Reason are false

No, the answer is incorrect.

S 0

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=109&assessment=199 3/4
4/21/24, 3:33 AM Reinforcement Learning - - Unit 15 - Week 12

Score: 0
Accepted Answers:
Both Assertion and Reason are true, and Reason is a correct explanation of the Assertion.

Check Answers and Submit

Your score is: 1/5

https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=109&assessment=199 4/4

You might also like