Non-Graded: Assignment 1: (Https://swayam - Gov.in)
Non-Graded: Assignment 1: (Https://swayam - Gov.in)
Non-Graded: Assignment 1: (Https://swayam - Gov.in)
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Non-Graded : Assignment 1
payment status Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Final score
Course
outline 1) Which of the following best suits the notion of ’regret’ in a standard multi-arm bandit 1 point
problem? Assume bandits are stationary.
About
“The difference between the total reward that could have been achieved by choosing the
NPTEL ()
optimal action from the beginning and the total reward achieved by choosing the worst action
from the beginning.”
How does an
NPTEL “The number of time steps required for the solution method to find the optimal action.”
online “The difference between the total reward that could have been achieved by choosing the
course optimal action from the beginning and the total reward accumulated by our solution method”
work? ()
“The difference between the best possible reward that can be sampled by selecting the
optimal action and the worst possible reward that can be sampled by selecting the optimal
Week 1 ()
action.”
Introduction to No, the answer is incorrect.
RL (unit? Score: 0
unit=17&lesso Accepted Answers:
n=18) “The difference between the total reward that could have been achieved by choosing the
optimal action from the beginning and the total reward accumulated by our solution method”
RL Framework
and
Applications 2) Credit assignment problem is the issue of assigning a correct mapping of rewards 1 point
(unit? accumulated to the action(s) that led to them. Which of the following is the reason for credit
unit=17&lesso assignment problem in RL?
n=19)
Rewards are restricted to be a scalar value
Introduction to
Rewards are delayed in RL setting
Immediate RL
(unit? Agent cannot observe the reward
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=17&assessment=188 1/3
4/21/24, 3:26 AM Reinforcement Learning - - Unit 4 - Week 1
Week 1: 0
Solutions
(unit?
0.1
unit=17&lesso 0.23
n=24)
0.31
Week 1 Yes, the answer is correct.
Feedback Score: 1
Form : Accepted Answers:
Reinforcement 0.23
Learning (unit?
unit=17&lesso
4) Consider the following statements 1 point
n=23)
i The rewards are obtained at a fixed time after taking an action.
Practice: ii Reinforcement Learning is neither supervised nor unsupervised learning.
Non-Graded : iii Two reinforcement learning agents can learn by playing against each other.
Assignment 1 iv Always selecting the action with maximum reward will automatically maximize the winning
(assessment? probability in a game.
name=188)
Which of the above statements is/are correct?
Quiz: Week 1 :
Assignment 1 i, ii, iii
(assessment? ii
name=200)
ii, iii
Week 2 () iii, iv
Week 5 ()
Check Answers and Submit
Week 6 ()
Your score is: 1/4
Week 7 ()
Week 8 ()
Week 9 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=17&assessment=188 2/3
4/21/24, 3:26 AM Reinforcement Learning - - Unit 4 - Week 1
Week 10 ()
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=17&assessment=188 3/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 5 - Week 2
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Non-Graded: Assignment 2
payment status Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Final score
Course
outline 1) Which of the following is true of the UCB algorithm? 1 point
About The action with the highest Q value is chosen at every iteration.
NPTEL ()
After a very large number of iterations, the confidence intervals of unselected actions will
not change much.
How does an
NPTEL The true expected-value of an action always lies within it’s estimated confidence interval.
online
course With a small probability ϵ , we select a random action to ensure adequate exploration of the
work? () action space.
No, the answer is incorrect.
Week 1 () Score: 0
Accepted Answers:
Week 2 () After a very large number of iterations, the confidence intervals of unselected actions will not
change much.
UCB 1 (unit?
unit=25&lesso 2) In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm, the 1 point
n=26)
estimates of Q values are
Concentration Q100 (1) = 1.73, Q100 (2) = 1.83, Q100 (3) = 1.89, Q100 (4) = 1.55 and the number of times
Bounds (unit? each of them are sampled are n1 = 25, n2 = 20, n3 = 30, n4 = 15 . Which arm will be
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=25&assessment=189 1/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 5 - Week 2
Week 2 i. Ti (n) = 1 + ∑
m=k+1
{I m = i}
n
Reinforcement n
iv. Ti (n) ≤ 1 + ∑m=k+1 {Qm−1 (a ) ≤ Qm−1 (i) + Cm−1,T∗
(m−1)
}
i
Week 5 ()
Week 6 ()
Week 7 ()
Week 8 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=25&assessment=189 2/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 5 - Week 2
Week 9 ()
Week 10 ()
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=25&assessment=189 3/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 6 - Week 3
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Week 3: Assignment 3(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
About 1) In full RL problem, which of the following is determined by the agent? 1 point
NPTEL ()
State
How does an Reward
NPTEL Action
online
None of the above
course
work? () No, the answer is incorrect.
Score: 0
Accepted Answers:
Week 1 ()
Action
Week 2 ()
2) Which of the following statements is true about the RL problem? 1 point
3) 1 point
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=34&assessment=190 1/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 6 - Week 3
Contextual Let us say we are taking actions according to a Gaussian distribution with parameters μ and σ .
Bandits (unit? We update the parameters according to REINFORCE and at denote the action taken at step t.
unit=34&lesso
n=37) a t −μ
(i) μ
t+1
= μ
t
+ αr t
2
t
σ
t
Full RL μ −a t
(ii) μ
t+1
= μ
t
+ αr t
t
Introduction
2
σ
t
(a t−μ )2
(unit? (iii) σ t+1 = σ t + αrt
t
3
σ
unit=34&lesso t
Returns, Value
Functions and
Which of the above updates are correct?
MDPs (unit?
unit=34&lesso
(i), (iii)
n=39)
(i), (iv)
Week 3:
(ii), (iv)
Solutions
(unit? (ii), (iii)
unit=34&lesso No, the answer is incorrect.
n=41) Score: 0
Accepted Answers:
Week 3
(i), (iv)
Feedback
Form : 4) Assertion: Contextual bandits can be modeled as a full reinforcement learning 1 point
Reinforcement
problem.
Learning (unit?
Reason: We can define an MDP with n states where n is the number of bandits. The number of
unit=34&lesso
actions from each state corresponds to the arms in each bandit, with every action leading to
n=40)
termination of the episode, and giving a reward according to the corresponding bandit and arm.
Practice:
Week 3: Assertion and Reason are both true and Reason is a correct explanation of Assertion
Assignment Assertion and Reason are both true and Reason is not a correct explanation of Assertion
3(Non
Graded)
Assertion is true and Reason is false
(assessment? Both Assertion and Reason are false
name=190)
No, the answer is incorrect.
Quiz: Week 3 :
Score: 0
Assignment 3 Accepted Answers:
(assessment? Assertion and Reason are both true and Reason is a correct explanation of Assertion
name=204)
5) Remember for discounted returns, 1 point
Week 4 ()
2
Gt = rt + γrt+1 + γ rt+2 +. . .
Week 5 () Where γ is a discount factor. Which of the following best explains what happens when γ > 1 ,
(say γ = 5 )?
Week 6 ()
Y h i
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=34&assessment=190 2/3
4/21/24, 3:27 AM Reinforcement Learning - - Unit 6 - Week 3
Problem
Solving
Session -
Jan 2024 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=34&assessment=190 3/3
4/21/24, 3:28 AM Reinforcement Learning - - Unit 7 - Week 4
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Week 4 : Assignment 4(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
Bellman
(i), (ii), (iii)
Equation (unit?
N h i i
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=42&assessment=191 1/3
4/21/24, 3:28 AM Reinforcement Learning - - Unit 7 - Week 4
Cauchy
Sequence and ∗
v (s) = maxa ∑
s'
∗
p(s'|s, a)v (s')
Green's
Equation (unit? ∗
v (s) = maxa ∑
s'
p(s'|s, a)[γE[r|s, a, s'] + v (s')]
∗
unit=42&lesso
n=46) ∗
v (s) = maxa ∑ p(s'|s, a)γ[E[r|s, a, s'] + v (s')]
∗
s'
n=47)
4) State True/False 1 point
Convergence
Proof (unit?
unit=42&lesso
In MDPs, there is a unique resultant state for any given state-action pair.
n=48)
True
Week 4: False
Solutions
(unit? No, the answer is incorrect.
Score: 0
unit=42&lesso
n=50) Accepted Answers:
False
Practice:
Week 4 :
5) For an operator L, let x be one of its fixed points. Consider the following 1 point
Assignment
statements:
4(Non
(i) Lm x = L
n
x, where m, n ∈ N
Graded)
(assessment? (ii)L2
x = x
Week 7 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=42&assessment=191 2/3
4/21/24, 3:28 AM Reinforcement Learning - - Unit 7 - Week 4
Week 8 ()
Week 9 ()
Week 10 ()
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=42&assessment=191 3/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 8 - Week 5
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Week 5 : Assignment 5(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
About 1) Let the policy and value function obtained by applying policy iteration on a finite 1 point
NPTEL () MDP be π n and vπ , after the stopping criterion is met. Then:
n
How does an
πn will be the optimal policy but vπ might not be the optimal value function.
n
NPTEL
online
πn might not be the optimal policy but vπ will be the optimal value function.
n
course
work? ()
Nothing can be said about π n and vπ . n
Week 1 () πn will be the optimal policy and vπ will be the optimal value function.
n
Week 4 () 2) Assertion: Monte Carlo policy evaluation must use exploring starts in the case of 1 point
non-deterministic policies.
Week 5 ()
Reason: They have to rely upon exploring starts in the case of non-deterministic policies to
Lpi
ensure adequate sampling of all states.
Convergence
(unit?
unit=51&lesso (Assume all states of the MDP are reachable from all other states).
n=52)
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=51&assessment=192 1/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 8 - Week 5
Value Iteration Assertion and Reason are both true and Reason is a correct explanation of Assertion
(unit?
Assertion and Reason are both true and Reason is not a correct explanation of Assertion
unit=51&lesso
n=53) Assertion is true and Reason is false
Both Assertion and Reason are false
Policy Iteration
(unit? No, the answer is incorrect.
unit=51&lesso Score: 0
n=54) Accepted Answers:
Both Assertion and Reason are false
Dynamic
Programming 3) Which of the following is/are guaranteed to be true at the stopping criterion for value 1 point
(unit? iteration?
unit=51&lesso
n=55) ϵ(1−γ)
Note: We stop the algorithm when ||vn+1 − v
n
|| <
2γ
Monte Carlo
(unit?
π ∗
||v − v || ≤ ϵ
unit=51&lesso
n=56)
n+1 ∗
||v − v || ≤ ϵ
Control in
Monte Carlo ||v
π ∗
− v || ≤ ϵ/2
(unit?
unit=51&lesso ||v
n+1 ∗
− v || ≤ ϵ/2
n=57)
Partially Correct.
Week 5 Score: 0.33
Feedback Accepted Answers:
π ∗
Form : ||v − v || ≤ ϵ
Reinforcement ||v
n+1 ∗
− v || ≤ ϵ
unit=51&lesso
n=59) 4) Consider an MDP, where there are n actions (a ∈ A, with |A| = n), each of which1 point
Week 5: is applicable in each state s ∈ S. If π is an ϵ - soft policy for some ϵ > 0, and let qπ be the
Solutions action-value function of the policy π , then consider the following statements and choose the
(unit? appropriate option.
unit=51&lesso
n=58) Assertion: Any ϵ - greedy policy with respect to qπ is strictly better than π .
Practice:
Week 5 : Reason: In an ϵ - greedy policy, the action with the maximal estimated action value is chosen
Assignment with probability 1 − ϵ + ϵ/n and other actions are chosen at random with probability ϵ/n .
5(Non
Graded) Both assertion and reason are true, and reason correct explanation for assertion.
(assessment? Both assertion and reason are true, but reason is not correct explanation for assertion.
name=192)
Assertion is false but reason is true.
Quiz: Week 5 : Assertion is true but reason is false.
Assignment 5
(assessment? No, the answer is incorrect.
Score: 0
name=208)
Accepted Answers:
Week 6 () Assertion is false but reason is true.
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=51&assessment=192 2/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 8 - Week 5
DOWNLOAD
VIDEOS ()
Check Answers and Submit
Problem
Solving
Session -
Jan 2024 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=51&assessment=192 3/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 9 - Week 6
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Week 6 : Assignment 6(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
How does an
NPTEL Which part of the equation is said to be as TD-error (δ)
online
course
δ = Rt+1 + γ (st+1 , at+1 )
work? () old
Week 2 ()
δ = Qold(st+1 , at+1 ) − Qold (st , at )
Week 5 ()
2) Which of the following are True for TD(0) ? (Assume that the environment is truly 1 point
Week 6 () Markov)
Off Policy MC
It uses the full return to update the value of states.
(unit? Both TD(0) and Monte-Carlo policy evaluation converge to the same value function, given
a finite number of samples.
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=60&assessment=193 1/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 9 - Week 6
unit=60&lesso Both TD(0) and Monte-Carlo policy evaluation converge to the same value function, given
n=61)
an infinite number of samples.
UCT (unit?
unit=60&lesso TD error is given by “δ = vnew (st , at ) − vold (st , at )”.
n=62)
No, the answer is incorrect.
Score: 0
TD(0) (unit?
unit=60&lesso
Accepted Answers:
n=63)
Both TD(0) and Monte-Carlo policy evaluation converge to the same value function, given an
infinite number of samples.
TD(0) Control
(unit? 3) Consider the following equation for weighted Importance Sampling 1 point
unit=60&lesso
n=64) N p(xi )
∑ f (x i )
i=1
q(xi )
unit=60&lesso
n=65) where x i is trajectory, f (x i) is return on trajectory, p(x i) is probability of trajectory x i when
Afterstate following policy π and q(x i) is probability of trajectory x i when following policy µ, then
(unit?
unit=60&lesso
π is estimation policy and µ is behaviour policy
n=66)
Week 6
Feedback 4) Assertion: SARSA is an on-policy method. 1 point
Form : Reason: In SARSA, we do not update the action that was actually used, so it is on-policy
Reinforcement method.
Learning (unit?
unit=60&lesso
n=68)
Practice: Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
Week 6 : Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
Assignment
Assertion is true, Reason is false
6(Non
Graded) Both Assertion and Reason are false
(assessment? No, the answer is incorrect.
name=193) Score: 0
Accepted Answers:
Quiz: Week 6 :
Assertion is true, Reason is false
Assignment 6
(assessment?
name=209) 5) Which of the following statements are true? (multi-correct) 1 point
S 0
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=60&assessment=193 2/3
4/21/24, 3:29 AM Reinforcement Learning - - Unit 9 - Week 6
Score: 0.5
Week 10 () Accepted Answers:
Planning requires at least a simulation model for the MDP.
Week 11 () Learning can be done just using the real-world samples.
Week 12 ()
Check Answers and Submit
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=60&assessment=193 3/3
4/21/24, 3:30 AM Reinforcement Learning - - Unit 10 - Week 7
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Week 7 : Assignment 7(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
NPTEL ()
2
Rt+1 + γRt+2 + γ V (st+2 )
How does an
NPTEL
2 3
Rt+1 + γRt+2 + γ Rt+3 + γ V (st+3 )
online
course
2 3
work? () Rt+1 + γRt+2 + γ Rt+3 + γ V (st+4 )
Week 3 ()
2) State True or False: The idea in Sarsa(λ ) is to apply the TD( λ ) prediction method 1 point
Week 4 () to just the states rather than to state-action pairs.
Week 5 () True
False
Week 6 ()
No, the answer is incorrect.
Score: 0
Week 7 () Accepted Answers:
False
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=69&assessment=194 1/3
4/21/24, 3:30 AM Reinforcement Learning - - Unit 10 - Week 7
Week 7
Feedback
Form :
Reinforcement
Learning (unit?
unit=69&lesso
n=75)
Practice:
Week 7 :
Assignment
7(Non
Graded)
(assessment?
name=194)
Quiz: Week 7 :
Assignment 7
(assessment?
name=210)
Week 8 ()
Week 9 ()
Week 10 ()
Week 11 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=69&assessment=194 2/3
4/21/24, 3:30 AM Reinforcement Learning - - Unit 10 - Week 7
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=69&assessment=194 3/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 11 - Week 8
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Week 8 : Assignment 8(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
About 1) Assuming that we use tile-coding state aggregation, which of the following is the 1 point
NPTEL () correct update equation for replacing traces, for a tile “h ”, when its indicator variable is 1?
How does an
⃗ (h) = γλe t−1
et ⃗ (h) + 1
NPTEL
online ⃗ ⃗
e (h) = γλe t−1 (h)
course
work? () ⃗ (h) = 1
et
Replacing traces are not well defined for tile-coding representations of states.
Week 1 ()
No, the answer is incorrect.
Score: 0
Week 2 ()
Accepted Answers:
⃗ (h) = 1
et
Week 3 ()
Week 5 () Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
Both Assertion and Reason are true, but Reason is not correct explanation for assertion.
Week 6 ()
Assertion is true and Reason is false
S 0
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=76&assessment=195 1/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 11 - Week 8
Score: 0
Week 8 () Accepted Answers:
Both Assertion and Reason are true, and Reason is correct explanation for Assertion.
Function
Approximation 3) State true or false for the following statement. 1 point
(unit? Statement:For the LSTD and LSTDQ methods, as we gather more samples, we can
unit=76&lesso ~ ~
incrementally append data to the A and b matrices. This allows us to stop and solve for the
n=77)
value of θπ at any point in a sampled trajectory.
Linear
Parameterizati True
on (unit? False
unit=76&lesso
n=78) Yes, the answer is correct.
Score: 1
State Accepted Answers:
Aggregation True
Methods (unit?
unit=76&lesso 4) Which of the following is the correct update equation for eligibility traces with linear 1 point
n=79)
function approximator, V (st ) = w
⊤
ϕ(st ) ?
Function
Approximation
⃗ = γλe t−1
et ⃗ + w
and Eligibility
Traces (unit?
⃗ = γλe t−1
et ⃗ + ϕ(st )
unit=76&lesso
n=80)
⃗ = γλe t−1
et ⃗ + ϕ(st−1 )
Fitted Q (unit?
unit=76&lesso
n=82) Check Answers and Submit
Week 8
Your score is: 2/4
Feedback
Form :
Reinforcement
Learning (unit?
unit=76&lesso
n=85)
Week 8:
Solutions
(unit?
unit=76&lesso
n=84)
Practice:
Week 8 :
Assignment
8(Non
Graded)
(assessment?
name=195)
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=76&assessment=195 2/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 11 - Week 8
Quiz: Week 8 :
Assignment 8
(assessment?
name=211)
Week 9 ()
Week 10 ()
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=76&assessment=195 3/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 12 - Week 9
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Week 9 : Assignment 9(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
Week 6 () 3) Which of the following is the correct definition of average reward formulation? 1 point
Week 7 ()
ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ]
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=86&assessment=196 1/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 12 - Week 9
Week 8 () 1
ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ]
N
Week 9 () 1
ρ(π) = limN →∞ E[r1 + r2 +. . . +rN ][π]
N
(cont'd) (unit?
unit=86&lesso |S|
n=90)
|A|
Policy
None of these
Gradient with
Function No, the answer is incorrect.
Approximation Score: 0
(unit? Accepted Answers:
unit=86&lesso |A|
n=91)
Week 9:
Check Answers and Submit
Solutions
(unit?
Your score is: 1/5
unit=86&lesso
n=92)
Week 9
Feedback
Form :
Reinforcement
Learning (unit?
unit=86&lesso
n=93)
Practice:
Week 9 :
Assignment
9(Non
Graded)
(assessment?
name=196)
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=86&assessment=196 2/3
4/21/24, 3:31 AM Reinforcement Learning - - Unit 12 - Week 9
Quiz: Week 9 :
Assignment 9
(assessment?
name=212)
Week 10 ()
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=86&assessment=196 3/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 13 - Week 10
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Week 10 : Assignment 10(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
Instructions: In the following questions, one or more choices may be correct. Select all that apply.
About
NPTEL ()
1) Given a problem with a well defined hierarchy, what ordering would you expect on 1 point
How does an the total expected reward for a hierarchically optimal policy (H), a recursively optimal policy (R)
NPTEL and a flat optimal policy (F)?
online
course
R ≤ F ≤ H
work? ()
F ≤ R ≤ H
Week 1 ()
R ≤ H ≤ F
Week 2 ()
F ≤ H ≤ R
Week 5 () 2) Do flat optimal solutions give the solution with highest expected returns? 1 point
Week 6 () Yes
No
Week 7 () Yes, the answer is correct.
S 1
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=94&assessment=197 1/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 13 - Week 10
Score: 1
Week 8 () Accepted Answers:
Yes
Week 9 ()
3) Which of the following are true for HAMs? 1 point
Week 10 ()
The finite state machines in a HAM must be connected in a directed acyclic graph.
Hierarchical A choice state can make a transition to any state in any machine in the HAM
Reinforcement Stop states terminate the episode.
Learning (unit?
Action states emit primitive actions.
unit=94&lesso
n=95) No, the answer is incorrect.
Score: 0
Types of
Accepted Answers:
Optimality
The finite state machines in a HAM must be connected in a directed acyclic graph.
(unit?
Action states emit primitive actions.
unit=94&lesso
n=96)
4) In Hierarchy of Abstract Machine the core MDP state changes only when we visit a 1 point
Semi Markov choice state.
Decision
Processes True
(unit?
False
unit=94&lesso
n=97) No, the answer is incorrect.
Score: 0
Options (unit?
Accepted Answers:
unit=94&lesso False
n=98)
Hierarchical False
Abstract No, the answer is incorrect.
Machines Score: 0
(unit? Accepted Answers:
unit=94&lesso False
n=100)
Week 10:
Solutions
Check Answers and Submit
(unit?
unit=94&lesso
Your score is: 1/5
n=101)
Week 10
Feedback
Form :
Reinforcement
Learning (unit?
unit=94&lesso
n=102)
Practice:
Week 10 :
Assignment
10(Non
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=94&assessment=197 2/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 13 - Week 10
Graded)
(assessment?
name=197)
Quiz: Week 10
: Assignment
10
(assessment?
name=213)
Week 11 ()
Week 12 ()
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=94&assessment=197 3/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 14 - Week 11
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Week 11 : Assignment 11(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=103&assessment=198 1/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 14 - Week 11
Week 11 ()
4) State True or False: 1 point
MAXQ (unit? In the MAXQ Framework, the expected reward function R ¯
(s, a) of the SMDP corresponding to
on=104)
True
MAXQ Value
False
Function
Decomposition No, the answer is incorrect.
(unit? Score: 0
unit=103&less Accepted Answers:
on=105) False
Option
Discovery
Check Answers and Submit
(unit?
unit=103&less
Your score is: 2/4
on=106)
Week 11
Feedback
Form :
Reinforcement
Learning (unit?
unit=103&less
on=108)
Week 11:
Solutions
(unit?
unit=103&less
on=107)
Practice:
Week 11 :
Assignment
11(Non
Graded)
(assessment?
name=198)
Quiz: Week 11
: Assignment
11
(assessment?
name=214)
Week 12 ()
DOWNLOAD
VIDEOS ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=103&assessment=198 2/3
4/21/24, 3:32 AM Reinforcement Learning - - Unit 14 - Week 11
Text
Transcripts ()
Problem
Solving
Session -
Jan 2024 ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=103&assessment=198 3/3
4/21/24, 3:33 AM Reinforcement Learning - - Unit 15 - Week 12
Answer Submitted.
X
(https://swayam.gov.in) (https://swayam.gov.in/nc_details/NPTEL)
amitbatra2011@gmail.com
If already
registered, click
to check your
Week 12 : Assignment 12(Non
payment status
Graded)
Assignment not submitted
Note : This assignment is only for practice purpose and it will not be counted towards the
Course
Final score
outline
Instructions: In the following questions, one or more choices may be correct. Select all that
About
apply
NPTEL ()
How does an 1) Suppose that we solve a POMDP using a Q-MDP like solution discussed in the 1 point
NPTEL lectures - where we assume that the MDP is known and solve it to learn Q values for the true
online (state, action) pairs. Which of the following are true?
course
work? ()
We can recover a policy for execution in the partially observable environment by weighting
Q values by the belief distribution bel so that π(s) = argmax a ∑ bel(s)Q(s, a).
Week 1 () s
We can recover an optimal policy for the POMDP from the Q values that have been learnt
Week 2 ()
for the true (state, action) pairs.
Week 3 ()
Policies recovered from Q-MDP like solution methods are always better than policies
learnt by history based methods.
Week 4 ()
None of the above
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=109&assessment=199 1/4
4/21/24, 3:33 AM Reinforcement Learning - - Unit 15 - Week 12
Week 9 ()
Week 10 ()
Week 11 ()
Week 12 ()
POMDP
Introduction
(unit?
unit=109&less
on=110)
Solving
POMDP (unit?
unit=109&less In the figure above, black squares are blocked. Assume the agent can see one step in the 4
on=111) cardinal directions. Assume that the agent’s observations are always correct and that there is no
prior information given regarding the states.
Week 12:
Solutions
Assertion: If the observation is that there are no obstruction to the East or West, but are present
(unit?
to the North and South, the belief that the agent is in the green shaded square is 0.5.
unit=109&less
on=113) Reason: Only the green and blue shaded squares have obstructions to the North and South, but
not to the East or West.
Practice:
Week 12 : Assertion and Reason are both true and Reason is a correct explanation of Assertion.
Assignment
Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
12(Non
Graded) Assertion is true but Reason is false.
(assessment? Assertion and Reason are both false.
name=199)
No, the answer is incorrect.
Week 12 Score: 0
Feedback Accepted Answers:
Form : Assertion and Reason are both false.
Reinforcement
Learning (unit? 3) Consider the grid world shown below. Walls and obstacles are colored gray. An 1 point
unit=109&less
agent is dropped into one of the unoccupied cells of the environment uniformly at random. The
on=112)
agent is equipped with a sensor that can detect the presence of walls or obstacles immediately to
Quiz: Week 12 its North, South, East or West. However the sensor is noisy, and an observation made in each
: Assignment direction may be wrong with a probability of 0.1. Given that the agent senses no obstacles in any
12 direction, what is the probability that it was dropped into the cell marked ‘x’?
(assessment?
name=215)
DOWNLOAD
VIDEOS ()
Text
Transcripts ()
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=109&assessment=199 2/4
4/21/24, 3:33 AM Reinforcement Learning - - Unit 15 - Week 12
Problem
Solving
Session -
Jan 2024 ()
1/5
82/91
164/173
None of the above.
No, the answer is incorrect.
Score: 0
Accepted Answers:
None of the above.
4) In the same environment as Question 3, what is the probability that the agent was 1 point
not dropped onto the cell marked ‘x’, if the observation made is that there are obstacles present
only to the North and to the South?
4/5
82/91
164/173
None of the above.
5) Asserion: In partially observable systems, histories that include both the sequence 1 point
of observations and the sequence of actions are typically able to disambiguate the true state of
an agent better than histories that include only the sequence of observations.
Reason: Different sequences of actions can lead to different interpretations of the sequence of
sensor observations.
Both Assertion and Reason are true, and Reason is a correct explanation of the
Assertion.
Both Assertion and Reason are true, but Reason is not a correct explanation of the
Assertion.
Assertion is true, Reason is false
Both Assertion and Reason are false
S 0
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=109&assessment=199 3/4
4/21/24, 3:33 AM Reinforcement Learning - - Unit 15 - Week 12
Score: 0
Accepted Answers:
Both Assertion and Reason are true, and Reason is a correct explanation of the Assertion.
https://onlinecourses.nptel.ac.in/noc24_cs52/unit?unit=109&assessment=199 4/4