Assignment 4 (Sol.) : Reinforcement Learning
Assignment 4 (Sol.) : Reinforcement Learning
)
Reinforcement Learning
Prof. B. Ravindran
1
(f) (Q, O ∧ ¬I, L, −1)
(g) (Q, ¬O ∧ I, Q, +1)
(h) (Q, ¬O ∧ ¬I, L, −1)
2. Based on the above problem description, what advice will you give to At Wit’s End?
(a) if there is laughter, play the organ and do not burn incense; if room is quite, play the
organ and burn incense
(b) never play the organ, always burn incense
(c) always play the organ, never burn incense
(d) if there is laughter, play the organ; if room is quite, do not play the organ and burn
incense
Sol. (d)
3. If a policy is greedy with respect to its own value function, then it is an optimal policy.
(a) false
(b) true
Sol. (b)
Consider the value function corresponding to an arbitrary policy π. If we derive a policy
that is greedy with respect to this value function, by the policy improvement theorem, we are
guaranteed to get a policy which is at least as good as the policy π. This derived policy will be
equivalent to the policy π if and only if π is optimal. Hence, if a policy is greedy with respect
to its own value function, then it is optimal.
4. Consider a 4 X 4 grid world problem where the goal is to reach either the top left corner or
the bottom right corner. The agent can choose from four actions {up, down, left, right} which
deterministically cause the corresponding state transitions, except that actions that would take
the agent off the grid leave the state unchanged. We model this as an undiscounted, episodic
task, where the reward is -1 for all transitions. Suppose that the agent follows the equiprobable
random policy. Given below is the partial value function for this problem. Calculate respec-
tively, the missing values in the first and second row? (Hint: the Bellman equation must hold
for every state.)
2
(a) -20, -14
(b) -14, -20
(c) -14, -18
(d) -20, -18
Sol. (b)
For the value in the first row, we have
X X
vπ (s) = π(a|s) p(s0 |s, a)[r + vπ (s0 )]
a s0
5. If π is the equiprobable random policy, what are the respective values of qπ (s1 , down) and qπ (s2 , down)
given that s1 is the last cell in the third row (value -14) and s2 is the last cell in the second
row?
3
Sol. (a)
For s1 , we have X
qπ (s1 , down) = p(s0 |s1 , down)[r + vπ (s0 )]
s0
qπ (s1 , down) = −1 + 0 = −1
6. In a particular grid-world example, rewards are positive for goals, negative for running into
the edge of the world, and zero the rest of the time. Are the signs of these rewards important,
or only the intervals between them? Prove, using the discounted return equation
∞
X
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ... = γ k Rt+k+1
k=0
that adding a constant C to all the rewards adds a constant, K, to the values of all states, and
thus does not affect the relative values of any states under any policies. What is K in terms
of C and γ?
1
(a) K = C(1−γ)
1
(b) K = C( 1−γ − 1)
1
(c) K = C( 1−γ + 1)
C
(d) K = 1−γ
Sol. (d)
Assume that the grid-world problem is a continuing task. For some policy π and state s, the
value function can be give as
vπ (s) = Eπ {Gt |st = s}.
Using the discounted reward equation, we have
∞
X
vπ (s) = Eπ { γ k Rt+k+1 |st = s}.
k=0
∞
X
vπ0 (s) = Eπ { γ k (Rt+k+1 + C)|st = s}
k=0
X∞ ∞
X
= Eπ { γ k Rt+k+1 + C γ k |st = s}
k=0 k=0
C
= vπ (s) + .
1−γ
We see that adding a constant C to all rewards does not affect the relative values of any states
C
under any policies. Here K = 1−γ .
4
7. Given a reinforcement learning problem, algorithm A will return the optimal state value func-
tion for that problem and algorithm B will return the optimal action value function. Your aim
is to use the value function so obtained to behave optimally in the environment. Assuming
that you know the expected rewards but not the transition probabilities corresponding to the
problem in question, which algorithm would you prefer to use for your control task?
(a) algorithm A
(b) algorithm B
Sol. (b)
Since algorithm B returns the optimal action value function, we can use the information
provided by the optimal action value function to control the behaviour of the agent without
knowledge of the transition probabilities of the underlying MDP.
8. In proving that Lπ is a contraction, we had the expression
X X
γ p(j|s)[v(j) − u(j)] ≤ γ||v − u|| p(j|s)
j j
Sol. (d)
9. We defined the operator Lπ : V → V as Lπ v = rπ + γPπ v. Having seen the proof of the
Banach fixed point theorem and assuming that v π and v ∗ have their usual meanings, which
among the following are implications of showing that Lπ is a contraction?
(a) v = v 0
(b) v 6= v 0
(c) ||Lπ v − Lπ v 0 || ≤ λ||v − v 0 ||, 0 ≤ λ < 1
5
(d) none of the above
Sol. (c)
The first option may not hold if v 6= vπ . Similarly, the second option may not hold if v = vπ .
The third option is true because Lπ is a contraction and in all three possible scenarios (v 6=
v 0 6= vπ , v 6= v 0 = vπ , and v = v 0 = vπ ), the statement holds.