0% found this document useful (0 votes)
170 views

Assignment 4 (Sol.) : Reinforcement Learning

This document summarizes the key points from a 6 question reinforcement learning assignment. 1) The assignment describes a haunted house problem as a Markov decision process (MDP) with states of laughter/quiet and actions of playing organ/burning incense. It asks which state transitions and rewards are correct. 2) For the haunted house letter writer, the advice is to play the organ if there is laughter, and to burn incense if the room is quiet. 3) A policy that is greedy with respect to its own value function must be optimal. This summarizes the essential information from the document in 3 sentences.

Uploaded by

simar rocks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
170 views

Assignment 4 (Sol.) : Reinforcement Learning

This document summarizes the key points from a 6 question reinforcement learning assignment. 1) The assignment describes a haunted house problem as a Markov decision process (MDP) with states of laughter/quiet and actions of playing organ/burning incense. It asks which state transitions and rewards are correct. 2) For the haunted house letter writer, the advice is to play the organ if there is laughter, and to burn incense if the room is quiet. 3) A policy that is greedy with respect to its own value function must be optimal. This summarizes the essential information from the document in 3 sentences.

Uploaded by

simar rocks
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Assignment 4 (Sol.

)
Reinforcement Learning
Prof. B. Ravindran

1. You receive the following letter:


Dear Friend,
Some time ago, I bought this old house, but found it to be haunted by ghostly sardonic laughter.
As a result it is hardly habitable. There is hope, however, for by actual testing I have found
that this haunting is subject to certain laws, obscure but infallible, and that the laughter can
be affected by my playing the organ or burning incense. In each minute, the laughter occurs or
not, it shows no degree. What it will do during the ensuing minute depends, in the following
exact way, on what has been happening during the preceding minute:
Whenever there is laughter, it will continue in the succeeding minute unless I play the organ,
in which case it will stop. But continuing to play the organ does not keep the house quiet.
I notice, however, that whenever I burn incense when the house is quiet and do not play the
organ it remains quiet for the next minute.
At this minute of writing, the laughter is going on. Please tell me what manipulations of
incense and organ I should make to get that house quiet, and to keep it so.
Sincerely, At Wits End
Assume that we make the following decisions in formulating this problem as an MDP:
State set: {L, Q}, where L indicates that there is laughter in the room, and Q indicates that
the room is quiet.
Action set: {O ∧ I, O ∧ ¬I, ¬O ∧ I, ¬O ∧ ¬I}, where O corresponds to playing the organ, and
I corresponds to burning incense.
We consider this as a continuing discounted problem with γ = 0.9 and we let the reward be
+1 on any transition into the silent state, and -1 on any transition into the laughing state.
Assuming deterministic state transitions and rewards based upon current state and action,
which among the following 4-tuples (current state, action, next state, reward) represent correct
state transitions and rewards?

(a) (L, O ∧ I, Q, +1)


(b) (L, O ∧ ¬I, L, −1)
(c) (L, ¬O ∧ I, Q, +1)
(d) (L, ¬O ∧ ¬I, L, −1)
(e) (Q, O ∧ I, Q, +1)

1
(f) (Q, O ∧ ¬I, L, −1)
(g) (Q, ¬O ∧ I, Q, +1)
(h) (Q, ¬O ∧ ¬I, L, −1)

Sol. (a), (d), (f), (g), (h)


We know that if there is laughter and the organ is played, then in the next step laughter will
stop. This contradicts option (b). Similarly, option (c) indicates that by burning incense alone
laughter can be made to stop, which is incorrect. Option (e) is also not correct because we
know that playing the organ when the house is quiet does not result in the house staying quiet.

2. Based on the above problem description, what advice will you give to At Wit’s End?

(a) if there is laughter, play the organ and do not burn incense; if room is quite, play the
organ and burn incense
(b) never play the organ, always burn incense
(c) always play the organ, never burn incense
(d) if there is laughter, play the organ; if room is quite, do not play the organ and burn
incense

Sol. (d)

3. If a policy is greedy with respect to its own value function, then it is an optimal policy.

(a) false
(b) true

Sol. (b)
Consider the value function corresponding to an arbitrary policy π. If we derive a policy
that is greedy with respect to this value function, by the policy improvement theorem, we are
guaranteed to get a policy which is at least as good as the policy π. This derived policy will be
equivalent to the policy π if and only if π is optimal. Hence, if a policy is greedy with respect
to its own value function, then it is optimal.

4. Consider a 4 X 4 grid world problem where the goal is to reach either the top left corner or
the bottom right corner. The agent can choose from four actions {up, down, left, right} which
deterministically cause the corresponding state transitions, except that actions that would take
the agent off the grid leave the state unchanged. We model this as an undiscounted, episodic
task, where the reward is -1 for all transitions. Suppose that the agent follows the equiprobable
random policy. Given below is the partial value function for this problem. Calculate respec-
tively, the missing values in the first and second row? (Hint: the Bellman equation must hold
for every state.)

2
(a) -20, -14
(b) -14, -20
(c) -14, -18
(d) -20, -18

Sol. (b)
For the value in the first row, we have
X X
vπ (s) = π(a|s) p(s0 |s, a)[r + vπ (s0 )]
a s0

vπ (s) = 0.25 ∗ (−1 + vπ (s) − 1 − 21 − 19)


vπ (s) = 0.25vπ (s) − 10.5
0.75vπ (s) = −10.5
vπ (s) = −14

Similarly, for the value in the second row, we have

vπ (s) = 0.25 ∗ (−23 + vπ (s) − 1 − 21 − 15)

vπ (s) = 0.25vπ (s) − 15


0.75vπ (s) = −15
vπ (s) = −20

5. If π is the equiprobable random policy, what are the respective values of qπ (s1 , down) and qπ (s2 , down)
given that s1 is the last cell in the third row (value -14) and s2 is the last cell in the second
row?

(a) -1, -15


(b) -15, -21
(c) 0, -14
(d) -13, -19

3
Sol. (a)
For s1 , we have X
qπ (s1 , down) = p(s0 |s1 , down)[r + vπ (s0 )]
s0

qπ (s1 , down) = −1 + 0 = −1

Similarly, for s2 , we have


qπ (s2 , down) = −1 − 14 = −15

6. In a particular grid-world example, rewards are positive for goals, negative for running into
the edge of the world, and zero the rest of the time. Are the signs of these rewards important,
or only the intervals between them? Prove, using the discounted return equation

X
Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ... = γ k Rt+k+1
k=0

that adding a constant C to all the rewards adds a constant, K, to the values of all states, and
thus does not affect the relative values of any states under any policies. What is K in terms
of C and γ?
1
(a) K = C(1−γ)
1
(b) K = C( 1−γ − 1)
1
(c) K = C( 1−γ + 1)
C
(d) K = 1−γ

Sol. (d)
Assume that the grid-world problem is a continuing task. For some policy π and state s, the
value function can be give as
vπ (s) = Eπ {Gt |st = s}.
Using the discounted reward equation, we have

X
vπ (s) = Eπ { γ k Rt+k+1 |st = s}.
k=0

Adding a constant C to all rewards, we have


X
vπ0 (s) = Eπ { γ k (Rt+k+1 + C)|st = s}
k=0
X∞ ∞
X
= Eπ { γ k Rt+k+1 + C γ k |st = s}
k=0 k=0
C
= vπ (s) + .
1−γ

We see that adding a constant C to all rewards does not affect the relative values of any states
C
under any policies. Here K = 1−γ .

4
7. Given a reinforcement learning problem, algorithm A will return the optimal state value func-
tion for that problem and algorithm B will return the optimal action value function. Your aim
is to use the value function so obtained to behave optimally in the environment. Assuming
that you know the expected rewards but not the transition probabilities corresponding to the
problem in question, which algorithm would you prefer to use for your control task?

(a) algorithm A
(b) algorithm B

Sol. (b)
Since algorithm B returns the optimal action value function, we can use the information
provided by the optimal action value function to control the behaviour of the agent without
knowledge of the transition probabilities of the underlying MDP.
8. In proving that Lπ is a contraction, we had the expression
X X
γ p(j|s)[v(j) − u(j)] ≤ γ||v − u|| p(j|s)
j j

This inequality holds because

(a) v(j) − u(j) is a component of ||v − u||


(b) the max norm of the difference on the LHS is less than the max norm of the difference
on the RHS
(c) the difference in the LHS can be negative but the norm in the RHS is non-negative
(d) the max norm on the RHS can at worst be equal to the difference in the LHS

Sol. (d)
9. We defined the operator Lπ : V → V as Lπ v = rπ + γPπ v. Having seen the proof of the
Banach fixed point theorem and assuming that v π and v ∗ have their usual meanings, which
among the following are implications of showing that Lπ is a contraction?

(a) v π is a fixed point of Lπ


(b) v π is a unique fixed point of Lπ
(c) repeatedly applying Lπ starting with an arbitrary v ∈ V results in convergence to v π
(d) repeatedly applying Lπ starting with an arbitrary v ∈ V results in convergence to v ∗

Sol. (b), (c)


Note that while the statement of option (a) is true, it is a result of the Bellman equation and
the definition of the Lπ operator. Option (d) is not true since repeated application of the
operator guarantees convergence only to v π and not the optimal v ∗ .
10. Given a value v ∈ V , suppose Lπ v = v 0 . Then we can conclude that

(a) v = v 0
(b) v 6= v 0
(c) ||Lπ v − Lπ v 0 || ≤ λ||v − v 0 ||, 0 ≤ λ < 1

5
(d) none of the above

Sol. (c)
The first option may not hold if v 6= vπ . Similarly, the second option may not hold if v = vπ .
The third option is true because Lπ is a contraction and in all three possible scenarios (v 6=
v 0 6= vπ , v 6= v 0 = vπ , and v = v 0 = vπ ), the statement holds.

You might also like