CS461_Fall24_HW4

CS461 Artificial Intelligence
Fall 2024: Homework 4

Due Date: 10.12.2024
1
CS461 Artificial Intelligence Homework 4
1 Informed Search
Consider a graph where the edge costs are always positive. We want to find
the least cost path from a node S to another node G. Suppose we use A* tree
search with some heuristic h′ to find such a path. Let C denote the cost of the
path from S to G by using heuristic h′ , h∗ (X) denote the shortest distance to
G from a node X, and h be an admissible heuristic for this graph.
(a) For the given definitions of h′ below, determine whether C = h∗ (S), C ≥

h∗ (S), or C > h∗ (S).
(i) h′ (X) = h(X)/2
h(X)+h∗(X)
(ii) h′ (X) = 2
(iii) h′ (X) = h(X) + h ∗ (X)
2
(
′ minY ∈K(X) h′ (Y ) − h(Y ) + h(X) if K(X) ̸= ∅,
(iv) h (X) ≤
h(X) if K(X) = ∅.
where K(X) defines the set of all neighboring nodes Y that satisfy
h∗ (X) > h∗ (Y ).
(
minY ∈K(X) h(Y ) + cost(X, Y ) if K(X) ̸= ∅,
(v) h′ (X) =
h(X) if K(X) = ∅.
3
2 Bayes Networks - Variable Elimination

Consider the Bayes network below where each variable has a binary domain. We
want to find P(Y — +z). In order to find it, we will perform variable elimination
with the order: X, T, U, V, W.
U V W X
Y Z
1. What factors do you have after inserting the evidence?
2. Find the factor f1 (as an equation) generated by eliminating X and give

the resulting list of factors.
3. Find the factor f2 (as an equation) generated by eliminating T and give

4. Find the factor f3 (as an equation) generated by eliminating U and give

4
5. Find the factor f4 (as an equation) generated by eliminating V and give

6. Find the factor f5 (as an equation) generated by eliminating W and give

7. How will you obtain P(Y — +z) from the remaining factors?
8. Is the variable elimination order optimal? If it is not, give an optimal

order.
5
3 HMM - Filtering
Consider the two HMM-like models below. Find the elapse time (P (Xt , Zt |e1:t−1 ))
and observation (P (Xt , Zt |e1:t )) updates.
Hint: Recall that for standard HMMs, they are in the following forms, re-
spectively:
X
P (Xt |e1:t−1 ) = P (Xt |xt−1 )P (xt−1 |e1:t−1 )
xt−1
P (Xt |e1:t ) ∝ P (Xt |e1:t−1 )P (et |Xt )
X1 X2 X3
Z1 Z2 Z3
E1 E2 E3
(a)
X1 X2 X3
Z1 Z2 Z3
E1 E2 E3
(b)
6
(a) Equations for model (a)
(b) Equations for model (b)
7
4 HMM: Viterbi Algorithm

Consider the HMM structure below. The agent performs an action Ut from some
state-independent policy at every time step t, and also receives some evidence
et of the true state Xt .
U1 U2 U3
X1 X2 X3
e1 e2 e3
In this setting, which (if any) of the expressions below are maximized by the
Viterbi algorithm? Explain your reasoning below.
Hint: Remember that the Viterbi algorithm for vanilla HMMs with no control
actions update is:
m1:t+1 = P (et+1 |Xt+1 ) · max P (Xt+1 |xt ) · m1:t

xt
1. P (X1:t )
2. P (U1:t )
3. P (e1:t )
4. P (X1:t , U1:t | e1:t )
5. P (X1:t , U1:t , e1:t )
Qt
6. P (U1 )P (X1 | U1 )P (e1 | X1 ) t=2 P (Ut | Ut−1 )P (Xt | Ut )P (et | Xt )
Qt
7. P (X1 | U1 )P (e1 | X1 ) t=2 P (Ut | Xt−1 )P (Xt | Xt−1 )P (et | Xt )
8
5 1D Gridworld MDP
Consider a 1D gridworld MDP. In this MDP, available actions are left, right,
and stay. Stay always results in the agent staying in its current square. Left
and right are successful in moving in the intended direction half of the time.
The other half of the time, the agent stays in its current square. An agent
cannot try to move left at the leftmost square, and cannot try to move right
on the rightmost square. Staying still on a square gives a reward equivalent to
the number on that square and all other transitions give zero reward (meaning
any transitions in which the agent moves to a different square give zero reward).
Initially, let V0 (s) = 0, ∀s and γ = 0.5.
Perform value iteration and put the values you find in the given empty grids.
(a) Compute V1 (s)
(b) Compute V2 (s)
9
(c) Compute V ∗ (s).

Hint 1: Think about states where the optimal action is obvious and work
from there.
P∞ k
Hint 2: k=0 a = 1/(1 − a), for 0 < a < 1
(d) What is the optimal policy? Fill each state with the optimal action (”left”,
”right” or ”stay”) in the grid below.
10
6 Discount Shaping
In MDPs, we usually select the discount factor as a hyperparameter then com-
pute the value function. However, although much less common, finding an
appropriate discount factor for a desired value could have some use as well (e.g.
reverse engineering a real-life process). Consider a gridworld-like MDP with five
states s1 , ..., s5 . The agent can either stay (aS ) or continue (aC ). You are also
given the following for transition kernel T (s, a, s′ ) and reward function R(s, a):
T (si , aS , si ) = 1 for i ∈ {1, 2, 3, 4}

T (si+1 , aC , si ) = 1 for i ∈ {1, 2, 3, 4}
T (s5 , a, s5 ) = 1 for all actions a
R(si , a) = 0 for i ∈ {1, 2, 3, 5} and for all actions a
R(s4 , aS ) = 0, R(s4 , aC ) = 10
If the desired optimal value for state s1 is V ∗ (s1 ) = 1, what is the discount
factor γ?
11
7 Reinforcement Learning I
Imagine an unknown game which has only two states {A, B} and in each state
the agent has two actions to choose from: {Up, Down}. Suppose a game agent
chooses actions according to some policy π and generates the following sequence
of actions and rewards in this unknown game:
t st at st+1 rt
0 A Down B 8
1 B Down B -4
2 B Up B 0
3 B Up A 2
4 A Up A -2
Assume a discount factor γ = 0.5 and a learning rate α = 0.5.

(a) In model-based reinforcement learning, we first estimate the transition
function T (s, a, s′ ) and the reward function R(s, a, s′ ). Fill in the following
estimates of T and R, obtained from the experience above. Write “n/a”
if not applicable or undefined.
T̂ (A, U p, A) = R̂(A, U p, A) =
T̂ (A, U p, B) = R̂(A, U p, B) =
T̂ (B, U p, A) = R̂(B, U p, A) =
T̂ (B, U p, B) = R̂(B, U p, B) =
12
(b) Recall the update function of Q-learning is:
Q(st , at ) ←
− (1 − α)Q(st , at ) + α(rt + γ max
′
Q(st+1 , a′ )) (1)
a
Assume that all Q-values initialized as 0. What are the following Q-values
learned by running Q-learning with the above experience sequence?
Q(A, Down) = Q(B, Up) =
13
8 Approximate Q-Learning
Suppose that one weekend, you decide to go to an amusement park with your
friends. You start at the amusement park feeling well, receiving positive rewards
from rides, with some rides offering greater rewards than others. However, each
ride carries a risk of making you feel sick. If you continue to go on rides while
feeling sick, there is a chance you may recover, but the rewards you earn from
rides will likely be reduced and could even be negative.
You have no prior experience visiting amusement parks, so you are unsure
about the exact rewards you will receive from each ride (whether feeling well
or sick). Similarly, you do not know the probability of getting sick on any
particular ride or recovering while sick. What you do know about the rides is
summarized below:
Actions / Rides Type Wait Speed

Big Dipper Rollercoaster Long Fast
Wild Mouse Rollercoaster Short Slow
Hair Raiser Drop tower Short Fast
Moon Ranger Pendulum Short Slow
Leave the Park Leave Short Slow
Let’s formulate this problem as a two state MDP: well and sick. The actions
are the choice of what ride to take. ‘Leave the Park‘ action terminates the run.
Taking a ride can either transition to the same state with some probability or
take you to the other state. We’ll use a feature-based approximation to the
Q-values, defined by the following four features and associated weights:
Features Initial Weights

f0 (state, action) = (
1 (this is a bias feature that is always 1) w0 = 1
1 if action type is Rollercoaster
f1 (state, action) = w1 = 2
0 otherwise
(
1 if action wait is Short
f2 (state, action) = w2 = 1
0 otherwise
(
1 if action speed is Fast
f3 (state, action) = w3 = 0.5
0 otherwise
(a) Calculate Q(W ell, BigDipper).
14
(b) Apply a Q-learning update based on the sample (W ell, BigDipper, Sick, −10.5),
using a learning rate of α = 0.5 and discount of γ = 0.5. What are the
new weights?
(c) Using our approximation, are the Q-values that involve the sick state the
same or different from the corresponding Q-values that involve the well
state? In other words, is Q(W ell, a) = Q(Sick, a) for any possible action
a? Why or why not? (in just one sentence)
(d) Assume we have the original weights from the table on the previous page.
What action will an ϵ-greedy approach choose from the W ell state? If
multiple actions could be chosen, give each action and its probability.
15
9 Deep RL - Value-based Methods

Answer the questions about Deep Q Networks (DQN), a value-based method,
below.
(a) Briefly explain the three key differences between (tabular) Q-learning and
DQN.
(b) We usually use deep neural networks to approximate the Q-function in

DQNs. In this context, which one of the three differences you explained
above do you think improves the performance? Explain your reasoning.
(c) The update frequency of the target network is an important hyperparam-

eter in the training of DQNs. Suppose that while training your agent,
you set the update frequency to update the target very rarely, say every
264 steps. How do you think would this impact your training and agent
performance?
16
10 Deep RL - Policy Gradient and Actor-Critic

Methods
Answer the questions about policy gradient- and actor-critic-based deep RL
methods below.
(a) In the context of policy gradient methods in reinforcement learning, the

variance of gradient estimates can significantly affect the learning process.
One common technique to reduce variance in policy gradient methods
involves leveraging the principle of causality.
Explain how the concept of causality is utilized in policy gradient meth-
ods to decrease the variability in the calculated gradients, either through
mathematical expressions or verbal explanation.
(b) In actor-critic reinforcement learning methods, the system is comprised of

two main components: the actor, which dictates the policy πθ (a|s), and
the critic, which estimates the value function Vϕ (s). Both components are
parameterized by θ and ϕ, respectively.
How the value function estimated by the critic implicitly serves as a base-
line to improve the gradient estimates for the actor’s policy?
17
(c) Compare the actor-critic and policy gradient methods in terms of bias and
variance. Fill in the table below with ’High’ or ’Low’ to indicate the level
of bias or variance you associate with each method and provide a brief
(∼1-2 sentence) explanation for each of your answers.
Method Bias Level Variance Level

Actor-Critic
Policy Gradient
18

CS461_Fall24_HW4

Uploaded by

Copyright:

Available Formats

CS461_Fall24_HW4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS461_Fall24_HW4

Uploaded by

Copyright:

Available Formats

CS461 Artificial Intelligence

Fall 2024: Homework 4

(a) For the given definitions of h′ below, determine whether C = h∗ (S), C ≥

(i) h′ (X) = h(X)/2

(iii) h′ (X) = h(X) + h ∗ (X)

2 Bayes Networks - Variable Elimination

1. What factors do you have after inserting the evidence?

2. Find the factor f1 (as an equation) generated by eliminating X and give

3. Find the factor f2 (as an equation) generated by eliminating T and give

4. Find the factor f3 (as an equation) generated by eliminating U and give

5. Find the factor f4 (as an equation) generated by eliminating V and give

6. Find the factor f5 (as an equation) generated by eliminating W and give

8. Is the variable elimination order optimal? If it is not, give an optimal

P (Xt |e1:t ) ∝ P (Xt |e1:t−1 )P (et |Xt )

(a) Equations for model (a)

(b) Equations for model (b)

4 HMM: Viterbi Algorithm

m1:t+1 = P (et+1 |Xt+1 ) · max P (Xt+1 |xt ) · m1:t

(a) Compute V1 (s)

(b) Compute V2 (s)

(c) Compute V ∗ (s).

T (si , aS , si ) = 1 for i ∈ {1, 2, 3, 4}

Assume a discount factor γ = 0.5 and a learning rate α = 0.5.

(b) Recall the update function of Q-learning is:

Q(A, Down) = Q(B, Up) =

Actions / Rides Type Wait Speed

Features Initial Weights

(a) Calculate Q(W ell, BigDipper).

9 Deep RL - Value-based Methods

(b) We usually use deep neural networks to approximate the Q-function in

(c) The update frequency of the target network is an important hyperparam-

10 Deep RL - Policy Gradient and Actor-Critic

(a) In the context of policy gradient methods in reinforcement learning, the

(b) In actor-critic reinforcement learning methods, the system is comprised of

Method Bias Level Variance Level

You might also like