CS461_Fall24_HW4
CS461_Fall24_HW4
CS461_Fall24_HW4
1
CS461 Artificial Intelligence Homework 4
1 Informed Search
Consider a graph where the edge costs are always positive. We want to find
the least cost path from a node S to another node G. Suppose we use A* tree
search with some heuristic h′ to find such a path. Let C denote the cost of the
path from S to G by using heuristic h′ , h∗ (X) denote the shortest distance to
G from a node X, and h be an admissible heuristic for this graph.
h(X)+h∗(X)
(ii) h′ (X) = 2
2
CS461 Artificial Intelligence Homework 4
(
′ minY ∈K(X) h′ (Y ) − h(Y ) + h(X) if K(X) ̸= ∅,
(iv) h (X) ≤
h(X) if K(X) = ∅.
where K(X) defines the set of all neighboring nodes Y that satisfy
h∗ (X) > h∗ (Y ).
(
minY ∈K(X) h(Y ) + cost(X, Y ) if K(X) ̸= ∅,
(v) h′ (X) =
h(X) if K(X) = ∅.
3
CS461 Artificial Intelligence Homework 4
U V W X
Y Z
4
CS461 Artificial Intelligence Homework 4
7. How will you obtain P(Y — +z) from the remaining factors?
5
CS461 Artificial Intelligence Homework 4
3 HMM - Filtering
Consider the two HMM-like models below. Find the elapse time (P (Xt , Zt |e1:t−1 ))
and observation (P (Xt , Zt |e1:t )) updates.
Hint: Recall that for standard HMMs, they are in the following forms, re-
spectively:
X
P (Xt |e1:t−1 ) = P (Xt |xt−1 )P (xt−1 |e1:t−1 )
xt−1
X1 X2 X3
Z1 Z2 Z3
E1 E2 E3
(a)
X1 X2 X3
Z1 Z2 Z3
E1 E2 E3
(b)
6
CS461 Artificial Intelligence Homework 4
7
CS461 Artificial Intelligence Homework 4
U1 U2 U3
X1 X2 X3
e1 e2 e3
In this setting, which (if any) of the expressions below are maximized by the
Viterbi algorithm? Explain your reasoning below.
Hint: Remember that the Viterbi algorithm for vanilla HMMs with no control
actions update is:
1. P (X1:t )
2. P (U1:t )
3. P (e1:t )
4. P (X1:t , U1:t | e1:t )
5. P (X1:t , U1:t , e1:t )
Qt
6. P (U1 )P (X1 | U1 )P (e1 | X1 ) t=2 P (Ut | Ut−1 )P (Xt | Ut )P (et | Xt )
Qt
7. P (X1 | U1 )P (e1 | X1 ) t=2 P (Ut | Xt−1 )P (Xt | Xt−1 )P (et | Xt )
8
CS461 Artificial Intelligence Homework 4
5 1D Gridworld MDP
Consider a 1D gridworld MDP. In this MDP, available actions are left, right,
and stay. Stay always results in the agent staying in its current square. Left
and right are successful in moving in the intended direction half of the time.
The other half of the time, the agent stays in its current square. An agent
cannot try to move left at the leftmost square, and cannot try to move right
on the rightmost square. Staying still on a square gives a reward equivalent to
the number on that square and all other transitions give zero reward (meaning
any transitions in which the agent moves to a different square give zero reward).
Initially, let V0 (s) = 0, ∀s and γ = 0.5.
Perform value iteration and put the values you find in the given empty grids.
9
CS461 Artificial Intelligence Homework 4
(d) What is the optimal policy? Fill each state with the optimal action (”left”,
”right” or ”stay”) in the grid below.
10
CS461 Artificial Intelligence Homework 4
6 Discount Shaping
In MDPs, we usually select the discount factor as a hyperparameter then com-
pute the value function. However, although much less common, finding an
appropriate discount factor for a desired value could have some use as well (e.g.
reverse engineering a real-life process). Consider a gridworld-like MDP with five
states s1 , ..., s5 . The agent can either stay (aS ) or continue (aC ). You are also
given the following for transition kernel T (s, a, s′ ) and reward function R(s, a):
11
CS461 Artificial Intelligence Homework 4
7 Reinforcement Learning I
Imagine an unknown game which has only two states {A, B} and in each state
the agent has two actions to choose from: {Up, Down}. Suppose a game agent
chooses actions according to some policy π and generates the following sequence
of actions and rewards in this unknown game:
t st at st+1 rt
0 A Down B 8
1 B Down B -4
2 B Up B 0
3 B Up A 2
4 A Up A -2
T̂ (A, U p, A) = R̂(A, U p, A) =
T̂ (A, U p, B) = R̂(A, U p, B) =
T̂ (B, U p, A) = R̂(B, U p, A) =
T̂ (B, U p, B) = R̂(B, U p, B) =
12
CS461 Artificial Intelligence Homework 4
Q(st , at ) ←
− (1 − α)Q(st , at ) + α(rt + γ max
′
Q(st+1 , a′ )) (1)
a
Assume that all Q-values initialized as 0. What are the following Q-values
learned by running Q-learning with the above experience sequence?
13
CS461 Artificial Intelligence Homework 4
8 Approximate Q-Learning
Suppose that one weekend, you decide to go to an amusement park with your
friends. You start at the amusement park feeling well, receiving positive rewards
from rides, with some rides offering greater rewards than others. However, each
ride carries a risk of making you feel sick. If you continue to go on rides while
feeling sick, there is a chance you may recover, but the rewards you earn from
rides will likely be reduced and could even be negative.
You have no prior experience visiting amusement parks, so you are unsure
about the exact rewards you will receive from each ride (whether feeling well
or sick). Similarly, you do not know the probability of getting sick on any
particular ride or recovering while sick. What you do know about the rides is
summarized below:
14
CS461 Artificial Intelligence Homework 4
(b) Apply a Q-learning update based on the sample (W ell, BigDipper, Sick, −10.5),
using a learning rate of α = 0.5 and discount of γ = 0.5. What are the
new weights?
(c) Using our approximation, are the Q-values that involve the sick state the
same or different from the corresponding Q-values that involve the well
state? In other words, is Q(W ell, a) = Q(Sick, a) for any possible action
a? Why or why not? (in just one sentence)
(d) Assume we have the original weights from the table on the previous page.
What action will an ϵ-greedy approach choose from the W ell state? If
multiple actions could be chosen, give each action and its probability.
15
CS461 Artificial Intelligence Homework 4
(a) Briefly explain the three key differences between (tabular) Q-learning and
DQN.
16
CS461 Artificial Intelligence Homework 4
17
CS461 Artificial Intelligence Homework 4
(c) Compare the actor-critic and policy gradient methods in terms of bias and
variance. Fill in the table below with ’High’ or ’Low’ to indicate the level
of bias or variance you associate with each method and provide a brief
(∼1-2 sentence) explanation for each of your answers.
18