CS461_Fall24_HW4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

CS461 Artificial Intelligence

Fall 2024: Homework 4


Due Date: 10.12.2024

1
CS461 Artificial Intelligence Homework 4

1 Informed Search
Consider a graph where the edge costs are always positive. We want to find
the least cost path from a node S to another node G. Suppose we use A* tree
search with some heuristic h′ to find such a path. Let C denote the cost of the
path from S to G by using heuristic h′ , h∗ (X) denote the shortest distance to
G from a node X, and h be an admissible heuristic for this graph.

(a) For the given definitions of h′ below, determine whether C = h∗ (S), C ≥


h∗ (S), or C > h∗ (S).

(i) h′ (X) = h(X)/2

h(X)+h∗(X)
(ii) h′ (X) = 2

(iii) h′ (X) = h(X) + h ∗ (X)

2
CS461 Artificial Intelligence Homework 4

(
′ minY ∈K(X) h′ (Y ) − h(Y ) + h(X) if K(X) ̸= ∅,
(iv) h (X) ≤
h(X) if K(X) = ∅.
where K(X) defines the set of all neighboring nodes Y that satisfy
h∗ (X) > h∗ (Y ).

(
minY ∈K(X) h(Y ) + cost(X, Y ) if K(X) ̸= ∅,
(v) h′ (X) =
h(X) if K(X) = ∅.

3
CS461 Artificial Intelligence Homework 4

2 Bayes Networks - Variable Elimination


Consider the Bayes network below where each variable has a binary domain. We
want to find P(Y — +z). In order to find it, we will perform variable elimination
with the order: X, T, U, V, W.

U V W X

Y Z

1. What factors do you have after inserting the evidence?

2. Find the factor f1 (as an equation) generated by eliminating X and give


the resulting list of factors.

3. Find the factor f2 (as an equation) generated by eliminating T and give


the resulting list of factors.

4. Find the factor f3 (as an equation) generated by eliminating U and give


the resulting list of factors.

4
CS461 Artificial Intelligence Homework 4

5. Find the factor f4 (as an equation) generated by eliminating V and give


the resulting list of factors.

6. Find the factor f5 (as an equation) generated by eliminating W and give


the resulting list of factors.

7. How will you obtain P(Y — +z) from the remaining factors?

8. Is the variable elimination order optimal? If it is not, give an optimal


order.

5
CS461 Artificial Intelligence Homework 4

3 HMM - Filtering
Consider the two HMM-like models below. Find the elapse time (P (Xt , Zt |e1:t−1 ))
and observation (P (Xt , Zt |e1:t )) updates.

Hint: Recall that for standard HMMs, they are in the following forms, re-
spectively:
X
P (Xt |e1:t−1 ) = P (Xt |xt−1 )P (xt−1 |e1:t−1 )
xt−1

P (Xt |e1:t ) ∝ P (Xt |e1:t−1 )P (et |Xt )

X1 X2 X3

Z1 Z2 Z3

E1 E2 E3

(a)

X1 X2 X3

Z1 Z2 Z3

E1 E2 E3

(b)

6
CS461 Artificial Intelligence Homework 4

(a) Equations for model (a)

(b) Equations for model (b)

7
CS461 Artificial Intelligence Homework 4

4 HMM: Viterbi Algorithm


Consider the HMM structure below. The agent performs an action Ut from some
state-independent policy at every time step t, and also receives some evidence
et of the true state Xt .

U1 U2 U3

X1 X2 X3

e1 e2 e3

In this setting, which (if any) of the expressions below are maximized by the
Viterbi algorithm? Explain your reasoning below.
Hint: Remember that the Viterbi algorithm for vanilla HMMs with no control
actions update is:

m1:t+1 = P (et+1 |Xt+1 ) · max P (Xt+1 |xt ) · m1:t


xt

1. P (X1:t )

2. P (U1:t )
3. P (e1:t )
4. P (X1:t , U1:t | e1:t )
5. P (X1:t , U1:t , e1:t )
Qt
6. P (U1 )P (X1 | U1 )P (e1 | X1 ) t=2 P (Ut | Ut−1 )P (Xt | Ut )P (et | Xt )
Qt
7. P (X1 | U1 )P (e1 | X1 ) t=2 P (Ut | Xt−1 )P (Xt | Xt−1 )P (et | Xt )

8
CS461 Artificial Intelligence Homework 4

5 1D Gridworld MDP
Consider a 1D gridworld MDP. In this MDP, available actions are left, right,
and stay. Stay always results in the agent staying in its current square. Left
and right are successful in moving in the intended direction half of the time.
The other half of the time, the agent stays in its current square. An agent
cannot try to move left at the leftmost square, and cannot try to move right
on the rightmost square. Staying still on a square gives a reward equivalent to
the number on that square and all other transitions give zero reward (meaning
any transitions in which the agent moves to a different square give zero reward).
Initially, let V0 (s) = 0, ∀s and γ = 0.5.

Perform value iteration and put the values you find in the given empty grids.

(a) Compute V1 (s)

(b) Compute V2 (s)

9
CS461 Artificial Intelligence Homework 4

(c) Compute V ∗ (s).


Hint 1: Think about states where the optimal action is obvious and work
from there.
P∞ k
Hint 2: k=0 a = 1/(1 − a), for 0 < a < 1

(d) What is the optimal policy? Fill each state with the optimal action (”left”,
”right” or ”stay”) in the grid below.

10
CS461 Artificial Intelligence Homework 4

6 Discount Shaping
In MDPs, we usually select the discount factor as a hyperparameter then com-
pute the value function. However, although much less common, finding an
appropriate discount factor for a desired value could have some use as well (e.g.
reverse engineering a real-life process). Consider a gridworld-like MDP with five
states s1 , ..., s5 . The agent can either stay (aS ) or continue (aC ). You are also
given the following for transition kernel T (s, a, s′ ) and reward function R(s, a):

T (si , aS , si ) = 1 for i ∈ {1, 2, 3, 4}


T (si+1 , aC , si ) = 1 for i ∈ {1, 2, 3, 4}
T (s5 , a, s5 ) = 1 for all actions a
R(si , a) = 0 for i ∈ {1, 2, 3, 5} and for all actions a
R(s4 , aS ) = 0, R(s4 , aC ) = 10
If the desired optimal value for state s1 is V ∗ (s1 ) = 1, what is the discount
factor γ?

11
CS461 Artificial Intelligence Homework 4

7 Reinforcement Learning I
Imagine an unknown game which has only two states {A, B} and in each state
the agent has two actions to choose from: {Up, Down}. Suppose a game agent
chooses actions according to some policy π and generates the following sequence
of actions and rewards in this unknown game:

t st at st+1 rt
0 A Down B 8
1 B Down B -4
2 B Up B 0
3 B Up A 2
4 A Up A -2

Assume a discount factor γ = 0.5 and a learning rate α = 0.5.


(a) In model-based reinforcement learning, we first estimate the transition
function T (s, a, s′ ) and the reward function R(s, a, s′ ). Fill in the following
estimates of T and R, obtained from the experience above. Write “n/a”
if not applicable or undefined.

T̂ (A, U p, A) = R̂(A, U p, A) =

T̂ (A, U p, B) = R̂(A, U p, B) =

T̂ (B, U p, A) = R̂(B, U p, A) =

T̂ (B, U p, B) = R̂(B, U p, B) =

12
CS461 Artificial Intelligence Homework 4

(b) Recall the update function of Q-learning is:

Q(st , at ) ←
− (1 − α)Q(st , at ) + α(rt + γ max

Q(st+1 , a′ )) (1)
a

Assume that all Q-values initialized as 0. What are the following Q-values
learned by running Q-learning with the above experience sequence?

Q(A, Down) = Q(B, Up) =

13
CS461 Artificial Intelligence Homework 4

8 Approximate Q-Learning
Suppose that one weekend, you decide to go to an amusement park with your
friends. You start at the amusement park feeling well, receiving positive rewards
from rides, with some rides offering greater rewards than others. However, each
ride carries a risk of making you feel sick. If you continue to go on rides while
feeling sick, there is a chance you may recover, but the rewards you earn from
rides will likely be reduced and could even be negative.
You have no prior experience visiting amusement parks, so you are unsure
about the exact rewards you will receive from each ride (whether feeling well
or sick). Similarly, you do not know the probability of getting sick on any
particular ride or recovering while sick. What you do know about the rides is
summarized below:

Actions / Rides Type Wait Speed


Big Dipper Rollercoaster Long Fast
Wild Mouse Rollercoaster Short Slow
Hair Raiser Drop tower Short Fast
Moon Ranger Pendulum Short Slow
Leave the Park Leave Short Slow
Let’s formulate this problem as a two state MDP: well and sick. The actions
are the choice of what ride to take. ‘Leave the Park‘ action terminates the run.
Taking a ride can either transition to the same state with some probability or
take you to the other state. We’ll use a feature-based approximation to the
Q-values, defined by the following four features and associated weights:

Features Initial Weights


f0 (state, action) = (
1 (this is a bias feature that is always 1) w0 = 1
1 if action type is Rollercoaster
f1 (state, action) = w1 = 2
0 otherwise
(
1 if action wait is Short
f2 (state, action) = w2 = 1
0 otherwise
(
1 if action speed is Fast
f3 (state, action) = w3 = 0.5
0 otherwise

(a) Calculate Q(W ell, BigDipper).

14
CS461 Artificial Intelligence Homework 4

(b) Apply a Q-learning update based on the sample (W ell, BigDipper, Sick, −10.5),
using a learning rate of α = 0.5 and discount of γ = 0.5. What are the
new weights?

(c) Using our approximation, are the Q-values that involve the sick state the
same or different from the corresponding Q-values that involve the well
state? In other words, is Q(W ell, a) = Q(Sick, a) for any possible action
a? Why or why not? (in just one sentence)

(d) Assume we have the original weights from the table on the previous page.
What action will an ϵ-greedy approach choose from the W ell state? If
multiple actions could be chosen, give each action and its probability.

15
CS461 Artificial Intelligence Homework 4

9 Deep RL - Value-based Methods


Answer the questions about Deep Q Networks (DQN), a value-based method,
below.

(a) Briefly explain the three key differences between (tabular) Q-learning and
DQN.

(b) We usually use deep neural networks to approximate the Q-function in


DQNs. In this context, which one of the three differences you explained
above do you think improves the performance? Explain your reasoning.

(c) The update frequency of the target network is an important hyperparam-


eter in the training of DQNs. Suppose that while training your agent,
you set the update frequency to update the target very rarely, say every
264 steps. How do you think would this impact your training and agent
performance?

16
CS461 Artificial Intelligence Homework 4

10 Deep RL - Policy Gradient and Actor-Critic


Methods
Answer the questions about policy gradient- and actor-critic-based deep RL
methods below.

(a) In the context of policy gradient methods in reinforcement learning, the


variance of gradient estimates can significantly affect the learning process.
One common technique to reduce variance in policy gradient methods
involves leveraging the principle of causality.
Explain how the concept of causality is utilized in policy gradient meth-
ods to decrease the variability in the calculated gradients, either through
mathematical expressions or verbal explanation.

(b) In actor-critic reinforcement learning methods, the system is comprised of


two main components: the actor, which dictates the policy πθ (a|s), and
the critic, which estimates the value function Vϕ (s). Both components are
parameterized by θ and ϕ, respectively.
How the value function estimated by the critic implicitly serves as a base-
line to improve the gradient estimates for the actor’s policy?

17
CS461 Artificial Intelligence Homework 4

(c) Compare the actor-critic and policy gradient methods in terms of bias and
variance. Fill in the table below with ’High’ or ’Low’ to indicate the level
of bias or variance you associate with each method and provide a brief
(∼1-2 sentence) explanation for each of your answers.

Method Bias Level Variance Level


Actor-Critic
Policy Gradient

18

You might also like