0% found this document useful (0 votes)
9 views

Lnotes 05

This document summarizes a lecture on value function approximation. It introduces using function approximators like linear combinations of features and neural networks to generalize value functions to large state/action spaces. It describes using linear feature representations to approximate value functions as a linear combination of feature vectors. It discusses using gradient descent and stochastic gradient descent to minimize the error between the approximated and true value functions. It also explains how to incorporate value function approximation into algorithms like Monte Carlo, temporal difference learning, and control methods like SARSA and Q-learning.

Uploaded by

mouaad ibnealryf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lnotes 05

This document summarizes a lecture on value function approximation. It introduces using function approximators like linear combinations of features and neural networks to generalize value functions to large state/action spaces. It describes using linear feature representations to approximate value functions as a linear combination of feature vectors. It discusses using gradient descent and stochastic gradient descent to minimize the error between the approximated and true value functions. It also explains how to incorporate value function approximation into algorithms like Monte Carlo, temporal difference learning, and control methods like SARSA and Q-learning.

Uploaded by

mouaad ibnealryf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CS234 Notes - Lecture 5

Value Function Approximation

Alex Jin, Emma Brunskill

March 20, 2018

7 Introduction
So far we have represented value function by a lookup table where each state has a corresponding entry,
V (s), or each state-action pair has an entry, Q(s, a). However, this approach might not generalize well
to problems with very large state and/or action spaces, or in other cases we might prefer quickly
learning approximations over converged values of each state. A popular approach to address this
problem is via function approximation:

vπ (s) ≈ v̂(s, w) or qπ (s, a) ≈ q̂(s, a, w)

Here w is usually referred to as the parameter or weights of our function approximator. We list a few
popular choices for function approximators:

• Linear combinations of features

• Neural networks

• Decision trees

• Nearest neighbors

• Fourier / wavelet bases

We will further explore two popular classes of differentiable function approximators: Linear feature
representations and Neural networks. The reason for demanding a dierentiable function is covered in
the following section.

8 Linear Feature Representations


In linear function representations, we use a feature vector to represent a state:

x(s) = [x1 (s) x2 (s) ... xn (s)]

We then approximate our value functions using a linear combination of features:


n
X
v̂(s, w) = x(s)T w = xj (s)wj
j=1

We dene the (quadratic) objective (also known as the loss) function to be:
h i
J(w) = Eπ (vπ (s) − v̂(s, w))2

1
Figure 1: Visualization of Gradient Descent. We wish to reach the center point where our objective
function is minimize. We do so by following the red arrows, which points in the negative direction of
our evaluated gradient.

8.1 Gradient descent


A common technique to minimize the above objective function is called Gradient Descent. Figure
1 provides a visual illustration: we start at some particular spot x0 , corresponding to some initial
value of our parameter w; we then evaluate the gradient at x0 , which tells us the direction of the
steepest increase in the objective function; to minimize our objective function, we take a step along
the negative direction of the gradient vector and arrive at x1 ; this process is repeated until we reach
some convergence criteria.

Mathematically, this can be summarized as:


" #
∂J(w) ∂J(w) ∂J(w)
∇w J(w) = ... compute the gradient
∂w1 ∂w2 ∂wn
1
∆w = − α∇w J(w) compute an update step using gradient descent
2
w ← w + ∆w take a step towards the local minimum

8.2 Stochastic gradient descent


In practice, Gradient Descent is not considered a sample ecient optimizer and stochastic gradient
descent (SGD) is used more often. Although the original SGD algorithm referred to updating the
weights using a single sample, due to the convenience of vectorization, people often refer to gradient
descent on a minibatch of samples as SGD as well. In (minibatch) SGD, we sample a minibatch of
past experiences, compute our objective function on that minibatch, and update our parameters using
gradient descent on the minibatch. Let us now go back to several algorithms we covered in previous
lectures and see how value function approximations can be incorporated.

8.3 Monte Carlo with linear VFA


Algorithm 1 is a modication of First-Visit Monte Carlo Policy Evaluation, and we replace our value
function with our linear VFA. We also make a note that, although our return, Return(st ), is an

2
Algorithm 1 Monte Carlo Linear Value Function Approximation for Policy Evaluation

1: Initialize w = 0, Returns(s) = 0 ∀s, k = 1


2: loop
3: Sample k-th episode (sk1 , ak1 , rk1 , sk2 , . . . , sk , Lk ) given π
4: for t = 1, . . . , Lk do
5: if rst visit to (s) in episode k then
PLk
6: j=t rkj to Return(st )
Append
7: w ← w + α(Return(st ) − v̂(st , w))x(st )
8: k =k+1

unbiased estimate, it is often very noisy.

8.4 Temporal Dierence (TD(0)) with linear VFA


Recall that in the tabular setting, we approximate Vπ via bootstrapping and sampling and update
V π (s) by
V π (s) = V π (s) + α(r + γV π (s0 ) − V π (s))

where r + γV π (s0 ) represents our target. Here we use the same VFA for evaluating both our target
and value. In later lectures we will consider techniques for using dierent VFAs (such as in the control
setting).

Using VFAs, we replace Vπ with V̂ π and our update equation becomes:

V̂ (s, w) = V̂ (s, w) + α(r + γ V̂ π (s0 , w) − V̂ π (s, w))∇w V̂ π (s, w)


π π

= V̂ π (s, w) + α(r + γ V̂ π (s0 , w) − V̂ π (s, w))x(s)

In value function approximation, although our target is a biased and approximated estimate of the
true value V π (s), linear TD(0) will still converge (close) to the global optimum. We will prove this
assertion now.

8.5 Convergence Guarantees for Linear VFA for Policy Evaluation


We dene the mean squared error of a linear VFA for a particular policy π relative to the true value
as:
X
M SV E(w) = d(s)(vπ (s) − v̂π (s, w))2
s∈s

where d(s) is the stationary distribution of states under policy π in the true decision process and
v̂π (s, w) = x(s)T w is our linear VFA.
Theorem 8.1. Monte Carlo policy evaluation with VFA converges to the weights wM C with minimum
mean squared error. X
M SV E(wM C ) = min d(s)(vπ (s) − v̂π (s, w))2
w
s∈s

Theorem 8.2. TD(0) policy evaluation with VFA converges to the weights wT D which is within a
constant factor of the minimum mean squared error.
1 X
M SV E(wM C ) = min d(s)(vπ (s) − v̂π (s, w))2
1 − γ w s∈s

We omit the proofs here and encourage interested readers to look at [1] for a more in-depth discussion.

3
Algorithm Tabular Linear VFA Nonlinear VFA
Monte-Carlo Control Yes (Yes) No
SARSA Yes (Yes) No
Q-learning Yes No No

Table 1: Summary of convergence of Control Methods with VFA. (Yes) means the result chatters
around near-optimal value function.

8.6 Control using VFA


Similar to VFAs, we can also use function approximators for action-values. That is, we let q̂(s, a, w) ≈
qπ (s, a). We may then interleave policy evaluation, by approximating using q̂(s, a, w), and policy
improvement, by -greedy policy improvement. To be more concrete, let's write out this mathemat-
ically.

First, we dene our objective function J(w) as

J(w) = Eπ [(qπ (s, a) − q̂ π (s, a, w))2 ]

Similar to what we did earlier, we may then use either gradient descent or stochastic gradient descent
to minimize the objective function. For example, for a linear action-value function approximator, this
can be summarized as:

x(s, a) = [x1 (s, a) x2 (s, a) ... xn (s, a)] state-action value features

q̂(s, a, w) = x(s, a)w represent state-action value as linear combinations of features

J(w) = Eπ [(qπ (s, a) − q̂ π (s, a, w))2 ] dene objective function

1
− ∇w J(w) = Eπ [(qπ (s, a) − q̂ π (s, a, w))∇w q̂ π (s, a, w)] compute the gradient
2
= (qπ (s, a) − q̂ π (s, a, w))x(s, a)
1
∆w = − α∇w J(w) compute an update step using gradient descent
2
= α(qπ (s, a) − q̂ π (s, a, w))x(s, a)
w ← w + ∆w take a step towards the local minimum

For Monte Carlo methods, we substitute our target qπ (s, a) with a return Gt . That is, our update
becomes:
∆w = α(Gt − q̂(s, a, w))∇w q̂(s, a, w)

For SARSA, we substitute our target with a TD target:

∆w = α(r + γ q̂(s0 , a0 , w) − q̂(s, a, w))∇w q̂(s, a, w)

For Q-learning, we substitute our target with a max TD target:

∆w = α(r + γ max
0
q̂(s0 , a0 , w) − q̂(s, a, w))∇w q̂(s, a, w)
a

We note that because our use of value function approximations, which can be expansions, to carry out
our Bellman backup, convergence is not guaranteed. We refer users to look for Baird's Counterexample
for a more concrete illustration. We summarize contraction guarantees in the table 1.

4
Figure 2: A generic feedforward neural network with four input units, two output units, and two
hidden layers.

9 Neural Networks
Although linear VFAs often work well given the right set of features, it can also be dicult to hand-
craft such feature set. Neural Networks provide a much richer function approximation class that is
able to directly go from states without requiring an explicit specication of features.

Figure 2 shows a generic feedforward ANN. The network in the gure has an output layer consisting
of two output units, an input layer with four input units, and two hidden layers: layers that are
neither input nor output layers. A real-valued weight is associated with each link. The units are
typically semi-linear, meaning that they compute a weighted sum of their input signals and then apply
a nonlinear function to the result. This is usually referred to as activation function. In assignment
2, we studied how neural networks with a single hidden layer can have the universal approximation
property, both experience and theory show that approximating the complex functions needed is made
much easier with a hierarchical composition of multiple hidden layers.

In the next lecture, we will take a closer look at some theory and recent results of neural networks
being applied to solve reinforcement learning problems.

References
[1] Tsitsiklis and Van Roy. An Analysis of Temporal-Dierence Learning with Function Approximation.
1997.

You might also like