Lnotes 05
Lnotes 05
7 Introduction
So far we have represented value function by a lookup table where each state has a corresponding entry,
V (s), or each state-action pair has an entry, Q(s, a). However, this approach might not generalize well
to problems with very large state and/or action spaces, or in other cases we might prefer quickly
learning approximations over converged values of each state. A popular approach to address this
problem is via function approximation:
Here w is usually referred to as the parameter or weights of our function approximator. We list a few
popular choices for function approximators:
• Neural networks
• Decision trees
• Nearest neighbors
We will further explore two popular classes of differentiable function approximators: Linear feature
representations and Neural networks. The reason for demanding a dierentiable function is covered in
the following section.
We dene the (quadratic) objective (also known as the loss) function to be:
h i
J(w) = Eπ (vπ (s) − v̂(s, w))2
1
Figure 1: Visualization of Gradient Descent. We wish to reach the center point where our objective
function is minimize. We do so by following the red arrows, which points in the negative direction of
our evaluated gradient.
2
Algorithm 1 Monte Carlo Linear Value Function Approximation for Policy Evaluation
where r + γV π (s0 ) represents our target. Here we use the same VFA for evaluating both our target
and value. In later lectures we will consider techniques for using dierent VFAs (such as in the control
setting).
In value function approximation, although our target is a biased and approximated estimate of the
true value V π (s), linear TD(0) will still converge (close) to the global optimum. We will prove this
assertion now.
where d(s) is the stationary distribution of states under policy π in the true decision process and
v̂π (s, w) = x(s)T w is our linear VFA.
Theorem 8.1. Monte Carlo policy evaluation with VFA converges to the weights wM C with minimum
mean squared error. X
M SV E(wM C ) = min d(s)(vπ (s) − v̂π (s, w))2
w
s∈s
Theorem 8.2. TD(0) policy evaluation with VFA converges to the weights wT D which is within a
constant factor of the minimum mean squared error.
1 X
M SV E(wM C ) = min d(s)(vπ (s) − v̂π (s, w))2
1 − γ w s∈s
We omit the proofs here and encourage interested readers to look at [1] for a more in-depth discussion.
3
Algorithm Tabular Linear VFA Nonlinear VFA
Monte-Carlo Control Yes (Yes) No
SARSA Yes (Yes) No
Q-learning Yes No No
Table 1: Summary of convergence of Control Methods with VFA. (Yes) means the result chatters
around near-optimal value function.
Similar to what we did earlier, we may then use either gradient descent or stochastic gradient descent
to minimize the objective function. For example, for a linear action-value function approximator, this
can be summarized as:
x(s, a) = [x1 (s, a) x2 (s, a) ... xn (s, a)] state-action value features
1
− ∇w J(w) = Eπ [(qπ (s, a) − q̂ π (s, a, w))∇w q̂ π (s, a, w)] compute the gradient
2
= (qπ (s, a) − q̂ π (s, a, w))x(s, a)
1
∆w = − α∇w J(w) compute an update step using gradient descent
2
= α(qπ (s, a) − q̂ π (s, a, w))x(s, a)
w ← w + ∆w take a step towards the local minimum
For Monte Carlo methods, we substitute our target qπ (s, a) with a return Gt . That is, our update
becomes:
∆w = α(Gt − q̂(s, a, w))∇w q̂(s, a, w)
∆w = α(r + γ max
0
q̂(s0 , a0 , w) − q̂(s, a, w))∇w q̂(s, a, w)
a
We note that because our use of value function approximations, which can be expansions, to carry out
our Bellman backup, convergence is not guaranteed. We refer users to look for Baird's Counterexample
for a more concrete illustration. We summarize contraction guarantees in the table 1.
4
Figure 2: A generic feedforward neural network with four input units, two output units, and two
hidden layers.
9 Neural Networks
Although linear VFAs often work well given the right set of features, it can also be dicult to hand-
craft such feature set. Neural Networks provide a much richer function approximation class that is
able to directly go from states without requiring an explicit specication of features.
Figure 2 shows a generic feedforward ANN. The network in the gure has an output layer consisting
of two output units, an input layer with four input units, and two hidden layers: layers that are
neither input nor output layers. A real-valued weight is associated with each link. The units are
typically semi-linear, meaning that they compute a weighted sum of their input signals and then apply
a nonlinear function to the result. This is usually referred to as activation function. In assignment
2, we studied how neural networks with a single hidden layer can have the universal approximation
property, both experience and theory show that approximating the complex functions needed is made
much easier with a hierarchical composition of multiple hidden layers.
In the next lecture, we will take a closer look at some theory and recent results of neural networks
being applied to solve reinforcement learning problems.
References
[1] Tsitsiklis and Van Roy. An Analysis of Temporal-Dierence Learning with Function Approximation.
1997.