0% found this document useful (0 votes)

9 views

Lnotes 05

This document summarizes a lecture on value function approximation. It introduces using function approximators like linear combinations of features and neural networks to generalize value functions to large state/action spaces. It describes using linear feature representations to approximate value functions as a linear combination of feature vectors. It discusses using gradient descent and stochastic gradient descent to minimize the error between the approximated and true value functions. It also explains how to incorporate value function approximation into algorithms like Monte Carlo, temporal difference learning, and control methods like SARSA and Q-learning.

Uploaded by

mouaad ibnealryf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Lnotes 05

Uploaded by

mouaad ibnealryf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CS234 Notes - Lecture 5

Value Function Approximation

Alex Jin, Emma Brunskill

March 20, 2018

7 Introduction
So far we have represented value function by a lookup table where each state has a corresponding entry,
V (s), or each state-action pair has an entry, Q(s, a). However, this approach might not generalize well
to problems with very large state and/or action spaces, or in other cases we might prefer quickly
learning approximations over converged values of each state. A popular approach to address this
problem is via function approximation:

vπ (s) ≈ v̂(s, w) or qπ (s, a) ≈ q̂(s, a, w)

Here w is usually referred to as the parameter or weights of our function approximator. We list a few
popular choices for function approximators:

• Linear combinations of features

• Neural networks

• Decision trees

• Nearest neighbors

• Fourier / wavelet bases

We will further explore two popular classes of differentiable function approximators: Linear feature
representations and Neural networks. The reason for demanding a dierentiable function is covered in
the following section.

8 Linear Feature Representations

In linear function representations, we use a feature vector to represent a state:

x(s) = [x1 (s) x2 (s) ... xn (s)]

We then approximate our value functions using a linear combination of features:

n
X
v̂(s, w) = x(s)T w = xj (s)wj
j=1

We dene the (quadratic) objective (also known as the loss) function to be:
h i
J(w) = Eπ (vπ (s) − v̂(s, w))2

1
Figure 1: Visualization of Gradient Descent. We wish to reach the center point where our objective
function is minimize. We do so by following the red arrows, which points in the negative direction of
our evaluated gradient.

8.1 Gradient descent

A common technique to minimize the above objective function is called Gradient Descent. Figure
1 provides a visual illustration: we start at some particular spot x0 , corresponding to some initial
value of our parameter w; we then evaluate the gradient at x0 , which tells us the direction of the
steepest increase in the objective function; to minimize our objective function, we take a step along
the negative direction of the gradient vector and arrive at x1 ; this process is repeated until we reach
some convergence criteria.

Mathematically, this can be summarized as:

" #
∂J(w) ∂J(w) ∂J(w)
∇w J(w) = ... compute the gradient
∂w1 ∂w2 ∂wn
1
∆w = − α∇w J(w) compute an update step using gradient descent
2
w ← w + ∆w take a step towards the local minimum

8.2 Stochastic gradient descent

In practice, Gradient Descent is not considered a sample ecient optimizer and stochastic gradient
descent (SGD) is used more often. Although the original SGD algorithm referred to updating the
weights using a single sample, due to the convenience of vectorization, people often refer to gradient
descent on a minibatch of samples as SGD as well. In (minibatch) SGD, we sample a minibatch of
past experiences, compute our objective function on that minibatch, and update our parameters using
gradient descent on the minibatch. Let us now go back to several algorithms we covered in previous
lectures and see how value function approximations can be incorporated.

8.3 Monte Carlo with linear VFA

Algorithm 1 is a modication of First-Visit Monte Carlo Policy Evaluation, and we replace our value
function with our linear VFA. We also make a note that, although our return, Return(st ), is an

2
Algorithm 1 Monte Carlo Linear Value Function Approximation for Policy Evaluation

1: Initialize w = 0, Returns(s) = 0 ∀s, k = 1

2: loop
3: Sample k-th episode (sk1 , ak1 , rk1 , sk2 , . . . , sk , Lk ) given π
4: for t = 1, . . . , Lk do
5: if rst visit to (s) in episode k then
PLk
6: j=t rkj to Return(st )
Append
7: w ← w + α(Return(st ) − v̂(st , w))x(st )
8: k =k+1

unbiased estimate, it is often very noisy.

8.4 Temporal Dierence (TD(0)) with linear VFA

Recall that in the tabular setting, we approximate Vπ via bootstrapping and sampling and update
V π (s) by
V π (s) = V π (s) + α(r + γV π (s0 ) − V π (s))

where r + γV π (s0 ) represents our target. Here we use the same VFA for evaluating both our target
and value. In later lectures we will consider techniques for using dierent VFAs (such as in the control
setting).

Using VFAs, we replace Vπ with V̂ π and our update equation becomes:

V̂ (s, w) = V̂ (s, w) + α(r + γ V̂ π (s0 , w) − V̂ π (s, w))∇w V̂ π (s, w)

π π

= V̂ π (s, w) + α(r + γ V̂ π (s0 , w) − V̂ π (s, w))x(s)

In value function approximation, although our target is a biased and approximated estimate of the
true value V π (s), linear TD(0) will still converge (close) to the global optimum. We will prove this
assertion now.

8.5 Convergence Guarantees for Linear VFA for Policy Evaluation

We dene the mean squared error of a linear VFA for a particular policy π relative to the true value
as:
X
M SV E(w) = d(s)(vπ (s) − v̂π (s, w))2
s∈s

where d(s) is the stationary distribution of states under policy π in the true decision process and
v̂π (s, w) = x(s)T w is our linear VFA.
Theorem 8.1. Monte Carlo policy evaluation with VFA converges to the weights wM C with minimum
mean squared error. X
M SV E(wM C ) = min d(s)(vπ (s) − v̂π (s, w))2
w
s∈s

Theorem 8.2. TD(0) policy evaluation with VFA converges to the weights wT D which is within a
constant factor of the minimum mean squared error.
1 X
M SV E(wM C ) = min d(s)(vπ (s) − v̂π (s, w))2
1 − γ w s∈s

We omit the proofs here and encourage interested readers to look at [1] for a more in-depth discussion.

3
Algorithm Tabular Linear VFA Nonlinear VFA
Monte-Carlo Control Yes (Yes) No
SARSA Yes (Yes) No
Q-learning Yes No No

Table 1: Summary of convergence of Control Methods with VFA. (Yes) means the result chatters
around near-optimal value function.

8.6 Control using VFA

Similar to VFAs, we can also use function approximators for action-values. That is, we let q̂(s, a, w) ≈
qπ (s, a). We may then interleave policy evaluation, by approximating using q̂(s, a, w), and policy
improvement, by -greedy policy improvement. To be more concrete, let's write out this mathemat-
ically.

First, we dene our objective function J(w) as

J(w) = Eπ [(qπ (s, a) − q̂ π (s, a, w))2 ]

Similar to what we did earlier, we may then use either gradient descent or stochastic gradient descent
to minimize the objective function. For example, for a linear action-value function approximator, this
can be summarized as:

x(s, a) = [x1 (s, a) x2 (s, a) ... xn (s, a)] state-action value features

q̂(s, a, w) = x(s, a)w represent state-action value as linear combinations of features

J(w) = Eπ [(qπ (s, a) − q̂ π (s, a, w))2 ] dene objective function

1
− ∇w J(w) = Eπ [(qπ (s, a) − q̂ π (s, a, w))∇w q̂ π (s, a, w)] compute the gradient
2
= (qπ (s, a) − q̂ π (s, a, w))x(s, a)
1
∆w = − α∇w J(w) compute an update step using gradient descent
2
= α(qπ (s, a) − q̂ π (s, a, w))x(s, a)
w ← w + ∆w take a step towards the local minimum

For Monte Carlo methods, we substitute our target qπ (s, a) with a return Gt . That is, our update
becomes:
∆w = α(Gt − q̂(s, a, w))∇w q̂(s, a, w)

For SARSA, we substitute our target with a TD target:

∆w = α(r + γ q̂(s0 , a0 , w) − q̂(s, a, w))∇w q̂(s, a, w)

For Q-learning, we substitute our target with a max TD target:

∆w = α(r + γ max
0
q̂(s0 , a0 , w) − q̂(s, a, w))∇w q̂(s, a, w)
a

We note that because our use of value function approximations, which can be expansions, to carry out
our Bellman backup, convergence is not guaranteed. We refer users to look for Baird's Counterexample
for a more concrete illustration. We summarize contraction guarantees in the table 1.

4
Figure 2: A generic feedforward neural network with four input units, two output units, and two
hidden layers.

9 Neural Networks
Although linear VFAs often work well given the right set of features, it can also be dicult to hand-
craft such feature set. Neural Networks provide a much richer function approximation class that is
able to directly go from states without requiring an explicit specication of features.

Figure 2 shows a generic feedforward ANN. The network in the gure has an output layer consisting
of two output units, an input layer with four input units, and two hidden layers: layers that are
neither input nor output layers. A real-valued weight is associated with each link. The units are
typically semi-linear, meaning that they compute a weighted sum of their input signals and then apply
a nonlinear function to the result. This is usually referred to as activation function. In assignment
2, we studied how neural networks with a single hidden layer can have the universal approximation
property, both experience and theory show that approximating the complex functions needed is made
much easier with a hierarchical composition of multiple hidden layers.

In the next lecture, we will take a closer look at some theory and recent results of neural networks
being applied to solve reinforcement learning problems.

References
[1] Tsitsiklis and Van Roy. An Analysis of Temporal-Dierence Learning with Function Approximation.
1997.

Final 12
No ratings yet
Final 12
11 pages
2.3+Value+Function+Approximation
No ratings yet
2.3+Value+Function+Approximation
55 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
Lecture 5: Value Function Approximation: Emma Brunskill
No ratings yet
Lecture 5: Value Function Approximation: Emma Brunskill
59 pages
Lecture 6: Value Function Approximation: David Silver
No ratings yet
Lecture 6: Value Function Approximation: David Silver
56 pages
3 - Chapter 8 Value Function Approximation
No ratings yet
3 - Chapter 8 Value Function Approximation
39 pages
07 FA Methods
No ratings yet
07 FA Methods
58 pages
Function Approximation
No ratings yet
Function Approximation
35 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
20AI903_RL_UNIT 4
No ratings yet
20AI903_RL_UNIT 4
49 pages
Internal
No ratings yet
Internal
25 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of contents
No ratings yet
1 - Table of contents
6 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Chapter
No ratings yet
Chapter
46 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
DL Slides 3
No ratings yet
DL Slides 3
99 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Montanari
No ratings yet
Montanari
10 pages
Summer_School_19
No ratings yet
Summer_School_19
95 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
EDA Lecture Module 4
No ratings yet
EDA Lecture Module 4
20 pages
Fa Ii
No ratings yet
Fa Ii
62 pages
08 PG Methods
No ratings yet
08 PG Methods
83 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
28 pages
Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
8 SVMs
No ratings yet
8 SVMs
72 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
Powell UnifiedFrameworkStochasticOptimization Jan292018
No ratings yet
Powell UnifiedFrameworkStochasticOptimization Jan292018
69 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Policy-Based Reinforcement Learning: Shusen Wang
No ratings yet
Policy-Based Reinforcement Learning: Shusen Wang
46 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
2501.10465v1
No ratings yet
2501.10465v1
10 pages
CO431 RL 2023 End Nov
No ratings yet
CO431 RL 2023 End Nov
3 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
SRE_Report_merged
No ratings yet
SRE_Report_merged
16 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
No ratings yet
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
9 pages
13_RL_4
No ratings yet
13_RL_4
48 pages
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
No ratings yet
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
DLbook
No ratings yet
DLbook
165 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
Stochastic Optimization For Machine Learning
No ratings yet
Stochastic Optimization For Machine Learning
59 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Computational Economics: Session 16: Numerical Dynamic Programming
No ratings yet
Computational Economics: Session 16: Numerical Dynamic Programming
17 pages
RL Unit 5
No ratings yet
RL Unit 5
30 pages
13_RL_3
No ratings yet
13_RL_3
48 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
32962
No ratings yet
32962
16 pages
Flow Chart (Experiment No 4) Reverse Order: Start
No ratings yet
Flow Chart (Experiment No 4) Reverse Order: Start
21 pages
CS407 Neural Computation: Lecturer: A/Prof. M. Bennamoun
No ratings yet
CS407 Neural Computation: Lecturer: A/Prof. M. Bennamoun
34 pages
Adaptive DP For Discrete Time LQR Optimal Tracking Control Problems With Unknown Dynamics
No ratings yet
Adaptive DP For Discrete Time LQR Optimal Tracking Control Problems With Unknown Dynamics
6 pages
2smart Model For Career Guidance Using Hybrid Deep Learning Technique
No ratings yet
2smart Model For Career Guidance Using Hybrid Deep Learning Technique
5 pages
The Control of A Highly Nonlinear Two-Wheels Balancing Robot: A Comparative Assessment Between LQR and PID-PID Control Schemes
No ratings yet
The Control of A Highly Nonlinear Two-Wheels Balancing Robot: A Comparative Assessment Between LQR and PID-PID Control Schemes
6 pages
W1M1-Intro To ML
No ratings yet
W1M1-Intro To ML
19 pages
Nonlinear Predictive Control of A Benchmark CSTR
No ratings yet
Nonlinear Predictive Control of A Benchmark CSTR
7 pages
Design, Development and Analysis of A Self-Driving Car Utilizing Artificial Intelligence and Computer Vision Techniques
No ratings yet
Design, Development and Analysis of A Self-Driving Car Utilizing Artificial Intelligence and Computer Vision Techniques
22 pages
Integration of AI and Data Science
No ratings yet
Integration of AI and Data Science
2 pages
Appreciative Systems
No ratings yet
Appreciative Systems
7 pages
keepingthepypsimple-chrisgadbury2024
No ratings yet
keepingthepypsimple-chrisgadbury2024
1 page
Block Diagrams
No ratings yet
Block Diagrams
60 pages
Software Testing Dictionary
No ratings yet
Software Testing Dictionary
16 pages
Introduction To Syse 515: Mathematical Applications in Systems Engineering
No ratings yet
Introduction To Syse 515: Mathematical Applications in Systems Engineering
18 pages
Bsed-Sci-Flores, Mark Brian P. - Thermodynamics - Discussion Questionss
No ratings yet
Bsed-Sci-Flores, Mark Brian P. - Thermodynamics - Discussion Questionss
2 pages
Investigation 9 - Experience 1
No ratings yet
Investigation 9 - Experience 1
26 pages
Set A: Criteria Black Box Testing White Box Testing
No ratings yet
Set A: Criteria Black Box Testing White Box Testing
6 pages
Sysml Overview Oose
100% (6)
Sysml Overview Oose
4 pages
Design and Simulation of DC Motor Speed Controller Using Pole Placement Technique and MATLAB
No ratings yet
Design and Simulation of DC Motor Speed Controller Using Pole Placement Technique and MATLAB
9 pages
IT301 - Assignment in Theoretical Understanding of SDLC
No ratings yet
IT301 - Assignment in Theoretical Understanding of SDLC
5 pages
Webinar Topics Covered: Hvac Webinar On 11 April
No ratings yet
Webinar Topics Covered: Hvac Webinar On 11 April
2 pages
ch1 Qa
No ratings yet
ch1 Qa
9 pages
HBS Case Assignment
No ratings yet
HBS Case Assignment
5 pages
B777 Lesson 1 - Flight Control
93% (14)
B777 Lesson 1 - Flight Control
36 pages
Topics of Interest Include, But Are Not Limited To, The Following
No ratings yet
Topics of Interest Include, But Are Not Limited To, The Following
1 page
Electrical and Electronics Engineering S7 & S8
No ratings yet
Electrical and Electronics Engineering S7 & S8
362 pages
1 CHAOS2012 Proceedings Papers M-P
No ratings yet
1 CHAOS2012 Proceedings Papers M-P
190 pages
SEC Spring2022 Lecture#1
No ratings yet
SEC Spring2022 Lecture#1
47 pages
POM - Question Bank
No ratings yet
POM - Question Bank
2 pages