Deep Reinforcement Learning Handout v2.0.docx (1)

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
WORK INTEGRATED LEARNING PROGRAMME

Course Handout
Part A: Content Design

Course Title Deep Reinforcement Learning
Course No(s) 4
Credit Units
Credit Model 2 - 0.5 - 1.5.

1 unit for class room hours, 0.5 unit for Tutorial, 1.5 units
for Student preparation. 1 unit = 32 hours
Content Authors S.P. Vimal
Version 1.0
Date 17, May, 2023
Course Objectives
C01: Understand
a. the conceptual, mathematical foundations of deep reinforcement learning b.
various classic & state-of-the-art Deep Reinforcement Learning algorithms
C02: Implement and Evaluate the deep reinforcement learning solutions to various problems
like planning, control, and decision making in various domains
C03: Provide conceptual, mathematical, and practical exposure on DRL
c. to understand the recent developments in deep reinforcement learning and
d. to enable modeling new problems as DRL problems.
Text Book(s)
T1 Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto,
Second Ed. , MIT Press
T2 Foundations of Deep Reinforcement Learning: Theory and Practice in Python

(Addison-Wesley Data & Analytics Series) 1st Edition by Laura Graesser and Wah
Loon Keng
Content Structure
1. Introduction: Introducing RL
1.1. Introduction to Reinforcement Learning (RL); Examples; Elements of
Reinforcement Learning ( Policy, Reward, Value, Model of the
environment) & their characteristics; Example: RL for Tic-Tac-Toe;
Historical Background;
1.2. Multi-armed Bandit Problem - Motivation and Problem Statement;
Incremental solution to the stationary & non-stationary MAB problems;
Exploration vs. Exploitation tradeoff; Bandit Gradient Algorithm as
Stochastic Gradient Ascent; Associative Search
2. MDP: Framework
2.1. (Finite) Markov Decision Processes: Modelling Agent-Environment
interaction using MDP; Examples; Discussion on Goals ,
2.2. Rewards & Returns; Policy and Value Functions;
2.3. Bellman Equation for value functions;
2.4. Optimal Policy and Optimal Value functions;
3. Approaches to Solving Reinforcement Problems
3.1. Dynamic Programming Solution (Policy Iteration; Value Iteration;
Generalized policy iteration; Efficiency of Dynamic Programming )
3.2. Monte Carlo (MC) Methods (MC prediction, MC control, incremental MC.)
[ Mid-Semester Exam ]
3.3. Temporal-Difference (TD) Learning

3.4. Discussion on Other Classic Approaches that combine 3.1, 3.2,3.3 4. Discuss
the classification of (Deep) Reinforcement Learning Approaches, Algorithms, and
Applications
4.1. Model-Based vs. Model Free;
4.2. Value-based vs. Policy-Based;
4.3. On-Policy vs. Off-Policy;
4.4. Deep Learning as a Function Approximator and Review of Related Literature
5. Value-Based DRL Methods
5.1. Function approximation; Feature Construction for Linear Methods (Tile
Coding, Asymmetric Tile Coding);
5.2. Linear function approximation; Semi-Gradient TD methods; Off-policy
function approximation TD divergence;
5.3. Deep Q Network; Double DQN; Rainbow
6. Policy Gradients Methods

6.1. Policy Gradient Methods, Policy Gradient Theorem,
6.2. REINFORCE algorithm, REINFORCE with baseline algorithm, 6.3. Actor-Critic
methods, REINFORCE algorithm for continuing problems (problems without
episode boundaries)
7. Model-Based Deep RL
7.1. Upper-Confidence-Bound Action Selection,
7.2. Monte-Carlo tree search,
7.3. AlphaGo Zero, MuZero, PlaNet
8. Imitation Learning
8.1. Introduction to Imitation Learning;
8.2. Imitation Learning Via Supervised Learning, Behavior Cloning, Inverse
Reinforcement Learning,
8.3. GAIL; Dataset augmentation, DAGGER;
8.4. Applications in autonomous Driving, Game Playing, and Robotics;
9. (Optional Content) Multi-Agent RL

9.1. Understanding multi-agent environment;
9.2. Cooperative vs Competitive agents, centralized vs. decentralized RL ; 9.3.
Proximity Primal Optimization (Surrogate Objective Function, Clipping) ;
Multi-agent PPO
10. (Optional Content) Special Topics

10.1. Discussion at a high level on a few selected topics from - Safety In
Reinforcement Learning: Constrained RL, Safe Exploration, Adversarial
Training, Corrigibility, Distributional Shift, Human-in-the-Loop, Formal
methods in Safe RL, Offline/Batch Reinforcement Learning
11. Course Summary
Learning Outcomes
After successfully completing this course, the students will be able to
LO-1: understand the fundamental concepts of reinforcement learning (RL), and algorithms
and apply them to solving problems, including control, decision-making, and
planning.
LO-2: Implement DRL algorithms, and handle challenges in training due to stability and
convergence
LO-3: evaluate the performance of DRL algorithms, including metrics such as sample
efficiency, robustness, and generalization.
LO-4: understand the challenges and opportunities of applying DRL to real-world problems
& model real-life problems
Part B: Learning Plan
Academic Term
Course Title Deep Reinforcement Learning
Course No
Lead Instructor
Detailed Plan of Contact Sessions

Session # # of Hrs Topics Ref
1 2 Introduction to Reinforcement Learning (RL); [TB] Ch-1

Examples; Elements of Reinforcement Learning (
Policy, Reward, Value, Model of the environment) &
their characteristics; Example: RL for Tic-Tac-Toe;
Historical Background;
2-3 3 Multi-armed Bandit Problem - Motivation and Problem [TB] Ch-2
Statement; Incremental solution to the stationary &
non-stationary MAB problems; Exploration vs.
Exploitation tradeoff; Bandit Gradient Algorithm as
Stochastic Gradient Ascent; Associative Search
3-4 2 (Finite) Markov Decision Processes: Modelling [TB] Ch-3

Agent-Environment interaction using MDP; Examples;
Discussion on Goals , Rewards & Returns; Policy and
Value Functions; Bellman Equation for value functions;
Optimal Policy and Optimal Value functions;
4-6 3 Introduction to Dynamic Programming; Policy Iteration; [TB] Ch-4

Value Iteration; Generalized policy iteration; Efficiency
of Dynamic Programming
7-8 2 MC prediction, MC control, incremental MC. [TB] Ch-5
9-10 3 TD Prediction, On-policy temporal difference [TB] Ch-6,7

model-free (SARSA), Off-policy temporal difference
model-free (Q-Learning), Expected SARSA, N-step TD
policy evaluation, N-step Sarsa, Off-policy n-step
Sarsa, Tree backup algorithm
11 1 Discussion on the classification of (Deep) Notes

Reinforcement Learning Approaches, Algorithms,
Applications: Model-Based vs. Model Free; Value
based vs. Policy-Based; On-Policy vs Off-Policy;
Mid-Semester Exam
12 3 Function approximation; Feature Construction for [TB-2]

Linear Methods (Tile Coding, Asymmetric Tile Coding); Ch-9,
Linear function approximation; Semi-Gradient TD [DQN]
methods; Off-policy function approximation TD [DDQN]
divergence; Deep Q Network; Double DQN; Rainbow [Rainbow]
13 3 Policy Gradient Methods, Policy Gradient Theorem, [TB-2] Ch-13,

REINFORCE algorithm, REINFORCE with baseline
algorithm, Actor-Critic methods, REINFORCE algorithm
for continuing problems (problems without episode
boundaries)
14-15 2 Upper-Confidence-Bound Action Selection, [Alphazero]

Monte-Carlo tree search, AlphaGo Zero, MuZero, [AlphaGo
PlaNet Zer o]
[MuZero]
[PlaNet]
15-16 3 Imitation Learning: Introduction, Imitation Learning; [DeepMim

Imitation Learning Via Supervised Learning, Behavior ic] [BAIL ]
Cloning, Inverse Reinforment Learning, GAIL; Dataset [ACM-SUR
augmentation , DAGGER; Applications in autonomous -IL ]
Driving, Game Playing, Robotics;
[optional] 3 Understanding multi-agent environment; cooperative vs [MARL]
competitive agents, centralized vs. decentralized RL ;
Proximity Primal Optimization (Surrogate Objective
Function, Clipping) ; Multi-agent PPO
[optional] 1 Discussion on few selected advanced [NAS -1]

topics; Course Summary. [NAS-2]
[SafeRL-S
ur]
[OfflineRL]
[RL - Reinforcement Learning, DRL: Deep Reinforcement Learning]
Detailed Plan for Lab work

Lab No.
Lab Objective Lab Sheet Access URLSession Reference
1 Bandit Gradient Algorithm NA
Implementing Dynamic programming, Monte

2-4
carlo, SARSA, Q-Learning;NA
3 Deep Q Network NA
4 REINFORCE NA
5 Imitation Learning NA
Evaluation Scheme:
Legend: EC = Evaluation Component; AN = After Noon Session; FN = Fore Noon Session
No Name Type Duration Weight Schedule Remarks
EC-1(a) Online 5 % TBA Two quizzes of 5% will be

Quizzes conducted, and the score of highest
will be taken towards grading;
Strictly no make-up;
) Assignments Take 25% TBA

Home
EC-2 Mid-Semes Closed 30% TBA

ter Test Book
EC-3 Comprehensi Open 40% TBA

ve Exam Book
Note:
Syllabus for Mid-Semester Test (Closed Book): Topics in Session Nos. 1 to 8
Syllabus for Comprehensive Exam (Open Book): All topics (Session Nos. 1 to 16)
Important links and information:
Elearn portal: https://elearn.bits-pilani.ac.in or Canvas

Students are expected to visit the Elearn portal on a regular basis and stay up to date with
the latest announcements and deadlines.
Contact sessions: Students should attend the online lectures as per the schedule
provided on the Elearn portal.
Evaluation Guidelines:
1 EC-1 consists of two Quizzes. Students will attempt them through the course pages on
the Elearn portal. Announcements will be made on the portal, in a timely manner.
2 EC-2 consists of either one or two Assignments. Students will attempt them through
the course pages on the Elearn portal. Announcements will be made on the portal,
in a timely manner.
3 For Closed Book tests: No books or reference material of any kind will be permitted. 4
For Open Book exams: Use of books and any printed / written reference material (filed
or bound) is permitted. However, loose sheets of paper will not be allowed. Use of
calculators is permitted in all exams. Laptops/Mobiles of any kind are not allowed.
Exchange of any material is not allowed.
5 If a student is unable to appear for the Regular Test/Exam due to genuine exigencies,
the student should follow the procedure to apply for the Make-Up Test/Exam which
will be made available on the Elearn portal. The Make-Up Test/Exam will be
conducted only at selected exam centres on the dates to be announced later.
Plagiarism Policy:
All submissions for graded components must be the result of

your original effort. It is strictly prohibited to copy and paste
verbatim from any sources, whether online or from your peers.
The use of unauthorized sources or materials, as well as
collusion or unauthorized collaboration to gain an unfair
advantage, is also strictly prohibited. Please note that we will
not distinguish between the person sharing their resources
and the one receiving them for plagiarism, and the
consequences will apply to both parties equally.
In cases where suspicious circumstances arise, such as

identical verbatim answers or a significant overlap of
unreasonable similarities in a set of submissions, will be
investigated, and severe punishments will be imposed on all
those found guilty of plagiarism.

Deep Reinforcement Learning Handout v2.0.docx (1)

Uploaded by

Copyright:

Available Formats

Deep Reinforcement Learning Handout v2.0.docx (1)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Reinforcement Learning Handout v2.0.docx (1)

Uploaded by

Copyright:

Available Formats

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

WORK INTEGRATED LEARNING PROGRAMME

Part A: Content Design

Credit Model 2 - 0.5 - 1.5.

Content Authors S.P. Vimal

Date 17, May, 2023

T2 Foundations of Deep Reinforcement Learning: Theory and Practice in Python

3.3. Temporal-Difference (TD) Learning

6. Policy Gradients Methods

9. (Optional Content) Multi-Agent RL

10. (Optional Content) Special Topics

11. Course Summary

Course Title Deep Reinforcement Learning

Detailed Plan of Contact Sessions

1 2 Introduction to Reinforcement Learning (RL); [TB] Ch-1

3-4 2 (Finite) Markov Decision Processes: Modelling [TB] Ch-3

4-6 3 Introduction to Dynamic Programming; Policy Iteration; [TB] Ch-4

7-8 2 MC prediction, MC control, incremental MC. [TB] Ch-5

9-10 3 TD Prediction, On-policy temporal difference [TB] Ch-6,7

11 1 Discussion on the classification of (Deep) Notes

12 3 Function approximation; Feature Construction for [TB-2]

13 3 Policy Gradient Methods, Policy Gradient Theorem, [TB-2] Ch-13,

14-15 2 Upper-Confidence-Bound Action Selection, [Alphazero]

15-16 3 Imitation Learning: Introduction, Imitation Learning; [DeepMim

[optional] 1 Discussion on few selected advanced [NAS -1]

[RL - Reinforcement Learning, DRL: Deep Reinforcement Learning]

Detailed Plan for Lab work

1 Bandit Gradient Algorithm NA

Implementing Dynamic programming, Monte

EC-1(a) Online 5 % TBA Two quizzes of 5% will be

) Assignments Take 25% TBA

EC-2 Mid-Semes Closed 30% TBA

EC-3 Comprehensi Open 40% TBA

Important links and information:

Elearn portal: https://elearn.bits-pilani.ac.in or Canvas

All submissions for graded components must be the result of

In cases where suspicious circumstances arise, such as

You might also like