0% found this document useful (0 votes)

39 views

Serge Levine Course Introduction To Reinforcement Learning 6 Value Function

Value function methods estimate the value or Q-function without explicitly learning a policy. Fitted Q-iteration is an off-policy batch method that fits a Q-function directly from samples without knowing the transition dynamics. While value iteration converges in the tabular case, function approximation methods like fitted value iteration and Q-learning do not theoretically converge, though they often work well in practice with tuning.

Uploaded by

Nathaniel Saura

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Serge Levine Course Introduction To Reinforcement Learning 6 Value Function

Uploaded by

Nathaniel Saura

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Value Function Methods

CS 294-112: Deep Reinforcement Learning

Sergey Levine
Class Notes
1. Extra TensorFlow session today (see Piazza)
2. Homework 2 is due in one week
• Don’t wait, start early!
3. Remember to start forming final project groups
Today’s Lecture
1. What if we just use a critic, without an actor?
2. Extracting a policy from a value function
3. The Q-learning algorithm
4. Extensions: continuous actions, improvements
• Goals:
• Understand how value functions give rise to policies
• Understand the Q-learning algorithm
• Understand practical considerations for Q-learning
Recap: actor-critic

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy
Can we omit policy gradient completely?

forget policies, let’s just do this!

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy
Policy iteration
fit a model to
High level idea: estimate return

generate
samples (i.e.
how to do this? run the policy)

improve the
policy
Dynamic programming

0.2 0.3 0.4 0.3

0.3 0.3 0.5 0.3
0.4 0.4 0.6 0.4
0.5 0.5 0.7 0.5

just use the current estimate here

Policy iteration with dynamic programming

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy

0.2 0.3 0.4 0.3

0.3 0.3 0.5 0.3
0.4 0.4 0.6 0.4
0.5 0.5 0.7 0.5
Even simpler dynamic programming

approximates the new value!

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy
Fitted value iteration
curse of
dimensionality

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy
What if we don’t know the transition dynamics?
need to know outcomes
for different actions!

Back to policy iteration…

can fit this using samples

Can we do the “max” trick again?

forget policy, compute value directly

can we do this with Q-values also, without knowing the transitions?

doesn’t require simulation of actions!

+ works even for off-policy samples (unlike actor-critic)
+ only one network, no high-variance policy gradient
- no convergence guarantees for non-linear function approximation (more on this later)
Fitted Q-iteration
Why is this algorithm off-policy?

dataset of transitions

Fitted Q-iteration
What is fitted Q-iteration optimizing?

most guarantees are lost when we leave the tabular case (e.g., when we use neural network function approximation)
Online Q-learning algorithms
fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy

off policy, so many choices here!

Exploration with Q-learning
final policy:

why is this a bad idea for step 1?

“epsilon-greedy”

“Boltzmann exploration”

We’ll discuss exploration in more detail in a later lecture!

Review
• Value-based methods
• Don’t learn a policy explicitly
• Just learn value or Q-function fit a model to
estimate return
• If we have value function, we
generate
have a policy samples (i.e.
run the policy)
• Fitted Q-iteration
improve the
• Batch mode, off-policy method policy

• Q-learning
• Online analogue of fitted Q-
iteration
Break
Value function learning theory
0.2 0.3 0.4 0.3
0.3 0.3 0.5 0.3
0.4 0.4 0.6 0.4
0.5 0.5 0.7 0.5
Value function learning theory
0.2 0.3 0.4 0.3
0.3 0.3 0.5 0.3
0.4 0.4 0.6 0.4
0.5 0.5 0.7 0.5
Non-tabular value function learning
Non-tabular value function learning

Conclusions:
value iteration converges (tabular case)
fitted value iteration does not converge
not in general
often not in practice
What about fitted Q-iteration?

Applies also to online Q-learning

But… it’s just regression!

Q-learning is not gradient descent!

no gradient through target value

A sad corollary

An aside regarding terminology

Review
• Value iteration theory
• Linear operator for backup
• Linear operator for projection
• Backup is contraction fit a model to
• Value iteration converges estimate return

• Convergence with function generate

approximation samples (i.e.
• Projection is also a contraction run the policy)
• Projection + backup is not a contraction
• Fitted value iteration does not in general improve the
converge policy

• Implications for Q-learning

• Q-learning, fitted Q-iteration, etc. does
not converge with function approximation
• But we can make it work in practice!
• Sometimes – tune in next time

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
AI-900 Dumps Microsoft Azure AI Fundamentals (Beta)
No ratings yet
AI-900 Dumps Microsoft Azure AI Fundamentals (Beta)
6 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
ML Performance Improvement Cheatsheet
No ratings yet
ML Performance Improvement Cheatsheet
11 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
PowerPoint Presentation
No ratings yet
PowerPoint Presentation
35 pages
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
Serge Levine Course Introduction To Reinforcement Learning 4: Actor Criric
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 4: Actor Criric
28 pages
Lec 5 Policy Gradients
No ratings yet
Lec 5 Policy Gradients
40 pages
Lecture 16 Meta Learning
No ratings yet
Lecture 16 Meta Learning
39 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
35 pages
cs224r_L03_MDP_PG
No ratings yet
cs224r_L03_MDP_PG
30 pages
Lecture 2.2 Example Data Preparation Feature Engineering
No ratings yet
Lecture 2.2 Example Data Preparation Feature Engineering
25 pages
verrfinalinov22
No ratings yet
verrfinalinov22
39 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
slidedeck_7_MAS_2021_22_RL_3_MC_Sarsa_QL
No ratings yet
slidedeck_7_MAS_2021_22_RL_3_MC_Sarsa_QL
65 pages
ch- 1- Introduction (OR,MS,IE,DS)
No ratings yet
ch- 1- Introduction (OR,MS,IE,DS)
25 pages
Ch1 Introduction To OR
No ratings yet
Ch1 Introduction To OR
26 pages
Or 3
No ratings yet
Or 3
22 pages
Pescriptive Analytics
No ratings yet
Pescriptive Analytics
24 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
Penelitian Operasional DLM Keperawatan
No ratings yet
Penelitian Operasional DLM Keperawatan
29 pages
Modelling L1
No ratings yet
Modelling L1
21 pages
Lecture-8-HCL-DSE - Sumita Narang
No ratings yet
Lecture-8-HCL-DSE - Sumita Narang
37 pages
Bioprocess Principle - UNIT IV - Compiled
No ratings yet
Bioprocess Principle - UNIT IV - Compiled
89 pages
Knowledge Graph Completion for Activity Recommendation in Business Process Modeling
No ratings yet
Knowledge Graph Completion for Activity Recommendation in Business Process Modeling
15 pages
SSA Champion Workshop Bank Exercise
No ratings yet
SSA Champion Workshop Bank Exercise
16 pages
Physical Model
No ratings yet
Physical Model
22 pages
Modelling and Process Dynamics:: Profesora: Silvia Ochoa Cáceres Universidad de Antioquia 01-2021
No ratings yet
Modelling and Process Dynamics:: Profesora: Silvia Ochoa Cáceres Universidad de Antioquia 01-2021
31 pages
AC 1103 Presentations
No ratings yet
AC 1103 Presentations
10 pages
AC 1103 Midterms
No ratings yet
AC 1103 Midterms
29 pages
BiasVariance
No ratings yet
BiasVariance
14 pages
Introduction To Simulation: MS 5225 Business Process Modeling & Simulation
No ratings yet
Introduction To Simulation: MS 5225 Business Process Modeling & Simulation
43 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
CHAPTER 2 LPP Mathematical Modelling
No ratings yet
CHAPTER 2 LPP Mathematical Modelling
48 pages
1.intro - Modelamiento y Dinámica
No ratings yet
1.intro - Modelamiento y Dinámica
32 pages
Chapter 1 - Introduction To or
No ratings yet
Chapter 1 - Introduction To or
50 pages
Module5.2 Feature selection methods
No ratings yet
Module5.2 Feature selection methods
64 pages
The Improvement Guide - Second Edition - Appendix B - Form B.3
No ratings yet
The Improvement Guide - Second Edition - Appendix B - Form B.3
8 pages
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
No ratings yet
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
17 pages
Introduction To Simulation
100% (1)
Introduction To Simulation
24 pages
To Operations Research: Ts. Dr. Suziyanti Binti Marjudi
No ratings yet
To Operations Research: Ts. Dr. Suziyanti Binti Marjudi
15 pages
Introduction To Operations Research
No ratings yet
Introduction To Operations Research
13 pages
Project
No ratings yet
Project
12 pages
Simplex Method
100% (1)
Simplex Method
80 pages
Lecture 4_simulation Modelling
No ratings yet
Lecture 4_simulation Modelling
17 pages
LP
No ratings yet
LP
45 pages
Linear Programming (Introduction)
No ratings yet
Linear Programming (Introduction)
9 pages
Chapter III - Supervised and Unsupervised Algorithms
No ratings yet
Chapter III - Supervised and Unsupervised Algorithms
122 pages
Lec01
No ratings yet
Lec01
20 pages
Mathematical Modelling
No ratings yet
Mathematical Modelling
19 pages
Arize Guide To Optimized Retraining
No ratings yet
Arize Guide To Optimized Retraining
8 pages
Neal Zhang
No ratings yet
Neal Zhang
33 pages
Chapter 5
No ratings yet
Chapter 5
42 pages
Model Fine Tuning Documentation
No ratings yet
Model Fine Tuning Documentation
11 pages
Lecture 7
No ratings yet
Lecture 7
19 pages
Murdochuni Researchmethods Guest Lecture Slides DR - Stephanie Diep
No ratings yet
Murdochuni Researchmethods Guest Lecture Slides DR - Stephanie Diep
30 pages
Chap 1 - RO - Handouts Operational Research
No ratings yet
Chap 1 - RO - Handouts Operational Research
24 pages
Costing Theory: Golden Rule To Clear Costing Subject
No ratings yet
Costing Theory: Golden Rule To Clear Costing Subject
82 pages
mansci-notes
No ratings yet
mansci-notes
26 pages
PW07 Model Experimentation Analysis
No ratings yet
PW07 Model Experimentation Analysis
43 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Sat - 96.Pdf - Machine Learning Models For Electricity Consumption Forecasting
No ratings yet
Sat - 96.Pdf - Machine Learning Models For Electricity Consumption Forecasting
11 pages
Rollinf Faculty Advt 2024 - L10 &12 - 0
No ratings yet
Rollinf Faculty Advt 2024 - L10 &12 - 0
15 pages
Ea C461 458
No ratings yet
Ea C461 458
3 pages
Jntua - B.tech. - R20 - Swayam Courses - Odd Semester - 2023 - 21.06.2023
No ratings yet
Jntua - B.tech. - R20 - Swayam Courses - Odd Semester - 2023 - 21.06.2023
10 pages
Ethics&ai-Notes 240301 100643
No ratings yet
Ethics&ai-Notes 240301 100643
50 pages
Future-of-Brands-2030
No ratings yet
Future-of-Brands-2030
13 pages
Spillover Effects of Generative AI on Human-Generated Content Creation
No ratings yet
Spillover Effects of Generative AI on Human-Generated Content Creation
45 pages
Review Jurnal
No ratings yet
Review Jurnal
7 pages
Teaching With RoboCup
No ratings yet
Teaching With RoboCup
6 pages
Fundamentals of Image Processing: Lecture #7 Edge Detection
No ratings yet
Fundamentals of Image Processing: Lecture #7 Edge Detection
65 pages
Report Presentation Adtec Anan Kosen 2023
No ratings yet
Report Presentation Adtec Anan Kosen 2023
37 pages
2c297e0b en
No ratings yet
2c297e0b en
8 pages
NAVIGATION FOR UNMANNED GROUND ROBOT (UGR) DURING CONNECTIVITY LOSS - Rev1
No ratings yet
NAVIGATION FOR UNMANNED GROUND ROBOT (UGR) DURING CONNECTIVITY LOSS - Rev1
65 pages
Automation at The Edge Ebook Red Hat Developer
No ratings yet
Automation at The Edge Ebook Red Hat Developer
16 pages
Cambridge IGCSE: First Language English 0500/22
No ratings yet
Cambridge IGCSE: First Language English 0500/22
5 pages
Unit 7
No ratings yet
Unit 7
17 pages
Course Outline HCI
No ratings yet
Course Outline HCI
3 pages
Life 3 0 Being Human in The Age of Artif
No ratings yet
Life 3 0 Being Human in The Age of Artif
4 pages
Technical Paper
No ratings yet
Technical Paper
5 pages
Module 1
No ratings yet
Module 1
19 pages
Bitmoji Ai Facial Skin Diagnosis Analyzer Machine 881411
No ratings yet
Bitmoji Ai Facial Skin Diagnosis Analyzer Machine 881411
51 pages
An Ethical Obligation To Use Artificial Intelligence?
No ratings yet
An Ethical Obligation To Use Artificial Intelligence?
17 pages
Robot Relationships Student Worksheet
No ratings yet
Robot Relationships Student Worksheet
5 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
10 Artificialintelligence sp01
No ratings yet
10 Artificialintelligence sp01
9 pages
U-Net: Convolutional Networks For Biomedical Image Segmentation
No ratings yet
U-Net: Convolutional Networks For Biomedical Image Segmentation
8 pages
Lecture Notes 18CS753-MODULE5
No ratings yet
Lecture Notes 18CS753-MODULE5
30 pages
AI PROJECT CYCLE-1 Class 9
100% (1)
AI PROJECT CYCLE-1 Class 9
7 pages

Serge Levine Course Introduction To Reinforcement Learning 6 Value Function

Uploaded by

Serge Levine Course Introduction To Reinforcement Learning 6 Value Function

Uploaded by

Value Function Methods

CS 294-112: Deep Reinforcement Learning

forget policies, let’s just do this!

0.2 0.3 0.4 0.3

just use the current estimate here

0.2 0.3 0.4 0.3

approximates the new value!

Back to policy iteration…

can fit this using samples

forget policy, compute value directly

doesn’t require simulation of actions!

off policy, so many choices here!

why is this a bad idea for step 1?

We’ll discuss exploration in more detail in a later lecture!

Applies also to online Q-learning

Q-learning is not gradient descent!

no gradient through target value

An aside regarding terminology

• Convergence with function generate

• Implications for Q-learning

You might also like