0% found this document useful (0 votes)
39 views

Serge Levine Course Introduction To Reinforcement Learning 6 Value Function

Value function methods estimate the value or Q-function without explicitly learning a policy. Fitted Q-iteration is an off-policy batch method that fits a Q-function directly from samples without knowing the transition dynamics. While value iteration converges in the tabular case, function approximation methods like fitted value iteration and Q-learning do not theoretically converge, though they often work well in practice with tuning.

Uploaded by

Nathaniel Saura
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

Serge Levine Course Introduction To Reinforcement Learning 6 Value Function

Value function methods estimate the value or Q-function without explicitly learning a policy. Fitted Q-iteration is an off-policy batch method that fits a Q-function directly from samples without knowing the transition dynamics. While value iteration converges in the tabular case, function approximation methods like fitted value iteration and Q-learning do not theoretically converge, though they often work well in practice with tuning.

Uploaded by

Nathaniel Saura
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Value Function Methods

CS 294-112: Deep Reinforcement Learning


Sergey Levine
Class Notes
1. Extra TensorFlow session today (see Piazza)
2. Homework 2 is due in one week
• Don’t wait, start early!
3. Remember to start forming final project groups
Today’s Lecture
1. What if we just use a critic, without an actor?
2. Extracting a policy from a value function
3. The Q-learning algorithm
4. Extensions: continuous actions, improvements
• Goals:
• Understand how value functions give rise to policies
• Understand the Q-learning algorithm
• Understand practical considerations for Q-learning
Recap: actor-critic

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy
Can we omit policy gradient completely?

forget policies, let’s just do this!

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy
Policy iteration
fit a model to
High level idea: estimate return

generate
samples (i.e.
how to do this? run the policy)

improve the
policy
Dynamic programming

0.2 0.3 0.4 0.3


0.3 0.3 0.5 0.3
0.4 0.4 0.6 0.4
0.5 0.5 0.7 0.5

just use the current estimate here


Policy iteration with dynamic programming

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy

0.2 0.3 0.4 0.3


0.3 0.3 0.5 0.3
0.4 0.4 0.6 0.4
0.5 0.5 0.7 0.5
Even simpler dynamic programming

approximates the new value!

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy
Fitted value iteration
curse of
dimensionality

fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy
What if we don’t know the transition dynamics?
need to know outcomes
for different actions!

Back to policy iteration…

can fit this using samples


Can we do the “max” trick again?

forget policy, compute value directly


can we do this with Q-values also, without knowing the transitions?

doesn’t require simulation of actions!


+ works even for off-policy samples (unlike actor-critic)
+ only one network, no high-variance policy gradient
- no convergence guarantees for non-linear function approximation (more on this later)
Fitted Q-iteration
Why is this algorithm off-policy?

dataset of transitions

Fitted Q-iteration
What is fitted Q-iteration optimizing?

most guarantees are lost when we leave the tabular case (e.g., when we use neural network function approximation)
Online Q-learning algorithms
fit a model to
estimate return

generate
samples (i.e.
run the policy)

improve the
policy

off policy, so many choices here!


Exploration with Q-learning
final policy:

why is this a bad idea for step 1?

“epsilon-greedy”

“Boltzmann exploration”

We’ll discuss exploration in more detail in a later lecture!


Review
• Value-based methods
• Don’t learn a policy explicitly
• Just learn value or Q-function fit a model to
estimate return
• If we have value function, we
generate
have a policy samples (i.e.
run the policy)
• Fitted Q-iteration
improve the
• Batch mode, off-policy method policy

• Q-learning
• Online analogue of fitted Q-
iteration
Break
Value function learning theory
0.2 0.3 0.4 0.3
0.3 0.3 0.5 0.3
0.4 0.4 0.6 0.4
0.5 0.5 0.7 0.5
Value function learning theory
0.2 0.3 0.4 0.3
0.3 0.3 0.5 0.3
0.4 0.4 0.6 0.4
0.5 0.5 0.7 0.5
Non-tabular value function learning
Non-tabular value function learning

Conclusions:
value iteration converges (tabular case)
fitted value iteration does not converge
not in general
often not in practice
What about fitted Q-iteration?

Applies also to online Q-learning


But… it’s just regression!

Q-learning is not gradient descent!

no gradient through target value


A sad corollary

An aside regarding terminology


Review
• Value iteration theory
• Linear operator for backup
• Linear operator for projection
• Backup is contraction fit a model to
• Value iteration converges estimate return

• Convergence with function generate


approximation samples (i.e.
• Projection is also a contraction run the policy)
• Projection + backup is not a contraction
• Fitted value iteration does not in general improve the
converge policy

• Implications for Q-learning


• Q-learning, fitted Q-iteration, etc. does
not converge with function approximation
• But we can make it work in practice!
• Sometimes – tune in next time

You might also like