Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
fit a model to
estimate return
generate
samples (i.e.
run the policy)
improve the
policy
Can we omit policy gradient completely?
fit a model to
estimate return
generate
samples (i.e.
run the policy)
improve the
policy
Policy iteration
fit a model to
High level idea: estimate return
generate
samples (i.e.
how to do this? run the policy)
improve the
policy
Dynamic programming
fit a model to
estimate return
generate
samples (i.e.
run the policy)
improve the
policy
fit a model to
estimate return
generate
samples (i.e.
run the policy)
improve the
policy
Fitted value iteration
curse of
dimensionality
fit a model to
estimate return
generate
samples (i.e.
run the policy)
improve the
policy
What if we don’t know the transition dynamics?
need to know outcomes
for different actions!
dataset of transitions
Fitted Q-iteration
What is fitted Q-iteration optimizing?
most guarantees are lost when we leave the tabular case (e.g., when we use neural network function approximation)
Online Q-learning algorithms
fit a model to
estimate return
generate
samples (i.e.
run the policy)
improve the
policy
“epsilon-greedy”
“Boltzmann exploration”
• Q-learning
• Online analogue of fitted Q-
iteration
Break
Value function learning theory
0.2 0.3 0.4 0.3
0.3 0.3 0.5 0.3
0.4 0.4 0.6 0.4
0.5 0.5 0.7 0.5
Value function learning theory
0.2 0.3 0.4 0.3
0.3 0.3 0.5 0.3
0.4 0.4 0.6 0.4
0.5 0.5 0.7 0.5
Non-tabular value function learning
Non-tabular value function learning
Conclusions:
value iteration converges (tabular case)
fitted value iteration does not converge
not in general
often not in practice
What about fitted Q-iteration?