Lecture 11 12 - Model Free Prediction, Monte-Carlo Learning, Temporal Difference Learning
Lecture 11 12 - Model Free Prediction, Monte-Carlo Learning, Temporal Difference Learning
• Temporal-Difference Learning
• Previous lectures:
• Planning by dynamic programming
• Solve a known MDP
• This week lectures:
• Model-free prediction
• Estimate the value function of an unknown MDP
• Next subsequent lectures:
• Model-free control
• Optimize the value function of an unknown MDP
Monte-Carlo Reinforcement Learning
Monte-Carlo Policy Evaluation
First Visit Monte-Carlo Policy Evaluation
Every-Visit Monte Carlo Policy Evaluation
Blackjack Example
Blackjack Value Function after Monte-Carlo Learning
Incremental Mean
Error Term
Old Estimate
The new mean, 𝜇𝑘 , is the old mean, 𝜇𝑘−1 , plus some step size, 1Τ𝑘 , (a little increment)
towards the difference between the new element, 𝑥𝑘 , and what you thought the mean was.
Incremental Monte-Carlo Updates
Example of Monte Carlo
Example episodes
Compute 𝑉 𝑆1 and 𝑉 𝑆2 using First visit and Every Visit Monte Carlo methods.
Today’s Agenda
• Temporal-Difference Learning
The idea of substituting the remainder of the trajectory with our estimate of what will happen from
that point onwards, this is called bootstrapping. We update our guess of the value function towards
subsequent guess.
Monte-Carlo and Temporal Difference Learning
Driving Home Example
Driving Home Example: MC vs. TD
Today’s Agenda
• Temporal-Difference Learning