Lecture4 Model Free Prediction
Lecture4 Model Free Prediction
Sep, 2024
1. Review
2. Optimality Proof
3. Model Free Prediction: MC vs TD
2
Markov decision process
A Markov decision process (MDP) is a Markov reward process with
decisions.
3
Example: Recylcing Robot
Example 3.3 (from Intro to RL) A mobile robot has the job of collecting empty soda cans in an office
environment. The rewards are zero most of the time, but become positive when the robot secures an empty can, or
large and negative if the battery runs all the way down.
4
Bellman expectation Equation
One-step Look ahead with state-value function
5
Optimal Solutions
The reward assumption:
- All goals can be described by the maximisation of expected cumulative reward
Therefore, we are interested in:
- The optimal state-value function v∗(s) is the maximum value function over all
policies
6
Theorem
- For any Markov Decision Process, there exists an optimal policy π∗ that
is better than or equal to all other policies, π∗ ≥ π, ∀π
- All optimal policies achieve the optimal value function, vπ∗ (s) = v∗(s)
- All optimal policies achieve the optimal action-value function, qπ∗ (s, a) =
q∗(s, a)
7
Bellman Optimality Equation
8
Solving MDP by Dynamic Programming
Dynamic programming assumes full knowledge of the MDP
It is used for planning in an MDP:
For prediction:
+ Input: MDP <S, A, P, R, 𝛾> and π
+ Output: value function vπ
Or for control:
+ Input: MDP <S, A, P, R, 𝛾>
+ Output: optimal value function v* and optimal policy π*
9
Iterative Policy Evalutation
10
Improve Policy by Greedy
11
Policy Improvement
12
Value Iteration
Problem: find optimal policy π
Solution: iterative application of Bellman optimality backup
● v1 → v2 → ... → vπ
13
Value Iteration
14
Examples
Slow: +1 /Fast: +2
Overheated: -10
Solving with 𝛾 = 0, 0.5, 0.9
15
Examples
Compute v*(high) and v*(low). Verify that with a specific value set of
parameters, the optimal values are fixed.
16
Optimality Proof
17
Optimality Proof
18
Optimality Proof
19
Optimality Proof
20
Quiz: Effect of noise and discount
21
Model Free Prediction
Set up:
- Given a MDP problem
- Given a policy π(.|s)
22
Examples
We don’t know a lot of things
- Only a state and some actions
23
Examples
24
Monte Carlo method
Main idea: we evaluate our policy following a complete trajectory
25
Monte Carlo method
To evaluate state s
- The first time-step t that state s is visited in an episode,
- Increment counter N(s) ← N(s) + 1
- Increment total return S(s) ← S(s) + G(t)
- Value is estimated by mean return V(s) = S(s)/N(s)
- By law of large numbers, V(s) → vπ(s) as N(s) → ∞
26
Example
27
Incremental Update
28
Temporal Different method
The MC approach work on a full episode
Temporal difference (TD) learning: learns from incomplete episodes
29
Temporal Different method
30
Bias-Variance Trade-off
31
Advantages and Disadvantages of MC vs. TD
MC has high variance, zero bias
- Good convergence properties (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to vπ(s) (but not always with function approximation)
- More sensitive to initial value
32
Example: How to extract policy
33
Reading Materials
● Chapter 5: 5.1-5.2
● Chapter 6: 6.1-6.2
34