0% found this document useful (0 votes)
2 views

Lecture4 Model Free Prediction

The document presents Lecture 4 on Model Free Control, focusing on Markov Decision Processes (MDPs) and their optimality proofs. It discusses methods for model-free prediction, comparing Monte Carlo and Temporal Difference approaches, and highlights the advantages and disadvantages of each. The lecture also includes examples, equations, and iterative processes for solving MDPs and improving policies.

Uploaded by

Newbie Gaming
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lecture4 Model Free Prediction

The document presents Lecture 4 on Model Free Control, focusing on Markov Decision Processes (MDPs) and their optimality proofs. It discusses methods for model-free prediction, comparing Monte Carlo and Temporal Difference approaches, and highlights the advantages and disadvantages of each. The lecture also includes examples, equations, and iterative processes for solving MDPs and improving policies.

Uploaded by

Newbie Gaming
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

AIT3007

Lecture 4: Model Free Control

Authors: Ta Viet Cuong, Ph. D


HMI laboratory, University of Engineering and Technology
Slide adapt from davidsilver.uk/teaching/

Sep, 2024
1. Review
2. Optimality Proof
3. Model Free Prediction: MC vs TD

2
Markov decision process
A Markov decision process (MDP) is a Markov reward process with
decisions.

3
Example: Recylcing Robot
Example 3.3 (from Intro to RL) A mobile robot has the job of collecting empty soda cans in an office
environment. The rewards are zero most of the time, but become positive when the robot secures an empty can, or
large and negative if the battery runs all the way down.

4
Bellman expectation Equation
One-step Look ahead with state-value function

5
Optimal Solutions
The reward assumption:
- All goals can be described by the maximisation of expected cumulative reward
Therefore, we are interested in:
- The optimal state-value function v∗(s) is the maximum value function over all
policies

- The optimal action-value function q∗(s, a) is the maximum action-value function


over all policies

6
Theorem
- For any Markov Decision Process, there exists an optimal policy π∗ that
is better than or equal to all other policies, π∗ ≥ π, ∀π
- All optimal policies achieve the optimal value function, vπ∗ (s) = v∗(s)
- All optimal policies achieve the optimal action-value function, qπ∗ (s, a) =
q∗(s, a)

The relations: π∗ ≥ π is partial order

7
Bellman Optimality Equation

8
Solving MDP by Dynamic Programming
Dynamic programming assumes full knowledge of the MDP
It is used for planning in an MDP:
For prediction:
+ Input: MDP <S, A, P, R, 𝛾> and π
+ Output: value function vπ
Or for control:
+ Input: MDP <S, A, P, R, 𝛾>
+ Output: optimal value function v* and optimal policy π*

9
Iterative Policy Evalutation

10
Improve Policy by Greedy

11
Policy Improvement

we need to prove that: vπ(s) <= vπ’(s), ∀𝑠

12
Value Iteration
Problem: find optimal policy π
Solution: iterative application of Bellman optimality backup
● v1 → v2 → ... → vπ

Using synchronous backups


○ At each iteration k + 1
○ For all states s ∈ S
○ Update vk+1(s) from vk (s’) where s’ is a successor state of s
there is no explicit policy

13
Value Iteration

14
Examples
Slow: +1 /Fast: +2
Overheated: -10
Solving with 𝛾 = 0, 0.5, 0.9

15
Examples
Compute v*(high) and v*(low). Verify that with a specific value set of
parameters, the optimal values are fixed.

16
Optimality Proof

17
Optimality Proof

18
Optimality Proof

19
Optimality Proof

20
Quiz: Effect of noise and discount

21
Model Free Prediction
Set up:
- Given a MDP problem
- Given a policy π(.|s)

Estimate the value function Vπ, with an unknown MDP


- Monte-Carlo methods
- Temporal different methods

22
Examples
We don’t know a lot of things
- Only a state and some actions

23
Examples

24
Monte Carlo method
Main idea: we evaluate our policy following a complete trajectory

25
Monte Carlo method
To evaluate state s
- The first time-step t that state s is visited in an episode,
- Increment counter N(s) ← N(s) + 1
- Increment total return S(s) ← S(s) + G(t)
- Value is estimated by mean return V(s) = S(s)/N(s)
- By law of large numbers, V(s) → vπ(s) as N(s) → ∞

26
Example

27
Incremental Update

28
Temporal Different method
The MC approach work on a full episode
Temporal difference (TD) learning: learns from incomplete episodes

Main idea: Update a guess to a guess


- Update V(s) by Rt+1 and V(s’)

29
Temporal Different method

30
Bias-Variance Trade-off

31
Advantages and Disadvantages of MC vs. TD
MC has high variance, zero bias
- Good convergence properties (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to vπ(s) (but not always with function approximation)
- More sensitive to initial value

32
Example: How to extract policy

33
Reading Materials
● Chapter 5: 5.1-5.2
● Chapter 6: 6.1-6.2

34

You might also like