0% found this document useful (0 votes)

2 views

Lecture4 Model Free Prediction

The document presents Lecture 4 on Model Free Control, focusing on Markov Decision Processes (MDPs) and their optimality proofs. It discusses methods for model-free prediction, comparing Monte Carlo and Temporal Difference approaches, and highlights the advantages and disadvantages of each. The lecture also includes examples, equations, and iterative processes for solving MDPs and improving policies.

Uploaded by

Newbie Gaming

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Lecture4 Model Free Prediction

Uploaded by

Newbie Gaming

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

AIT3007

Lecture 4: Model Free Control

Authors: Ta Viet Cuong, Ph. D

HMI laboratory, University of Engineering and Technology
Slide adapt from davidsilver.uk/teaching/

Sep, 2024
1. Review
2. Optimality Proof
3. Model Free Prediction: MC vs TD

2
Markov decision process
A Markov decision process (MDP) is a Markov reward process with
decisions.

3
Example: Recylcing Robot
Example 3.3 (from Intro to RL) A mobile robot has the job of collecting empty soda cans in an office
environment. The rewards are zero most of the time, but become positive when the robot secures an empty can, or
large and negative if the battery runs all the way down.

4
Bellman expectation Equation
One-step Look ahead with state-value function

5
Optimal Solutions
The reward assumption:
- All goals can be described by the maximisation of expected cumulative reward
Therefore, we are interested in:
- The optimal state-value function v∗(s) is the maximum value function over all
policies

- The optimal action-value function q∗(s, a) is the maximum action-value function

over all policies

6
Theorem
- For any Markov Decision Process, there exists an optimal policy π∗ that
is better than or equal to all other policies, π∗ ≥ π, ∀π
- All optimal policies achieve the optimal value function, vπ∗ (s) = v∗(s)
- All optimal policies achieve the optimal action-value function, qπ∗ (s, a) =
q∗(s, a)

The relations: π∗ ≥ π is partial order

7
Bellman Optimality Equation

8
Solving MDP by Dynamic Programming
Dynamic programming assumes full knowledge of the MDP
It is used for planning in an MDP:
For prediction:
+ Input: MDP <S, A, P, R, 𝛾> and π
+ Output: value function vπ
Or for control:
+ Input: MDP <S, A, P, R, 𝛾>
+ Output: optimal value function v* and optimal policy π*

9
Iterative Policy Evalutation

10
Improve Policy by Greedy

11
Policy Improvement

we need to prove that: vπ(s) <= vπ’(s), ∀𝑠

12
Value Iteration
Problem: find optimal policy π
Solution: iterative application of Bellman optimality backup
● v1 → v2 → ... → vπ

Using synchronous backups

○ At each iteration k + 1
○ For all states s ∈ S
○ Update vk+1(s) from vk (s’) where s’ is a successor state of s
there is no explicit policy

13
Value Iteration

14
Examples
Slow: +1 /Fast: +2
Overheated: -10
Solving with 𝛾 = 0, 0.5, 0.9

15
Examples
Compute v*(high) and v*(low). Verify that with a specific value set of
parameters, the optimal values are fixed.

16
Optimality Proof

17
Optimality Proof

18
Optimality Proof

19
Optimality Proof

20
Quiz: Effect of noise and discount

21
Model Free Prediction
Set up:
- Given a MDP problem
- Given a policy π(.|s)

Estimate the value function Vπ, with an unknown MDP

- Monte-Carlo methods
- Temporal different methods

22
Examples
We don’t know a lot of things
- Only a state and some actions

23
Examples

24
Monte Carlo method
Main idea: we evaluate our policy following a complete trajectory

25
Monte Carlo method
To evaluate state s
- The first time-step t that state s is visited in an episode,
- Increment counter N(s) ← N(s) + 1
- Increment total return S(s) ← S(s) + G(t)
- Value is estimated by mean return V(s) = S(s)/N(s)
- By law of large numbers, V(s) → vπ(s) as N(s) → ∞

26
Example

27
Incremental Update

28
Temporal Different method
The MC approach work on a full episode
Temporal difference (TD) learning: learns from incomplete episodes

Main idea: Update a guess to a guess

- Update V(s) by Rt+1 and V(s’)

29
Temporal Different method

30
Bias-Variance Trade-off

31
Advantages and Disadvantages of MC vs. TD
MC has high variance, zero bias
- Good convergence properties (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to vπ(s) (but not always with function approximation)
- More sensitive to initial value

32
Example: How to extract policy

33
Reading Materials
● Chapter 5: 5.1-5.2
● Chapter 6: 6.1-6.2

Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
MIT16 410F10 Lec22
No ratings yet
MIT16 410F10 Lec22
19 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
15 MDP
No ratings yet
15 MDP
35 pages
CS229
No ratings yet
CS229
17 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
No ratings yet
Quick Start: Resolving A Markov Decision Process Problem Using The Mdptoolbox in Matlab
9 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
2023_week3_modelfree
No ratings yet
2023_week3_modelfree
63 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
Rust J. - Numerical Dynamic Programming in Economics
No ratings yet
Rust J. - Numerical Dynamic Programming in Economics
167 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
No ratings yet
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
23 pages
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
No ratings yet
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
36 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Solution to Assignment_4_Dynamic Programming
No ratings yet
Solution to Assignment_4_Dynamic Programming
11 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes
No ratings yet
On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes
15 pages
RL_UNIT-II (1)
No ratings yet
RL_UNIT-II (1)
14 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
RL Ese
No ratings yet
RL Ese
7 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Definitions
No ratings yet
Definitions
2 pages
06 MDP
No ratings yet
06 MDP
89 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Lec 09
No ratings yet
Lec 09
51 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Notes:: National Exams 2016
No ratings yet
Notes:: National Exams 2016
6 pages
Math108x Document w06GroupAssignment
No ratings yet
Math108x Document w06GroupAssignment
15 pages
Assembler Training (Basics) Part - 1
No ratings yet
Assembler Training (Basics) Part - 1
133 pages
Chapter7 Arrays
No ratings yet
Chapter7 Arrays
26 pages
IntroOptimManifolds Boumal 2023
No ratings yet
IntroOptimManifolds Boumal 2023
368 pages
Proving Identities
No ratings yet
Proving Identities
8 pages
Mathematical Tripos: Complete Answers Are Preferred To Fragments
No ratings yet
Mathematical Tripos: Complete Answers Are Preferred To Fragments
21 pages
Clec 301 Engg Maths II
No ratings yet
Clec 301 Engg Maths II
3 pages
Gemini
No ratings yet
Gemini
1 page
2015 AASVR Flatcam Iccv
No ratings yet
2015 AASVR Flatcam Iccv
4 pages
Gershgorin Circle Theorem PDF
No ratings yet
Gershgorin Circle Theorem PDF
47 pages
OPERATION
No ratings yet
OPERATION
243 pages
Important Questions BEEM
No ratings yet
Important Questions BEEM
5 pages
Grade 8 Full Topics Math Questions Answers
No ratings yet
Grade 8 Full Topics Math Questions Answers
5 pages
Student Exploration: Fan Cart Physics: 1. Experiment
No ratings yet
Student Exploration: Fan Cart Physics: 1. Experiment
2 pages
Studenttext
No ratings yet
Studenttext
15 pages
Applied stochastic hydrogeology 1st Edition Rubin All Chapters Instant Download
100% (9)
Applied stochastic hydrogeology 1st Edition Rubin All Chapters Instant Download
81 pages
C# For C++ Programmers
100% (1)
C# For C++ Programmers
73 pages
Tesis SPG Uni
No ratings yet
Tesis SPG Uni
136 pages
Error Free Liquid Flow Diverters For Calibration F
No ratings yet
Error Free Liquid Flow Diverters For Calibration F
9 pages
120 Days Countdown
No ratings yet
120 Days Countdown
6 pages
9.1. Definitions Levelling.: Line Latitude Departure S E AB BC CD DE EA
No ratings yet
9.1. Definitions Levelling.: Line Latitude Departure S E AB BC CD DE EA
5 pages
ASTM and Grain Size Measurements
100% (1)
ASTM and Grain Size Measurements
5 pages
Previewpdf
No ratings yet
Previewpdf
37 pages
Maths 1 Mock 3 (Week 1-4) Sols ?
No ratings yet
Maths 1 Mock 3 (Week 1-4) Sols ?
15 pages
Ms Lynch Has A Choice of Two Assets The First
No ratings yet
Ms Lynch Has A Choice of Two Assets The First
2 pages
Calculus and Matrix Algebra
No ratings yet
Calculus and Matrix Algebra
64 pages
g11 All Source Complete Week 1 To 10
75% (8)
g11 All Source Complete Week 1 To 10
218 pages
Module 4 - Impulse & Momentum
No ratings yet
Module 4 - Impulse & Momentum
7 pages

Lecture4 Model Free Prediction

Uploaded by

Lecture4 Model Free Prediction

Uploaded by

AIT3007

Lecture 4: Model Free Control

Authors: Ta Viet Cuong, Ph. D

- The optimal action-value function q∗(s, a) is the maximum action-value function

The relations: π∗ ≥ π is partial order

we need to prove that: vπ(s) <= vπ’(s), ∀𝑠

Using synchronous backups

Estimate the value function Vπ, with an unknown MDP

Main idea: Update a guess to a guess

You might also like