0% found this document useful (0 votes)

3 views41 pages

3 Evaluation

Uploaded by

Ardiansyah Mochamad Nugraha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views41 pages

3 Evaluation

Uploaded by

Ardiansyah Mochamad Nugraha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Reinforcement Learning

3. Online Evaluation

Thomas Bonald

2024 – 2025
Markov decision process → Model

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:

▶ receives reward rt
▶ moves to state st+1
The rewards and transitions are stochastic in general.
Some states may be terminal.
Policy → Agent

Definition
Given a Markov decision process, a policy defines the action taken
in each non-terminal state:

∀s ∈ S, π(a|s) = P(at = a| st = s)
Gain → Objective

Definition
Given the rewards r0 , r1 , r2 , . . ., we refer to the gain as:

G = r0 + γr1 + γ 2 r2 + . . .
Value function → Expectation

Consider some policy π.

Definition
The value function of π is the expected gain from each state:

∀s, Vπ (s) = E(G |s0 = s)

Bellman’s equation
The value function Vπ is the unique solution to the fixed-point
equation:
∀s, V (s) = E(r0 + γV (s1 )| s0 = s)
Maze (random policy)
Online evaluation

How can the agent estimate the value function Vπ of her policy
while interacting with the environment?
Online evaluation

How can the agent estimate the value function Vπ of her policy
while interacting with the environment?

Useful when:
▶ The environment is unknown
(e.g., robot, maze)
▶ The state space is too large
(e.g., games)
Outline

1. Incremental mean
2. Monte-Carlo learning
3. TD learning
Incremental mean
How to compute the mean M of some data stream x1 , x2 , . . .?

Two options:
1. Store the sum:

S
S ← S + xt M←
t
Incremental mean
How to compute the mean M of some data stream x1 , x2 , . . .?

Two options:
1. Store the sum:

S
S ← S + xt M←
t
2. Use the incremental mean:
1
M ← M + α(xt − M) α =
t
Incremental mean
How to compute the mean M of some data stream x1 , x2 , . . .?

Two options:
1. Store the sum:

S
S ← S + xt M←
t
2. Use the incremental mean:
1
M ← M + α(xt − M) α =
t
We use the notation:
α
M ← xt

Unless otherwise specified, we take α = 1t

In practice, the parameter α can be constant → learning rate
Outline

1. Incremental mean
2. Monte-Carlo learning
3. TD learning
MC learning
Idea: Estimate the value function of some policy π using
complete episodes s0 , s1 , . . . , sT (assuming the presence of
terminal states or setting the time horizon T )
MC learning
Idea: Estimate the value function of some policy π using
complete episodes s0 , s1 , . . . , sT (assuming the presence of
terminal states or setting the time horizon T )

Gain Gt at time t = 0, 1, . . .

G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −1 = rT −1
MC learning
Idea: Estimate the value function of some policy π using
complete episodes s0 , s1 , . . . , sT (assuming the presence of
terminal states or setting the time horizon T )

Gain Gt at time t = 0, 1, . . .

G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −1 = rT −1

MC updates

α
∀t = 0, . . . , T − 1, V (st ) ← Gt
Example: A or B

(+1) (−2)

A B

(+5) (−3)

C D
Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

What is the value function after MC learning

over the episodes [S, A, B, C] and [S, B, C]?
Example: Random walk

Random walk in a 5 × 5 grid

Discount factor γ = 0.9
Time horizon = 100

Rewards Value function

Random walk: MC learning

Random walk in a 5 × 5 grid

Discount factor γ = 0.9
Time horizon = 100

Value function MC (1 episode) MC (10 episodes)

Backward updates

Let s0 , r0 , s1 , r1 , . . . , sT −1 , rT −1 , sT be an episode

G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −2 = rT −2 + γrT −1
GT −1 = rT −1
Backward updates

Let s0 , r0 , s1 , r1 , . . . , sT −1 , rT −1 , sT be an episode

G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −2 = rT −2 + γrT −1
GT −1 = rT −1
MC updates (backward)
Init: G ← 0
Updates: For t = T − 1, . . . , 0

G ← rt + γG
α
V (st ) ← G
Outline

1. Incremental mean
2. Monte-Carlo learning
3. TD learning
TD learning

Idea: Online estimation of the value function of some policy π

(no need for complete episodes)
TD learning

Idea: Online estimation of the value function of some policy π

(no need for complete episodes)

TD updates

α
∀t = 0, 1, . . . V (st ) ← rt + γV (st+1 )

Note: Bootstrapping!
cf. Bellman’s equation

∀s, V (s) = E(rt + γV (st+1 )| st = s)

Example: A or B

(+1) (−2)

A B

(+5) (−3)

C D
Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

You start from V = 0.

What is the value function after TD learning
over the episodes [S, A, B, C] and [S, B, C]?
Example: Random walk

Random walk in a 5 × 5 grid

Discount factor γ = 0.9
Time horizon = 100

Rewards Value function

Random walk: TD learning

Random walk in a 5 × 5 grid

Discount factor γ = 0.9

Value function TD (t = 100) TD (t = 1000)

Random walk: TD learning vs MC learning
TD (t = 100) TD (t = 1000)

Value function

MC (1 × 100) MC (10 × 100)

Random walk: Learning curve

Spearman correlation with the true value function

Time horizon = 100

1.0 MC
Value function TD
0.8

Spearman correlation
0.6

0.4

0.2

0.0
0 2 4 6 8 10
Episodes
MC learning vs TD learning
MC learning vs TD learning

MC
▶ requires complete episodes
▶ requires memory
▶ has high variance but no bias
MC learning vs TD learning

MC
▶ requires complete episodes
▶ requires memory
▶ has high variance but no bias
TD
▶ learns continuously
▶ is memory-less (cf. Markov property)
▶ has low variance but potentially high bias
(depending of the initial value of V )
Overview
Quiz
From TD to MC: n-step TD
Estimation of the gain at time t after n time steps:
(1)
Gt = rt + γV (st+1 )
(2)
Gt = rt + γrt+1 + γ 2 V (st+2 )
..
.
(n)
Gt = rt + γrt+1 + . . . + γ n V (st+n )
From TD to MC: n-step TD
Estimation of the gain at time t after n time steps:
(1)
Gt = rt + γV (st+1 )
(2)
Gt = rt + γrt+1 + γ 2 V (st+2 )
..
.
(n)
Gt = rt + γrt+1 + . . . + γ n V (st+n )

n-step TD updates

α (n)
∀t = 0, 1, . . . V (st ) ← Gt
From TD to MC: n-step TD
Estimation of the gain at time t after n time steps:
(1)
Gt = rt + γV (st+1 )
(2)
Gt = rt + γrt+1 + γ 2 V (st+2 )
..
.
(n)
Gt = rt + γrt+1 + . . . + γ n V (st+n )

n-step TD updates

α (n)
∀t = 0, 1, . . . V (st ) ← Gt

Note: Requires a memory of size n

Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

You start from V = 0.

What is the value function after 2-step TD learning over the
episodes [S, A, B, C] and [S, B, C]?
Summary

Online prediction
▶ MC learning
From complete episodes
▶ TD learning
Memory-less → online learning
▶ n-step TD learning
Limited memory

Next lecture
Online control

Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Weak Convergence and Empirical Processes With Applications to Statistics (a.W. Van Der Vaart • Jon a. Wellner) (Z-Library)
No ratings yet
Weak Convergence and Empirical Processes With Applications to Statistics (a.W. Van Der Vaart • Jon a. Wellner) (Z-Library)
693 pages
Module 5-rl
No ratings yet
Module 5-rl
54 pages
Model Free Prediction (2)
No ratings yet
Model Free Prediction (2)
38 pages
Model free methods
No ratings yet
Model free methods
31 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
16 RL
No ratings yet
16 RL
51 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
games2-6pp
No ratings yet
games2-6pp
15 pages
TEMPORAL DIFFERENCE LEARNING
No ratings yet
TEMPORAL DIFFERENCE LEARNING
15 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
F20-AI-L10
No ratings yet
F20-AI-L10
45 pages
SP14 CS188 Lecture 10 -- Reinforcement Learning I -Print
No ratings yet
SP14 CS188 Lecture 10 -- Reinforcement Learning I -Print
25 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
RL-1
No ratings yet
RL-1
30 pages
Lec 5
No ratings yet
Lec 5
13 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Lecture 4: Model-Free Prediction: David Silver
No ratings yet
Lecture 4: Model-Free Prediction: David Silver
51 pages
M3
No ratings yet
M3
57 pages
2016M1_E
No ratings yet
2016M1_E
25 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Dissecting Reinforcement Learning-Part9
No ratings yet
Dissecting Reinforcement Learning-Part9
15 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Lecture_12_slides_-_after
No ratings yet
Lecture_12_slides_-_after
50 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
notes
No ratings yet
notes
6 pages
Bivariate Only
No ratings yet
Bivariate Only
20 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
No ratings yet
CS 188 Introduction To Artificial Intelligence Summer 2019 Note 4
9 pages
Understandable Statistics: Concepts And Methods (AP Edition) Charles Henry Brase 2024 Scribd Download
100% (1)
Understandable Statistics: Concepts And Methods (AP Edition) Charles Henry Brase 2024 Scribd Download
47 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
AS02
No ratings yet
AS02
16 pages
Discrete Probability Distributions
No ratings yet
Discrete Probability Distributions
72 pages
Martingale Approach To Pricing Perpetual American Options
No ratings yet
Martingale Approach To Pricing Perpetual American Options
26 pages
Monte Carlo Learning
No ratings yet
Monte Carlo Learning
14 pages
Sampling Distributions of The OLS Estimators
No ratings yet
Sampling Distributions of The OLS Estimators
27 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
28 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Get Mathematical statistics basic ideas and selected topics Volume II Bickel P.J. PDF ebook with Full Chapters Now
100% (2)
Get Mathematical statistics basic ideas and selected topics Volume II Bickel P.J. PDF ebook with Full Chapters Now
55 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
Merged OCR Stats
No ratings yet
Merged OCR Stats
57 pages
CS221 - Artificial Intelligence - Search - 4 Dynamic Programming
No ratings yet
CS221 - Artificial Intelligence - Search - 4 Dynamic Programming
23 pages
Stats Salah Notes
No ratings yet
Stats Salah Notes
6 pages
CS221 - Artificial Intelligence - Machine Learning - 6 Non-Linear Features
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 6 Non-Linear Features
22 pages
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
24 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Lecture4 Slides
No ratings yet
Lecture4 Slides
13 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Chapter 1 - Probability
No ratings yet
Chapter 1 - Probability
27 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Discrete Probability Distributions: Mcgraw-Hill/Irwin
No ratings yet
Discrete Probability Distributions: Mcgraw-Hill/Irwin
40 pages
On On The Historicity of Jesus Why We Have Reasons To Doubt Richard Carrier's Historiography
67% (3)
On On The Historicity of Jesus Why We Have Reasons To Doubt Richard Carrier's Historiography
7 pages
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
16 pages
Measures of Central Tendency and Measures of Variation
No ratings yet
Measures of Central Tendency and Measures of Variation
21 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
37 RL
No ratings yet
37 RL
18 pages
Business Statistics-1 08.03.2022 MBS
No ratings yet
Business Statistics-1 08.03.2022 MBS
16 pages
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
12 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Homework 6 Solution
No ratings yet
Homework 6 Solution
12 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Stochastic Approximation
No ratings yet
Stochastic Approximation
9 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Machine Learning: Neural Networks
No ratings yet
Machine Learning: Neural Networks
22 pages
Capstone Research Powerpoint Presentation
No ratings yet
Capstone Research Powerpoint Presentation
38 pages
Monte Carlo Simulation.xlsx - Google Sheetst4
No ratings yet
Monte Carlo Simulation.xlsx - Google Sheetst4
1 page
PSet1 - Solnb Solutiond
No ratings yet
PSet1 - Solnb Solutiond
10 pages
A4 Solution
No ratings yet
A4 Solution
3 pages
Probability Distributions
No ratings yet
Probability Distributions
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Scatter Diagrams and Karl Pearson Correlation Table by Arun
No ratings yet
Scatter Diagrams and Karl Pearson Correlation Table by Arun
3 pages
Tutorial 4
No ratings yet
Tutorial 4
3 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Or2 Tugasgt1 21s18052 PDF Free
No ratings yet
Or2 Tugasgt1 21s18052 PDF Free
4 pages
Minimal Sufficient Statistics
No ratings yet
Minimal Sufficient Statistics
6 pages
Probability Theory
No ratings yet
Probability Theory
6 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Aksantara2015 Sheet1
No ratings yet
Aksantara2015 Sheet1
2 pages
Tabel Poisson PDF
No ratings yet
Tabel Poisson PDF
3 pages
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet

3 Evaluation

Uploaded by

3 Evaluation

Uploaded by

Reinforcement Learning

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:

Consider some policy π.

∀s, Vπ (s) = E(G |s0 = s)

Unless otherwise specified, we take α = 1t

(+2) (+3) (+1)

What is the value function after MC learning

Random walk in a 5 × 5 grid

Rewards Value function

Random walk in a 5 × 5 grid

Value function MC (1 episode) MC (10 episodes)

Idea: Online estimation of the value function of some policy π

Idea: Online estimation of the value function of some policy π

∀s, V (s) = E(rt + γV (st+1 )| st = s)

(+2) (+3) (+1)

You start from V = 0.

Random walk in a 5 × 5 grid

Rewards Value function

Random walk in a 5 × 5 grid

Value function TD (t = 100) TD (t = 1000)

MC (1 × 100) MC (10 × 100)

Spearman correlation with the true value function

Note: Requires a memory of size n

(+2) (+3) (+1)

You start from V = 0.

You might also like