3 Evaluation
3 Evaluation
3. Online Evaluation
Thomas Bonald
2024 – 2025
Markov decision process → Model
Definition
Given a Markov decision process, a policy defines the action taken
in each non-terminal state:
∀s ∈ S, π(a|s) = P(at = a| st = s)
Gain → Objective
Definition
Given the rewards r0 , r1 , r2 , . . ., we refer to the gain as:
G = r0 + γr1 + γ 2 r2 + . . .
Value function → Expectation
Bellman’s equation
The value function Vπ is the unique solution to the fixed-point
equation:
∀s, V (s) = E(r0 + γV (s1 )| s0 = s)
Maze (random policy)
Online evaluation
How can the agent estimate the value function Vπ of her policy
while interacting with the environment?
Online evaluation
How can the agent estimate the value function Vπ of her policy
while interacting with the environment?
Useful when:
▶ The environment is unknown
(e.g., robot, maze)
▶ The state space is too large
(e.g., games)
Outline
1. Incremental mean
2. Monte-Carlo learning
3. TD learning
Incremental mean
How to compute the mean M of some data stream x1 , x2 , . . .?
Two options:
1. Store the sum:
S
S ← S + xt M←
t
Incremental mean
How to compute the mean M of some data stream x1 , x2 , . . .?
Two options:
1. Store the sum:
S
S ← S + xt M←
t
2. Use the incremental mean:
1
M ← M + α(xt − M) α =
t
Incremental mean
How to compute the mean M of some data stream x1 , x2 , . . .?
Two options:
1. Store the sum:
S
S ← S + xt M←
t
2. Use the incremental mean:
1
M ← M + α(xt − M) α =
t
We use the notation:
α
M ← xt
1. Incremental mean
2. Monte-Carlo learning
3. TD learning
MC learning
Idea: Estimate the value function of some policy π using
complete episodes s0 , s1 , . . . , sT (assuming the presence of
terminal states or setting the time horizon T )
MC learning
Idea: Estimate the value function of some policy π using
complete episodes s0 , s1 , . . . , sT (assuming the presence of
terminal states or setting the time horizon T )
Gain Gt at time t = 0, 1, . . .
G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −1 = rT −1
MC learning
Idea: Estimate the value function of some policy π using
complete episodes s0 , s1 , . . . , sT (assuming the presence of
terminal states or setting the time horizon T )
Gain Gt at time t = 0, 1, . . .
G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −1 = rT −1
MC updates
α
∀t = 0, . . . , T − 1, V (st ) ← Gt
Example: A or B
(+1) (−2)
A B
(+5) (−3)
C D
Exercise
(−1) (+1)
(+3)
A B
C D
Let s0 , r0 , s1 , r1 , . . . , sT −1 , rT −1 , sT be an episode
G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −2 = rT −2 + γrT −1
GT −1 = rT −1
Backward updates
Let s0 , r0 , s1 , r1 , . . . , sT −1 , rT −1 , sT be an episode
G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −2 = rT −2 + γrT −1
GT −1 = rT −1
MC updates (backward)
Init: G ← 0
Updates: For t = T − 1, . . . , 0
G ← rt + γG
α
V (st ) ← G
Outline
1. Incremental mean
2. Monte-Carlo learning
3. TD learning
TD learning
TD updates
α
∀t = 0, 1, . . . V (st ) ← rt + γV (st+1 )
Note: Bootstrapping!
cf. Bellman’s equation
(+1) (−2)
A B
(+5) (−3)
C D
Exercise
(−1) (+1)
(+3)
A B
C D
Value function
1.0 MC
Value function TD
0.8
Spearman correlation
0.6
0.4
0.2
0.0
0 2 4 6 8 10
Episodes
MC learning vs TD learning
MC learning vs TD learning
MC
▶ requires complete episodes
▶ requires memory
▶ has high variance but no bias
MC learning vs TD learning
MC
▶ requires complete episodes
▶ requires memory
▶ has high variance but no bias
TD
▶ learns continuously
▶ is memory-less (cf. Markov property)
▶ has low variance but potentially high bias
(depending of the initial value of V )
Overview
Quiz
From TD to MC: n-step TD
Estimation of the gain at time t after n time steps:
(1)
Gt = rt + γV (st+1 )
(2)
Gt = rt + γrt+1 + γ 2 V (st+2 )
..
.
(n)
Gt = rt + γrt+1 + . . . + γ n V (st+n )
From TD to MC: n-step TD
Estimation of the gain at time t after n time steps:
(1)
Gt = rt + γV (st+1 )
(2)
Gt = rt + γrt+1 + γ 2 V (st+2 )
..
.
(n)
Gt = rt + γrt+1 + . . . + γ n V (st+n )
n-step TD updates
α (n)
∀t = 0, 1, . . . V (st ) ← Gt
From TD to MC: n-step TD
Estimation of the gain at time t after n time steps:
(1)
Gt = rt + γV (st+1 )
(2)
Gt = rt + γrt+1 + γ 2 V (st+2 )
..
.
(n)
Gt = rt + γrt+1 + . . . + γ n V (st+n )
n-step TD updates
α (n)
∀t = 0, 1, . . . V (st ) ← Gt
(−1) (+1)
(+3)
A B
C D
Online prediction
▶ MC learning
From complete episodes
▶ TD learning
Memory-less → online learning
▶ n-step TD learning
Limited memory
Next lecture
Online control