0% found this document useful (0 votes)
3 views41 pages

3 Evaluation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views41 pages

3 Evaluation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Reinforcement Learning

3. Online Evaluation

Thomas Bonald

2024 – 2025
Markov decision process → Model

At time t = 0, 1, 2, . . ., the agent in state st takes action at and:


▶ receives reward rt
▶ moves to state st+1
The rewards and transitions are stochastic in general.
Some states may be terminal.
Policy → Agent

Definition
Given a Markov decision process, a policy defines the action taken
in each non-terminal state:

∀s ∈ S, π(a|s) = P(at = a| st = s)
Gain → Objective

Definition
Given the rewards r0 , r1 , r2 , . . ., we refer to the gain as:

G = r0 + γr1 + γ 2 r2 + . . .
Value function → Expectation

Consider some policy π.


Definition
The value function of π is the expected gain from each state:

∀s, Vπ (s) = E(G |s0 = s)

Bellman’s equation
The value function Vπ is the unique solution to the fixed-point
equation:
∀s, V (s) = E(r0 + γV (s1 )| s0 = s)
Maze (random policy)
Online evaluation

How can the agent estimate the value function Vπ of her policy
while interacting with the environment?
Online evaluation

How can the agent estimate the value function Vπ of her policy
while interacting with the environment?

Useful when:
▶ The environment is unknown
(e.g., robot, maze)
▶ The state space is too large
(e.g., games)
Outline

1. Incremental mean
2. Monte-Carlo learning
3. TD learning
Incremental mean
How to compute the mean M of some data stream x1 , x2 , . . .?

Two options:
1. Store the sum:

S
S ← S + xt M←
t
Incremental mean
How to compute the mean M of some data stream x1 , x2 , . . .?

Two options:
1. Store the sum:

S
S ← S + xt M←
t
2. Use the incremental mean:
1
M ← M + α(xt − M) α =
t
Incremental mean
How to compute the mean M of some data stream x1 , x2 , . . .?

Two options:
1. Store the sum:

S
S ← S + xt M←
t
2. Use the incremental mean:
1
M ← M + α(xt − M) α =
t
We use the notation:
α
M ← xt

Unless otherwise specified, we take α = 1t


In practice, the parameter α can be constant → learning rate
Outline

1. Incremental mean
2. Monte-Carlo learning
3. TD learning
MC learning
Idea: Estimate the value function of some policy π using
complete episodes s0 , s1 , . . . , sT (assuming the presence of
terminal states or setting the time horizon T )
MC learning
Idea: Estimate the value function of some policy π using
complete episodes s0 , s1 , . . . , sT (assuming the presence of
terminal states or setting the time horizon T )

Gain Gt at time t = 0, 1, . . .

G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −1 = rT −1
MC learning
Idea: Estimate the value function of some policy π using
complete episodes s0 , s1 , . . . , sT (assuming the presence of
terminal states or setting the time horizon T )

Gain Gt at time t = 0, 1, . . .

G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −1 = rT −1

MC updates

α
∀t = 0, . . . , T − 1, V (st ) ← Gt
Example: A or B

(+1) (−2)

A B

(+5) (−3)

C D
Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

What is the value function after MC learning


over the episodes [S, A, B, C] and [S, B, C]?
Example: Random walk

Random walk in a 5 × 5 grid


Discount factor γ = 0.9
Time horizon = 100

Rewards Value function


Random walk: MC learning

Random walk in a 5 × 5 grid


Discount factor γ = 0.9
Time horizon = 100

Value function MC (1 episode) MC (10 episodes)


Backward updates

Let s0 , r0 , s1 , r1 , . . . , sT −1 , rT −1 , sT be an episode

G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −2 = rT −2 + γrT −1
GT −1 = rT −1
Backward updates

Let s0 , r0 , s1 , r1 , . . . , sT −1 , rT −1 , sT be an episode

G0 = r0 + γr1 + . . . + γ T −1 rT −1
G1 = r1 + . . . + γ T −2 rT −1
..
.
GT −2 = rT −2 + γrT −1
GT −1 = rT −1
MC updates (backward)
Init: G ← 0
Updates: For t = T − 1, . . . , 0

G ← rt + γG
α
V (st ) ← G
Outline

1. Incremental mean
2. Monte-Carlo learning
3. TD learning
TD learning

Idea: Online estimation of the value function of some policy π


(no need for complete episodes)
TD learning

Idea: Online estimation of the value function of some policy π


(no need for complete episodes)

TD updates

α
∀t = 0, 1, . . . V (st ) ← rt + γV (st+1 )

Note: Bootstrapping!
cf. Bellman’s equation

∀s, V (s) = E(rt + γV (st+1 )| st = s)


Example: A or B

(+1) (−2)

A B

(+5) (−3)

C D
Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

You start from V = 0.


What is the value function after TD learning
over the episodes [S, A, B, C] and [S, B, C]?
Example: Random walk

Random walk in a 5 × 5 grid


Discount factor γ = 0.9
Time horizon = 100

Rewards Value function


Random walk: TD learning

Random walk in a 5 × 5 grid


Discount factor γ = 0.9

Value function TD (t = 100) TD (t = 1000)


Random walk: TD learning vs MC learning
TD (t = 100) TD (t = 1000)

Value function

MC (1 × 100) MC (10 × 100)


Random walk: Learning curve

Spearman correlation with the true value function


Time horizon = 100

1.0 MC
Value function TD
0.8

Spearman correlation
0.6

0.4

0.2

0.0
0 2 4 6 8 10
Episodes
MC learning vs TD learning
MC learning vs TD learning

MC
▶ requires complete episodes
▶ requires memory
▶ has high variance but no bias
MC learning vs TD learning

MC
▶ requires complete episodes
▶ requires memory
▶ has high variance but no bias
TD
▶ learns continuously
▶ is memory-less (cf. Markov property)
▶ has low variance but potentially high bias
(depending of the initial value of V )
Overview
Quiz
From TD to MC: n-step TD
Estimation of the gain at time t after n time steps:
(1)
Gt = rt + γV (st+1 )
(2)
Gt = rt + γrt+1 + γ 2 V (st+2 )
..
.
(n)
Gt = rt + γrt+1 + . . . + γ n V (st+n )
From TD to MC: n-step TD
Estimation of the gain at time t after n time steps:
(1)
Gt = rt + γV (st+1 )
(2)
Gt = rt + γrt+1 + γ 2 V (st+2 )
..
.
(n)
Gt = rt + γrt+1 + . . . + γ n V (st+n )

n-step TD updates

α (n)
∀t = 0, 1, . . . V (st ) ← Gt
From TD to MC: n-step TD
Estimation of the gain at time t after n time steps:
(1)
Gt = rt + γV (st+1 )
(2)
Gt = rt + γrt+1 + γ 2 V (st+2 )
..
.
(n)
Gt = rt + γrt+1 + . . . + γ n V (st+n )

n-step TD updates

α (n)
∀t = 0, 1, . . . V (st ) ← Gt

Note: Requires a memory of size n


Exercise

(−1) (+1)
(+3)
A B

(+2) (+3) (+1)

C D

You start from V = 0.


What is the value function after 2-step TD learning over the
episodes [S, A, B, C] and [S, B, C]?
Summary

Online prediction
▶ MC learning
From complete episodes
▶ TD learning
Memory-less → online learning
▶ n-step TD learning
Limited memory

Next lecture
Online control

You might also like