0% found this document useful (0 votes)

57 views60 pages

Lecture 2: Making Sequences of Good Decisions Given A Model of The World

The document discusses reinforcement learning concepts including Markov decision processes (MDPs), Markov reward processes (MRPs), and policies. It provides an example of an MDP modeling a Mars rover with states and a transition matrix. The MDP includes rewards for certain states to illustrate an MRP. The lecture will cover evaluation and control algorithms for MDPs given a model of the dynamics and rewards.

Uploaded by

Parth Gautam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views60 pages

Lecture 2: Making Sequences of Good Decisions Given A Model of The World

Uploaded by

Parth Gautam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Lecture 2: Making Sequences of Good Decisions Given

a Model of the World

Emma Brunskill

CS234 Reinforcement Learning

Winter 2019

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model Winter
of the World
2019 1 / 60
Today’s Plan

Last Time:
Introduction
Components of an agent: model, value, policy
This Time:
Making good decisions given a Markov decision process
Next Time:
Policy evaluation when don’t have a model of how the world works

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model Winter
of the World
2019 2 / 60
Models, Policies, Values

Model: Mathematical models of dynamics and reward

Policy: Function mapping agent’s states to actions
Value function: future rewards from being in a state and/or action
when following a particular policy

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model Winter
of the World
2019 3 / 60
Today: Given a model of the world

Markov Processes
Markov Reward Processes (MRPs)
Markov Decision Processes (MDPs)
Evaluation and Control in MDPs

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model Winter
of the World
2019 4 / 60
Full Observability: Markov Decision Process (MDP)

MDPs can model a huge number of interesting problems and settings

Bandits: single state MDP
Optimal control mostly about continuous-state MDPs
Partially observable MDPs = MDP where state is history
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a Model Winter
of the World
2019 5 / 60
Recall: Markov Property

Information state: sufficient statistic of history

State st is Markov if and only if:

p(st+1 |st , at ) = p(st+1 |ht , at )

Future is independent of past given present

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model Winter
of the World
2019 6 / 60
Markov Process or Markov Chain

Memoryless random process

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model Winter
of the World
2019 7 / 60
Example: Mars Rover Markov Chain Transition Matrix, P

!" !# !$ !% !& !' !(

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

 
0.6 0.4 0 0 0 0 0
0.4 0.2 0.4 0 0 0 0
 
 0 0.4 0.2 0.4 0 0 0
 
0
P= 0 0.4 0.2 0.4 0 0
0 0 0 0.4 0.2 0.4 0 
 
0 0 0 0 0.4 0.2 0.4
0 0 0 0 0 0.4 0.6

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model Winter
of the World
2019 8 / 60
Example: Mars Rover Markov Chain Episodes

!" !# !$ !% !& !' !(

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Example: Sample episodes starting from S4

s4 , s5 , s6 , s7 , s7 , s7 , . . .
s4 , s4 , s5 , s4 , s5 , s6 , . . .
s4 , s3 , s2 , s1 , . . .

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a Model Winter
of the World
2019 9 / 60
Markov Reward Process (MRP)

Markov Reward Process is a Markov Chain + rewards

Definition of Markov Reward Process (MRP)
S is a (finite) set of states (s ∈ S)
P is dynamics/transition model that specifices P(st+1 = s 0 |st = s)
R is a reward function R(st = s) = E[rt |st = s]
Discount factor γ ∈ [0, 1]
Note: no actions
If finite number (N) of states, can express R as a vector

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 10 / 60
Example: Mars Rover MRP

!" !# !$ !% !& !' !(

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Reward: +1 in s1 , +10 in s7 , 0 in all other states

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 11 / 60
Return & Value Function

Definition of Horizon
Number of time steps in each episode
Can be infinite
Otherwise called finite Markov reward process
Definition of Return, Gt (for a MRP)
Discounted sum of rewards from time step t to horizon

Gt = rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · ·

Definition of State Value Function, V (s) (for a MRP)

Expected return from starting in state s

V (s) = E[Gt |st = s] = E[rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · |st = s]

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 12 / 60
Discount Factor

Mathematically convenient (avoid infinite returns and values)

Humans often act as if there’s a discount factor < 1
γ = 0: Only care about immediate reward
γ = 1: Future reward is as beneficial as immediate reward
If episode lengths are always finite, can use γ = 1

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 13 / 60
Example: Mars Rover MRP

!" !# !$ !% !& !' !(

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Reward: +1 in s1 , +10 in s7 , 0 in all other states

Sample returns for sample 4-step episodes, γ = 1/2
1 1 1
s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 ×0+ 8 × 10 = 1.25

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 14 / 60
Example: Mars Rover MRP

!" !# !$ !% !& !' !(

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Reward: +1 in s1 , +10 in s7 , 0 in all other states

Sample returns for sample 4-step episodes, γ = 1/2
1 1 1
s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 ×0+ 8 × 10 = 1.25
1 1 1
s4 , s4 , s5 , s4 : 0 + 2 ×0+ 4 ×0+ 8 ×0=0
1 1 1
s4 , s3 , s2 , s1 : 0 + 2 ×0+ 4 ×0+ 8 × 1 = 0.125

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 15 / 60
Example: Mars Rover MRP

!" !# !$ !% !& !' !(

0.4 0.4 0.4 0.4 0.4 0.4

0.6 0.2 0.2 0.2 0.2 0.2 0.6

Reward: +1 in s1 , +10 in s7 , 0 in all other states

Value function: expected return from starting in state s
V (s) = E[Gt |st = s] = E[rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · |st = s]
Sample returns for sample 4-step episodes, γ = 1/2
1 1 1
s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 ×0+ 8 × 10 = 1.25
1 1 1
s4 , s4 , s5 , s4 : 0 + 2 ×0+ 4 ×0+ 8 ×0=0
1 1 1
s4 , s3 , s2 , s1 : 0 + 2 ×0+ 4 ×0+ 8 × 1 = 0.125
V = [1.53 0.37 0.13 0.22 0.85 3.59 15.31]
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 16 / 60
Computing the Value of a Markov Reward Process

Could estimate by simulation

Generate a large number of episodes
Average returns
Concentration inequalities bound how quickly average concentrates to
expected value
Requires no assumption of Markov structure

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 17 / 60
Computing the Value of a Markov Reward Process

Could estimate by simulation

Markov property yields additional structure
MRP value function satisfies
X
V (s) = R(s) + γ P(s 0 |s)V (s 0 )
s 0 ∈S
|{z}
Immediate reward | {z }
Discounted sum of future rewards

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 18 / 60
Matrix Form of Bellman Equation for MRP

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 19 / 60
Analytic Solution for Value of MRP

Solving directly requires taking a matrix inverse ∼ O(N 3 )

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 20 / 60
Iterative Algorithm for Computing Value of a MRP

Dynamic programming
Initialize V0 (s) = 0 for all s
For k = 1 until convergence
For all s in S
X
Vk (s) = R(s) + γ P(s 0 |s)Vk−1 (s 0 )
s 0 ∈S

Computational complexity: O(|S|2 ) for each iteration (|S| = N)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 21 / 60
Markov Decision Process (MDP)

Markov Decision Process is Markov Reward Process + actions

Definition of MDP
S is a (finite) set of Markov states s ∈ S
A is a (finite) set of actions a ∈ A
P is dynamics/transition model for each action, that specifies
P(st+1 = s 0 |st = s, at = a)
R is a reward function1

R(st = s, at = a) = E[rt |st = s, at = a]

Discount factor γ ∈ [0, 1]

MDP is a tuple: (S, A, P, R, γ)

1
Reward is sometimes defined as a function of the current state, or as a function of
the (state, action, next state) tuple. Most frequently in this class, we will assume reward
is a function of state and action
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 22 / 60
Example: Mars Rover MDP

!" !# !$ !% !& !' !(

   
1 0 0 0 0 0 0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 1 0 0 0 0
   
0 1 0 0 0 0 0 0 0 0 1 0 0 0
0 0
   
P(s |s, a1 ) = 
0 0 1 0 0 0 0 P(s |s, a2 ) = 

0 0 0 0 1 0 0
0 0 0 1 0 0 0 0 0 0 0 0 1 0
   
0 0 0 0 1 0 0  0 0 0 0 0 0 1
0 0 0 0 0 1 0 0 0 0 0 0 0 1

2 deterministic actions
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 23 / 60
MDP Policies

Policy specifies what action to take in each state

Can be deterministic or stochastic
For generality, consider as a conditional distribution
Given a state, specifies a distribution over actions
Policy: π(a|s) = P(at = a|st = s)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 24 / 60
MDP + Policy

MDP + π(a|s) = Markov Reward Process

Precisely, it is the MRP (S, R π , P π , γ), where
X
R π (s) = π(a|s)R(s, a)
a∈A
X
0
π
P (s |s) = π(a|s)P(s 0 |s, a)
a∈A

Implies we can use same techniques to evaluate the value of a policy

for a MDP as we could to compute the value of a MRP, by defining a
MRP with R π and P π

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 25 / 60
MDP Policy Evaluation, Iterative Algorithm

Initialize V0 (s) = 0 for all s

For k = 1 until convergence
For all s in S
X
Vkπ (s) = r (s, π(s)) + γ p(s 0 |s, π(s))Vk−1
π
(s 0 )
s 0 ∈S

This is a Bellman backup for a particular policy

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 26 / 60
Policy Evaluation: Example & Check Your Understanding

!" !# !$ !% !& !' !(

Two actions
Reward: for all actions, +1 in state s1 , +10 in state s7 , 0 otherwise
Let π(s) = a1 ∀s. γ = 0.
What is the value of this policy?
Recall iterativ e
X
Vkπ (s) = r (s, π(s)) + γ p(s 0 |s, π(s))Vk−1
π
(s 0 )
s 0 ∈S

V π = [1 0 0 0 0 0 10]
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 27 / 60
Practice: MDP 1 Iteration of Policy Evaluation, Mars
Rover Example
Dynamics: p(s6 |s6 , a1 ) = 0.5, p(s7 |s6 , a1 ) = 0.5, . . .
Reward: for all actions, +1 in state s1 , +10 in state s7 , 0 otherwise
Let π(s) = a1 ∀s, assume Vk =[1 0 0 0 0 0 10] and k = 1, γ = 0.5
For all s in S
X
Vkπ (s) = r (s, π(s)) + γ p(s 0 |s, π(s))Vk−1
π
(s 0 )
s 0 ∈S

Vk+1 (s6 ) = r (s6 , a1 ) + γ ∗ 0.5 ∗ Vk (s6 ) + γ ∗ 0.5 ∗ Vk (s7 )

Vk+1 (s6 ) = 0 + 0.5 ∗ 0.5 ∗ 0 + .5 ∗ 0.5 ∗ 10

Vk+1 (s6 ) = 2.5

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 28 / 60
MDP Control

Compute the optimal policy

π ∗ (s) = arg max V π (s)

There exists a unique optimal value function

Optimal policy for a MDP in an infinite horizon problem is
deterministic

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 29 / 60
Check Your Understanding

!" !# !$ !% !& !' !(

7 discrete states (location of rover)

2 actions: Left or Right
How many deterministic policies are there?
27

Is the optimal policy for a MDP always unique?

No, there may be two actions that have the same optimal value
function

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 30 / 60
MDP Control

Compute the optimal policy

π ∗ (s) = arg max V π (s)

There exists a unique optimal value function

Optimal policy for a MDP in an infinite horizon problem (agent acts
forever is
Deterministic
Stationary (does not depend on time step)
Unique? Not necessarily, may have state-actions with identical optimal
values

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 31 / 60
Policy Search

One option is searching to compute best policy

Number of deterministic policies is |A||S|
Policy iteration is generally more efficient than enumeration

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 32 / 60
MDP Policy Iteration (PI)

Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or kπi − πi−1 k1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 33 / 60
New Definition: State-Action Value Q

State-action value of a policy

X
Q π (s, a) = R(s, a) + γ P(s 0 |s, a)V π (s 0 )
s 0 ∈S

Take action a, then follow the policy π

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 34 / 60
Policy Improvement

Compute state-action value of a policy πi

For s in S and a in A:
X
Q πi (s, a) = R(s, a) + γ P(s 0 |s, a)V πi (s 0 )
s 0 ∈S

Compute new policy πi+1 , for all s ∈ S

πi+1 (s) = arg max Q πi (s, a) ∀s ∈ S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 35 / 60
MDP Policy Iteration (PI)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 36 / 60
Delving Deeper Into Policy Improvement Step

X
Q πi (s, a) = R(s, a) + γ P(s 0 |s, a)V πi (s 0 )
s 0 ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 37 / 60
Delving Deeper Into Policy Improvement Step

X
Q πi (s, a) = R(s, a) + γ P(s 0 |s, a)V πi (s 0 )
s 0 ∈S
X
max Q πi (s, a) ≥ R(s, πi (s)) + γ P(s 0 |s, πi (s))V πi (s 0 ) = V πi (s)
a
s 0 ∈S

πi+1 (s) = arg max Q πi (s, a)

Suppose we take πi+1 (s) for one action, then follow πi forever
Our expected sum of rewards is at least as good as if we had always
followed πi
But new proposed policy is to always follow πi+1 ...

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 38 / 60
Monotonic Improvement in Policy

Definition
V π1 ≥ V π2 : V π1 (s) ≥ V π2 (s), ∀s ∈ S
Proposition: V πi+1 ≥ V πi with strict inequality if πi is suboptimal,
where πi+1 is the new policy we get from policy improvement on πi

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 39 / 60
Proof: Monotonic Improvement in Policy

V πi (s) ≤ max Q πi (s, a)

a
X
= max R(s, a) + γ P(s 0 |s, a)V πi (s 0 )
a
s 0 ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 40 / 60
Proof: Monotonic Improvement in Policy

V πi (s) ≤ max Q πi (s, a)

a
X
= max R(s, a) + γ P(s 0 |s, a)V πi (s 0 )
a
s 0 ∈S
X
=R(s, πi+1 (s)) + γ P(s 0 |s, πi+1 (s))V πi (s 0 ) //by the definition of πi+1
s 0 ∈S
X
≤R(s, πi+1 (s)) + γ P(s 0 |s, πi+1 (s)) max
0
Q πi 0 0
(s , a )
a
s 0 ∈S
X
=R(s, πi+1 (s)) + γ P(s 0 |s, πi+1 (s))
s 0 ∈S
!
X
R(s 0 , πi+1 (s 0 )) + γ P(s 00 |s 0 , πi+1 (s 0 ))V πi (s 00 )
s 00 ∈S
..
.
=V πi+1 (s)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 41 / 60
Policy Iteration (PI): Check Your Understanding
Note: all the below is for finite state-action spaces
Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or kπi − πi−1 k1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1
If policy doesn’t change, can it ever change again?
No

Is there a maximum number of iterations of policy iteration?

|A||S| since that is the maximum number of policies, and as the policy
improvement step is monotonically improving, each policy can only
appear in one round of policy iteration unless it is an optimal policy.

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 42 / 60
Policy Iteration (PI): Check Your Understanding

Suppose for all s ∈ S, πi+1 (s) = πi (s)

Then for all s ∈ S, Q πi+1 (s, a) = Q πi (s, a)
Recall policy improvement step
X
Q πi (s, a) = R(s, a) + γ P(s 0 |s, a)V πi (s 0 )
s 0 ∈S

πi+1 (s) = arg max Q πi (s, a)

a
πi+1
πi+2 (s) = arg max Q (s, a) = arg max Q πi (s, a)
a a

Therefore policy cannot ever change again

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 43 / 60
MDP: Computing Optimal Policy and Optimal Value

Policy iteration computes optimal value and policy

Value iteration is another technique
Idea: Maintain optimal value of starting in a state s if have a finite
number of steps k left in the episode
Iterate to consider longer and longer episodes

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 44 / 60
Bellman Equation and Bellman Backup Operators

Value function of a policy must satisfy the Bellman equation

X
V π (s) = R π (s) + γ P π (s 0 |s)V π (s 0 )
s 0 ∈S

Bellman backup operator

Applied to a value function
Returns a new value function
Improves the value if possible
X
BV (s) = max R(s, a) + γ p(s 0 |s, a)V (s 0 )
a
s 0 ∈S

BV yields a value function over all states s

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 45 / 60
Value Iteration (VI)

Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until [finite horizon, convergence]:
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S

View as Bellman backup on value function

Vk+1 = BVk
X
πk+1 (s) = arg max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 46 / 60
Policy Iteration as Bellman Operations

Bellman backup operator B π for a particular policy is defined as

X
B π V (s) = R π (s) + γ P π (s 0 |s)V (s)
s 0 ∈S

Policy evaluation amounts to computing the fixed point of B π

To do policy evaluation, repeatedly apply operator until V stops
changing
V π = BπBπ · · · BπV

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 47 / 60
Policy Iteration as Bellman Operations

Bellman backup operator B π for a particular policy is defined as

X
B π V (s) = R π (s) + γ P π (s 0 |s)V (s)
s 0 ∈S

To do policy improvement
X
πk+1 (s) = arg max R(s, a) + γ P(s 0 |s, a)V πk (s 0 )
a
s 0 ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 48 / 60
Going Back to Value Iteration (VI)

Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until [finite horizon, convergence]:
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S

Equivalently, in Bellman backup notation

Vk+1 = BVk

To extract optimal policy if can act for k + 1 more steps,

X
π(s) = arg max R(s, a) + γ P(s 0 |s, a)Vk+1 (s 0 )
a
s 0 ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 49 / 60
Contraction Operator

Let O be an operator,and |x| denote (any) norm of x

If |OV − OV 0 | ≤ |V − V 0 |, then O is a contraction operator

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 50 / 60
Will Value Iteration Converge?

Yes, if discount factor γ < 1, or end up in a terminal state with

probability 1
Bellman backup is a contraction if discount factor, γ < 1
If apply it to two different value functions, distance between value
functions shrinks after applying Bellman equation to each

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 51 / 60
Proof: Bellman Backup is a Contraction on V for γ < 1

Let kV − V 0 k = maxs |V (s) − V 0 (s)| be the infinity norm

   

0 0  0 0 0 0 
X X
kBVk − BVj k = max R(s, a) + γ P(s |s, a)V k (s ) − max  R(s, a ) + γ P(s |s, a )V j (s )
a
a 0
s 0 ∈S s 0 ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 52 / 60
Proof: Bellman Backup is a Contraction on V for γ < 1

Let kV − V 0 k = maxs |V (s) − V 0 (s)| be the infinity norm

   

0 0  0 0 0 0 
X X
kBVk − BVj k = max R(s, a) + γ P(s |s, a)V k (s ) − max  R(s, a ) + γ P(s |s, a )V j (s )
a
a 0
s 0 ∈S s 0 ∈S

 

0 0 0 0
X X
≤ max
R(s, a) + γ P(s |s, a)Vk (s ) − R(s, a) − γ P(s |s, a)Vj (s )
a

s 0 ∈S s 0 ∈S

0 0 0
X
= max γ P(s |s, a)(Vk (s ) − Vj (s ))
a

s 0 ∈S

0
X
≤ max γ P(s |s, a)kVk − Vj k)
a

s 0 ∈S

0
X
= γkVk − Vj k max a
P(s |s, a))
s 0 ∈S

= γkVk − Vj k

Note: Even if all inequalities are equalities, this is still a contraction if γ < 1

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 53 / 60
Check Your Understanding

Prove value iteration converges to a unique solution for discrete state

and action spaces with γ < 1
Does the initialization of values in value iteration impact anything?

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 54 / 60
Value Iteration for Finite Horizon H

Vk = optimal value if making k more decisions

πk = optimal policy if making k more decisions
Initialize V0 (s) = 0 for all states s
For k = 1 : H
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S
X
πk+1 (s) = arg max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 55 / 60
Is optimal policy stationary (independent of time step) in
finite horizon tasks? ? In general, no

Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until k == H:
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S
X
πk+1 (s) = arg max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 56 / 60
Value vs Policy Iteration

Value iteration:
Compute optimal value for horizon = k
Note this can be used to compute optimal policy if horizon = k
Increment k
Policy iteration
Compute infinite horizon value of a policy
Use to select another (better) policy
Closely related to a very popular method in RL: policy gradient

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 57 / 60
What You Should Know

Define MP, MRP, MDP, Bellman operator, contraction, model,

Q-value, policy
Be able to implement
Value Iteration
Policy Iteration
Give pros and cons of different policy evaluation approaches
Be able to prove contraction properties
Limitations of presented approaches and Markov assumptions
Which policy evaluation methods require the Markov assumption?

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 58 / 60
Policy Improvement

Compute state-action value of a policy πi

X
Q πi (s, a) = R(s, a) + γ P(s 0 |s, a)V πi (s 0 )
s 0 ∈S

Note
X
max Q πi (s, a) = max R(s, a) + γ P(s 0 |s, a)V π (s 0 )
a a
s 0 ∈S
X
≥ R(s, πi (s)) + γ P(s 0 |s, πi (s))V π (s 0 )
s 0 ∈S
= V πi (s)

Define new policy, for all s ∈ S

πi+1 (s) = arg max Q πi (s, a) ∀s ∈ S

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 59 / 60
Policy Iteration (PI)

Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or kπi − πi−1 k1 > 0 (L1-norm):
Policy evaluation of πi
i =i +1
Policy improvement:
X
Q πi (s, a) = R(s, a) + γ P(s 0 |s, a)V πi (s 0 )
s 0 ∈S

πi+1 (s) = arg max Q πi (s, a)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 60 / 60

No Part of This Publication Shall Be Reproduced, Transmitted, or Sold in Whole or in Part in Any Form, Without The Prior Written Consent
100% (1)
No Part of This Publication Shall Be Reproduced, Transmitted, or Sold in Whole or in Part in Any Form, Without The Prior Written Consent
21 pages
Kahn Probs and Answers
No ratings yet
Kahn Probs and Answers
96 pages
Lecture 2 Post
No ratings yet
Lecture 2 Post
65 pages
Lecture 2 Pre
No ratings yet
Lecture 2 Pre
58 pages
Mod2 Slides
No ratings yet
Mod2 Slides
161 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
2023 Week2 Lecture Before
No ratings yet
2023 Week2 Lecture Before
77 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
subtitle (10)
No ratings yet
subtitle (10)
2 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
DLMAIRIL01_Q4-2024_Session2
No ratings yet
DLMAIRIL01_Q4-2024_Session2
68 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Lec 08
No ratings yet
Lec 08
59 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
CS 747, Autumn 2020: Week 4, Lecture 1: Shivaram Kalyanakrishnan
No ratings yet
CS 747, Autumn 2020: Week 4, Lecture 1: Shivaram Kalyanakrishnan
103 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
AI (IT) UNIT-4
No ratings yet
AI (IT) UNIT-4
37 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
On State Variables and POMDP-s
No ratings yet
On State Variables and POMDP-s
49 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
mdp2 6pp
No ratings yet
mdp2 6pp
14 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
UNIT-4 OF AI
No ratings yet
UNIT-4 OF AI
9 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
06 MDP
No ratings yet
06 MDP
89 pages
CS229
No ratings yet
CS229
17 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
2025_MDPs 1
No ratings yet
2025_MDPs 1
62 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
RL-1
No ratings yet
RL-1
30 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
lec12
No ratings yet
lec12
60 pages
Introduction to Finite Element Analysis
From Everand
Introduction to Finite Element Analysis
Rahul Basu
No ratings yet
Quantum Mechanics: Fundamental Theories
From Everand
Quantum Mechanics: Fundamental Theories
Bharat Saluja
No ratings yet
Seabed Seismic Techniques: QC and Data Processing Keys
From Everand
Seabed Seismic Techniques: QC and Data Processing Keys
Mundy Obilor Jim
5/5 (2)
Engineering Physics
From Everand
Engineering Physics
Dr. S.G Ibrahim
No ratings yet
Quantum Frontiers
From Everand
Quantum Frontiers
Oliver Cook
No ratings yet
IBM Lotus Notes and Domino 7
No ratings yet
IBM Lotus Notes and Domino 7
55 pages
Ezymath - Navigating Math Problem With Interactivity
No ratings yet
Ezymath - Navigating Math Problem With Interactivity
6 pages
Walmart and Flipkart Deal, Impact On Indian Economy: Key Facts
No ratings yet
Walmart and Flipkart Deal, Impact On Indian Economy: Key Facts
2 pages
Tools Used To Prepare Coffee
No ratings yet
Tools Used To Prepare Coffee
2 pages
TIG Welding
No ratings yet
TIG Welding
35 pages
Difference Between Ledger & Journal
No ratings yet
Difference Between Ledger & Journal
2 pages
Bs 5268-7 Base of Span Table
No ratings yet
Bs 5268-7 Base of Span Table
20 pages
Masonry Tools and Equipment
No ratings yet
Masonry Tools and Equipment
5 pages
The Power of Emotions in World Politics 1st Edition Simon Koschut Editor All Chapters Instant Download
100% (3)
The Power of Emotions in World Politics 1st Edition Simon Koschut Editor All Chapters Instant Download
35 pages
20 Load Sensing Hydraulic Systems
No ratings yet
20 Load Sensing Hydraulic Systems
4 pages
8% Income Tax
No ratings yet
8% Income Tax
6 pages
Electrophon AF462 22AF462 R-Player Philips - Österreich
No ratings yet
Electrophon AF462 22AF462 R-Player Philips - Österreich
2 pages
04072025 City Council Packet
No ratings yet
04072025 City Council Packet
255 pages
JXSE - ProgGuide - v2.6 (Final)
No ratings yet
JXSE - ProgGuide - v2.6 (Final)
92 pages
Assyst Bullmer Cutter Procut 5000/7501: Spare and Wearing Parts List
100% (1)
Assyst Bullmer Cutter Procut 5000/7501: Spare and Wearing Parts List
38 pages
Trencor T1460 Screen
No ratings yet
Trencor T1460 Screen
2 pages
CA FINAL DT Amendment Book BY Durgesh Singh
No ratings yet
CA FINAL DT Amendment Book BY Durgesh Singh
115 pages
Information Management MCQ All Week 15 MCQ
No ratings yet
Information Management MCQ All Week 15 MCQ
56 pages
Module 2 - Technologies Behind Iot: 2.1 Challenges and Issues, Security Control Units
No ratings yet
Module 2 - Technologies Behind Iot: 2.1 Challenges and Issues, Security Control Units
21 pages
Structure Warranty Certificate
No ratings yet
Structure Warranty Certificate
1 page
Essay Robots Again
No ratings yet
Essay Robots Again
2 pages
XXX Taffesdsse2017 XXX
No ratings yet
XXX Taffesdsse2017 XXX
14 pages
COMP3331 Assignment
No ratings yet
COMP3331 Assignment
10 pages
Custom Officer Syllabus
No ratings yet
Custom Officer Syllabus
50 pages
Definition of Administrative Law
No ratings yet
Definition of Administrative Law
2 pages
Briefing On Control of Structural Glass To PNAP APP-37 For SE Grade Officer
No ratings yet
Briefing On Control of Structural Glass To PNAP APP-37 For SE Grade Officer
61 pages
Sieve: Actionable Insights From Monitored Metrics in Microservices
No ratings yet
Sieve: Actionable Insights From Monitored Metrics in Microservices
17 pages
Pressing Matter Winter School 2025
No ratings yet
Pressing Matter Winter School 2025
2 pages