Lecture 2: Making Sequences of Good Decisions Given A Model of The World
Lecture 2: Making Sequences of Good Decisions Given A Model of The World
Emma Brunskill
Winter 2019
Last Time:
Introduction
Components of an agent: model, value, policy
This Time:
Making good decisions given a Markov decision process
Next Time:
Policy evaluation when don’t have a model of how the world works
Markov Processes
Markov Reward Processes (MRPs)
Markov Decision Processes (MDPs)
Evaluation and Control in MDPs
0.6 0.4 0 0 0 0 0
0.4 0.2 0.4 0 0 0 0
0 0.4 0.2 0.4 0 0 0
0
P= 0 0.4 0.2 0.4 0 0
0 0 0 0.4 0.2 0.4 0
0 0 0 0 0.4 0.2 0.4
0 0 0 0 0 0.4 0.6
Definition of Horizon
Number of time steps in each episode
Can be infinite
Otherwise called finite Markov reward process
Definition of Return, Gt (for a MRP)
Discounted sum of rewards from time step t to horizon
For finite state MRP, we can express V (s) using a matrix equation
P(s1 |s1 ) · · ·
P(sN |s1 )
V (s1 ) R(s1 ) P(s1 |s2 ) · · · V (s1 )
.. .. P(sN |s2 ) .
. = . +γ ..
.. .. ..
. . .
V (sN ) R(sN ) V (sN )
P(s1 |sN ) · · · P(sN |sN )
V = R + γPV
For finite state MRP, we can express V (s) using a matrix equation
P(s1 |s1 ) · · ·
P(sN |s1 )
V (s1 ) R(s1 ) P(s1 |s2 ) · · · V (s1 )
.. .. P(sN |s2 ) .
. = . +γ ..
.. .. ..
. . .
V (sN ) R(sN ) V (sN )
P(s1 |sN ) · · · P(sN |sN )
V = R + γPV
V − γPV = R
(I − γP)V = R
V = (I − γP)−1 R
Dynamic programming
Initialize V0 (s) = 0 for all s
For k = 1 until convergence
For all s in S
X
Vk (s) = R(s) + γ P(s 0 |s)Vk−1 (s 0 )
s 0 ∈S
1
Reward is sometimes defined as a function of the current state, or as a function of
the (state, action, next state) tuple. Most frequently in this class, we will assume reward
is a function of state and action
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 22 / 60
Example: Mars Rover MDP
1 0 0 0 0 0 0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 1 0 0 0 0
0 1 0 0 0 0 0 0 0 0 1 0 0 0
0 0
P(s |s, a1 ) =
0 0 1 0 0 0 0 P(s |s, a2 ) =
0 0 0 0 1 0 0
0 0 0 1 0 0 0 0 0 0 0 0 1 0
0 0 0 0 1 0 0 0 0 0 0 0 0 1
0 0 0 0 0 1 0 0 0 0 0 0 0 1
2 deterministic actions
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 23 / 60
MDP Policies
Two actions
Reward: for all actions, +1 in state s1 , +10 in state s7 , 0 otherwise
Let π(s) = a1 ∀s. γ = 0.
What is the value of this policy?
Recall iterativ e
X
Vkπ (s) = r (s, π(s)) + γ p(s 0 |s, π(s))Vk−1
π
(s 0 )
s 0 ∈S
V π = [1 0 0 0 0 0 10]
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 2: Making Sequences of Good Decisions Given a ModelWinter
of the2019
World 27 / 60
Practice: MDP 1 Iteration of Policy Evaluation, Mars
Rover Example
Dynamics: p(s6 |s6 , a1 ) = 0.5, p(s7 |s6 , a1 ) = 0.5, . . .
Reward: for all actions, +1 in state s1 , +10 in state s7 , 0 otherwise
Let π(s) = a1 ∀s, assume Vk =[1 0 0 0 0 0 10] and k = 1, γ = 0.5
For all s in S
X
Vkπ (s) = r (s, π(s)) + γ p(s 0 |s, π(s))Vk−1
π
(s 0 )
s 0 ∈S
Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or kπi − πi−1 k1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1
Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or kπi − πi−1 k1 > 0 (L1-norm, measures if the policy
changed for any state):
V πi ← MDP V function policy evaluation of πi
πi+1 ← Policy improvement
i =i +1
X
Q πi (s, a) = R(s, a) + γ P(s 0 |s, a)V πi (s 0 )
s 0 ∈S
X
Q πi (s, a) = R(s, a) + γ P(s 0 |s, a)V πi (s 0 )
s 0 ∈S
X
max Q πi (s, a) ≥ R(s, πi (s)) + γ P(s 0 |s, πi (s))V πi (s 0 ) = V πi (s)
a
s 0 ∈S
Suppose we take πi+1 (s) for one action, then follow πi forever
Our expected sum of rewards is at least as good as if we had always
followed πi
But new proposed policy is to always follow πi+1 ...
Definition
V π1 ≥ V π2 : V π1 (s) ≥ V π2 (s), ∀s ∈ S
Proposition: V πi+1 ≥ V πi with strict inequality if πi is suboptimal,
where πi+1 is the new policy we get from policy improvement on πi
Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until [finite horizon, convergence]:
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S
Vk+1 = BVk
X
πk+1 (s) = arg max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S
To do policy improvement
X
πk+1 (s) = arg max R(s, a) + γ P(s 0 |s, a)V πk (s 0 )
a
s 0 ∈S
Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until [finite horizon, convergence]:
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S
Vk+1 = BVk
0 0 0 0 0 0
X X
kBVk − BVj k =
max R(s, a) + γ P(s |s, a)V k (s ) − max R(s, a ) + γ P(s |s, a )V j (s )
a
a 0
s 0 ∈S s 0 ∈S
0 0 0 0 0 0
X X
kBVk − BVj k =
max R(s, a) + γ P(s |s, a)V k (s ) − max R(s, a ) + γ P(s |s, a )V j (s )
a
a 0
s 0 ∈S s 0 ∈S
0 0 0 0
X X
≤
max
R(s, a) + γ P(s |s, a)Vk (s ) − R(s, a) − γ P(s |s, a)Vj (s )
a
s 0 ∈S s 0 ∈S
0 0 0
X
=
max γ P(s |s, a)(Vk (s ) − Vj (s ))
a
s 0 ∈S
0
X
≤
max γ P(s |s, a)kVk − Vj k)
a
s 0 ∈S
0
X
=
γkVk − Vj k max a
P(s |s, a))
s 0 ∈S
= γkVk − Vj k
Note: Even if all inequalities are equalities, this is still a contraction if γ < 1
Set k = 1
Initialize V0 (s) = 0 for all states s
Loop until k == H:
For each state s
X
Vk+1 (s) = max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S
X
πk+1 (s) = arg max R(s, a) + γ P(s 0 |s, a)Vk (s 0 )
a
s 0 ∈S
Value iteration:
Compute optimal value for horizon = k
Note this can be used to compute optimal policy if horizon = k
Increment k
Policy iteration
Compute infinite horizon value of a policy
Use to select another (better) policy
Closely related to a very popular method in RL: policy gradient
Note
X
max Q πi (s, a) = max R(s, a) + γ P(s 0 |s, a)V π (s 0 )
a a
s 0 ∈S
X
≥ R(s, πi (s)) + γ P(s 0 |s, πi (s))V π (s 0 )
s 0 ∈S
= V πi (s)
Set i = 0
Initialize π0 (s) randomly for all states s
While i == 0 or kπi − πi−1 k1 > 0 (L1-norm):
Policy evaluation of πi
i =i +1
Policy improvement:
X
Q πi (s, a) = R(s, a) + γ P(s 0 |s, a)V πi (s 0 )
s 0 ∈S