Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
Shiyu Zhao
Shiyu Zhao 1 / 26
A grid-world example
Start Start
forbidden forbidden
Shiyu Zhao 2 / 26
State
s1 s2 s3
s4 s5 s6
s7 s8 s9
Shiyu Zhao 3 / 26
Action
a1
s1 s2 s3
a4
a2 s4 s5 s6
a5
s7 s8 s9
a3
s1 s2 s3
s4 s5 s6
s7 s8 s9
When taking an action, the agent may move from one state to another. Such a
process is called state transition.
• At state s1 , if we choose action a2 , then what is the next state?
a2
s1 −→ s2
s1 s2 s3
s4 s5 s6
s7 s8 s9
Shiyu Zhao 6 / 26
State transition
s1 s2 s3
s4 s5 s6
s7 s8 s9
s1 s2 s3
s4 s5 s6
s7 s8 s9
p(s2 |s1 , a2 ) = 1
p(si |s1 , a2 ) = 0 ∀i 6= 2
Shiyu Zhao 8 / 26
Policy
Based on this policy, we get the following paths with different starting points.
Shiyu Zhao 9 / 26
Policy
π(a1 |s1 ) = 0
π(a2 |s1 ) = 1
π(a3 |s1 ) = 0
π(a4 |s1 ) = 0
π(a5 |s1 ) = 0
It is a deterministic policy.
Shiyu Zhao 10 / 26
Policy
Prob=0.5
Shiyu Zhao 11 / 26
Policy
Prob=0.5
Prob=0.5
Shiyu Zhao 13 / 26
Reward
s1 s2 s3
s4 s5 s6
s7 s8 s9
Shiyu Zhao 14 / 26
Reward
s1 s2 s3
s4 s5 s6
s7 s8 s9
s1 s2 s3
s4 s5 s6
s7 s8 s9
Shiyu Zhao 16 / 26
Trajectory and return
r=0
s1 s2 s3
r=0
s4 s5 s6
r=0
s7 s8 s9
r=1
The return of this trajectory is the sum of all the rewards collected along the
trajectory:
return = 0 + 0 + 0 + 1 = 1
Shiyu Zhao 17 / 26
Trajectory and return
s1 s2 s3
r=0
s4 s5 s6
r=-1
s7 s8 s9
r=0 r=+1
return = 0 − 1 + 0 + 1 = 0
Shiyu Zhao 18 / 26
Trajectory and return
r=0
s1 s2 s3 s1 s2 s3
r=0 r=0
s4 s5 s6 s4 s5 s6
r=0 r=-1
s7 s8 s9 s7 s8 s9
r=1 r=0 r=+1
Shiyu Zhao 19 / 26
Discounted return
r=0
s1 s2 s3
r=0
s4 s5 s6
r=0
s7 s8 s9
r=+1 r=+1, r=+1, r=+1,...
The return is
return = 0 + 0 + 0 + 1 + 1 + 1 + · · · = ∞
Shiyu Zhao 20 / 26
Discounted return
r=0
s1 s2 s3
r=0
s4 s5 s6
r=0
s7 s8 s9
r=+1 r=+1, r=+1, r=+1,...
discounted return = 0 + γ0 + γ 2 0 + γ 3 1 + γ 4 1 + γ 5 1 + . . .
1
= γ 3 (1 + γ + γ 2 + . . . ) = γ 3 .
1−γ
Roles: 1) the sum becomes finite; 2) balance the far and near future rewards:
• If γ is close to 0, the value of the discounted return is dominated by the
rewards obtained in the near future.
• If γ is close to 1, the value of the discounted return is dominated by the
rewards obtained in the far future.
Shiyu Zhao 21 / 26
Episode
When interacting with the environment following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a
trial).
r=0
s1 s2 s3
r=0
s4 s5 s6
r=0
s7 s8 s9
r=1
Example: episode
a a a a
s1 −−2→ s2 −−3→ s5 −−3→ s8 −−2→ s9
r=0 r=0 r=0 r=1
Some tasks may have no terminal states, meaning the interaction with the
environment will never end. Such tasks are called continuing tasks.
Shiyu Zhao 23 / 26
Markov decision process (MDP)
All the concepts introduced in this lecture can be put in the framework in MDP.
Shiyu Zhao 24 / 26
Markov decision process (MDP)
The grid world could be abstracted as a more general model, Markov process.
Prob=0.5 Prob=1
s1 s2 s3
Prob=0.5
Prob=0.5
Prob=1
Prob=0.5
Prob=1
s4 s5 s6
Prob=1
Prob=1
Prob=1 Prob=1
s7 s8 s9
The circles represent states and the links with arrows represent the state
transition.
Markov decision process becomes Markov process once the policy is given!
Shiyu Zhao 25 / 26
Summary
Shiyu Zhao 26 / 26
Lecture 2: Bellman Equation
Shiyu Zhao
Shiyu Zhao 1 / 52
Outline
In this lecture:
• A core concept: state value
• A fundamental tool: the Bellman equation
Shiyu Zhao 2 / 52
Outline
1 Motivating examples
2 State value
6 Action value
7 Summary
Shiyu Zhao 3 / 52
Outline
1 Motivating examples
2 State value
6 Action value
7 Summary
Shiyu Zhao 4 / 52
Motivating example 1: Why return is important?
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1
r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1
return1 = 0 + γ1 + γ 2 1 + . . . ,
= γ(1 + γ + γ 2 + . . . ),
γ
= .
1−γ
Shiyu Zhao 6 / 52
Motivating example 1: Why return is important?
r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1
return1 = 0 + γ1 + γ 2 1 + . . . ,
= γ(1 + γ + γ 2 + . . . ),
γ
= .
1−γ
Shiyu Zhao 6 / 52
Motivating example 1: Why return is important?
r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1
return2 = −1 + γ1 + γ 2 1 + . . . ,
= −1 + γ(1 + γ + γ 2 + . . . ),
γ
= −1 + .
1−γ
Shiyu Zhao 7 / 52
Motivating example 1: Why return is important?
r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1
return2 = −1 + γ1 + γ 2 1 + . . . ,
= −1 + γ(1 + γ + γ 2 + . . . ),
γ
= −1 + .
1−γ
Shiyu Zhao 7 / 52
Motivating example 1: Why return is important?
r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1
Policy 3 is stochastic!
Exercise: Based on policy 3 (right figure), starting from s1 , the
discounted return is
Answer:
γ γ
return3 = 0.5 −1 + + 0.5 ,
1−γ 1−γ
γ
= −0.5 + .
1−γ
Shiyu Zhao 8 / 52
Motivating example 1: Why return is important?
r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1
Policy 3 is stochastic!
Exercise: Based on policy 3 (right figure), starting from s1 , the
discounted return is
Answer:
γ γ
return3 = 0.5 −1 + + 0.5 ,
1−γ 1−γ
γ
= −0.5 + .
1−γ
Shiyu Zhao 8 / 52
Motivating example 1: Why return is important?
r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1
The above inequality suggests that the first policy is the best and the
second policy is the worst, which is exactly the same as our intuition.
Calculating return is important to evaluate a policy.
Shiyu Zhao 9 / 52
Motivating example 2: How to calculate return?
While return is important, how to calculate it?
Method 1: by definition
Let vi denote the return obtained starting from si (i = 1, 2, 3, 4)
v1 = r1 + γr2 + γ 2 r3 + . . .
v2 = r2 + γr3 + γ 2 r4 + . . .
v3 = r3 + γr4 + γ 2 r1 + . . .
v4 = r4 + γr1 + γ 2 r2 + . . .
Shiyu Zhao 10 / 52
Motivating example 2: How to calculate return?
While return is important, how to calculate it?
Method 1: by definition
Let vi denote the return obtained starting from si (i = 1, 2, 3, 4)
v1 = r1 + γr2 + γ 2 r3 + . . .
v2 = r2 + γr3 + γ 2 r4 + . . .
v3 = r3 + γr4 + γ 2 r1 + . . .
v4 = r4 + γr1 + γ 2 r2 + . . .
Shiyu Zhao 10 / 52
Motivating example 2: How to calculate return?
While return is important, how to calculate it?
Method 1: by definition
Let vi denote the return obtained starting from si (i = 1, 2, 3, 4)
v1 = r1 + γr2 + γ 2 r3 + . . .
v2 = r2 + γr3 + γ 2 r4 + . . .
v3 = r3 + γr4 + γ 2 r1 + . . .
v4 = r4 + γr1 + γ 2 r2 + . . .
Shiyu Zhao 10 / 52
Motivating example 2: How to calculate return?
While return is important, how to calculate it?
Method 2:
Method 2:
Method 2:
v = r + γPv
v = r + γPv
s1 s2
r=0 r=1
s3 s4
r=1 r=1
Answer:
v1 = 0 + γv3
v2 = 1 + γv4
v3 = 1 + γv4
v4 = 1 + γv4
Exercise: How to solve them? We can first calculate v4 , and then
v3 , v2 , v1 .
Shiyu Zhao 13 / 52
Motivating example 2: How to calculate return?
Exercise: Consider the policy shown in the figure. Please write out the
relation among the returns (that is to write out the Bellman equation)
s1 s2
r=0 r=1
s3 s4
r=1 r=1
Answer:
v1 = 0 + γv3
v2 = 1 + γv4
v3 = 1 + γv4
v4 = 1 + γv4
Exercise: How to solve them? We can first calculate v4 , and then
v3 , v2 , v1 .
Shiyu Zhao 13 / 52
Outline
1 Motivating examples
2 State value
6 Action value
7 Summary
Shiyu Zhao 14 / 52
Some notations
Consider the following single-step process:
At
St −−→ Rt+1 , St+1
Shiyu Zhao 16 / 52
Some notations
Shiyu Zhao 16 / 52
State value
The expectation (or called expected value or mean) of Gt is defined as
the state-value function or simply state value:
vπ (s) = E[Gt |St = s]
Remarks:
• It is a function of s. It is a conditional expectation with the condition
that the state starts from s.
• It is based on the policy π. For a different policy, the state value may
be different.
• It represents the “value” of a state. If the state value is greater, then
the policy is better because greater cumulative rewards can be
obtained.
Q: What is the relationship between return and state value?
A: The state value is the mean of all possible returns that can be
obtained starting from a state. If everything - π(a|s), p(r|s, a), p(s0 |s, a)
- is deterministic, then state value is the same as return.
Shiyu Zhao 17 / 52
State value
The expectation (or called expected value or mean) of Gt is defined as
the state-value function or simply state value:
vπ (s) = E[Gt |St = s]
Remarks:
• It is a function of s. It is a conditional expectation with the condition
that the state starts from s.
• It is based on the policy π. For a different policy, the state value may
be different.
• It represents the “value” of a state. If the state value is greater, then
the policy is better because greater cumulative rewards can be
obtained.
Q: What is the relationship between return and state value?
A: The state value is the mean of all possible returns that can be
obtained starting from a state. If everything - π(a|s), p(r|s, a), p(s0 |s, a)
- is deterministic, then state value is the same as return.
Shiyu Zhao 17 / 52
State value
Example:
r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5
s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1
Shiyu Zhao 18 / 52
Outline
1 Motivating examples
2 State value
6 Action value
7 Summary
Shiyu Zhao 19 / 52
Bellman equation
Shiyu Zhao 20 / 52
Deriving the Bellman equation
Consider a random trajectory:
At At+1 At+2
St −−→ Rt+1 , St+1 −−−→ Rt+2 , St+2 −−−→ Rt+3 , . . .
Note that
• This is the mean of immediate rewards
Shiyu Zhao 22 / 52
Deriving the Bellman equation
Note that
• This is the mean of immediate rewards
Shiyu Zhao 22 / 52
Deriving the Bellman equation
Note that
• This is the mean of future rewards
• E[Gt+1 |St = s, St+1 = s0 ] = E[Gt+1 |St+1 = s0 ] due to the memoryless
Markov property.
Shiyu Zhao 23 / 52
Deriving the Bellman equation
Note that
• This is the mean of future rewards
• E[Gt+1 |St = s, St+1 = s0 ] = E[Gt+1 |St+1 = s0 ] due to the memoryless
Markov property.
Shiyu Zhao 23 / 52
Deriving the Bellman equation
Therefore, we have
Highlights:
• The above equation is called the Bellman equation, which characterizes
the relationship among the state-value functions of different states.
• It consists of two terms: the immediate reward term and the future
reward term.
• A set of equations: every state has an equation like this!!!
Shiyu Zhao 24 / 52
Deriving the Bellman equation
Therefore, we have
s1 s2
r=0 r=1
s3 s4
r=1 r=1
s1 s2
r=0 r=1
s3 s4
r=1 r=1
s1 s2
r=0 r=1
s3 s4
r=1 r=1
s1 s2
r=0 r=1
s3 s4
r=1 r=1
Solve the above equations one by one from the last to the first:
1
vπ (s4 ) = ,
1−γ
1
vπ (s3 ) = ,
1−γ
1
vπ (s2 ) = ,
1−γ
γ
vπ (s1 ) = .
1−γ
Shiyu Zhao 28 / 52
An illustrative example
How to solve them?
Solve the above equations one by one from the last to the first:
1
vπ (s4 ) = ,
1−γ
1
vπ (s3 ) = ,
1−γ
1
vπ (s2 ) = ,
1−γ
γ
vπ (s1 ) = .
1−γ
Shiyu Zhao 28 / 52
An illustrative example
If γ = 0.9, then
1
vπ (s4 ) = = 10,
1 − 0.9
1
vπ (s3 ) = = 10,
1 − 0.9
1
vπ (s2 ) = = 10,
1 − 0.9
0.9
vπ (s1 ) = = 9.
1 − 0.9
What to do after we have calculated state values? Be patient
(calculating action value and improve policy)
Shiyu Zhao 29 / 52
Exercise
r=-1
Prob=0.5
s1 s2
r=0
r=1
Prob=0.5
s3 s4
r=1 r=1
Exercise:
" #
X X X
0 0
vπ (s) = π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s )
a r s0
Shiyu Zhao 30 / 52
Exercise
Answer:
Solve the above equations one by one from the last to the first.
1 1 1
vπ (s4 ) = , vπ (s3 ) = , vπ (s2 ) = ,
1−γ 1−γ 1−γ
vπ (s1 ) = 0.5[0 + γvπ (s3 )] + 0.5[−1 + γvπ (s2 )],
γ
= −0.5 + .
1−γ
Substituting γ = 0.9 yields
Shiyu Zhao 31 / 52
Outline
1 Motivating examples
2 State value
6 Action value
7 Summary
Shiyu Zhao 32 / 52
Matrix-vector form of the Bellman equation
• The above elementwise form is valid for every state s ∈ S. That means
there are |S| equations like this!
• If we put all the equations together, we have a set of linear equations,
which can be concisely written in a matrix-vector form.
• The matrix-vector form is very elegant and important.
Shiyu Zhao 33 / 52
Matrix-vector form of the Bellman equation
Recall that:
" #
X X X
0 0
vπ (s) = π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s )
a r s0
where
X X X
rπ (s) , π(a|s) p(r|s, a)r, pπ (s0 |s) , π(a|s)p(s0 |s, a)
a r a
Shiyu Zhao 34 / 52
Matrix-vector form of the Bellman equation
Put all these equations for all the states together and rewrite to a
matrix-vector form
vπ = rπ + γPπ vπ
where
• vπ = [vπ (s1 ), . . . , vπ (sn )]T ∈ Rn
• rπ = [rπ (s1 ), . . . , rπ (sn )]T ∈ Rn
• Pπ ∈ Rn×n , where [Pπ ]ij = pπ (sj |si ), is the state transition matrix
Shiyu Zhao 35 / 52
Illustrative examples
If there are four states, vπ = rπ + γPπ vπ can be written out as
vπ (s1 ) rπ (s1 ) pπ (s1 |s1 ) pπ (s2 |s1 ) pπ (s3 |s1 ) pπ (s4 |s1 ) vπ (s1 )
vπ (s2 ) rπ (s2 ) pπ (s1 |s2 ) pπ (s2 |s2 ) pπ (s3 |s2 ) pπ (s4 |s2 ) vπ (s2 )
= +γ .
pπ (s1 |s3 ) pπ (s2 |s3 ) pπ (s3 |s3 ) pπ (s4 |s3 )
vπ (s3 ) rπ (s3 ) vπ (s3 )
vπ (s4 ) rπ (s4 ) pπ (s1 |s4 ) pπ (s2 |s4 ) pπ (s3 |s4 ) pπ (s4 |s4 ) vπ (s4 )
| {z } | {z } | {z }| {z }
vπ rπ Pπ vπ
Shiyu Zhao 36 / 52
Illustrative examples
If there are four states, vπ = rπ + γPπ vπ can be written out as
vπ (s1 ) rπ (s1 ) pπ (s1 |s1 ) pπ (s2 |s1 ) pπ (s3 |s1 ) pπ (s4 |s1 ) vπ (s1 )
vπ (s2 ) rπ (s2 ) pπ (s1 |s2 ) pπ (s2 |s2 ) pπ (s3 |s2 ) pπ (s4 |s2 ) vπ (s2 )
= +γ .
pπ (s1 |s3 ) pπ (s2 |s3 ) pπ (s3 |s3 ) pπ (s4 |s3 )
vπ (s3 ) rπ (s3 ) vπ (s3 )
vπ (s4 ) rπ (s4 ) pπ (s1 |s4 ) pπ (s2 |s4 ) pπ (s3 |s4 ) pπ (s4 |s4 ) vπ (s4 )
| {z } | {z } | {z }| {z }
vπ rπ Pπ vπ
s1 s2
r=0 r=1
s3 s4
r=1 r=1
r=-1
Prob=0.5
s1 s2
r=0
r=1
Prob=0.5
s3 s4
r=1 r=1
1 Motivating examples
2 State value
6 Action value
7 Summary
Shiyu Zhao 38 / 52
Solve state values
Shiyu Zhao 39 / 52
Solve state values
The Bellman equation in matrix-vector form is
vπ = rπ + γPπ vπ
• The closed-form solution is:
vπ = (I − γPπ )−1 rπ
In practice, we still need to use numerical tools to calculate the matrix
inverse.
Can we avoid the matrix inverse operation? Yes, by iterative
algorithms.
Proof.
Define the error as δk = vk − vπ . We only need to show δk → 0. Substituting
vk+1 = δk+1 + vπ and vk = δk + vπ into vk+1 = rπ + γPπ vk gives
As a result,
Note that 0 ≤ Pπk ≤ 1, which means every entry of Pπk is no greater than 1 for
any k = 0, 1, 2, . . . . That is because Pπk 1 = 1, where 1 = [1, . . . , 1]T . On the
other hand, since γ < 1, we know γ k → 0 and hence δk+1 = γ k+1 Pπk+1 δ0 → 0
as k → ∞.
Shiyu Zhao 41 / 52
Solve state values
Examples: rboundary = rforbidden = −1, rtarget = +1, γ = 0.9
The following are two “good” policies and the state values. The two
policies are different for the top two states in the forth column.
1 2 3 4 5 1 2 3 4 5
1 2 3 4 5 1 2 3 4 5
Shiyu Zhao 42 / 52
Solve state values
Examples: rboundary = rforbidden = −1, rtarget = +1, γ = 0.9
The following are two “bad” policies and the state values. The state
values are less than those of the good policies.
1 2 3 4 5 1 2 3 4 5
1 2 3 4 5 1 2 3 4 5
Shiyu Zhao 43 / 52
Outline
1 Motivating examples
2 State value
6 Action value
7 Summary
Shiyu Zhao 44 / 52
Action value
Shiyu Zhao 45 / 52
Action value
Definition:
qπ (s, a) = E[Gt |St = s, At = a]
Hence,
X
vπ (s) = π(a|s)qπ (s, a) (2)
a
Shiyu Zhao 46 / 52
Action value
Definition:
qπ (s, a) = E[Gt |St = s, At = a]
Hence,
X
vπ (s) = π(a|s)qπ (s, a) (2)
a
Shiyu Zhao 46 / 52
Action value
(2) and (4) are the two sides of the same coin:
• (2) shows how to obtain state values from action values.
• (4) shows how to obtain action values from state values.
Shiyu Zhao 47 / 52
Action value
(2) and (4) are the two sides of the same coin:
• (2) shows how to obtain state values from action values.
• (4) shows how to obtain action values from state values.
Shiyu Zhao 47 / 52
Action value
(2) and (4) are the two sides of the same coin:
• (2) shows how to obtain state values from action values.
• (4) shows how to obtain action values from state values.
Shiyu Zhao 47 / 52
Illustrative example for action value
s1 s2
r=-1 r=1
s3 s4
r=1 r=1
Questions:
• qπ (s1 , a1 ), qπ (s1 , a3 ), qπ (s1 , a4 ), qπ (s1 , a5 ) =? Be careful!
Shiyu Zhao 48 / 52
Illustrative example for action value
s1 s2
r=-1 r=1
s3 s4
r=1 r=1
Questions:
• qπ (s1 , a1 ), qπ (s1 , a3 ), qπ (s1 , a4 ), qπ (s1 , a5 ) =? Be careful!
Shiyu Zhao 48 / 52
Illustrative example for action value
s1 s2
r=-1 r=1
s3 s4
r=1 r=1
Shiyu Zhao 49 / 52
Illustrative example for action value
s1 s2
r=-1 r=1
s3 s4
r=1 r=1
Highlights:
• Action value is important since we care about which action to take.
• We can first calculate all the state values and then calculate the action
values.
• We can also directly calculate the action values with or without models.
Shiyu Zhao 50 / 52
Outline
1 Motivating examples
2 State value
6 Action value
7 Summary
Shiyu Zhao 51 / 52
Summary
Key concepts and results:
• State value: vπ (s) = E[Gt |St = s]
• Action value: qπ (s, a) = E[Gt |St = s, At = a]
• The Bellman equation (elementwise form):
X hX X i
vπ (s) = π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπ (s0 )
a r s0
| {z }
qπ (s,a)
X
= π(a|s)qπ (s, a)
a
vπ = rπ + γPπ vπ
vπ = rπ + γPπ vπ
vπ = rπ + γPπ vπ
vπ = rπ + γPπ vπ
Shiyu Zhao
Outline
Shiyu Zhao 1 / 50
Outline
In this lecture:
• Core concepts: optimal state value and optimal policy
• A fundamental tool: the Bellman optimality equation (BOE)
Shiyu Zhao 2 / 50
Outline
1 Motivating examples
3 BOE: Introduction
7 BOE: Solution
8 BOE: Optimality
Shiyu Zhao 3 / 50
Outline
1 Motivating examples
3 BOE: Introduction
7 BOE: Solution
8 BOE: Optimality
Shiyu Zhao 4 / 50
Motivating examples
s1 s2
r=-1 r=1
s3 s4
r=1 r=1
Bellman equation:
s1 s2
r=-1 r=1
s3 s4
r=1 r=1
Shiyu Zhao 6 / 50
Motivating examples
s1 s2
r=-1 r=1
s3 s4
r=1 r=1
Question: While the policy is not good, how can we improve it?
Answer: by using action values.
The current policy π(a|s1 ) is
(
1 a = a2
π(a|s1 ) =
0 a 6= a2
Shiyu Zhao 7 / 50
Motivating examples
s1 s2
r=-1 r=1
s3 s4
r=1 r=1
Question: While the policy is not good, how can we improve it?
Answer: by using action values.
The current policy π(a|s1 ) is
(
1 a = a2
π(a|s1 ) =
0 a 6= a2
Shiyu Zhao 7 / 50
Motivating examples
s1 s2
r=-1 r=1
s3 s4
r=1 r=1
s1 s2
r=-1 r=1
s3 s4
r=1 r=1
s1 s2
r=-1 r=1
s3 s4
r=1 r=1
Shiyu Zhao 9 / 50
Outline
1 Motivating examples
3 BOE: Introduction
7 BOE: Solution
8 BOE: Optimality
Shiyu Zhao 10 / 50
Optimal policy
1 Motivating examples
3 BOE: Introduction
7 BOE: Solution
8 BOE: Optimality
Shiyu Zhao 12 / 50
Bellman optimality equation (BOE)
Shiyu Zhao 13 / 50
Bellman optimality equation (BOE)
Shiyu Zhao 13 / 50
Bellman optimality equation (BOE)
Shiyu Zhao 13 / 50
Bellman optimality equation (BOE)
Remarks:
• p(r|s, a), p(s0 |s, a) are known.
• v(s), v(s0 ) are unknown and to be calculated.
• Is π(s) known or unknown?
Shiyu Zhao 13 / 50
Bellman optimality equation (BOE)
v = max(rπ + γPπ v)
π
Shiyu Zhao 14 / 50
Bellman optimality equation (BOE)
v = max(rπ + γPπ v)
π
v = max(rπ + γPπ v)
π
1 Motivating examples
3 BOE: Introduction
7 BOE: Solution
8 BOE: Optimality
Shiyu Zhao 16 / 50
Maximization on the right-hand side of BOE
x = max(2x − 1 − a2 ).
a
This equation has two unknowns. To solve them, first consider the right
hand side. Regardless the value of x, maxa (2x − 1 − a2 ) = 2x − 1 where
the maximization is achieved when a = 0. Second, when a = 0, the
equation becomes x = 2x − 1, which leads to x = 1. Therefore, a = 0
and x = 1 are the solution of the equation.
Shiyu Zhao 17 / 50
Maximization on the right-hand side of BOE
x = max(2x − 1 − a2 ).
a
This equation has two unknowns. To solve them, first consider the right
hand side. Regardless the value of x, maxa (2x − 1 − a2 ) = 2x − 1 where
the maximization is achieved when a = 0. Second, when a = 0, the
equation becomes x = 2x − 1, which leads to x = 1. Therefore, a = 0
and x = 1 are the solution of the equation.
Shiyu Zhao 17 / 50
Maximization on the right-hand side of BOE
P
Example (How to solve maxπ a π(a|s)q(s, a))
Suppose q1 , q2 , q3 ∈ R are given. Find c∗1 , c∗2 , c∗3 solving
max c1 q1 + c2 q2 + c3 q3 .
c1 ,c2 ,c3
where c1 + c2 + c3 = 1 and c1 , c2 , c3 ≥ 0.
Without loss of generality, suppose q3 ≥ q1 , q2 . Then, the optimal
solution is c∗3 = 1 and c∗1 = c∗2 = 0. That is because for any c1 , c2 , c3
q3 = (c1 + c2 + c3 )q3 = c1 q3 + c2 q3 + c3 q3 ≥ c1 q1 + c2 q2 + c3 q3 .
Shiyu Zhao 18 / 50
Maximization on the right-hand side of BOE
P
Example (How to solve maxπ a π(a|s)q(s, a))
Suppose q1 , q2 , q3 ∈ R are given. Find c∗1 , c∗2 , c∗3 solving
max c1 q1 + c2 q2 + c3 q3 .
c1 ,c2 ,c3
where c1 + c2 + c3 = 1 and c1 , c2 , c3 ≥ 0.
Without loss of generality, suppose q3 ≥ q1 , q2 . Then, the optimal
solution is c∗3 = 1 and c∗1 = c∗2 = 0. That is because for any c1 , c2 , c3
q3 = (c1 + c2 + c3 )q3 = c1 q3 + c2 q3 + c3 q3 ≥ c1 q1 + c2 q2 + c3 q3 .
Shiyu Zhao 18 / 50
Maximization on the right-hand side of BOE
1 Motivating examples
3 BOE: Introduction
7 BOE: Solution
8 BOE: Optimality
Shiyu Zhao 20 / 50
Solve the Bellman optimality equation
v = f (v)
where
X
[f (v)]s = max π(a|s)q(s, a), s∈S
π
a
Shiyu Zhao 21 / 50
Solve the Bellman optimality equation
v = f (v)
where
X
[f (v)]s = max π(a|s)q(s, a), s∈S
π
a
Shiyu Zhao 21 / 50
Outline
1 Motivating examples
3 BOE: Introduction
7 BOE: Solution
8 BOE: Optimality
Shiyu Zhao 22 / 50
Preliminaries: Contraction mapping theorem
Some concepts:
• Fixed point: x ∈ X is a fixed point of f : X → X if
f (x) = x
Shiyu Zhao 23 / 50
Preliminaries: Contraction mapping theorem
Some concepts:
• Fixed point: x ∈ X is a fixed point of f : X → X if
f (x) = x
Shiyu Zhao 23 / 50
Preliminaries: Contraction mapping theorem
Shiyu Zhao 24 / 50
Preliminaries: Contraction mapping theorem
Shiyu Zhao 24 / 50
Preliminaries: Contraction mapping theorem
Shiyu Zhao 25 / 50
Preliminaries: Contraction mapping theorem
Examples:
xk+1 = 0.5xk
xk+1 = Axk
Shiyu Zhao 26 / 50
Preliminaries: Contraction mapping theorem
Examples:
xk+1 = 0.5xk
xk+1 = Axk
Shiyu Zhao 26 / 50
Outline
1 Motivating examples
3 BOE: Introduction
7 BOE: Solution
8 BOE: Optimality
Shiyu Zhao 27 / 50
Contraction property of BOE
Shiyu Zhao 28 / 50
Contraction property of BOE
Shiyu Zhao 28 / 50
Solve the Bellman optimality equation
Shiyu Zhao 29 / 50
Solve the Bellman optimality equation
Elementwise form:
!
X X X
vk+1 (s) = max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 )
π
a r s0
X
= max π(a|s)qk (s, a)
π
a
= max qk (s, a)
a
Shiyu Zhao 30 / 50
Solve the Bellman optimality equation
Elementwise form:
!
X X X
vk+1 (s) = max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 )
π
a r s0
X
= max π(a|s)qk (s, a)
π
a
= max qk (s, a)
a
Shiyu Zhao 30 / 50
Solve the Bellman optimality equation
Procedure summary:
Procedure summary:
Procedure summary:
Procedure summary:
Procedure summary:
s2
s1 s3
target area
Shiyu Zhao 32 / 50
Example
s2
s1 s3
target area
q-value table a` a0 ar
s1 −1 + γv(s1 ) 0 + γv(s1 ) 1 + γv(s2 )
s2 0 + γv(s1 ) 1 + γv(s2 ) 0 + γv(s3 )
s3 1 + γv(s2 ) 0 + γv(s3 ) −1 + γv(s3 )
Consider γ = 0.9
Shiyu Zhao 33 / 50
Example
s2
s1 s3
target area
q-value table a` a0 ar
s1 −1 + γv(s1 ) 0 + γv(s1 ) 1 + γv(s2 )
s2 0 + γv(s1 ) 1 + γv(s2 ) 0 + γv(s3 )
s3 1 + γv(s2 ) 0 + γv(s3 ) −1 + γv(s3 )
Consider γ = 0.9
Shiyu Zhao 33 / 50
Example
• k = 1:
Excise: With v1 (s) calculated in the last step, calculate by yourself.
q-value:
a` a0 ar
s1 −0.1 0.9 1.9
s2 0.9 1.9 0.9
s3 1.9 0.9 −0.1
Greedy policy (select the greatest q-value):
The policy is the same as the previous one, which is already optimal.
v-value: v2 (s) = ...
• k = 2, 3, . . .
Shiyu Zhao 35 / 50
Example
• k = 1:
Excise: With v1 (s) calculated in the last step, calculate by yourself.
q-value:
a` a0 ar
s1 −0.1 0.9 1.9
s2 0.9 1.9 0.9
s3 1.9 0.9 −0.1
Greedy policy (select the greatest q-value):
The policy is the same as the previous one, which is already optimal.
v-value: v2 (s) = ...
• k = 2, 3, . . .
Shiyu Zhao 35 / 50
Example
• k = 1:
Excise: With v1 (s) calculated in the last step, calculate by yourself.
q-value:
a` a0 ar
s1 −0.1 0.9 1.9
s2 0.9 1.9 0.9
s3 1.9 0.9 −0.1
Greedy policy (select the greatest q-value):
The policy is the same as the previous one, which is already optimal.
v-value: v2 (s) = ...
• k = 2, 3, . . .
Shiyu Zhao 35 / 50
Outline
1 Motivating examples
3 BOE: Introduction
7 BOE: Solution
8 BOE: Optimality
Shiyu Zhao 36 / 50
Policy optimality
v ∗ = max(rπ + γPπ v ∗ )
π
Suppose
π ∗ = arg max(rπ + γPπ v ∗ )
π
Then
v ∗ = rπ∗ + γPπ∗ v ∗
Shiyu Zhao 37 / 50
Policy optimality
Shiyu Zhao 38 / 50
Optimal policy
1 Motivating examples
3 BOE: Introduction
7 BOE: Solution
8 BOE: Optimality
Shiyu Zhao 40 / 50
Analyzing optimal policies
Shiyu Zhao 41 / 50
Analyzing optimal policies
Shiyu Zhao 41 / 50
Analyzing optimal policies
Shiyu Zhao 41 / 50
Analyzing optimal policies
The optimal policy and the corresponding optimal state value are
obtained by solving the BOE.
1 2 3 4 5 1 2 3 4 5
Shiyu Zhao 42 / 50
Analyzing optimal policies
1 2 3 4 5 1 2 3 4 5
(b) The discount rate is γ = 0.5. Others are the same as (a).
Shiyu Zhao 43 / 50
Analyzing optimal policies
If we change γ to 0
1 2 3 4 5 1 2 3 4 5
Shiyu Zhao 44 / 50
Analyzing optimal policies
1 2 3 4 5 1 2 3 4 5
Shiyu Zhao 45 / 50
Analyzing optimal policies
What if we change r → ar + b?
For example,
becomes
Shiyu Zhao 46 / 50
Analyzing optimal policies
where γ ∈ (0, 1) is the discount rate and 1 = [1, . . . , 1]T . Consequently, the
optimal policies are invariant to the affine transformation of the reward signals.
Shiyu Zhao 47 / 50
Analyzing optimal policies
Meaningless detour?
1 2 1 2 1 2 1 2
Meaningless detour?
1 2 1 2 1 2 1 2
Meaningless detour?
1 2 1 2 1 2 1 2
• Matrix-vector form:
v = max(rπ + γPπ v)
π
Shiyu Zhao 49 / 50
Summary
Shiyu Zhao 50 / 50
Lecture 4: Value Iteration and Policy Iteration
Shiyu Zhao
Outline
Shiyu Zhao 1 / 40
Outline
Shiyu Zhao 2 / 40
Outline
Shiyu Zhao 3 / 40
Value iteration algorithm
The algorithm
where vk is given.
• Step 2: value update.
Shiyu Zhao 6 / 40
Value iteration algorithm - Elementwise form
is
!
X X X
πk+1 (s) = arg max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 ) , s∈S
π
a r s0
| {z }
qk (s,a)
where a∗k (s) = arg maxa qk (a, s). πk+1 is called a greedy policy, since it
simply selects the greatest q-value.
Shiyu Zhao 7 / 40
Value iteration algorithm - Elementwise form
is
!
X X X
vk+1 (s) = πk+1 (a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 ) , s∈S
a r s0
| {z }
qk (s,a)
Shiyu Zhao 8 / 40
Value iteration algorithm - Pseudocode
. Procedure summary:
vk (s) → qk (s, a) → greedy policy πk+1 (a|s) → new value vk+1 = max qk (s, a)
a
Initialization: The probability model p(r|s, a) and p(s0 |s, a) for all (s, a) are
known. Initial guess v0 .
Aim: Search the optimal state value and an optimal policy solving the Bellman
optimality equation.
While vk has not converged in the sense that kvk − vk−1 k is greater than a
predefined small threshold, for the kth iteration, do
For every state s ∈ S, do
For every action a ∈ A(s), do
q-value: qk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vk (s0 )
P P
Shiyu Zhao 9 / 40
Value iteration algorithm - Example
s1 s2 s1 s2 s1 s2
s3 s4 s3 s4 s3 s4
q-value a1 a2 a3 a4 a5
s1 −1 + γv(s1 ) −1 + γv(s2 ) 0 + γv(s3 ) −1 + γv(s1 ) 0 + γv(s1 )
s2 −1 + γv(s2 ) −1 + γv(s2 ) 1 + γv(s4 ) 0 + γv(s1 ) −1 + γv(s2 )
s3 0 + γv(s1 ) 1 + γv(s4 ) −1 + γv(s3 ) −1 + γv(s3 ) 0 + γv(s3 )
s4 −1 + γv(s2 ) −1 + γv(s4 ) −1 + γv(s4 ) 0 + γv(s3 ) 1 + γv(s4 )
Shiyu Zhao 10 / 40
Value iteration algorithm - Example
Shiyu Zhao 11 / 40
Value iteration algorithm - Example
Shiyu Zhao
threshold. 12 / 40
Outline
Shiyu Zhao 13 / 40
Policy iteration algorithm
. Algorithm description:
Given a random initial policy π0 ,
• Step 1: policy evaluation (PE)
This step is to calculate the state value of πk :
Shiyu Zhao 14 / 40
Policy iteration algorithm
. Q1: In the policy evaluation step, how to get the state value vπk
by solving the Bellman equation?
• Closed-form solution:
• Iterative solution:
vπ(j+1)
k
= rπk + γPπk vπ(j)
k
, j = 0, 1, 2, ...
. Q2: In the policy improvement step, why is the new policy πk+1
better than πk ?
If πk+1 = arg maxπ (rπ + γPπ vπk ), then vπk+1 ≥ vπk for any k.
Shiyu Zhao 17 / 40
Policy iteration algorithm
As a result, vπk keeps increasing and will converge. Still need to prove it
converges to v ∗ .
Shiyu Zhao 18 / 40
Policy iteration algorithm
Shiyu Zhao 19 / 40
Policy iteration algorithm - Elementwise form
(j+1) (j)
Stop when j → ∞ or j is sufficiently large or kvπk − vπk k is
sufficiently small.
Shiyu Zhao 20 / 40
Policy iteration algorithm - Elementwise form
Shiyu Zhao 21 / 40
Policy iteration algorithm - Implementation
Initialization: The probability model p(r|s, a) and p(s0 |s, a) for all (s, a) are
known. Initial guess π0 .
Aim: Search for the optimal state value and an optimal policy.
While the policy has not converged, for the kth iteration, do
Policy evaluation:
(0)
Initialization: an arbitrary initial guess vπk
(j)
While v πk has not converged, for the jth iteration, do
For every state s ∈ S, do hP i
(j+1) P P 0 (j) 0
vπk (s) = a πk (a|s) r p(r|s, a)r + γ s0 p(s |s, a)vπk (s )
Policy improvement:
For every state s ∈ S, do
For every action a ∈ A(s), do
qπk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vπk (s0 )
P P
Shiyu Zhao 22 / 40
Policy iteration algorithm - Simple example
s1 s2 s1 s2
Shiyu Zhao 23 / 40
Policy iteration algorithm - Simple example
qπk (s, a) a` a0 ar
s1 −1 + γvπk (s1 ) 0 + γvπk (s1 ) 1 + γvπk (s2 )
s2 0 + γvπk (s1 ) 1 + γvπk (s2 ) −1 + γvπk (s2 )
qπ0 (s, a) a` a0 ar
s1 −10 −9 −7.1
s2 −9 −7.1 −9.1
Now you know another powerful algorithm searching for optimal policies!
Now let’s apply it and see what we can find.
Shiyu Zhao 26 / 40
Policy iteration algorithm - Complicated example
1 1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0
2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0 2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0
4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0 4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3
5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0 5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1
2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0 2 2 0.0 -100.0
0.0 -100.0
0.0
4.8 0.0
5.3 0.0
5.9
4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3 4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3
5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1 5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1
1 1 0.0
3.5 0.0
3.9 0.0
4.3 0.0
4.8 0.0
4.3
5.3 1 1 0.0
3.5 0.0
3.9 0.0
4.3 0.0
4.8 0.0
4.3
5.3
2 2 0.0
3.1 -100.0
0.0 -100.0
3.5 0.0
4.8 0.0
5.3 0.0
5.9 2 2 0.0
3.1 -100.0
0.0 -100.0
3.5 0.0
4.8 0.0
5.3 0.0
5.9
3 3 0.0
2.8 0.0 -100.0
10.0 0.0
5.9 0.0
6.6 3 3 0.0
2.8 0.0
2.5 -100.0
10.0 0.0
5.9 0.0
6.6
4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3 4 4 0.0
2.5 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3
5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1 5 5 0.0
2.3 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1
Shiyu Zhao 28 / 40
Outline
Shiyu Zhao 29 / 40
Compare value iteration and policy iteration
Shiyu Zhao 31 / 40
Compare value iteration and policy iteration
vπ(2)
1
= rπ1 + γPπ1 vπ(1)
1
..
.
truncated policy iteration ← v̄1 ←−vπ(j)
1
= rπ1 + γPπ1 vπ(j−1)
1
..
.
policy iteration ← vπ1 ←−vπ(∞)
1
= rπ1 + γPπ1 vπ(∞)
1
Initialization: The probability model p(r|s, a) and p(s0 |s, a) for all (s, a) are known. Initial
guess π0 .
Aim: Search for the optimal state value and an optimal policy.
While the policy has not converged, for the kth iteration, do
Policy evaluation:
(0)
Initialization: select the initial guess as vk = vk−1 . The maximum iteration is set to
be jtruncate .
While j < jtruncate , do
For every state s ∈ S, do hP i
(j+1) P P 0 (j) 0
vk (s) = a πk (a|s) r p(r|s, a)r + γ s0 p(s |s, a)vk (s )
(j )
Set vk = vk truncate
Policy improvement:
For every state s ∈ S, do
For every action a ∈ A(s), do
P P 0 0
qk (s, a) = r p(r|s, a)r + γ s0 p(s |s, a)vk (s )
∗
ak (s) = arg maxa qk (s, a)
πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise
Shiyu Zhao 34 / 40
Truncated policy iteration - Convergence
Consider the iterative algorithm for solving the policy evaluation step:
vπ(j+1)
k
= rπk + γPπk vπ(j)
k
, j = 0, 1, 2, ...
(0)
If the initial guess is selected as vπk = vπk−1 , it holds that
vπ(j+1)
k
≥ vπ(j)
k
for every j = 0, 1, 2, . . . .
Shiyu Zhao 35 / 40
Truncated policy iteration - Convergence
v*
v
k
vk
Policy iteration
Value iteration
Truncated policy iteration
Optimal state value
Figure: Illustration of the relationship among value iteration, policy iteration, and
truncated policy iteration.
. Setup: The same as the previous example. Below is the initial policy.
1 2 3 4 5
. Define kvk − v ∗ k as the state value error at time k. The stop criterion
is kvk − v ∗ k < 0.01.
Shiyu Zhao 37 / 40
Truncated policy iteration - Example
50
Truncated policy iteration-1
0
10 20 30 40 50 60
100
Truncated policy iteration-3
50
0
10 20 30 40 50 60
10 20 30 40 50 60
50
Truncated policy iteration-1
0
10 20 30 40 50 60
100
Truncated policy iteration-3
50
0
10 20 30 40 50 60
10 20 30 40 50 60
. The greater the value of x is, the faster the value estimate converges.
. However, the benefit of increasing x drops quickly when x is large.
. In practice, run a few number of iterations in the policy evaluation step.
Shiyu Zhao 39 / 40
Summary
Shiyu Zhao 40 / 40
Lecture 5: Monte Carlo Learning
Shiyu Zhao
Outline
Shiyu Zhao 1 / 50
Outline
1 Motivating example
Shiyu Zhao 2 / 50
Outline
1 Motivating example
Shiyu Zhao 3 / 50
Motivating example: Monte Carlo estimation
Shiyu Zhao 4 / 50
Motivating example: Monte Carlo estimation
. Method 1: Model-based
• Suppose the probabilistic model is known as
Then by definition
X
E[X] = xp(x) = 1 × 0.5 + (−1) × 0.5 = 0
x
Shiyu Zhao 5 / 50
Motivating example: Monte Carlo estimation
. Method 2: Model-free
• Idea: Flip the coin many times, and then calculate the average of the
outcomes.
• Suppose we get a sample sequence: {x1 , x2 , . . . , xN }.
Then, the mean can be approximated as
N
1 X
E[X] ≈ x̄ = xj .
N j=1
Shiyu Zhao 6 / 50
Motivating example: Monte Carlo estimation
2
samples
average
-1
-2
0 50 100 150 200
Sample index
Shiyu Zhao 7 / 50
Motivating example: Monte Carlo estimation
For a random variable X. Suppose {xj }Nj=1 are some iid samples. Let
1
PN
x̄ = N j=1 xj be the average of the samples. Then,
E[x̄] = E[X],
1
Var[x̄] = Var[X].
N
As a result, x̄ is an unbiased estimate of E[X] and its variance decreases
to zero as N increases to infinity.
Shiyu Zhao 8 / 50
Motivating example: Monte Carlo estimation
. Summary:
Shiyu Zhao 9 / 50
Outline
1 Motivating example
Shiyu Zhao 10 / 50
Convert policy iteration to be model-free
Shiyu Zhao 11 / 50
Convert policy iteration to be model-free
Shiyu Zhao 12 / 50
Convert policy iteration to be model-free
Shiyu Zhao 13 / 50
Convert policy iteration to be model-free
• Suppose we have a set of episodes and hence {g (j) (s, a)}. Then,
N
1 X (i)
qπk (s, a) = E[Gt |St = s, At = a] ≈ g (s, a).
N i=1
Shiyu Zhao 14 / 50
The MC Basic algorithm
• Step 1: policy evaluation. This step is to obtain qπk (s, a) for all
(s, a). Specifically, for each action-state pair (s, a), run an infinite
number of (or sufficiently many) episodes. The average of their returns
is used to approximate qπk (s, a).
• Step 2: policy improvement. This step is to solve
P
πk+1 (s) = arg maxπ a π(a|s)qπk (s, a) for all s ∈ S. The greedy
optimal policy is πk+1 (a∗k |s) = 1 where a∗k = arg maxa qπk (s, a).
Shiyu Zhao 15 / 50
The MC Basic algorithm
While the value estimate has not converged, for the kth iteration, do
For every state s ∈ S, do
For every action a ∈ A(s), do
Collect sufficiently many episodes starting from (s, a) following πk
MC-based policy evaluation step:
qπk (s, a) = average return of all the episodes starting from (s, a)
Policy improvement step:
a∗k (s) = arg maxa qπk (s, a)
πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise
Shiyu Zhao 16 / 50
The MC Basic algorithm
Shiyu Zhao 17 / 50
Illustrative example 1: step by step
s1 s2 s3
s4 s5 s6
s7 s8 s9
Task:
• An initial policy is shown in the figure.
• Use MC Basic to find the optimal policy.
• rboundary = −1, rforbidden = −1, rtarget = 1, γ = 0.9.
Shiyu Zhao 18 / 50
Illustrative example 1: step by step
s1 s2 s3
s4 s5 s6
s7 s8 s9
Shiyu Zhao 19 / 50
Illustrative example 1: step by step
s1 s2 s3
s4 s5 s6
s7 s8 s9
Shiyu Zhao 20 / 50
Illustrative example 1: step by step
s1 s2 s3
s4 s5 s6
s7 s8 s9
1 a1 1 a a
• Starting from (s1 , a1 ), the episode is s1 −→s1 −→ s1 −→ . . .. Hence, the
action value is
action value is
action value is
s1 s2 s3
s4 s5 s6
s7 s8 s9
4 a 1 a
1 a
• Starting from (s1 , a4 ), the episode is s1 −→s1 −→ s1 −→ . . .. Hence, the
action value is
5 a 1 a
1 a
• Starting from (s1 , a5 ), the episode is s1 −→s1 −→ s1 −→ . . .. Hence, the
action value is
Shiyu Zhao 22 / 50
Illustrative example 1: step by step
s1 s2 s3
s4 s5 s6
s7 s8 s9
s1 s2 s3
s4 s5 s6
s7 s8 s9
Shiyu Zhao 24 / 50
Illustrative example 2: Episode length
Example setup:
• 5-by-5 grid world
• Reward setting: rboundary = −1, rforbidden = −10, rtarget = 1, γ = 0.9
Shiyu Zhao 25 / 50
Illustrative example 2: Episode length
1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0 1
2 0.0 0.0 0.0 0.0 0.0 2 2 0.0 0.0 0.0 0.0 0.0 2
3 0.0 0.0 1.0 0.0 0.0 3 3 0.0 0.0 1.9 0.0 0.0 3
4 0.0 1.0 1.0 1.0 0.0 4 4 0.0 1.9 1.9 1.9 0.0 4
5 0.0 0.0 1.0 0.0 0.0 5 5 0.0 0.9 1.9 0.9 0.0 5
Estimated state value and policy with Estimated state value and policy with
episode length=1 episode length=2
Episode length=3 Episode length=3 Episode length=4 Episode length=4
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0 1
2 0.0 0.0 0.0 0.0 0.0 2 2 0.0 0.0 0.0 0.0 0.0 2
3 0.0 0.0 2.7 0.0 0.0 3 3 0.0 0.0 3.4 0.0 0.0 3
4 0.0 2.7 2.7 2.7 0.0 4 4 0.0 3.4 3.4 3.4 0.7 4
5 0.0 1.7 2.7 1.7 0.8 5 5 0.0 2.4 3.4 2.4 1.5 5
Estimated state value and policy with Estimated state value and policy with
Shiyu Zhao
episode length=3 episode length=4 26 / 50
Illustrative example 2: Episode length
Episode length=14 Episode length=14 Episode length=15 Episode length=15
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1.2 1.6 2.0 2.5 3.0 1 1 1.4 1.8 2.2 2.7 3.3 1
2 0.9 1.2 2.5 3.0 3.6 2 2 1.1 1.4 2.7 3.3 3.8 2
3 0.5 0.3 7.7 3.6 4.3 3 3 0.8 0.5 7.9 3.8 4.5 3
4 0.3 7.7 7.7 7.7 5.0 4 4 0.5 7.9 7.9 7.9 5.2 4
5 0.0 6.7 7.7 6.7 5.8 5 5 0.2 6.9 7.9 6.9 6.0 5
Estimated state value and policy with Estimated state value and policy with
episode length=14 episode length=15
1 3.1 3.5 3.9 4.4 4.9 1 1 3.5 3.9 4.3 4.8 5.3 1
2 2.7 3.1 4.4 4.9 5.5 2 2 3.1 3.5 4.8 5.3 5.9 2
3 2.4 2.1 9.6 5.5 6.1 3 3 2.8 2.5 10.0 5.9 6.6 3
4 2.1 9.6 9.6 9.6 6.9 4 4 2.5 10.0 10.0 10.0 7.3 4
5 1.9 8.6 9.6 8.6 7.7 5 5 2.3 9.0 10.0 9.0 8.1 5
Estimated state value and policy with Estimated state value and policy with
episode length=30 episode length=100
Shiyu Zhao 27 / 50
Illustrative example 2: Episode length
Episode length=4 Episode length=4
1 2 3 4 5 1 2 3 4 5
. Findings:
• When the episode length is short, only the states that are close to the
target have nonzero state values.
• As the episode length increases, the states that are closer to the target
have nonzero values earlier than those farther away.
• The episode length should be sufficiently long.
• The episode length does not have to be infinitely long.
Shiyu Zhao 28 / 50
Outline
1 Motivating example
Shiyu Zhao 29 / 50
Use data more efficiently
Shiyu Zhao 30 / 50
Use data more efficiently
Shiyu Zhao 31 / 50
Use data more efficiently
Shiyu Zhao 32 / 50
Update value estimate more efficiently
Shiyu Zhao 33 / 50
Update value estimate more efficiently
Shiyu Zhao 34 / 50
MC Exploring Starts
. If we use data and update estimate more efficiently, we get a new
algorithm called MC Exploring Starts:
Pseudocode: MC Exploring Starts (a sample-efficient variant of MC Basic)
Shiyu Zhao 35 / 50
MC Exploring Starts
Shiyu Zhao 36 / 50
MC Exploring Starts
Shiyu Zhao 37 / 50
Outline
1 Motivating example
Shiyu Zhao 38 / 50
Soft policies
Shiyu Zhao 39 / 50
ε-greedy policies
where Π denotes the set of all possible policies. The optimal policy here is
(
1, a = a∗k ,
πk+1 (a|s) =
0, a 6= a∗k ,
Shiyu Zhao 41 / 50
MC ε-Greedy algorithm
where Πε denotes the set of all ε-greedy policies with a fixed value of ε.
The optimal policy here is
|A(s)|−1
(
1− |A(s)| ε, a = a∗k ,
πk+1 (a|s) = 1
|A(s)| ε, a 6= a∗k .
|A(st )|−1
a = a∗
(
1− |A(st )|
,
π(a|st ) = 1
|A(st )|
, a 6= a∗
Shiyu Zhao 43 / 50
Exploration ability
8100
Visited times
2
8000
3 7900
7800
4
7700
7600
5 20 40 60 80 100 120
State-action index
(a) 100 steps (b) 1000 steps (c) 10000 steps (d)
Shiyu Zhao 44 / 50
Exploration ability
Visited times
2 2
1.5
3
1
4 0.5
0
5 20 40 60 80 100 120
State-action index
(a) 100 steps (b) 1000 steps (c) 10000 steps (d)
Shiyu Zhao 45 / 50
Estimate based on one episode
. Run the MC ε-Greedy algorithm as follows. In every iteration:
• In the episode generation step, use the previous policy generates an
episode of 1 million steps!
• In the rest steps, use the single episode to update the policy.
• Two iterations can lead to the optimal ε-greedy policy.
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
(a) Initial policy (b) After the first (c) After the second
iteration iteration
Shiyu Zhao 47 / 50
Optimality
. Given an ε-greedy policy, what is its state value?
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 3.5 3.9 4.3 4.8 5.3 1 1 0.4 0.5 0.9 1.3 1.4
2 2 3.1 3.5 4.8 5.3 5.9 2 2 0.1 0.0 0.5 1.3 1.7
3 3 2.8 2.5 10.0 5.9 6.6 3 3 0.1 -0.4 3.4 1.4 1.9
4 4 2.5 10.0 10.0 10.0 7.3 4 4 -0.1 3.4 3.3 3.7 2.2
5 5 2.3 9.0 10.0 9.0 8.1 5 5 -0.3 2.6 3.7 3.1 2.7
ε=0 ε = 0.1
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 -2.2 -2.4 -2.1 -1.7 -1.8 1 1 -8.0 -9.0 -8.4 -7.2 -7.8
2 2 -2.5 -3.0 -3.3 -2.3 -2.0 2 2 -8.7 -10.8 -12.4 -9.6 -8.9
3 3 -2.3 -3.3 -2.5 -2.8 -2.2 3 3 -8.3 -12.3 -15.3 -12.3 -10.5
4 4 -2.5 -2.5 -2.8 -2.0 -2.4 4 4 -9.7 -15.5 -17.0 -14.4 -12.2
5 5 -2.8 -3.2 -2.1 -2.3 -2.2 5 5 -10.9 -16.7 -15.2 -14.3 -12.4
ε = 0.2 ε = 0.5
1 1 3.5 3.9 4.3 4.8 5.3 1 1 0.4 0.5 0.9 1.3 1.4
2 2 3.1 3.5 4.8 5.3 5.9 2 2 0.1 0.0 0.5 1.3 1.7
3 3 2.8 2.5 10.0 5.9 6.6 3 3 0.1 -0.4 3.4 1.4 1.9
4 4 2.5 10.0 10.0 10.0 7.3 4 4 -0.1 3.4 3.3 3.7 2.2
5 5 2.3 9.0 10.0 9.0 8.1 5 5 -0.3 2.6 3.7 3.1 2.7
ε=0 ε = 0.1
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 -1.1 -1.5 -1.1 -0.6 -0.6 1 1 -4.3 -5.5 -4.5 -2.6 -2.3
2 2 -1.5 -2.2 -2.3 -1.0 -0.6 2 2 -5.6 -7.7 -7.7 -4.1 -2.4
3 3 -1.2 -2.4 -2.2 -1.5 -0.6 3 3 -5.5 -9.0 -8.0 -5.6 -2.8
4 4 -1.6 -2.3 -2.6 -1.4 -1.1 4 4 -6.8 -8.9 -9.4 -5.5 -4.2
5 5 -2.0 -3.0 -1.8 -1.4 -1.0 5 5 -7.9 -10.1 -6.7 -5.1 -3.7
ε = 0.2 ε = 0.5
. The optimal ε-greedy policies are not consistent with the greedy
optimal one! Why is that? Consider the target for example.
Shiyu Zhao 49 / 50
Summary
Key points:
• Mean estimation by the Monte Carlo methods
• Three algorithms:
• MC Basic
• MC Exploring Starts
• MC ε-Greedy
• Relationship among the three algorithms
• Optimality vs exploration of ε-greedy policies
Shiyu Zhao 50 / 50
Lecture 6:
Stochastic Approximation
and
Stochastic Gradient Descent
Shiyu Zhao
Outline
Shiyu Zhao 1 / 65
Introduction
Why?
• The ideas and expressions of TD algorithms are very different from the
algorithms we studied so far.
• Many students who see the TD algorithms the first time many wonder
why these algorithms were designed in the first place and why they
work effectively.
• There is a knowledge gap!
Shiyu Zhao 2 / 65
Introduction
In this lecture,
• We fill the knowledge gap between the previous and upcoming lectures
by introducing basic stochastic approximation (SA) algorithms.
• We will see in the next lecture that the temporal-difference algorithms
are special SA algorithms. As a result, it will be much easier to
understand these algorithms.
Shiyu Zhao 3 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 4 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 5 / 65
Motivating example: mean estimation, again
N
1 X
E[X] ≈ x̄ := xi .
N i=1
Shiyu Zhao 7 / 65
Motivating example: mean estimation
In particular, suppose
k
1X
wk+1 = xi , k = 1, 2, . . .
k i=1
and hence
k−1
1 X
wk = xi , k = 2, 3, . . .
k − 1 i=1
Then, wk+1 can be expressed in terms of wk as
k k−1
!
1X 1 X
wk+1 = xi = xi + xk
k i=1 k i=1
1 1
= ((k − 1)wk + xk ) = wk − (wk − xk ).
k k
Therefore, we obtain the following iterative algorithm:
1
wk+1 = wk − (wk − xk ).
k
Shiyu Zhao 8 / 65
Motivating example: mean estimation
We can use
1
wk+1 = wk − (wk − xk ).
k
to calculate the mean x̄ incrementally:
w1 = x1 ,
1
w2 = w1 − (w1 − x1 ) = x1 ,
1
1 1 1
w3 = w2 − (w2 − x2 ) = x1 − (x1 − x2 ) = (x1 + x2 ),
2 2 2
1 1
w4 = w3 − (w3 − x3 ) = (x1 + x2 + x3 ),
3 3
..
.
k
1X
wk+1 = xi .
k i=1
Shiyu Zhao 9 / 65
Motivating example: mean estimation
Shiyu Zhao 10 / 65
Motivating example: mean estimation
wk+1 = wk − αk (wk − xk ),
Shiyu Zhao 11 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 12 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 13 / 65
Robbins-Monro algorithm
g(w) = ∇w J(w) = 0
Shiyu Zhao 15 / 65
Robbins-Monro algorithm – Problem statement
Shiyu Zhao 16 / 65
Robbins-Monro algorithm – The algorithm
wk+1 = wk − ak g̃(wk , ηk ), k = 1, 2, 3, . . .
where
• wk is the kth estimate of the root
• g̃(wk , ηk ) = g(wk ) + ηk is the kth noisy observation
• ak is a positive coefficient.
The function g(w) is a black box! This algorithm relies on data:
• Input sequence: {wk }
• Noisy output sequence: {g̃(wk , ηk )}
Philosophy: without model, we need data!
• Here, the model refers to the expression of the function.
Shiyu Zhao 17 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 18 / 65
Robbins-Monro algorithm – Illustrative examples
w1 = 20 =⇒ g(w1 ) = 10
w2 = w1 − a1 g(w1 ) = 20 − 0.5 ∗ 10 = 15 =⇒ g(w2 ) = 5
w3 = w2 − a2 g(w2 ) = 15 − 0.5 ∗ 5 = 12.5 =⇒ g(w3 ) = 2.5
..
.
wk → 10
Excises:
• What if ak = 1?
• What if ak = 2?
Shiyu Zhao 19 / 65
Robbins-Monro algorithm – Illustrative examples
0
0 10 20 30 40 50
k
2
Observation noise
-2
0 10 20 30 40 50
Iteration index k
Shiyu Zhao 20 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 21 / 65
Robbins-Monro algorithm – Convergence properties
Shiyu Zhao 22 / 65
Robbins-Monro algorithm – Convergence properties
An illustrative example:
• g(w) = tanh(w − 1)
• The true root of g(w) = 0 is w∗ = 1.
• Parameters: w1 = 3, ak = 1/k, ηk ≡ 0 (no noise for the sake of
simplicity)
The RM algorithm in this case is
wk+1 = wk − ak g(wk )
Shiyu Zhao 23 / 65
Robbins-Monro algorithm – Convergence properties
Simulation result: wk converges to the true root w∗ = 1.
1.5
0.5
g(w) 0
...... w w
4 3
w2 w1
-0.5
-1
Shiyu Zhao 25 / 65
Robbins-Monro algorithm – Convergence properties
Shiyu Zhao 27 / 65
Robbins-Monro algorithm – Convergence properties
• It holds that !
n
X 1
lim − ln n = κ,
n→∞ k
k=1
If the three conditions are not satisfied, the algorithm may not work.
• For example, g(w) = w3 − 5 does not satisfy the first condition on
gradient boundedness. If the initial guess is good, the algorithm can
converge (locally). Otherwise, it will diverge.
We will see that ak is often selected as a sufficiently small constant in
many RL algorithms. Although the second condition is not satisfied in
this case, the algorithm can still work effectively.
Shiyu Zhao 30 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 31 / 65
Robbins-Monro algorithm – Apply to mean estimation
Recall that
wk+1 = wk + αk (xk − wk ).
Shiyu Zhao 32 / 65
Robbins-Monro algorithm – Apply to mean estimation
1) Consider a function:
.
g(w) = w − E[X].
Our aim is to solve g(w) = 0. If we can do that, then we can obtain E[X].
2) The observation we can get is
.
g̃(w, x) = w − x,
wk+1 = (1 − αk )wk + βk ηk ,
where {αk }∞ ∞ ∞
k=1 , {βk }k=1 , {ηk }k=1 are stochastic sequences. Here αk ≥ 0, βk ≥ 0 for
all k. Then, wk would converge to zero with probability 1 if the following conditions
are satisfied:
P∞ P∞ 2
P∞ 2
1) k=1 αk = ∞, k=1 αk < ∞; k=1 βk < ∞ uniformly w.p.1;
Shiyu Zhao 37 / 65
Stochastic gradient descent
n
1X
wk+1 = wk − αk ∇w f (wk , xi ).
n i=1
Shiyu Zhao 38 / 65
Stochastic gradient descent – Algorithm
wk+1 = wk − αk ∇w f (wk , xk ),
Shiyu Zhao 39 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 40 / 65
Stochastic gradient descent – Example and application
where
f (w, X) = kw − Xk2 /2 ∇w f (w, X) = w − X
Excises:
Shiyu Zhao 41 / 65
Stochastic gradient descent – Example and application
Answer:
• The GD algorithm for solving the above problem is
wk+1 = wk − αk ∇w J(wk )
= wk − αk E[∇w f (wk , X)]
= wk − αk E[wk − X].
• Note:
• It is the same as the mean estimation algorithm we presented before.
• That mean estimation algorithm is a special SGD algorithm.
Shiyu Zhao 42 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 43 / 65
Stochastic gradient descent – Convergence
From GD to SGD:
Since
∇w f (wk , xk ) 6= E[∇w f (w, X)]
whether wk → w∗ as k → ∞ by SGD?
Shiyu Zhao 44 / 65
Stochastic gradient descent – Convergence
Let
g(w) = ∇w J(w) = E[∇w f (w, X)].
Shiyu Zhao 45 / 65
Stochastic gradient descent – Convergence
g̃(w, η) = ∇w f (w, x)
= E[∇w f (w, X)] + ∇w f (w, x) − E[∇w f (w, X)] .
| {z } | {z }
g(w) η
Shiyu Zhao 46 / 65
Stochastic gradient descent – Convergence
3) {xk }∞
k=1 is iid;
Shiyu Zhao 47 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 48 / 65
Stochastic gradient descent – Convergence pattern
Shiyu Zhao 49 / 65
Stochastic gradient descent – Convergence pattern
∇2w f ≥ c > 0
Shiyu Zhao 50 / 65
Stochastic gradient descent – Convergence pattern
Note that
stochastic gradient true gradient
z }| { z }| {
∇w f (wk , xk ) − E[∇w f (wk , X)]
δk ≤ .
c|wk − w∗ |
| {z }
distance to the optimal solution
Shiyu Zhao 51 / 65
Stochastic gradient descent – Convergence pattern
Shiyu Zhao 52 / 65
Stochastic gradient descent – Convergence pattern
Result:
20 30
mean SG (m=1)
data MBGD (m=5)
SGD (m=1) MBGD (m=50)
15 25
MBGD (m=5)
MBGD (m=50)
10 20
Distance to mean
5 15
y
0 10
-5 5
0
-20 -15 -10 -5 0 5 10 0 5 10 15 20 25 30
x Iteration step
• Although the initial guess of the mean is far away from the true value,
the SGD estimate can approach the neighborhood of the true value
fast.
• When the estimate is close to the true value, it exhibits certain
randomness but still approaches the true value gradually.
Shiyu Zhao 53 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 54 / 65
Stochastic gradient descent – A deterministic formulation
Shiyu Zhao 55 / 65
Stochastic gradient descent – A deterministic formulation
Suppose the set is large and we can only fetch a single number every
time. In this case, we can use the following iterative algorithm:
wk+1 = wk − αk ∇w f (wk , xk ).
Questions:
• Is this algorithm SGD? It does not involve any random variables or
expected values.
• How should we use the finite set of numbers {xi }ni=1 ? Should we sort
these numbers in a certain order and then use them one by one? Or
should we randomly sample a number from the set?
Shiyu Zhao 56 / 65
Stochastic gradient descent – A deterministic formulation
p(X = xi ) = 1/n
sampled randomly.
Shiyu Zhao 57 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 58 / 65
BGD, MBGD, and SGD
Suppose we would like to minimize J(w) = E[f (w, X)] given a set of random
samples {xi }n
i=1 of X. The BGD, SGD, MBGD algorithms solving this problem
are, respectively,
n
1X
wk+1 = wk − αk ∇w f (wk , xi ), (BGD)
n i=1
1 X
wk+1 = wk − αk ∇w f (wk , xj ), (MBGD)
m j∈I
k
• In the BGD algorithm, all the samples are used in every iteration. When n
is large, (1/n) n
P
i=1 ∇w f (wk , xi ) is close to the true gradient
E[∇w f (wk , X)].
• In the MBGD algorithm, Ik is a subset of {1, . . . , n} with the size as
|Ik | = m. The set Ik is obtained by m times idd samplings.
• In the SGD algorithm, xk is randomly sampled from {xi }n
i=1 at time k.
Shiyu Zhao 59 / 65
BGD, MBGD, and SGD
Shiyu Zhao 60 / 65
BGD, MBGD, and SGD – Illustrative examples
Given some numbers {xi }ni=1 , our aim is to calculate the mean
Pn
x̄ = i=1 xi /n. This problem can be equivalently stated as the following
optimization problem:
n
1 X
min J(w) = kw − xi k2
w 2n i=1
(m) P
where x̄k = j∈Ik xj /m.
Shiyu Zhao 61 / 65
BGD, MBGD, and SGD
Shiyu Zhao 62 / 65
BGD, MBGD, and SGD
Let αk = 1/k. Given 100 points, using different mini-batch sizes leads to
different convergence speed.
20 30
mean SG (m=1)
data MBGD (m=5)
SGD (m=1) MBGD (m=50)
15 25
MBGD (m=5)
MBGD (m=50)
10 20
Distance to mean
5 15
y
0 10
-5 5
0
-20 -15 -10 -5 0 5 10 0 5 10 15 20 25 30
x Iteration step
Shiyu Zhao 63 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 64 / 65
Summary
wk+1 = wk − ak g̃(wk , ηk )
• SGD algorithm: minimize J(w) = E[f (w, X)] using {∇w f (wk , xk )}
wk+1 = wk − αk ∇w f (wk , xk ),
Shiyu Zhao
Introduction
Chapter 6:
Chapter 5: Stochastic
Monte Carlo Approximation
Chapter 4: Learning
Value Iteration &
Policy Iteration Chapter 7:
Temporal‐Difference
Learning
tabular representation
to
Chapter 3: Chapter 2:
function representation
Bellman Optimality Bellman Algorithm/Methods
Equation Equation
Chapter 8:
Value Function
Fundamental tools Approximation
Chapter 9:
Chapter 10:
Policy Function
Actor‐Critic
Approximation
Methods
(or Policy Gradient)
Shiyu Zhao 1 / 70
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 2 / 70
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 3 / 70
Motivating examples: curve fitting
So far in this book, state and action values are represented by tables.
a1 a2 a3 a4 a5
s1 qπ (s1 , a1 ) qπ (s1 , a2 ) qπ (s1 , a3 ) qπ (s1 , a4 ) qπ (s1 , a5 )
. . . . . .
. . . . . .
. . . . . .
s9 qπ (s9 , a1 ) qπ (s9 , a2 ) qπ (s9 , a3 ) qπ (s9 , a4 ) qπ (s9 , a5 )
Shiyu Zhao 4 / 70
Motivating examples: curve fitting
Consider an example:
• Suppose there are one-dimensional states s1 , . . . , s|S| .
• Their state values are vπ (s1 ), . . . , vπ (s|S| ), where π is a given policy.
• Suppose |S| is very large and we hope to use a simple curve to
approximate these dots to save storage.
Shiyu Zhao 5 / 70
Motivating examples: curve fitting
where
• w is the parameter vector
• φ(s) the feature vector of s
• v̂(s, w) is linear in w
Shiyu Zhao 6 / 70
Motivating examples: curve fitting
" #
a
v̂(s, w) = as + b = [s, 1] = φT (s)w,
|{z} b
φT (s) | {z }
w
The benefits:
• The tabular representation needs to store |S| state values. Now, we
need to only store two parameters a and b.
• Every time we would like to use the value of s, we can calculate
φT (s)w.
• Such a benefit is not free. It comes with a cost: the state values can
not be represented accurately. This is why this method is called value
approximation.
Shiyu Zhao 7 / 70
Motivating examples: curve fitting
In this case,
• The dimensions of w and φ(s) increase, but the values may be fitted
more accurately.
• Although v̂(s, w) is nonlinear in s, it is linear in w. The nonlinearity is
contained in φ(s).
Shiyu Zhao 8 / 70
Motivating examples: curve fitting
Shiyu Zhao 9 / 70
Motivating examples: curve fitting
Quick summary:
• Idea: Approximate the state and action values using parameterized
functions: v̂(s, w) ≈ vπ (s) where w ∈ Rm is the parameter vector.
• Key difference: How to access and assign the value of v(s)
• Advantage:
1) Storage: The dimension of w may be much less than |S|.
2) Generalization: When a state s is visited, the parameter w is
updated so that the values of some other unvisited states can also
be updated.
Shiyu Zhao 10 / 70
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 11 / 70
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 12 / 70
Objective function
Shiyu Zhao 13 / 70
Objective function
Shiyu Zhao 14 / 70
Objective function
• Drawback:
• The states may not be equally important. For example, some states
may be rarely visited by a policy. Hence, this way does not consider
the real dynamics of the Markov process under the given policy.
Shiyu Zhao 15 / 70
Objective function
Shiyu Zhao 17 / 70
Objective function - Stationary distribution
Illustrative example:
• Given a policy shown in the figure.
• Let nπ (s) denote the number of times that s has been visited in a very
long episode generated by π.
• Then, dπ (s) can be approximated by
nπ (s)
dπ (s) ≈ P 0
s0 ∈S nπ (s )
1 2
0.8
0.2
2
0
0 200 400 600 800 1000
Step index
Shiyu Zhao
Figure: Long-run behavior of an -greedy policy with = 0.5. 18 / 70
Objective function - Stationary distribution
The converged values can be predicted because they are the entries of dπ :
dTπ = dTπ Pπ
It can be calculated that the left eigenvector for the eigenvalue of one is
h iT
dπ = 0.0345, 0.1084, 0.1330, 0.7241
5 Deep Q-learning
6 Summary
Shiyu Zhao 20 / 70
Optimization algorithms
While we have the objective function, the next step is to optimize it.
• To minimize the objective function J(w), we can use the
gradient-descent algorithm:
wk+1 = wk − αk ∇w J(wk )
Shiyu Zhao 21 / 70
Optimization algorithms
Shiyu Zhao 22 / 70
Optimization algorithms
In particular,
• First, Monte Carlo learning with function approximation
Let gt be the discounted return starting from st in the episode. Then,
gt can be used to approximate vπ (st ). The algorithm becomes
Shiyu Zhao 23 / 70
Optimization algorithms
It can only estimate the state values of a given policy, but it is important
to understand other algorithms introduced later.
Shiyu Zhao 24 / 70
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 25 / 70
Selection of function approximators
An important question that has not been answered: How to select the
function v̂(s, w)?
• The first approach, which was widely used before, is to use a linear
function
v̂(s, w) = φT (s)w
Shiyu Zhao 26 / 70
Linear function approximation
∇w v̂(s, w) = φ(s).
yields
Shiyu Zhao 27 / 70
Linear function approximation
Shiyu Zhao 28 / 70
Linear function approximation
φ(s) = es ∈ R|S| ,
Shiyu Zhao 29 / 70
Linear function approximation
Shiyu Zhao 30 / 70
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 31 / 70
Illustrative examples
Shiyu Zhao 32 / 70
Illustrative examples
Ground truth:
• The true state values and the 3D visualization
1 2 3 4 5 1 2 3 4 5
Experience samples:
• 500 episodes were generated following the given policy.
• Each episode has 500 steps and starts from a randomly selected
state-action pair following a uniform distribution.
Shiyu Zhao 33 / 70
Illustrative examples
0
0 100 200 300 400 500
Episode index
Shiyu Zhao 34 / 70
Illustrative examples
Notably, φ(s) can also be defined as φ(s) = [x, y, 1]T , where the order of
the elements does not matter.
Shiyu Zhao 35 / 70
Illustrative examples
5
TD-Linear: =0.0005
0
0 100 200 300 400 500
Episode index
• The trend is right, but there are errors due to limited approximation
ability!
• We are trying to use a plane to approximate a non-plane surface!
Shiyu Zhao 36 / 70
Illustrative examples
In this case,
v̂(s, w) = φT (s)w = w1 + w2 x + w3 y + w4 x2 + w5 y 2 + w6 xy
Shiyu Zhao 37 / 70
Illustrative examples
0
0 100 200 300 400 500
Episode index
2.5
1.5
0.5
0
0 100 200 300 400 500
Episode index
5 Deep Q-learning
6 Summary
Shiyu Zhao 39 / 70
Summary of the story
5 Deep Q-learning
6 Summary
Shiyu Zhao 41 / 70
Theoretical analysis
• The algorithm
Shiyu Zhao 42 / 70
Theoretical analysis
Shiyu Zhao 43 / 70
Theoretical analysis
Shiyu Zhao 44 / 70
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 45 / 70
Sarsa with function approximation
Shiyu Zhao 46 / 70
Sarsa with function approximation
To search for optimal policies, we can combine policy evaluation and
policy improvement.
Pseudocode: Sarsa with function approximation
Aim: Search a policy that can lead the agent to the target from an initial state-
action pair (s0 , a0 ).
Illustrative example:
• Sarsa with linear function approximation.
• γ = 0.9, = 0.1, rboundary = rforbidden = −10, rtarget = 1, α = 0.001.
1 2 3 4 5
0
Total reward
1
-500
-1000 2
500
4
5
0
0 100 200 300 400 500
Episode index
Shiyu Zhao 48 / 70
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 49 / 70
Q-learning with function approximation
Shiyu Zhao 50 / 70
Q-learning with function approximation
Shiyu Zhao 51 / 70
Q-learning with function approximation
Illustrative example:
• Q-learning with linear function approximation.
• γ = 0.9, = 0.1, rboundary = rforbidden = −10, rtarget = 1, α = 0.001.
1 2 3 4 5
0
Total reward
-2000
2
-4000
0 100 200 300 400 500
3
Episode length
1000 4
5
0
0 100 200 300 400 500
Episode index
Shiyu Zhao 52 / 70
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 53 / 70
Deep Q-learning
Shiyu Zhao 54 / 70
Deep Q-learning
Shiyu Zhao 55 / 70
Deep Q-learning
• For the sake of simplicity, we can assume that w in y is fixed (at least
for a while) when we calculate the gradient.
Shiyu Zhao 56 / 70
Deep Q-learning
Shiyu Zhao 57 / 70
Deep Q-learning
Shiyu Zhao 58 / 70
Deep Q-learning - Two networks
First technique:
• Two networks, a main network and a target network.
Why is it used?
• The mathematical reason has been explained when we calculate the
gradient.
Implementation details:
• Let w and wT denote the parameters of the main and target networks,
respectively. They are set to be the same initially.
• In every iteration, we draw a mini-batch of samples {(s, a, r, s0 )} from
the replay buffer (will be explained later).
• The inputs of the networks include state s and action a. The target
.
output is yT = r + γ maxa∈A(s0 ) q̂(s0 , a, wT ). Then, we directly
minimize the TD error or called loss function (yT − q̂(s, a, w))2 over
the mini-batch {(s, a, yT )}.
Shiyu Zhao 59 / 70
Deep Q-learning - Experience replay
Another technique:
• Experience replay
Question: What is experience replay?
Answer:
• After we have collected some experience samples, we do NOT use
these samples in the order they were collected.
.
• Instead, we store them in a set, called replay buffer B = {(s, a, r, s0 )}
• Every time we train the neural network, we can draw a mini-batch of
random samples from the replay buffer.
• The draw of samples, or called experience replay, should follow a
uniform distribution (why?).
Shiyu Zhao 60 / 70
Deep Q-learning - Experience replay
Shiyu Zhao 61 / 70
Deep Q-learning - Experience replay
Answer (continued):
• However, the samples are not uniformly collected because they are
generated consequently by certain policies.
• To break the correlation between consequent samples, we can use the
experience replay technique by uniformly drawing samples from the
replay buffer.
• This is the mathematical reason why experience replay is necessary
and why the experience replay must be uniform.
Shiyu Zhao 62 / 70
Deep Q-learning - Experience replay
Shiyu Zhao 63 / 70
Deep Q-learning
Pseudocode: Deep Q-learning (off-policy version)
Aim: Learn an optimal target network to approximate the optimal action values
from the experience samples generated by a behavior policy πb .
Remarks:
• Why no policy update?
• Why not using the policy update equation that we derived?
• The network input and output are different from the DQN paper.
Shiyu Zhao 64 / 70
Deep Q-learning
Illustrative example:
• This example aims to learn optimal action values for every state-action
pair.
• Once the optimal action values are obtained, the optimal greedy policy
can be obtained immediately.
Shiyu Zhao 65 / 70
Deep Q-learning
Setup:
• One single episode is used to train the network.
• This episode is generated by an exploratory behavior policy shown in
Figure (a).
• The episode only has 1,000 steps! The tabular Q-learning requires
100,000 steps.
• A shallow neural network with one single hidden layer is used as a
nonlinear approximator of q̂(s, a, w). The hidden layer has 100 neurons.
See details in the book.
Shiyu Zhao 66 / 70
Deep Q-learning
1 2 3 4 5 1 2 3 4 5
1 1
2 2
3 3
4 4
5 5
5 10
4 8
3 6
2 4
1 2
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index
The TD error converges to zero. The state estimation error converges to zero.
Shiyu Zhao 67 / 70
Deep Q-learning
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
7 8
5 7
4
6
3
2
5
1
0 4
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index
Shiyu Zhao
The TD error converges to zero. The state error does not converge to zero. 68 / 70
Outline
5 Deep Q-learning
6 Summary
Shiyu Zhao 69 / 70
Summary
Shiyu Zhao 70 / 70
Lecture 7: Temporal-Difference Learning
Shiyu Zhao
Outline
Shiyu Zhao 1 / 60
Introduction
Shiyu Zhao 2 / 60
Outline
1 Motivating examples
8 Summary
Shiyu Zhao 3 / 60
Outline
1 Motivating examples
8 Summary
Shiyu Zhao 4 / 60
Motivating example: stochastic algorithms
We next consider some stochastic problems and show how to use the RM
algorithm to solve them.
First, consider the simple mean estimation problem: calculate
w = E[X],
w = E[v(X)],
g(w) = w − E[v(X)]
.
g̃(w, η) = w − v(x) = (w − E[v(X)]) + (E[v(X)] − v(x)) = g(w) + η.
Shiyu Zhao 6 / 60
Motivating example: stochastic algorithms
w = E[R + γv(X)],
Quick summary:
• The above three examples are more and more complex.
• They can all be solved by the RM algorithm.
• We will see that the TD algorithms have similar expressions.
Shiyu Zhao 8 / 60
Outline
1 Motivating examples
8 Summary
Shiyu Zhao 9 / 60
TD learning of state values
Note that
• TD learning often refers to a broad class of RL algorithms.
• TD learning also refers to a specific algorithm for estimating state
values as introduced below.
Shiyu Zhao 10 / 60
TD learning of state values – Algorithm description
Shiyu Zhao 11 / 60
TD learning of state values – Algorithm properties
Here,
.
v̄t = rt+1 + γv(st+1 )
Therefore,
Note that
E[δπ,t |St = st ] = vπ (st ) − E Rt+1 + γvπ (St+1 )|St = st = 0.
Other properties:
• The TD algorithm in (3) only estimates the state value of a given
policy.
• It does not estimate the action values.
• It does not search for optimal policies.
• Later, we will see how to estimate action values and then search for
optimal policies.
• Nonetheless, the TD algorithm in (3) is fundamental for understanding
the core idea.
Shiyu Zhao 15 / 60
TD learning of state values – The idea of the algorithm
Shiyu Zhao 16 / 60
TD learning of state values – The idea of the algorithm
Since we can only obtain the samples r and s0 of R and S 0 , the noisy
observation we have is
Shiyu Zhao 18 / 60
TD learning of state values – The idea of the algorithm
where vk (s) is the estimate of vπ (s) at the kth step; rk , s0k are the
samples of R, S 0 obtained at the kth step.
Shiyu Zhao 19 / 60
TD learning of state values – The idea of the algorithm
where vk (s) is the estimate of vπ (s) at the kth step; rk , s0k are the
samples of R, S 0 obtained at the kth step.
Shiyu Zhao 20 / 60
TD learning of state values – Algorithm convergence
By the TD algorithm (1), vt (s) converges with probability 1 to vπ (s) for all
s ∈ S as t → ∞ if t αt (s) = ∞ and t αt2 (s) < ∞ for all s ∈ S.
P P
Remarks:
• This theorem says the state value can be found by the TD algorithm for a
given a policy π.
P P 2
• t αt (s) = ∞ and t αt (s) < ∞ must be valid for all s ∈ S. At time step
t, if s = st which means that s is visited at time t, then αt (s) > 0;
otherwise, αt (s) = 0 for all the other s 6= st . That requires every state must
be visited an infinite (or sufficiently many) number of times.
• The learning rate α is often selected as a small constant. In this case, the
condition that t αt2 (s) < ∞ is invalid anymore. When α is constant, it can
P
still be shown that the algorithm converges in the sense of expectation sense.
For the proof of the theorem, see my book.
Shiyu Zhao 21 / 60
TD learning of state values – Algorithm properties
While TD learning and MC learning are both model-free, what are the
advantages and disadvantages of TD learning compared to MC
learning?
Shiyu Zhao 22 / 60
TD learning of state values – Algorithm properties
While TD learning and MC learning are both model-free, what are the
advantages and disadvantages of TD learning compared to MC
learning?
1 Motivating examples
8 Summary
Shiyu Zhao 24 / 60
TD learning of action values – Sarsa
Shiyu Zhao 25 / 60
Sarsa – Algorithm
where t = 0, 1, 2, . . . .
• qt (st , at ) is an estimate of qπ (st , at );
• αt (st , at ) is the learning rate depending on st , at .
Shiyu Zhao 26 / 60
Sarsa – Algorithm
• Why is this algorithm called Sarsa? That is because each step of the
algorithm involves (st , at , rt+1 , st+1 , at+1 ). Sarsa is the abbreviation of
state-action-reward-state-action.
Remarks:
• This theorem says the action value can be found by Sarsa for a given a
policy π.
Shiyu Zhao 28 / 60
Sarsa – Implementation
Shiyu Zhao 29 / 60
Sarsa – Implementation
Shiyu Zhao 30 / 60
Sarsa – Examples
Task description:
• The task is to find a good path from a specific starting state to the
target state.
• This task is different from all the previous tasks where we need to
find out the optimal policy for every state!
• Each episode starts from the top-left state and end in the target
state.
• In the future, pay attention to what the task is.
• rtarget = 0, rforbidden = rboundary = −10, and rother = −1. The
learning rate is α = 0.1 and the value of is 0.1.
Shiyu Zhao 31 / 60
Sarsa – Examples
Results:
• The left figures above show the final policy obtained by Sarsa.
• The right figures show the total reward and length of every episode.
1 2 3 4 5
Total rewards
0
1
-200
2 -400
200
4 100
0
5 0 100 200 300 400 500
Episode index
Shiyu Zhao 32 / 60
Outline
1 Motivating examples
8 Summary
Shiyu Zhao 33 / 60
TD learning of action values: Expected Sarsa
where
X .
E[qt (st+1 , A)]) = πt (a|st+1 )qt (st+1 , a) = vt (st+1 )
a
• Need more computation. But it is beneficial in the sense that it reduces the
estimation variances because it reduces random variables in Sarsa from
{st , at , rt+1 , st+1 , at+1 } to {st , at , rt+1 , st+1 }.
Shiyu Zhao 34 / 60
TD learning of action values: Expected Sarsa
Illustrative example:
1 2 3 4 5
Total rewards
0
1
-200
2
-400
0 100 200 300 400 500
Episode length
3
200
4 100
0
5 0 100 200 300 400 500
Episode index
Shiyu Zhao 35 / 60
Outline
1 Motivating examples
8 Summary
Shiyu Zhao 36 / 60
TD learning of action values: n-step Sarsa
• n-step Sarsa needs (st , at , rt+1 , st+1 , at+1 , . . . , rt+n , st+n , at+n ).
• Since (rt+n , st+n , at+n ) has not been collected at time t, we are not able to
implement n-step Sarsa at step t. However, we can wait until time t + n to
update the q-value of (st , at ):
• Since n-step Sarsa includes Sarsa and MC learning as two extreme cases, its
performance is a blend of Sarsa and MC learning:
• If n is large, its performance is close to MC learning and hence has a large
variance but a small bias.
• If n is small, its performance is close to Sarsa and hence has a relatively
large bias due to the initial guess and relatively low variance.
• Finally, n-step Sarsa is also for policy evaluation. It can be combined with
the policy improvement step to search for optimal policies.
Shiyu Zhao 39 / 60
Outline
1 Motivating examples
8 Summary
Shiyu Zhao 40 / 60
TD learning of optimal action values: Q-learning
Shiyu Zhao 41 / 60
Q-learning – Algorithm
Q-learning is very similar to Sarsa. They are different only in terms of the
TD target:
• The TD target in Q-learning is rt+1 + γ maxa∈A qt (st+1 , a)
• The TD target in Sarsa is rt+1 + γqt (st+1 , at+1 ).
Shiyu Zhao 42 / 60
Q-learning – Algorithm
Shiyu Zhao 43 / 60
Off-policy vs on-policy
Shiyu Zhao 44 / 60
Off-policy vs on-policy
Shiyu Zhao 45 / 60
Off-policy vs on-policy
Shiyu Zhao 46 / 60
Off-policy vs on-policy
Sarsa is on-policy.
• First, Sarsa aims to solve the Bellman equation of a given policy π:
Shiyu Zhao 48 / 60
Off-policy vs on-policy
Q-learning is off-policy.
Shiyu Zhao 49 / 60
Q-learning – Implementation
Shiyu Zhao 51 / 60
Q-learning – Examples
Task description:
• The task in these examples is to find an optimal policy for all the
states.
• The reward setting is rboundary = rforbidden = −1, and rtarget = 1.
The discount rate is γ = 0.9. The learning rate is α = 0.1.
Ground truth: an optimal policy and the corresponding optimal state
values.
1 2 3 4 5 1 2 3 4 5
6
State value error
4
3
2
4
0
5 0 2 4 6 8 10
Step in the episode 104
Shiyu Zhao
(a) Estimated policy (b) State value error 53 / 60
Q-learning – Examples
1 2 3 4 5
1 8
3 4
2
4
0
5 0 2 4 6 8 10
Step in the episode 104
Shiyu Zhao 54 / 60
Q-learning – Examples
1 2 3 4 5
4
3
2
4
0
5 0 2 4 6 8 10
Step in the episode 104
1 2 3 4 5
4
3
2
4
0
5 0 2 4 6 8 10
Step in the episode 104
1 Motivating examples
8 Summary
Shiyu Zhao 56 / 60
A unified point of view
Shiyu Zhao 58 / 60
Outline
1 Motivating examples
8 Summary
Shiyu Zhao 59 / 60
Summary
Shiyu Zhao 60 / 60
Lecture 9: Policy Gradient Methods
Shiyu Zhao
Introduction
Chapter 6:
Chapter 5: Stochastic
Monte Carlo Approximation
Chapter 4: Learning
Value Iteration &
Policy Iteration Chapter 7:
Temporal‐Difference
Learning
tabular representation
to
Chapter 3: Chapter 2:
function representation
Bellman Optimality Bellman Algorithm/Methods
Equation Equation
Chapter 8:
Value Function
Fundamental tools Approximation
Chapter 9:
Chapter 10:
Policy Function
Actor‐Critic
Approximation
Methods
(or Policy Gradient)
Shiyu Zhao 1 / 43
Introduction
Shiyu Zhao 2 / 43
Outline
5 Summary
Shiyu Zhao 3 / 43
Outline
5 Summary
Shiyu Zhao 4 / 43
Basic idea of policy gradient
a1 a2 a3 a4 a5
s1 π(a1 |s1 ) π(a2 |s1 ) π(a3 |s1 ) π(a4 |s1 ) π(a5 |s1 )
. . . . . .
. . . . . .
. . . . . .
s9 π(a1 |s9 ) π(a2 |s9 ) π(a3 |s9 ) π(a4 |s9 ) π(a5 |s9 )
Shiyu Zhao 5 / 43
Basic idea of policy gradient
π(a|s, θ)
Shiyu Zhao 6 / 43
Basic idea of policy gradient
Shiyu Zhao 7 / 43
Basic idea of policy gradient
Shiyu Zhao 8 / 43
Basic idea of policy gradient
Shiyu Zhao 9 / 43
Basic idea of policy gradient
5 Summary
Shiyu Zhao 11 / 43
Metrics to define optimal policies - 1) The average value
where S ∼ d.
Shiyu Zhao 12 / 43
Metrics to define optimal policies - 1) The average value
Vector-product form:
X
v̄π = d(s)vπ (s) = dT vπ
s∈S
where
vπ = [. . . , vπ (s), . . . ]T ∈ R|S|
d = [. . . , d(s), . . . ]T ∈ R|S| .
Shiyu Zhao 13 / 43
Metrics to define optimal policies - 1) The average value
d0 (s0 ) = 1, d0 (s 6= s0 ) = 0.
Shiyu Zhao 14 / 43
Metrics to define optimal policies - 1) The average value
dTπ Pπ = dTπ ,
where S ∼ dπ . Here,
. X
rπ (s) = π(a|s)r(s, a)
a∈A
An equivalent definition!
Shiyu Zhao 17 / 43
Metrics to define optimal policies - Remarks
= r̄π
Note that
• The starting state s0 does not matter.
• The two definitions of r̄π are equivalent.
See the proof in the book.
Shiyu Zhao 18 / 43
Metrics to define optimal policies - Remarks
Shiyu Zhao 19 / 43
Metrics to define optimal policies - Remarks
Shiyu Zhao 20 / 43
Metrics to define optimal policies - Remarks
r̄π = (1 − γ)v̄π .
Shiyu Zhao 21 / 43
Metrics to define optimal policies - Excise
Excise:
You will see the following metric often in the literature:
"∞ #
X
t
J(θ) = E γ Rt+1
t=0
Shiyu Zhao 22 / 43
Metrics to define optimal policies - Excise
∞
" #
X
t
J(θ) = E γ Rt+1
t=0
= v̄π
Shiyu Zhao 23 / 43
Outline
5 Summary
Shiyu Zhao 24 / 43
Gradients of the metrics
Shiyu Zhao 25 / 43
Gradients of the metrics
where
• J(θ) can be v̄π , r̄π , or v̄π0 .
• “=” may denote strict equality, approximation, or proportional to.
• η is a distribution or weight of the states.
Shiyu Zhao 26 / 43
Gradients of the metrics
1
∇θ v̄π = ∇θ r̄π
1−γ
X X
∇θ v̄π0 = ρπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
Details are not given here. Interested readers can read my book.
Shiyu Zhao 27 / 43
Gradients of the metrics
Shiyu Zhao 28 / 43
Gradients of the metrics
X X
∇θ J(θ) = η(s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
= E ∇θ ln π(A|S, θ)qπ (S, A)
and hence
Shiyu Zhao 29 / 43
Gradients of the metrics
Then, we have
X X
∇θ J = d(s) ∇θ π(a|s, θ)qπ (s, a)
s a
X X
= d(s) π(a|s, θ)∇θ ln π(a|s, θ)qπ (s, a)
s a
" #
X
= ES∼d π(a|S, θ)∇θ ln π(a|S, θ)qπ (S, a)
a
= ES∼d,A∼π ∇θ ln π(A|S, θ)qπ (S, A)
.
= E ∇θ ln π(A|S, θ)qπ (S, A)
Shiyu Zhao 30 / 43
Gradients of the metrics
Some remarks:
• Such a form based on the softmax function can be realized by a neural
network whose input is s and parameter is θ. The network has |A|
outputs, each of which corresponds to π(a|s, θ) for an action a. The
activation function of the output layer should be softmax.
• Since π(a|s, θ) > 0 for all a, the parameterized policy is stochastic and
hence exploratory.
• There also exist deterministic policy gradient (DPG) methods.
Shiyu Zhao 32 / 43
Outline
5 Summary
Shiyu Zhao 33 / 43
Gradient-ascent algorithm
Now, we are ready to present the first policy gradient algorithm to find
optimal policies!
• The gradient-ascent algorithm maximizing J(θ) is
Shiyu Zhao 34 / 43
Gradient-ascent algorithm
Shiyu Zhao 35 / 43
Gradient-ascent algorithm
• How to sample S?
• S ∼ d, where the distribution d is a long-run behavior under π.
• How to sample A?
• A ∼ π(A|S, θ). Hence, at should be sampled following π(θt ) at st .
• Therefore, the policy gradient method is on-policy.
Shiyu Zhao 36 / 43
Gradient-ascent algorithm
Since
∇θ π(at |st , θt )
∇θ ln π(at |st , θt ) =
π(at |st , θt )
the algorithm can be rewritten as
Shiyu Zhao 37 / 43
Gradient-ascent algorithm
π(at |st , θt+1 ) ≈ π(at |st , θt ) + (∇θ π(at |st , θt ))T (θt+1 − θt )
= π(at |st , θt ) + αβt (∇θ π(at |st , θt ))T (∇θ π(at |st , θt ))
= π(at |st , θt ) + αβt k∇θ π(at |st , θt )k2
Shiyu Zhao 38 / 43
Gradient-ascent algorithm
qt (st , at )
θt+1 = θt + α ∇θ π(at |st , θt )
π(at |st , θt )
| {z }
βt
Recall that
is replaced by
Shiyu Zhao 40 / 43
REINFORCE algorithm
Shiyu Zhao 41 / 43
Outline
5 Summary
Shiyu Zhao 42 / 43
Summary
Shiyu Zhao 43 / 43