0% found this document useful (0 votes)
9 views

Book Mathmatical Foundation of Reinforcement Learning Lecture Slides

Uploaded by

li ming
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Book Mathmatical Foundation of Reinforcement Learning Lecture Slides

Uploaded by

li ming
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 524

Lecture 1: Basic Concepts in Reinforcement Learning

Shiyu Zhao

School of Engineering, Westlake University


Contents

• First, introduce fundamental concepts in reinforcement learning (RL) by


examples.

• Second, formalize the concepts in the context of Markov decision processes.

Shiyu Zhao 1 / 26
A grid-world example

Start Start

forbidden forbidden

forbidden target forbidden target

An illustrative example used throughout this course:


• Grid of cells: Accessible/forbidden/target cells, boundary.
• Very easy to understand and useful for illustration
Task:
• Given any starting area, find a “good” way to the target.
• How to define “good”? Avoid forbidden cells, detours, or boundary.

Shiyu Zhao 2 / 26
State

State: The status of the agent with respect to the environment.


• For the grid-world example, the location of the agent is the state. There are
nine possible locations and hence nine states: s1 , s2 , . . . , s9 .

s1 s2 s3

s4 s5 s6

s7 s8 s9

State space: the set of all states S = {si }9i=1 .

Shiyu Zhao 3 / 26
Action

Action: For each state, there are five possible actions: a1 , . . . , a5


• a1 : move upwards;
• a2 : move rightwards;
• a3 : move downwards;
• a4 : move leftwards;
• a5 : stay unchanged;

a1
s1 s2 s3

a4
a2 s4 s5 s6
a5

s7 s8 s9

a3

Action space of a state: the set of all possible actions of a state.


A(si ) = {ai }5i=1 .
Question: can different states have different sets of actions?
Shiyu Zhao 4 / 26
State transition

s1 s2 s3

s4 s5 s6

s7 s8 s9

When taking an action, the agent may move from one state to another. Such a
process is called state transition.
• At state s1 , if we choose action a2 , then what is the next state?
a2
s1 −→ s2

• At state s1 , if we choose action a1 , then what is the next state?


a1
s1 −→ s1

• State transition defines the interaction with the environment.


• Question: Can we define the state transition in other ways? Yes.
Shiyu Zhao 5 / 26
State transition

s1 s2 s3

s4 s5 s6

s7 s8 s9

Forbidden area: At state s5 , if we choose action a2 , then what is the next


state?
• Case 1: the forbidden area is accessible but with penalty. Then,
a2
s5 −→ s6

• Case 2: the forbidden area is inaccessible (e.g., surrounded by a wall)


a2
s5 −→ s5

We consider the first case, which is more general and challenging.

Shiyu Zhao 6 / 26
State transition

s1 s2 s3

s4 s5 s6

s7 s8 s9

Tabular representation: We can use a table to describe the state transition:

a1 (upwards) a2 (rightwards) a3 (downwards) a4 (leftwards) a5 (unchanged)


s1 s1 s2 s4 s1 s1
s2 s2 s3 s5 s1 s2
s3 s3 s3 s6 s2 s3
s4 s1 s5 s7 s4 s4
s5 s2 s6 s8 s4 s5
s6 s3 s6 s9 s5 s6
s7 s4 s8 s7 s7 s7
s8 s5 s9 s8 s7 s8
s9 s6 s9 s9 s8 s9

Can only represent deterministic cases.


Shiyu Zhao 7 / 26
State transition

s1 s2 s3

s4 s5 s6

s7 s8 s9

State transition probability: use probability to describe state transition!


• Intuition: At state s1 , if we choose action a2 , the next state is s2 .
• Math:

p(s2 |s1 , a2 ) = 1
p(si |s1 , a2 ) = 0 ∀i 6= 2

Here it is a deterministic case. The state transition could be stochastic (for


example, wind gust).

Shiyu Zhao 8 / 26
Policy

Policy tells the agent what actions to take at a state.


Intuitive representation: The arrows demonstrate a policy.

Based on this policy, we get the following paths with different starting points.

Shiyu Zhao 9 / 26
Policy

Mathematical representation: using conditional probability


For example, for state s1 :

π(a1 |s1 ) = 0
π(a2 |s1 ) = 1
π(a3 |s1 ) = 0
π(a4 |s1 ) = 0
π(a5 |s1 ) = 0

It is a deterministic policy.
Shiyu Zhao 10 / 26
Policy

There are stochastic policies.


For example:
Prob=0.5

Prob=0.5

In this policy, for s1 :


π(a1 |s1 ) = 0
π(a2 |s1 ) = 0.5
π(a3 |s1 ) = 0.5
π(a4 |s1 ) = 0
π(a5 |s1 ) = 0

Shiyu Zhao 11 / 26
Policy

Prob=0.5

Prob=0.5

Tabular representation of a policy: how to use this table.


a1 (upwards) a2 (rightwards) a3 (downwards) a4 (leftwards ) a5 (unchanged)
s1 0 0.5 0.5 0 0
s2 0 0 1 0 0
s3 0 0 0 1 0
s4 0 1 0 0 0
s5 0 0 1 0 0
s6 0 0 1 0 0
s7 0 1 0 0 0
s8 0 1 0 0 0
s9 0 0 0 0 1

Can represent either deterministic or stochastic cases.


Shiyu Zhao 12 / 26
Reward

Reward is one of the most unique concepts of RL.


Reward: a real number we get after taking an action.
• A positive reward represents encouragement to take such actions.
• A negative reward represents punishment to take such actions.
Questions:
• What about a zero reward? No punishment.
• Can positive mean punishment? Yes.

Shiyu Zhao 13 / 26
Reward

s1 s2 s3

s4 s5 s6

s7 s8 s9

In the grid-world example, the rewards are designed as follows:


• If the agent attempts to get out of the boundary, let rbound = −1
• If the agent attempts to enter a forbidden cell, let rforbid = −1
• If the agent reaches the target cell, let rtarget = +1
• Otherwise, the agent gets a reward of r = 0.
Reward can be interpreted as a human-machine interface, with which we can
guide the agent to behave as what we expect.
For example, with the above designed rewards, the agent will try to avoid
getting out of the boundary or stepping into the forbidden cells.

Shiyu Zhao 14 / 26
Reward

s1 s2 s3

s4 s5 s6

s7 s8 s9

Tabular representation of reward transition: how to use the table?


a1 (upwards) a2 (rightwards) a3 (downwards) a4 (leftwards ) a5 (unchanged)
s1 rbound 0 0 rbound 0
s2 rbound 0 0 0 0
s3 rbound rbound rforbid 0 0
s4 0 0 rforbid rbound 0
s5 0 rforbid 0 0 0
s6 0 rbound rtarget 0 rforbid
s7 0 0 rbound rbound rforbid
s8 0 rtarget rbound rforbid 0
s9 rforbid rbound rbound 0 rtarget

Can only represent deterministic cases.


Shiyu Zhao 15 / 26
Reward

s1 s2 s3

s4 s5 s6

s7 s8 s9

Mathematical description: conditional probability


• Intuition: At state s1 , if we choose action a1 , the reward is −1.
• Math: p(r = −1|s1 , a1 ) = 1 and p(r 6= −1|s1 , a1 ) = 0
Remarks:
• Here it is a deterministic case. The reward transition could be stochastic.
• For example, if you study hard, you will get rewards. But how much is
uncertain.
• The reward depends on the state and action, but not the next state (for
example, consider s1 , a1 and s1 , a5 ).

Shiyu Zhao 16 / 26
Trajectory and return

r=0
s1 s2 s3
r=0

s4 s5 s6
r=0

s7 s8 s9
r=1

A trajectory is a state-action-reward chain:


a a a a
s1 −−2→ s2 −−3→ s5 −−3→ s8 −−2→ s9
r=0 r=0 r=0 r=1

The return of this trajectory is the sum of all the rewards collected along the
trajectory:

return = 0 + 0 + 0 + 1 = 1

Shiyu Zhao 17 / 26
Trajectory and return

s1 s2 s3
r=0

s4 s5 s6
r=-1

s7 s8 s9
r=0 r=+1

A different policy gives a different trajectory:


a a a a
s1 −−3→ s4 −
−−3−
→ s7 −−2→ s8 −−−
2
→ s9
r=0 r=−1 r=0 r=+1

The return of this path is:

return = 0 − 1 + 0 + 1 = 0

Shiyu Zhao 18 / 26
Trajectory and return

r=0
s1 s2 s3 s1 s2 s3
r=0 r=0

s4 s5 s6 s4 s5 s6
r=0 r=-1

s7 s8 s9 s7 s8 s9
r=1 r=0 r=+1

Which policy is better?


• Intuition: the first is better, because it avoids the forbidden areas.
• Mathematics: the first one is better, since it has a greater return!
• Return could be used to evaluate whether a policy is good or not (see details
in the next lecture)!

Shiyu Zhao 19 / 26
Discounted return

r=0
s1 s2 s3
r=0

s4 s5 s6
r=0

s7 s8 s9
r=+1 r=+1, r=+1, r=+1,...

A trajectory may be infinite:


a
2 3a 3 2 a5 5 a a a
s1 −→ s2 −→ s5 −→ s8 −→ s9 −→ s9 −→ s9 . . .

The return is

return = 0 + 0 + 0 + 1 + 1 + 1 + · · · = ∞

The definition is invalid since the return diverges!

Shiyu Zhao 20 / 26
Discounted return

r=0
s1 s2 s3
r=0

s4 s5 s6
r=0

s7 s8 s9
r=+1 r=+1, r=+1, r=+1,...

Need to introduce a discount rate γ ∈ [0, 1)


Discounted return:

discounted return = 0 + γ0 + γ 2 0 + γ 3 1 + γ 4 1 + γ 5 1 + . . .
1
= γ 3 (1 + γ + γ 2 + . . . ) = γ 3 .
1−γ
Roles: 1) the sum becomes finite; 2) balance the far and near future rewards:
• If γ is close to 0, the value of the discounted return is dominated by the
rewards obtained in the near future.
• If γ is close to 1, the value of the discounted return is dominated by the
rewards obtained in the far future.
Shiyu Zhao 21 / 26
Episode

When interacting with the environment following a policy, the agent may stop
at some terminal states. The resulting trajectory is called an episode (or a
trial).

r=0
s1 s2 s3
r=0

s4 s5 s6
r=0

s7 s8 s9
r=1

Example: episode
a a a a
s1 −−2→ s2 −−3→ s5 −−3→ s8 −−2→ s9
r=0 r=0 r=0 r=1

An episode is usually assumed to be a finite trajectory. Tasks with episodes are


called episodic tasks.
Shiyu Zhao 22 / 26
Episode

Some tasks may have no terminal states, meaning the interaction with the
environment will never end. Such tasks are called continuing tasks.

In the grid-world example, should we stop after arriving the target?

In fact, we can treat episodic and continuing tasks in a unified mathematical


way by converting episodic tasks to continuing tasks.
• Option 1: Treat the target state as a special absorbing state. Once the agent
reaches an absorbing state, it will never leave. The consequent rewards
r = 0.
• Option 2: Treat the target state as a normal state with a policy. The agent
can still leave the target state and gain r = +1 when entering the target
state.
We consider option 2 in this course so that we don’t need to distinguish the
target state from the others and can treat it as a normal state.

Shiyu Zhao 23 / 26
Markov decision process (MDP)

Key elements of MDP:


• Sets:
• State: the set of states S
• Action: the set of actions A(s) is associated for state s ∈ S.
• Reward: the set of rewards R(s, a).
• Probability distribution:
• State transition probability: at state s, taking action a, the probability to
transit to state s0 is p(s0 |s, a)
• Reward probability: at state s, taking action a, the probability to get
reward r is p(r|s, a)
• Policy: at state s, the probability to choose action a is π(a|s)
• Markov property: memoryless property

p(st+1 |at , st , . . . , a0 , s0 ) = p(st+1 |at , st ),


p(rt+1 |at , st , . . . , a0 , s0 ) = p(rt+1 |at , st ).

All the concepts introduced in this lecture can be put in the framework in MDP.
Shiyu Zhao 24 / 26
Markov decision process (MDP)

The grid world could be abstracted as a more general model, Markov process.
Prob=0.5 Prob=1
s1 s2 s3
Prob=0.5

Prob=0.5

Prob=1
Prob=0.5

Prob=1
s4 s5 s6

Prob=1

Prob=1
Prob=1 Prob=1
s7 s8 s9

The circles represent states and the links with arrows represent the state
transition.
Markov decision process becomes Markov process once the policy is given!

Shiyu Zhao 25 / 26
Summary

By using grid-world examples, we demonstrated the following key concepts:


• State
• Action
• State transition, state transition probability p(s0 |s, a)
• Reward, reward probability p(r|s, a)
• Trajectory, episode, return, discounted return
• Markov decision process

Shiyu Zhao 26 / 26
Lecture 2: Bellman Equation

Shiyu Zhao

School of Engineering, Westlake University


Outline

Shiyu Zhao 1 / 52
Outline

In this lecture:
• A core concept: state value
• A fundamental tool: the Bellman equation

Shiyu Zhao 2 / 52
Outline

1 Motivating examples

2 State value

3 Bellman equation: Derivation

4 Bellman equation: Matrix-vector form

5 Bellman equation: Solve the state values

6 Action value

7 Summary

Shiyu Zhao 3 / 52
Outline

1 Motivating examples

2 State value

3 Bellman equation: Derivation

4 Bellman equation: Matrix-vector form

5 Bellman equation: Solve the state values

6 Action value

7 Summary

Shiyu Zhao 4 / 52
Motivating example 1: Why return is important?

• What is return? The (discounted) sum of the rewards obtained along a


trajectory.
• Why return is important? See the following examples.
r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1

• Question: From the starting point s1 , which policy is the “best”?


Which is the “worst”?
Intuition: the first is the best and the second is the worst, because of
the forbidden area.
• Question: can we use mathematics to describe such an intuition?
Answer: Return could be used to evaluate policies. See the following.
Shiyu Zhao 5 / 52
Motivating example 1: Why return is important?

• What is return? The (discounted) sum of the rewards obtained along a


trajectory.
• Why return is important? See the following examples.
r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1

• Question: From the starting point s1 , which policy is the “best”?


Which is the “worst”?
Intuition: the first is the best and the second is the worst, because of
the forbidden area.
• Question: can we use mathematics to describe such an intuition?
Answer: Return could be used to evaluate policies. See the following.
Shiyu Zhao 5 / 52
Motivating example 1: Why return is important?

• What is return? The (discounted) sum of the rewards obtained along a


trajectory.
• Why return is important? See the following examples.
r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1

• Question: From the starting point s1 , which policy is the “best”?


Which is the “worst”?
Intuition: the first is the best and the second is the worst, because of
the forbidden area.
• Question: can we use mathematics to describe such an intuition?
Answer: Return could be used to evaluate policies. See the following.
Shiyu Zhao 5 / 52
Motivating example 1: Why return is important?

r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1

Based on policy 1 (left figure), starting from s1 , the discounted return is

return1 = 0 + γ1 + γ 2 1 + . . . ,
= γ(1 + γ + γ 2 + . . . ),
γ
= .
1−γ

Shiyu Zhao 6 / 52
Motivating example 1: Why return is important?

r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1

Based on policy 1 (left figure), starting from s1 , the discounted return is

return1 = 0 + γ1 + γ 2 1 + . . . ,
= γ(1 + γ + γ 2 + . . . ),
γ
= .
1−γ

Shiyu Zhao 6 / 52
Motivating example 1: Why return is important?

r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1

Exercise: Based on policy 2 (middle figure), starting from s1 , what is


the discounted return?
Answer:

return2 = −1 + γ1 + γ 2 1 + . . . ,
= −1 + γ(1 + γ + γ 2 + . . . ),
γ
= −1 + .
1−γ

Shiyu Zhao 7 / 52
Motivating example 1: Why return is important?

r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1

Exercise: Based on policy 2 (middle figure), starting from s1 , what is


the discounted return?
Answer:

return2 = −1 + γ1 + γ 2 1 + . . . ,
= −1 + γ(1 + γ + γ 2 + . . . ),
γ
= −1 + .
1−γ

Shiyu Zhao 7 / 52
Motivating example 1: Why return is important?

r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1

Policy 3 is stochastic!
Exercise: Based on policy 3 (right figure), starting from s1 , the
discounted return is
Answer:
   
γ γ
return3 = 0.5 −1 + + 0.5 ,
1−γ 1−γ
γ
= −0.5 + .
1−γ

Shiyu Zhao 8 / 52
Motivating example 1: Why return is important?

r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1

Policy 3 is stochastic!
Exercise: Based on policy 3 (right figure), starting from s1 , the
discounted return is
Answer:
   
γ γ
return3 = 0.5 −1 + + 0.5 ,
1−γ 1−γ
γ
= −0.5 + .
1−γ

Shiyu Zhao 8 / 52
Motivating example 1: Why return is important?

r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1

In summary, starting from s1 ,

return1 > return3 > return2

The above inequality suggests that the first policy is the best and the
second policy is the worst, which is exactly the same as our intuition.
Calculating return is important to evaluate a policy.

Shiyu Zhao 9 / 52
Motivating example 2: How to calculate return?
While return is important, how to calculate it?

Method 1: by definition
Let vi denote the return obtained starting from si (i = 1, 2, 3, 4)
v1 = r1 + γr2 + γ 2 r3 + . . .
v2 = r2 + γr3 + γ 2 r4 + . . .
v3 = r3 + γr4 + γ 2 r1 + . . .
v4 = r4 + γr1 + γ 2 r2 + . . .

Shiyu Zhao 10 / 52
Motivating example 2: How to calculate return?
While return is important, how to calculate it?

Method 1: by definition
Let vi denote the return obtained starting from si (i = 1, 2, 3, 4)
v1 = r1 + γr2 + γ 2 r3 + . . .
v2 = r2 + γr3 + γ 2 r4 + . . .
v3 = r3 + γr4 + γ 2 r1 + . . .
v4 = r4 + γr1 + γ 2 r2 + . . .

Shiyu Zhao 10 / 52
Motivating example 2: How to calculate return?
While return is important, how to calculate it?

Method 1: by definition
Let vi denote the return obtained starting from si (i = 1, 2, 3, 4)
v1 = r1 + γr2 + γ 2 r3 + . . .
v2 = r2 + γr3 + γ 2 r4 + . . .
v3 = r3 + γr4 + γ 2 r1 + . . .
v4 = r4 + γr1 + γ 2 r2 + . . .

Shiyu Zhao 10 / 52
Motivating example 2: How to calculate return?
While return is important, how to calculate it?

Method 2:

v1 = r1 + γ(r2 + γr3 + . . . ) = r1 + γv2

v2 = r2 + γ(r3 + γr4 + . . . ) = r2 + γv3


v3 = r3 + γ(r4 + γr1 + . . . ) = r3 + γv4
v4 = r4 + γ(r1 + γr2 + . . . ) = r4 + γv1
• The returns rely on each other. Bootstrapping!
Shiyu Zhao 11 / 52
Motivating example 2: How to calculate return?
While return is important, how to calculate it?

Method 2:

v1 = r1 + γ(r2 + γr3 + . . . ) = r1 + γv2

v2 = r2 + γ(r3 + γr4 + . . . ) = r2 + γv3


v3 = r3 + γ(r4 + γr1 + . . . ) = r3 + γv4
v4 = r4 + γ(r1 + γr2 + . . . ) = r4 + γv1
• The returns rely on each other. Bootstrapping!
Shiyu Zhao 11 / 52
Motivating example 2: How to calculate return?
While return is important, how to calculate it?

Method 2:

v1 = r1 + γ(r2 + γr3 + . . . ) = r1 + γv2

v2 = r2 + γ(r3 + γr4 + . . . ) = r2 + γv3


v3 = r3 + γ(r4 + γr1 + . . . ) = r3 + γv4
v4 = r4 + γ(r1 + γr2 + . . . ) = r4 + γv1
• The returns rely on each other. Bootstrapping!
Shiyu Zhao 11 / 52
Motivating example 2: How to calculate return?

How to solve these equations? Write in the following matrix-vector form:


          
v1 r1 γv2 r1 0 1 0 0 v1
 v2   r2   γv3   r2   0 0 1 0  v2 
= + =  +γ 
          
  
 v3   r3   γv4   r3   0 0 0 1  v3 
v4 r4 γv1 r4 1 0 0 0 v4
| {z } | {z } | {z }| {z }
v r P v

which can be rewritten as

v = r + γPv

This is the Bellman equation (for this specific deterministic problem)!!


• Though simple, it demonstrates the core idea: the value of one state
relies on the values of other states.
• A matrix-vector form is more clear to see how to solve the state values.
Shiyu Zhao 12 / 52
Motivating example 2: How to calculate return?

How to solve these equations? Write in the following matrix-vector form:


          
v1 r1 γv2 r1 0 1 0 0 v1
 v2   r2   γv3   r2   0 0 1 0  v2 
= + =  +γ 
          
  
 v3   r3   γv4   r3   0 0 0 1  v3 
v4 r4 γv1 r4 1 0 0 0 v4
| {z } | {z } | {z }| {z }
v r P v

which can be rewritten as

v = r + γPv

This is the Bellman equation (for this specific deterministic problem)!!


• Though simple, it demonstrates the core idea: the value of one state
relies on the values of other states.
• A matrix-vector form is more clear to see how to solve the state values.
Shiyu Zhao 12 / 52
Motivating example 2: How to calculate return?
Exercise: Consider the policy shown in the figure. Please write out the
relation among the returns (that is to write out the Bellman equation)

s1 s2
r=0 r=1

s3 s4
r=1 r=1

Answer:
v1 = 0 + γv3
v2 = 1 + γv4
v3 = 1 + γv4
v4 = 1 + γv4
Exercise: How to solve them? We can first calculate v4 , and then
v3 , v2 , v1 .
Shiyu Zhao 13 / 52
Motivating example 2: How to calculate return?
Exercise: Consider the policy shown in the figure. Please write out the
relation among the returns (that is to write out the Bellman equation)

s1 s2
r=0 r=1

s3 s4
r=1 r=1

Answer:
v1 = 0 + γv3
v2 = 1 + γv4
v3 = 1 + γv4
v4 = 1 + γv4
Exercise: How to solve them? We can first calculate v4 , and then
v3 , v2 , v1 .
Shiyu Zhao 13 / 52
Outline

1 Motivating examples

2 State value

3 Bellman equation: Derivation

4 Bellman equation: Matrix-vector form

5 Bellman equation: Solve the state values

6 Action value

7 Summary

Shiyu Zhao 14 / 52
Some notations
Consider the following single-step process:
At
St −−→ Rt+1 , St+1

• t, t + 1: discrete time instances


• St : state at time t
• At : the action taken at state St
• Rt+1 : the reward obtained after taking At
• St+1 : the state transited to after taking At
Note that St , At , Rt+1 are all random variables.
This step is governed by the following probability distributions:
• St → At is governed by π(At = a|St = s)
• St , At → Rt+1 is governed by p(Rt+1 = r|St = s, At = a)
• St , At → St+1 is governed by p(St+1 = s0 |St = s, At = a)
At this moment, we assume we know the model (i.e., the probability
distributions)!
Shiyu Zhao 15 / 52
Some notations
Consider the following single-step process:
At
St −−→ Rt+1 , St+1

• t, t + 1: discrete time instances


• St : state at time t
• At : the action taken at state St
• Rt+1 : the reward obtained after taking At
• St+1 : the state transited to after taking At
Note that St , At , Rt+1 are all random variables.
This step is governed by the following probability distributions:
• St → At is governed by π(At = a|St = s)
• St , At → Rt+1 is governed by p(Rt+1 = r|St = s, At = a)
• St , At → St+1 is governed by p(St+1 = s0 |St = s, At = a)
At this moment, we assume we know the model (i.e., the probability
distributions)!
Shiyu Zhao 15 / 52
Some notations

Consider the following multi-step trajectory:


At At+1 At+2
St −−→ Rt+1 , St+1 −−−→ Rt+2 , St+2 −−−→ Rt+3 , . . .

The discounted return is

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .

• γ ∈ [0, 1) is a discount rate.


• Gt is also a random variable since Rt+1 , Rt+2 , . . . are random
variables.

Shiyu Zhao 16 / 52
Some notations

Consider the following multi-step trajectory:


At At+1 At+2
St −−→ Rt+1 , St+1 −−−→ Rt+2 , St+2 −−−→ Rt+3 , . . .

The discounted return is

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .

• γ ∈ [0, 1) is a discount rate.


• Gt is also a random variable since Rt+1 , Rt+2 , . . . are random
variables.

Shiyu Zhao 16 / 52
State value
The expectation (or called expected value or mean) of Gt is defined as
the state-value function or simply state value:
vπ (s) = E[Gt |St = s]
Remarks:
• It is a function of s. It is a conditional expectation with the condition
that the state starts from s.
• It is based on the policy π. For a different policy, the state value may
be different.
• It represents the “value” of a state. If the state value is greater, then
the policy is better because greater cumulative rewards can be
obtained.
Q: What is the relationship between return and state value?
A: The state value is the mean of all possible returns that can be
obtained starting from a state. If everything - π(a|s), p(r|s, a), p(s0 |s, a)
- is deterministic, then state value is the same as return.
Shiyu Zhao 17 / 52
State value
The expectation (or called expected value or mean) of Gt is defined as
the state-value function or simply state value:
vπ (s) = E[Gt |St = s]
Remarks:
• It is a function of s. It is a conditional expectation with the condition
that the state starts from s.
• It is based on the policy π. For a different policy, the state value may
be different.
• It represents the “value” of a state. If the state value is greater, then
the policy is better because greater cumulative rewards can be
obtained.
Q: What is the relationship between return and state value?
A: The state value is the mean of all possible returns that can be
obtained starting from a state. If everything - π(a|s), p(r|s, a), p(s0 |s, a)
- is deterministic, then state value is the same as return.
Shiyu Zhao 17 / 52
State value

Example:

r=-1
Prob=0.5
s1 s2 s1 s2 s1 s2
r=0 r=1 r=-1 r=1 r=0
r=1
Prob=0.5

s3 s4 s3 s4 s3 s4
r=1 r=1 r=1 r=1 r=1 r=1

Recall the returns obtained from s1 for the three examples:


γ
vπ1 (s1 ) = 0 + γ1 + γ 2 1 + · · · = γ(1 + γ + γ 2 + . . . ) =
1−γ
γ
vπ2 (s1 ) = −1 + γ1 + γ 2 1 + · · · = −1 + γ(1 + γ + γ 2 + . . . ) = −1 +
1−γ
   
γ γ γ
vπ3 (s1 ) = 0.5 −1 + + 0.5 = −0.5 +
1−γ 1−γ 1−γ

Shiyu Zhao 18 / 52
Outline

1 Motivating examples

2 State value

3 Bellman equation: Derivation

4 Bellman equation: Matrix-vector form

5 Bellman equation: Solve the state values

6 Action value

7 Summary

Shiyu Zhao 19 / 52
Bellman equation

• While state value is important, how to calculate? The answer lies in


the Bellman equation.
• In a word, the Bellman equation describes the relationship among the
values of all states.
• Next, we derive the Bellman equation.
• There is some math.
• We already have the intuition.

Shiyu Zhao 20 / 52
Deriving the Bellman equation
Consider a random trajectory:
At At+1 At+2
St −−→ Rt+1 , St+1 −−−→ Rt+2 , St+2 −−−→ Rt+3 , . . .

The return Gt can be written as

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . . ,


= Rt+1 + γ(Rt+2 + γRt+3 + . . . ),
= Rt+1 + γGt+1 ,

Then, it follows from the definition of the state value that

vπ (s) = E[Gt |St = s]


= E[Rt+1 + γGt+1 |St = s]
= E[Rt+1 |St = s] + γE[Gt+1 |St = s]

Next, calculate the two terms, respectively.


Shiyu Zhao 21 / 52
Deriving the Bellman equation
Consider a random trajectory:
At At+1 At+2
St −−→ Rt+1 , St+1 −−−→ Rt+2 , St+2 −−−→ Rt+3 , . . .

The return Gt can be written as

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . . ,


= Rt+1 + γ(Rt+2 + γRt+3 + . . . ),
= Rt+1 + γGt+1 ,

Then, it follows from the definition of the state value that

vπ (s) = E[Gt |St = s]


= E[Rt+1 + γGt+1 |St = s]
= E[Rt+1 |St = s] + γE[Gt+1 |St = s]

Next, calculate the two terms, respectively.


Shiyu Zhao 21 / 52
Deriving the Bellman equation

First, calculate the first term E[Rt+1 |St = s]:


X
E[Rt+1 |St = s] = π(a|s)E[Rt+1 |St = s, At = a]
a
X X
= π(a|s) p(r|s, a)r
a r

Note that
• This is the mean of immediate rewards

Shiyu Zhao 22 / 52
Deriving the Bellman equation

First, calculate the first term E[Rt+1 |St = s]:


X
E[Rt+1 |St = s] = π(a|s)E[Rt+1 |St = s, At = a]
a
X X
= π(a|s) p(r|s, a)r
a r

Note that
• This is the mean of immediate rewards

Shiyu Zhao 22 / 52
Deriving the Bellman equation

Second, calculate the second term E[Gt+1 |St = s]:


X
E[Gt+1 |St = s] = E[Gt+1 |St = s, St+1 = s0 ]p(s0 |s)
s0
X
= E[Gt+1 |St+1 = s0 ]p(s0 |s)
s0
X
= vπ (s0 )p(s0 |s)
s0
X X
= vπ (s0 ) p(s0 |s, a)π(a|s)
s0 a

Note that
• This is the mean of future rewards
• E[Gt+1 |St = s, St+1 = s0 ] = E[Gt+1 |St+1 = s0 ] due to the memoryless
Markov property.

Shiyu Zhao 23 / 52
Deriving the Bellman equation

Second, calculate the second term E[Gt+1 |St = s]:


X
E[Gt+1 |St = s] = E[Gt+1 |St = s, St+1 = s0 ]p(s0 |s)
s0
X
= E[Gt+1 |St+1 = s0 ]p(s0 |s)
s0
X
= vπ (s0 )p(s0 |s)
s0
X X
= vπ (s0 ) p(s0 |s, a)π(a|s)
s0 a

Note that
• This is the mean of future rewards
• E[Gt+1 |St = s, St+1 = s0 ] = E[Gt+1 |St+1 = s0 ] due to the memoryless
Markov property.

Shiyu Zhao 23 / 52
Deriving the Bellman equation

Therefore, we have

vπ (s) = E[Rt+1 |St = s] + γE[Gt+1 |St = s],


X X X X
= π(a|s) p(r|s, a)r + γ π(a|s) p(s0 |s, a)vπ (s0 ),
a r a s0
| {z } | {z }
mean of immediate rewards mean of future rewards
" #
X X X
0 0
= π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s ) , ∀s ∈ S.
a r s0

Highlights:
• The above equation is called the Bellman equation, which characterizes
the relationship among the state-value functions of different states.
• It consists of two terms: the immediate reward term and the future
reward term.
• A set of equations: every state has an equation like this!!!
Shiyu Zhao 24 / 52
Deriving the Bellman equation

Therefore, we have

vπ (s) = E[Rt+1 |St = s] + γE[Gt+1 |St = s],


X X X X
= π(a|s) p(r|s, a)r + γ π(a|s) p(s0 |s, a)vπ (s0 ),
a r a s0
| {z } | {z }
mean of immediate rewards mean of future rewards
" #
X X X
= π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπ (s0 ) , ∀s ∈ S.
a r s0

Highlights: symbols in this equation


• vπ (s) and vπ (s0 ) are state values to be calculated. Bootstrapping!
• π(a|s) is a given policy. Solving the equation is called policy
evaluation.
• p(r|s, a) and p(s0 |s, a) represent the dynamic model. What if the
model is known or unknown?
Shiyu Zhao 25 / 52
An illustrative example

s1 s2
r=0 r=1

s3 s4
r=1 r=1

Write out the Bellman equation according to the general expression:


" #
X X X
0 0
vπ (s) = π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s )
a r s0
This example is simple because the policy is deterministic.
First, consider the state value of s1 :
• π(a = a3 |s1 ) = 1 and π(a 6= a3 |s1 ) = 0.
• p(s0 = s3 |s1 , a3 ) = 1 and p(s0 6= s3 |s1 , a3 ) = 0.
• p(r = 0|s1 , a3 ) = 1 and p(r 6= 0|s1 , a3 ) = 0.
Substituting them into the Bellman equation gives
vπ (s1 ) = 0 + γvπ (s3 )
Shiyu Zhao 26 / 52
An illustrative example

s1 s2
r=0 r=1

s3 s4
r=1 r=1

Write out the Bellman equation according to the general expression:


" #
X X X
0 0
vπ (s) = π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s )
a r s0
This example is simple because the policy is deterministic.
First, consider the state value of s1 :
• π(a = a3 |s1 ) = 1 and π(a 6= a3 |s1 ) = 0.
• p(s0 = s3 |s1 , a3 ) = 1 and p(s0 6= s3 |s1 , a3 ) = 0.
• p(r = 0|s1 , a3 ) = 1 and p(r 6= 0|s1 , a3 ) = 0.
Substituting them into the Bellman equation gives
vπ (s1 ) = 0 + γvπ (s3 )
Shiyu Zhao 26 / 52
An illustrative example

s1 s2
r=0 r=1

s3 s4
r=1 r=1

Write out the Bellman equation according to the general expression:


" #
X X X
0 0
vπ (s) = π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s )
a r s0
This example is simple because the policy is deterministic.
First, consider the state value of s1 :
• π(a = a3 |s1 ) = 1 and π(a 6= a3 |s1 ) = 0.
• p(s0 = s3 |s1 , a3 ) = 1 and p(s0 6= s3 |s1 , a3 ) = 0.
• p(r = 0|s1 , a3 ) = 1 and p(r 6= 0|s1 , a3 ) = 0.
Substituting them into the Bellman equation gives
vπ (s1 ) = 0 + γvπ (s3 )
Shiyu Zhao 26 / 52
An illustrative example

s1 s2
r=0 r=1

s3 s4
r=1 r=1

Write out the Bellman equation according to the general expression.


" #
X X X
0 0
vπ (s) = π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s )
a r s0

Similarly, it can be obtained that

vπ (s1 ) = 0 + γvπ (s3 ),


vπ (s2 ) = 1 + γvπ (s4 ),
vπ (s3 ) = 1 + γvπ (s4 ),
vπ (s4 ) = 1 + γvπ (s4 ).
Shiyu Zhao 27 / 52
An illustrative example
How to solve them?

vπ (s1 ) = 0 + γvπ (s3 ),


vπ (s2 ) = 1 + γvπ (s4 ),
vπ (s3 ) = 1 + γvπ (s4 ),
vπ (s4 ) = 1 + γvπ (s4 ).

Solve the above equations one by one from the last to the first:
1
vπ (s4 ) = ,
1−γ
1
vπ (s3 ) = ,
1−γ
1
vπ (s2 ) = ,
1−γ
γ
vπ (s1 ) = .
1−γ

Shiyu Zhao 28 / 52
An illustrative example
How to solve them?

vπ (s1 ) = 0 + γvπ (s3 ),


vπ (s2 ) = 1 + γvπ (s4 ),
vπ (s3 ) = 1 + γvπ (s4 ),
vπ (s4 ) = 1 + γvπ (s4 ).

Solve the above equations one by one from the last to the first:
1
vπ (s4 ) = ,
1−γ
1
vπ (s3 ) = ,
1−γ
1
vπ (s2 ) = ,
1−γ
γ
vπ (s1 ) = .
1−γ

Shiyu Zhao 28 / 52
An illustrative example

If γ = 0.9, then
1
vπ (s4 ) = = 10,
1 − 0.9
1
vπ (s3 ) = = 10,
1 − 0.9
1
vπ (s2 ) = = 10,
1 − 0.9
0.9
vπ (s1 ) = = 9.
1 − 0.9
What to do after we have calculated state values? Be patient
(calculating action value and improve policy)

Shiyu Zhao 29 / 52
Exercise

r=-1
Prob=0.5
s1 s2
r=0
r=1
Prob=0.5

s3 s4
r=1 r=1

Exercise:
" #
X X X
0 0
vπ (s) = π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s )
a r s0

• write out the Bellman equations for each state.


• solve the state values from the Bellman equations.
• compare with the policy in the last example.

Shiyu Zhao 30 / 52
Exercise

Answer:

vπ (s1 ) = 0.5[0 + γvπ (s3 )] + 0.5[−1 + γvπ (s2 )],


vπ (s2 ) = 1 + γvπ (s4 ),
vπ (s3 ) = 1 + γvπ (s4 ),
vπ (s4 ) = 1 + γvπ (s4 ).

Solve the above equations one by one from the last to the first.
1 1 1
vπ (s4 ) = , vπ (s3 ) = , vπ (s2 ) = ,
1−γ 1−γ 1−γ
vπ (s1 ) = 0.5[0 + γvπ (s3 )] + 0.5[−1 + γvπ (s2 )],
γ
= −0.5 + .
1−γ
Substituting γ = 0.9 yields

vπ (s4 ) = 10, vπ (s3 ) = 10, vπ (s2 ) = 10, vπ (s1 ) = −0.5 + 9 = 8.5.

Compare with the previous policy. This one is worse.

Shiyu Zhao 31 / 52
Outline

1 Motivating examples

2 State value

3 Bellman equation: Derivation

4 Bellman equation: Matrix-vector form

5 Bellman equation: Solve the state values

6 Action value

7 Summary

Shiyu Zhao 32 / 52
Matrix-vector form of the Bellman equation

Why consider the matrix-vector form?


• How to solve the Bellman equation?
One unknown relies on another unknown.
" #
X X X
0 0
vπ (s) = π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s )
a r s0

• The above elementwise form is valid for every state s ∈ S. That means
there are |S| equations like this!
• If we put all the equations together, we have a set of linear equations,
which can be concisely written in a matrix-vector form.
• The matrix-vector form is very elegant and important.

Shiyu Zhao 33 / 52
Matrix-vector form of the Bellman equation

Recall that:
" #
X X X
0 0
vπ (s) = π(a|s) p(r|s, a)r + γ p(s |s, a)vπ (s )
a r s0

Rewrite the Bellman equation as


X
vπ (s) = rπ (s) + γ pπ (s0 |s)vπ (s0 ) (1)
s0

where
X X X
rπ (s) , π(a|s) p(r|s, a)r, pπ (s0 |s) , π(a|s)p(s0 |s, a)
a r a

Shiyu Zhao 34 / 52
Matrix-vector form of the Bellman equation

Suppose the states could be indexed as si (i = 1, . . . , n).


For state si , the Bellman equation is
X
vπ (si ) = rπ (si ) + γ pπ (sj |si )vπ (sj )
sj

Put all these equations for all the states together and rewrite to a
matrix-vector form

vπ = rπ + γPπ vπ

where
• vπ = [vπ (s1 ), . . . , vπ (sn )]T ∈ Rn
• rπ = [rπ (s1 ), . . . , rπ (sn )]T ∈ Rn
• Pπ ∈ Rn×n , where [Pπ ]ij = pπ (sj |si ), is the state transition matrix

Shiyu Zhao 35 / 52
Illustrative examples
If there are four states, vπ = rπ + γPπ vπ can be written out as
      
vπ (s1 ) rπ (s1 ) pπ (s1 |s1 ) pπ (s2 |s1 ) pπ (s3 |s1 ) pπ (s4 |s1 ) vπ (s1 )
 vπ (s2 )   rπ (s2 )   pπ (s1 |s2 ) pπ (s2 |s2 ) pπ (s3 |s2 ) pπ (s4 |s2 )  vπ (s2 ) 
=  +γ  .
      
pπ (s1 |s3 ) pπ (s2 |s3 ) pπ (s3 |s3 ) pπ (s4 |s3 )
 
 vπ (s3 )   rπ (s3 )    vπ (s3 ) 
vπ (s4 ) rπ (s4 ) pπ (s1 |s4 ) pπ (s2 |s4 ) pπ (s3 |s4 ) pπ (s4 |s4 ) vπ (s4 )
| {z } | {z } | {z }| {z }
vπ rπ Pπ vπ

Shiyu Zhao 36 / 52
Illustrative examples
If there are four states, vπ = rπ + γPπ vπ can be written out as
      
vπ (s1 ) rπ (s1 ) pπ (s1 |s1 ) pπ (s2 |s1 ) pπ (s3 |s1 ) pπ (s4 |s1 ) vπ (s1 )
 vπ (s2 )   rπ (s2 )   pπ (s1 |s2 ) pπ (s2 |s2 ) pπ (s3 |s2 ) pπ (s4 |s2 )  vπ (s2 ) 
=  +γ  .
      
pπ (s1 |s3 ) pπ (s2 |s3 ) pπ (s3 |s3 ) pπ (s4 |s3 )
 
 vπ (s3 )   rπ (s3 )    vπ (s3 ) 
vπ (s4 ) rπ (s4 ) pπ (s1 |s4 ) pπ (s2 |s4 ) pπ (s3 |s4 ) pπ (s4 |s4 ) vπ (s4 )
| {z } | {z } | {z }| {z }
vπ rπ Pπ vπ

s1 s2
r=0 r=1

s3 s4
r=1 r=1

For this specific example:


      
vπ (s1 ) 0 0 0 1 0 vπ (s1 )
 vπ (s2 )   1   0 0 0 1  vπ (s2 ) 
= +γ
      
  
 vπ (s3 )   1   0 0 0 1  vπ (s3 ) 
vπ (s4 ) 1 0 0 0 1 vπ (s4 )
Shiyu Zhao 36 / 52
Illustrative examples
If there are four states, vπ = rπ + γPπ vπ can be written out as
      
vπ (s1 ) rπ (s1 ) pπ (s1 |s1 ) pπ (s2 |s1 ) pπ (s3 |s1 ) pπ (s4 |s1 ) vπ (s1 )
 v (s )   r (s )   p (s |s ) pπ (s2 |s2 ) pπ (s3 |s2 ) pπ (s4 |s2 )   v (s ) 
 π 2   π 2  π 1 2  π 2
=  +γ  .
 
 pπ (s1 |s3 ) pπ (s2 |s3 ) pπ (s3 |s3 ) pπ (s4 |s3 )
 
 vπ (s3 )   rπ (s3 )    vπ (s3 ) 
vπ (s4 ) rπ (s4 ) pπ (s1 |s4 ) pπ (s2 |s4 ) pπ (s3 |s4 ) pπ (s4 |s4 ) vπ (s4 )
| {z } | {z } | {z }| {z }
vπ rπ Pπ vπ

r=-1
Prob=0.5
s1 s2
r=0
r=1
Prob=0.5

s3 s4
r=1 r=1

For this specific example:


      
vπ (s1 ) 0.5(0) + 0.5(−1) 0 0.5 0.5 0 vπ (s1 )
 vπ (s2 )   1   0 0 0 1  vπ (s2 ) 
= +γ .
      
 
 vπ (s3 )   1   0 0 0 1  vπ (s3 ) 
vπ (s4 ) 1 0 0 0 1 vπ (s4 )
Shiyu Zhao 37 / 52
Outline

1 Motivating examples

2 State value

3 Bellman equation: Derivation

4 Bellman equation: Matrix-vector form

5 Bellman equation: Solve the state values

6 Action value

7 Summary

Shiyu Zhao 38 / 52
Solve state values

Why to solve state values?


• Given a policy, finding out the corresponding state values is called
policy evaluation! It is a fundamental problem in RL. It is the
foundation to find better policies.
• It is important to understand how to solve the Bellman equation.

Shiyu Zhao 39 / 52
Solve state values
The Bellman equation in matrix-vector form is
vπ = rπ + γPπ vπ
• The closed-form solution is:
vπ = (I − γPπ )−1 rπ
In practice, we still need to use numerical tools to calculate the matrix
inverse.
Can we avoid the matrix inverse operation? Yes, by iterative
algorithms.

• An iterative solution is:


vk+1 = rπ + γPπ vk
This algorithm leads to a sequence {v0 , v1 , v2 , . . . }. We can show that
vk → vπ = (I − γPπ )−1 rπ , k→∞
Shiyu Zhao 40 / 52
Solve state values
The Bellman equation in matrix-vector form is
vπ = rπ + γPπ vπ
• The closed-form solution is:
vπ = (I − γPπ )−1 rπ
In practice, we still need to use numerical tools to calculate the matrix
inverse.
Can we avoid the matrix inverse operation? Yes, by iterative
algorithms.

• An iterative solution is:


vk+1 = rπ + γPπ vk
This algorithm leads to a sequence {v0 , v1 , v2 , . . . }. We can show that
vk → vπ = (I − γPπ )−1 rπ , k→∞
Shiyu Zhao 40 / 52
Solve state values (optional)

Proof.
Define the error as δk = vk − vπ . We only need to show δk → 0. Substituting
vk+1 = δk+1 + vπ and vk = δk + vπ into vk+1 = rπ + γPπ vk gives

δk+1 + vπ = rπ + γPπ (δk + vπ ),

which can be rewritten as

δk+1 = −vπ + rπ + γPπ δk + γPπ vπ = γPπ δk .

As a result,

δk+1 = γPπ δk = γ 2 Pπ2 δk−1 = · · · = γ k+1 Pπk+1 δ0 .

Note that 0 ≤ Pπk ≤ 1, which means every entry of Pπk is no greater than 1 for
any k = 0, 1, 2, . . . . That is because Pπk 1 = 1, where 1 = [1, . . . , 1]T . On the
other hand, since γ < 1, we know γ k → 0 and hence δk+1 = γ k+1 Pπk+1 δ0 → 0
as k → ∞.

Shiyu Zhao 41 / 52
Solve state values
Examples: rboundary = rforbidden = −1, rtarget = +1, γ = 0.9
The following are two “good” policies and the state values. The two
policies are different for the top two states in the forth column.
1 2 3 4 5 1 2 3 4 5

1 1 3.5 3.9 4.3 4.8 5.3

2 2 3.1 3.5 4.8 5.3 5.9

3 3 2.8 2.5 10.0 5.9 6.6

4 4 2.5 10.0 10.0 10.0 7.3

5 5 2.3 9.0 10.0 9.0 8.1

1 2 3 4 5 1 2 3 4 5

1 1 3.5 3.9 4.3 4.8 5.3

2 2 3.1 3.5 4.8 5.3 5.9

3 3 2.8 2.5 10.0 5.9 6.6

4 4 2.5 10.0 10.0 10.0 7.3

5 5 2.3 9.0 10.0 9.0 8.1

Shiyu Zhao 42 / 52
Solve state values
Examples: rboundary = rforbidden = −1, rtarget = +1, γ = 0.9
The following are two “bad” policies and the state values. The state
values are less than those of the good policies.
1 2 3 4 5 1 2 3 4 5

1 1 -6.6 -7.3 -8.1 -9.0 -10.0

2 2 -8.5 -8.3 -8.1 -9.0 -10.0

3 3 -7.5 -8.3 -8.1 -9.0 -10.0

4 4 -7.5 -7.2 -9.1 -9.0 -10.0

5 5 -7.6 -7.3 -8.1 -9.0 -10.0

1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 -10.0 -10.0

2 2 -9.0 -10.0 -0.4 -0.5 -10.0

3 3 -10.0 -0.5 0.5 -0.5 0.0

4 4 0.0 -1.0 -0.5 -0.5 -10.0

5 5 0.0 0.0 0.0 0.0 0.0

Shiyu Zhao 43 / 52
Outline

1 Motivating examples

2 State value

3 Bellman equation: Derivation

4 Bellman equation: Matrix-vector form

5 Bellman equation: Solve the state values

6 Action value

7 Summary

Shiyu Zhao 44 / 52
Action value

From state value to action value:


• State value: the average return the agent can get starting from a state.
• Action value: the average return the agent can get starting from a
state and taking an action.
Why do we care action value? Because we want to know which action is
better. This point will be clearer in the following lectures.
We will frequently use action values.

Shiyu Zhao 45 / 52
Action value

Definition:
qπ (s, a) = E[Gt |St = s, At = a]

• qπ (s, a) is a function of the state-action pair (s, a)


• qπ (s, a) depends on π
It follows from the properties of conditional expectation that
X
E[Gt |St = s] = E[Gt |St = s, At = a] π(a|s)
| {z } a
| {z }
vπ (s) qπ (s,a)

Hence,
X
vπ (s) = π(a|s)qπ (s, a) (2)
a

Shiyu Zhao 46 / 52
Action value

Definition:
qπ (s, a) = E[Gt |St = s, At = a]

• qπ (s, a) is a function of the state-action pair (s, a)


• qπ (s, a) depends on π
It follows from the properties of conditional expectation that
X
E[Gt |St = s] = E[Gt |St = s, At = a] π(a|s)
| {z } a
| {z }
vπ (s) qπ (s,a)

Hence,
X
vπ (s) = π(a|s)qπ (s, a) (2)
a

Shiyu Zhao 46 / 52
Action value

Recall that the state value is given by


X hX X i
vπ (s) = π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπ (s0 ) (3)
a r s0
| {z }
qπ (s,a)

By comparing (2) and (3), we have the action-value function as


X X
qπ (s, a) = p(r|s, a)r + γ p(s0 |s, a)vπ (s0 ) (4)
r s0

(2) and (4) are the two sides of the same coin:
• (2) shows how to obtain state values from action values.
• (4) shows how to obtain action values from state values.

Shiyu Zhao 47 / 52
Action value

Recall that the state value is given by


X hX X i
vπ (s) = π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπ (s0 ) (3)
a r s0
| {z }
qπ (s,a)

By comparing (2) and (3), we have the action-value function as


X X
qπ (s, a) = p(r|s, a)r + γ p(s0 |s, a)vπ (s0 ) (4)
r s0

(2) and (4) are the two sides of the same coin:
• (2) shows how to obtain state values from action values.
• (4) shows how to obtain action values from state values.

Shiyu Zhao 47 / 52
Action value

Recall that the state value is given by


X hX X i
vπ (s) = π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπ (s0 ) (3)
a r s0
| {z }
qπ (s,a)

By comparing (2) and (3), we have the action-value function as


X X
qπ (s, a) = p(r|s, a)r + γ p(s0 |s, a)vπ (s0 ) (4)
r s0

(2) and (4) are the two sides of the same coin:
• (2) shows how to obtain state values from action values.
• (4) shows how to obtain action values from state values.

Shiyu Zhao 47 / 52
Illustrative example for action value

s1 s2
r=-1 r=1

s3 s4
r=1 r=1

Write out the action values for state s1 .

qπ (s1 , a2 ) = −1 + γvπ (s2 ),

Questions:
• qπ (s1 , a1 ), qπ (s1 , a3 ), qπ (s1 , a4 ), qπ (s1 , a5 ) =? Be careful!

Shiyu Zhao 48 / 52
Illustrative example for action value

s1 s2
r=-1 r=1

s3 s4
r=1 r=1

Write out the action values for state s1 .

qπ (s1 , a2 ) = −1 + γvπ (s2 ),

Questions:
• qπ (s1 , a1 ), qπ (s1 , a3 ), qπ (s1 , a4 ), qπ (s1 , a5 ) =? Be careful!

Shiyu Zhao 48 / 52
Illustrative example for action value

s1 s2
r=-1 r=1

s3 s4
r=1 r=1

For the other actions:

qπ (s1 , a1 ) = −1 + γvπ (s1 ),


qπ (s1 , a3 ) = 0 + γvπ (s3 ),
qπ (s1 , a4 ) = −1 + γvπ (s1 ),
qπ (s1 , a5 ) = 0 + γvπ (s1 ).

Shiyu Zhao 49 / 52
Illustrative example for action value

s1 s2
r=-1 r=1

s3 s4
r=1 r=1

Highlights:
• Action value is important since we care about which action to take.
• We can first calculate all the state values and then calculate the action
values.
• We can also directly calculate the action values with or without models.

Shiyu Zhao 50 / 52
Outline

1 Motivating examples

2 State value

3 Bellman equation: Derivation

4 Bellman equation: Matrix-vector form

5 Bellman equation: Solve the state values

6 Action value

7 Summary

Shiyu Zhao 51 / 52
Summary
Key concepts and results:
• State value: vπ (s) = E[Gt |St = s]
• Action value: qπ (s, a) = E[Gt |St = s, At = a]
• The Bellman equation (elementwise form):
X hX X i
vπ (s) = π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπ (s0 )
a r s0
| {z }
qπ (s,a)
X
= π(a|s)qπ (s, a)
a

• The Bellman equation (matrix-vector form):

vπ = rπ + γPπ vπ

• How to solve the Bellman equation: closed-form solution, iterative


solution
Shiyu Zhao 52 / 52
Summary
Key concepts and results:
• State value: vπ (s) = E[Gt |St = s]
• Action value: qπ (s, a) = E[Gt |St = s, At = a]
• The Bellman equation (elementwise form):
X hX X i
vπ (s) = π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπ (s0 )
a r s0
| {z }
qπ (s,a)
X
= π(a|s)qπ (s, a)
a

• The Bellman equation (matrix-vector form):

vπ = rπ + γPπ vπ

• How to solve the Bellman equation: closed-form solution, iterative


solution
Shiyu Zhao 52 / 52
Summary
Key concepts and results:
• State value: vπ (s) = E[Gt |St = s]
• Action value: qπ (s, a) = E[Gt |St = s, At = a]
• The Bellman equation (elementwise form):
X hX X i
vπ (s) = π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπ (s0 )
a r s0
| {z }
qπ (s,a)
X
= π(a|s)qπ (s, a)
a

• The Bellman equation (matrix-vector form):

vπ = rπ + γPπ vπ

• How to solve the Bellman equation: closed-form solution, iterative


solution
Shiyu Zhao 52 / 52
Summary
Key concepts and results:
• State value: vπ (s) = E[Gt |St = s]
• Action value: qπ (s, a) = E[Gt |St = s, At = a]
• The Bellman equation (elementwise form):
X hX X i
vπ (s) = π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπ (s0 )
a r s0
| {z }
qπ (s,a)
X
= π(a|s)qπ (s, a)
a

• The Bellman equation (matrix-vector form):

vπ = rπ + γPπ vπ

• How to solve the Bellman equation: closed-form solution, iterative


solution
Shiyu Zhao 52 / 52
Optimal Policy
and
Bellman Optimality Equation

Shiyu Zhao
Outline

Shiyu Zhao 1 / 50
Outline

In this lecture:
• Core concepts: optimal state value and optimal policy
• A fundamental tool: the Bellman optimality equation (BOE)

Shiyu Zhao 2 / 50
Outline

1 Motivating examples

2 Definition of optimal policy

3 BOE: Introduction

4 BOE: Maximization on the right-hand side

5 BOE: Rewrite as v = f (v)

6 Contraction mapping theorem

7 BOE: Solution

8 BOE: Optimality

9 Analyzing optimal policies

Shiyu Zhao 3 / 50
Outline

1 Motivating examples

2 Definition of optimal policy

3 BOE: Introduction

4 BOE: Maximization on the right-hand side

5 BOE: Rewrite as v = f (v)

6 Contraction mapping theorem

7 BOE: Solution

8 BOE: Optimality

9 Analyzing optimal policies

Shiyu Zhao 4 / 50
Motivating examples

s1 s2
r=-1 r=1

s3 s4
r=1 r=1

Bellman equation:

vπ (s1 ) = −1 + γvπ (s2 ),


vπ (s2 ) = +1 + γvπ (s4 ),
vπ (s3 ) = +1 + γvπ (s4 ),
vπ (s4 ) = +1 + γvπ (s4 ).

State value: Let γ = 0.9. Then, it can be calculated that

vπ (s4 ) = vπ (s3 ) = vπ (s2 ) = 10, vπ (s1 ) = 8.


Shiyu Zhao 5 / 50
Motivating examples

s1 s2
r=-1 r=1

s3 s4
r=1 r=1

Action value: consider s1

qπ (s1 , a1 ) = −1 + γvπ (s1 ) = 6.2,


qπ (s1 , a2 ) = −1 + γvπ (s2 ) = 8,
qπ (s1 , a3 ) = 0 + γvπ (s3 ) = 9,
qπ (s1 , a4 ) = −1 + γvπ (s1 ) = 6.2,
qπ (s1 , a5 ) = 0 + γvπ (s1 ) = 7.2.

Shiyu Zhao 6 / 50
Motivating examples

s1 s2
r=-1 r=1

s3 s4
r=1 r=1

Question: While the policy is not good, how can we improve it?
Answer: by using action values.
The current policy π(a|s1 ) is
(
1 a = a2
π(a|s1 ) =
0 a 6= a2

Shiyu Zhao 7 / 50
Motivating examples

s1 s2
r=-1 r=1

s3 s4
r=1 r=1

Question: While the policy is not good, how can we improve it?
Answer: by using action values.
The current policy π(a|s1 ) is
(
1 a = a2
π(a|s1 ) =
0 a 6= a2

Shiyu Zhao 7 / 50
Motivating examples

s1 s2
r=-1 r=1

s3 s4
r=1 r=1

Observe the action values that we obtained just now:

qπ (s1 , a1 ) = 6.2, qπ (s1 , a2 ) = 8, qπ (s1 , a3 ) = 9,


qπ (s1 , a4 ) = 6.2, qπ (s1 , a5 ) = 7.2.

What if we select the greatest action value? Then, a new policy is


obtained:
(
1 a = a∗
πnew (a|s1 ) =
0 a 6= a∗

where a∗ = arg maxa qπ (s1 , a) = a3 .


Shiyu Zhao 8 / 50
Motivating examples

s1 s2
r=-1 r=1

s3 s4
r=1 r=1

Observe the action values that we obtained just now:

qπ (s1 , a1 ) = 6.2, qπ (s1 , a2 ) = 8, qπ (s1 , a3 ) = 9,


qπ (s1 , a4 ) = 6.2, qπ (s1 , a5 ) = 7.2.

What if we select the greatest action value? Then, a new policy is


obtained:
(
1 a = a∗
πnew (a|s1 ) =
0 a 6= a∗

where a∗ = arg maxa qπ (s1 , a) = a3 .


Shiyu Zhao 8 / 50
Motivating examples

s1 s2
r=-1 r=1

s3 s4
r=1 r=1

Question: why doing this can improve the policy?

• Intuition: action values can be used to evaluate actions.


• Math: nontrivial and will be introduced in this lecture.

Shiyu Zhao 9 / 50
Outline

1 Motivating examples

2 Definition of optimal policy

3 BOE: Introduction

4 BOE: Maximization on the right-hand side

5 BOE: Rewrite as v = f (v)

6 Contraction mapping theorem

7 BOE: Solution

8 BOE: Optimality

9 Analyzing optimal policies

Shiyu Zhao 10 / 50
Optimal policy

The state value could be used to evaluate if a policy is good or not: if

vπ1 (s) ≥ vπ2 (s) for all s ∈ S

then π1 is “better” than π2 .


Definition
A policy π ∗ is optimal if vπ∗ (s) ≥ vπ (s) for all s and for any other policy
π.

The definition leads to many questions:


• Does the optimal policy exist?
• Is the optimal policy unique?
• Is the optimal policy stochastic or deterministic?
• How to obtain the optimal policy?
To answer these questions, we study the Bellman optimality equation.
Shiyu Zhao 11 / 50
Optimal policy

The state value could be used to evaluate if a policy is good or not: if

vπ1 (s) ≥ vπ2 (s) for all s ∈ S

then π1 is “better” than π2 .


Definition
A policy π ∗ is optimal if vπ∗ (s) ≥ vπ (s) for all s and for any other policy
π.

The definition leads to many questions:


• Does the optimal policy exist?
• Is the optimal policy unique?
• Is the optimal policy stochastic or deterministic?
• How to obtain the optimal policy?
To answer these questions, we study the Bellman optimality equation.
Shiyu Zhao 11 / 50
Optimal policy

The state value could be used to evaluate if a policy is good or not: if

vπ1 (s) ≥ vπ2 (s) for all s ∈ S

then π1 is “better” than π2 .


Definition
A policy π ∗ is optimal if vπ∗ (s) ≥ vπ (s) for all s and for any other policy
π.

The definition leads to many questions:


• Does the optimal policy exist?
• Is the optimal policy unique?
• Is the optimal policy stochastic or deterministic?
• How to obtain the optimal policy?
To answer these questions, we study the Bellman optimality equation.
Shiyu Zhao 11 / 50
Optimal policy

The state value could be used to evaluate if a policy is good or not: if

vπ1 (s) ≥ vπ2 (s) for all s ∈ S

then π1 is “better” than π2 .


Definition
A policy π ∗ is optimal if vπ∗ (s) ≥ vπ (s) for all s and for any other policy
π.

The definition leads to many questions:


• Does the optimal policy exist?
• Is the optimal policy unique?
• Is the optimal policy stochastic or deterministic?
• How to obtain the optimal policy?
To answer these questions, we study the Bellman optimality equation.
Shiyu Zhao 11 / 50
Outline

1 Motivating examples

2 Definition of optimal policy

3 BOE: Introduction

4 BOE: Maximization on the right-hand side

5 BOE: Rewrite as v = f (v)

6 Contraction mapping theorem

7 BOE: Solution

8 BOE: Optimality

9 Analyzing optimal policies

Shiyu Zhao 12 / 50
Bellman optimality equation (BOE)

Bellman optimality equation (elementwise form):


!
X X X
0 0
v(s) = π(a|s) p(r|s, a)r + γ p(s |s, a)v(s ) , ∀s ∈ S
a r s0

Shiyu Zhao 13 / 50
Bellman optimality equation (BOE)

Bellman optimality equation (elementwise form):


!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s ) , ∀s ∈ S
π
a r s0

Shiyu Zhao 13 / 50
Bellman optimality equation (BOE)

Bellman optimality equation (elementwise form):


!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s ) , ∀s ∈ S
π
a r s0
X
= max π(a|s)q(s, a) s ∈ S
π
a

Shiyu Zhao 13 / 50
Bellman optimality equation (BOE)

Bellman optimality equation (elementwise form):


!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s ) , ∀s ∈ S
π
a r s0
X
= max π(a|s)q(s, a) s ∈ S
π
a

Remarks:
• p(r|s, a), p(s0 |s, a) are known.
• v(s), v(s0 ) are unknown and to be calculated.
• Is π(s) known or unknown?

Shiyu Zhao 13 / 50
Bellman optimality equation (BOE)

Bellman optimality equation (matrix-vector form):

v = max(rπ + γPπ v)
π

where the elements corresponding to s or s0 are


X X
[rπ ]s , π(a|s) p(r|s, a)r,
a r
X X
[Pπ ]s,s0 = p(s0 |s) , π(a|s) p(s0 |s, a)
a s0

Here maxπ is performed elementwise.

Shiyu Zhao 14 / 50
Bellman optimality equation (BOE)

Bellman optimality equation (matrix-vector form):

v = max(rπ + γPπ v)
π

BOE is tricky yet elegant!


• Why elegant? It describes the optimal policy and optimal state value
in an elegant way.
• Why tricky? There is a maximization on the right-hand side, which
may not be straightforward to see how to compute.
• Many questions to answer:
• Algorithm: how to solve this equation?
• Existence: does this equation have solutions?
• Uniqueness: is the solution to this equation unique?
• Optimality: how is it related to optimal policy?
Shiyu Zhao 15 / 50
Bellman optimality equation (BOE)

Bellman optimality equation (matrix-vector form):

v = max(rπ + γPπ v)
π

BOE is tricky yet elegant!


• Why elegant? It describes the optimal policy and optimal state value
in an elegant way.
• Why tricky? There is a maximization on the right-hand side, which
may not be straightforward to see how to compute.
• Many questions to answer:
• Algorithm: how to solve this equation?
• Existence: does this equation have solutions?
• Uniqueness: is the solution to this equation unique?
• Optimality: how is it related to optimal policy?
Shiyu Zhao 15 / 50
Outline

1 Motivating examples

2 Definition of optimal policy

3 BOE: Introduction

4 BOE: Maximization on the right-hand side

5 BOE: Rewrite as v = f (v)

6 Contraction mapping theorem

7 BOE: Solution

8 BOE: Optimality

9 Analyzing optimal policies

Shiyu Zhao 16 / 50
Maximization on the right-hand side of BOE

BOE: elementwise form


!
X X X
v(s) = max π(a|s) p(r|s, a)r + γ p(s0 |s, a)v(s0 ) , ∀s ∈ S
π
a r s0

BOE: matrix-vector form v = maxπ (rπ + γPπ v)


Example (How to solve two unknowns from one equation)
Consider two variables x, a ∈ R. Suppose they satisfy

x = max(2x − 1 − a2 ).
a

This equation has two unknowns. To solve them, first consider the right
hand side. Regardless the value of x, maxa (2x − 1 − a2 ) = 2x − 1 where
the maximization is achieved when a = 0. Second, when a = 0, the
equation becomes x = 2x − 1, which leads to x = 1. Therefore, a = 0
and x = 1 are the solution of the equation.
Shiyu Zhao 17 / 50
Maximization on the right-hand side of BOE

BOE: elementwise form


!
X X X
v(s) = max π(a|s) p(r|s, a)r + γ p(s0 |s, a)v(s0 ) , ∀s ∈ S
π
a r s0

BOE: matrix-vector form v = maxπ (rπ + γPπ v)


Example (How to solve two unknowns from one equation)
Consider two variables x, a ∈ R. Suppose they satisfy

x = max(2x − 1 − a2 ).
a

This equation has two unknowns. To solve them, first consider the right
hand side. Regardless the value of x, maxa (2x − 1 − a2 ) = 2x − 1 where
the maximization is achieved when a = 0. Second, when a = 0, the
equation becomes x = 2x − 1, which leads to x = 1. Therefore, a = 0
and x = 1 are the solution of the equation.
Shiyu Zhao 17 / 50
Maximization on the right-hand side of BOE

Fix v 0 (s) first and solve π:


!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s ) , ∀s ∈ S
π
a r s0
X
= max π(a|s)q(s, a)
π
a

P
Example (How to solve maxπ a π(a|s)q(s, a))
Suppose q1 , q2 , q3 ∈ R are given. Find c∗1 , c∗2 , c∗3 solving

max c1 q1 + c2 q2 + c3 q3 .
c1 ,c2 ,c3

where c1 + c2 + c3 = 1 and c1 , c2 , c3 ≥ 0.
Without loss of generality, suppose q3 ≥ q1 , q2 . Then, the optimal
solution is c∗3 = 1 and c∗1 = c∗2 = 0. That is because for any c1 , c2 , c3

q3 = (c1 + c2 + c3 )q3 = c1 q3 + c2 q3 + c3 q3 ≥ c1 q1 + c2 q2 + c3 q3 .
Shiyu Zhao 18 / 50
Maximization on the right-hand side of BOE

Fix v 0 (s) first and solve π:


!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s ) , ∀s ∈ S
π
a r s0
X
= max π(a|s)q(s, a)
π
a

P
Example (How to solve maxπ a π(a|s)q(s, a))
Suppose q1 , q2 , q3 ∈ R are given. Find c∗1 , c∗2 , c∗3 solving

max c1 q1 + c2 q2 + c3 q3 .
c1 ,c2 ,c3

where c1 + c2 + c3 = 1 and c1 , c2 , c3 ≥ 0.
Without loss of generality, suppose q3 ≥ q1 , q2 . Then, the optimal
solution is c∗3 = 1 and c∗1 = c∗2 = 0. That is because for any c1 , c2 , c3

q3 = (c1 + c2 + c3 )q3 = c1 q3 + c2 q3 + c3 q3 ≥ c1 q1 + c2 q2 + c3 q3 .
Shiyu Zhao 18 / 50
Maximization on the right-hand side of BOE

Fix v 0 (s) first and solve π:


!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s ) , ∀s ∈ S
π
a r s0
X
= max π(a|s)q(s, a)
π
a
P
Inspired by the above example, considering that a π(a|s) = 1, we have
X
max π(a|s)q(s, a) = max q(s, a),
π a∈A(s)
a

where the optimality is achieved when


(
1 a = a∗
π(a|s) =
0 a 6= a∗

where a∗ = arg maxa q(s, a).


Shiyu Zhao 19 / 50
Outline

1 Motivating examples

2 Definition of optimal policy

3 BOE: Introduction

4 BOE: Maximization on the right-hand side

5 BOE: Rewrite as v = f (v)

6 Contraction mapping theorem

7 BOE: Solution

8 BOE: Optimality

9 Analyzing optimal policies

Shiyu Zhao 20 / 50
Solve the Bellman optimality equation

The BOE is v = maxπ (rπ + γPπ v). Let

f (v) := max(rπ + γPπ v)


π

Then, the Bellman optimality equation becomes

v = f (v)

where
X
[f (v)]s = max π(a|s)q(s, a), s∈S
π
a

Next, how to solve the equation?

Shiyu Zhao 21 / 50
Solve the Bellman optimality equation

The BOE is v = maxπ (rπ + γPπ v). Let

f (v) := max(rπ + γPπ v)


π

Then, the Bellman optimality equation becomes

v = f (v)

where
X
[f (v)]s = max π(a|s)q(s, a), s∈S
π
a

Next, how to solve the equation?

Shiyu Zhao 21 / 50
Outline

1 Motivating examples

2 Definition of optimal policy

3 BOE: Introduction

4 BOE: Maximization on the right-hand side

5 BOE: Rewrite as v = f (v)

6 Contraction mapping theorem

7 BOE: Solution

8 BOE: Optimality

9 Analyzing optimal policies

Shiyu Zhao 22 / 50
Preliminaries: Contraction mapping theorem

Some concepts:
• Fixed point: x ∈ X is a fixed point of f : X → X if

f (x) = x

• Contraction mapping (or contractive function): f is a contraction


mapping if
kf (x1 ) − f (x2 )k ≤ γkx1 − x2 k

where γ ∈ (0, 1).


• γ must be strictly less than 1 so that many limits such as γ k → 0 as
k → 0 hold.
• Here k · k can be any vector norm.

Shiyu Zhao 23 / 50
Preliminaries: Contraction mapping theorem

Some concepts:
• Fixed point: x ∈ X is a fixed point of f : X → X if

f (x) = x

• Contraction mapping (or contractive function): f is a contraction


mapping if
kf (x1 ) − f (x2 )k ≤ γkx1 − x2 k

where γ ∈ (0, 1).


• γ must be strictly less than 1 so that many limits such as γ k → 0 as
k → 0 hold.
• Here k · k can be any vector norm.

Shiyu Zhao 23 / 50
Preliminaries: Contraction mapping theorem

Examples to demonstrate the concepts.


Example
• x = f (x) = 0.5x, x ∈ R.
It is easy to verify that x = 0 is a fixed point since 0 = 0.5 × 0.
Moreover, f (x) = 0.5x is a contraction mapping because
k0.5x1 − 0.5x2 k = 0.5kx1 − x2 k ≤ γkx1 − x2 k for any γ ∈ [0.5, 1).

• x = f (x) = Ax, where x ∈ Rn , A ∈ Rn×n and kAk ≤ γ < 1.


It is easy to verify that x = 0 is a fixed point since 0 = A0. To see the
contraction property,
kAx1 − Ax2 k = kA(x1 − x2 )k ≤ kAkkx1 − x2 k ≤ γkx1 − x2 k.
Therefore, f (x) = Ax is a contraction mapping.

Shiyu Zhao 24 / 50
Preliminaries: Contraction mapping theorem

Examples to demonstrate the concepts.


Example
• x = f (x) = 0.5x, x ∈ R.
It is easy to verify that x = 0 is a fixed point since 0 = 0.5 × 0.
Moreover, f (x) = 0.5x is a contraction mapping because
k0.5x1 − 0.5x2 k = 0.5kx1 − x2 k ≤ γkx1 − x2 k for any γ ∈ [0.5, 1).

• x = f (x) = Ax, where x ∈ Rn , A ∈ Rn×n and kAk ≤ γ < 1.


It is easy to verify that x = 0 is a fixed point since 0 = A0. To see the
contraction property,
kAx1 − Ax2 k = kA(x1 − x2 )k ≤ kAkkx1 − x2 k ≤ γkx1 − x2 k.
Therefore, f (x) = Ax is a contraction mapping.

Shiyu Zhao 24 / 50
Preliminaries: Contraction mapping theorem

Theorem (Contraction Mapping Theorem)


For any equation that has the form of x = f (x), if f is a contraction
mapping, then
• Existence: there exists a fixed point x∗ satisfying f (x∗ ) = x∗ .
• Uniqueness: The fixed point x∗ is unique.
• Algorithm: Consider a sequence {xk } where xk+1 = f (xk ), then
xk → x∗ as k → ∞. Moreover, the convergence rate is exponentially
fast.

For the proof of this theorem, see the book.

Shiyu Zhao 25 / 50
Preliminaries: Contraction mapping theorem

Examples:

• x = 0.5x, where f (x) = 0.5x and x ∈ R


x∗ = 0 is the unique fixed point. It can be solved iteratively by

xk+1 = 0.5xk

• x = Ax, where f (x) = Ax and x ∈ Rn , A ∈ Rn×n and kAk < 1


x∗ = 0 is the unique fixed point. It can be solved iteratively by

xk+1 = Axk

Shiyu Zhao 26 / 50
Preliminaries: Contraction mapping theorem

Examples:

• x = 0.5x, where f (x) = 0.5x and x ∈ R


x∗ = 0 is the unique fixed point. It can be solved iteratively by

xk+1 = 0.5xk

• x = Ax, where f (x) = Ax and x ∈ Rn , A ∈ Rn×n and kAk < 1


x∗ = 0 is the unique fixed point. It can be solved iteratively by

xk+1 = Axk

Shiyu Zhao 26 / 50
Outline

1 Motivating examples

2 Definition of optimal policy

3 BOE: Introduction

4 BOE: Maximization on the right-hand side

5 BOE: Rewrite as v = f (v)

6 Contraction mapping theorem

7 BOE: Solution

8 BOE: Optimality

9 Analyzing optimal policies

Shiyu Zhao 27 / 50
Contraction property of BOE

Let’s come back to the Bellman optimality equation:

v = f (v) = max(rπ + γPπ v)


π

Theorem (Contraction Property)


f (v) is a contraction mapping satisfying

kf (v1 ) − f (v2 )k ≤ γkv1 − v2 k

where γ is the discount rate!

For the proof of this lemma, see our book.

Shiyu Zhao 28 / 50
Contraction property of BOE

Let’s come back to the Bellman optimality equation:

v = f (v) = max(rπ + γPπ v)


π

Theorem (Contraction Property)


f (v) is a contraction mapping satisfying

kf (v1 ) − f (v2 )k ≤ γkv1 − v2 k

where γ is the discount rate!

For the proof of this lemma, see our book.

Shiyu Zhao 28 / 50
Solve the Bellman optimality equation

Applying the contraction mapping theorem gives the following results.

Theorem (Existence, Uniqueness, and Algorithm)


For the BOE v = f (v) = maxπ (rπ + γPπ v), there always exists a
solution v ∗ and the solution is unique. The solution could be solved
iteratively by
vk+1 = f (vk ) = max(rπ + γPπ vk )
π

This sequence {vk } converges to v exponentially fast given any initial
guess v0 . The convergence rate is determined by γ.

Shiyu Zhao 29 / 50
Solve the Bellman optimality equation

The iterative algorithm:


Matrix-vector form:

vk+1 = f (vk ) = max(rπ + γPπ vk )


π

Elementwise form:
!
X X X
vk+1 (s) = max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 )
π
a r s0
X
= max π(a|s)qk (s, a)
π
a

= max qk (s, a)
a

Shiyu Zhao 30 / 50
Solve the Bellman optimality equation

The iterative algorithm:


Matrix-vector form:

vk+1 = f (vk ) = max(rπ + γPπ vk )


π

Elementwise form:
!
X X X
vk+1 (s) = max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 )
π
a r s0
X
= max π(a|s)qk (s, a)
π
a

= max qk (s, a)
a

Shiyu Zhao 30 / 50
Solve the Bellman optimality equation

Procedure summary:

• For any s, current estimated value vk (s)


• For any a ∈ A(s), calculate
qk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vk (s0 )
P P

• Calculate the greedy policy πk+1 for s as


(
1 a = a∗k (s)
πk+1 (a|s) =
0 a 6= a∗k (s)

where a∗k (s) = arg maxa qk (s, a).


• Calculate vk+1 (s) = maxa qk (s, a)

The above algorithm is actually the value iteration algorithm as discussed


in the next lecture.
Shiyu Zhao 31 / 50
Solve the Bellman optimality equation

Procedure summary:

• For any s, current estimated value vk (s)


• For any a ∈ A(s), calculate
qk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vk (s0 )
P P

• Calculate the greedy policy πk+1 for s as


(
1 a = a∗k (s)
πk+1 (a|s) =
0 a 6= a∗k (s)

where a∗k (s) = arg maxa qk (s, a).


• Calculate vk+1 (s) = maxa qk (s, a)

The above algorithm is actually the value iteration algorithm as discussed


in the next lecture.
Shiyu Zhao 31 / 50
Solve the Bellman optimality equation

Procedure summary:

• For any s, current estimated value vk (s)


• For any a ∈ A(s), calculate
qk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vk (s0 )
P P

• Calculate the greedy policy πk+1 for s as


(
1 a = a∗k (s)
πk+1 (a|s) =
0 a 6= a∗k (s)

where a∗k (s) = arg maxa qk (s, a).


• Calculate vk+1 (s) = maxa qk (s, a)

The above algorithm is actually the value iteration algorithm as discussed


in the next lecture.
Shiyu Zhao 31 / 50
Solve the Bellman optimality equation

Procedure summary:

• For any s, current estimated value vk (s)


• For any a ∈ A(s), calculate
qk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vk (s0 )
P P

• Calculate the greedy policy πk+1 for s as


(
1 a = a∗k (s)
πk+1 (a|s) =
0 a 6= a∗k (s)

where a∗k (s) = arg maxa qk (s, a).


• Calculate vk+1 (s) = maxa qk (s, a)

The above algorithm is actually the value iteration algorithm as discussed


in the next lecture.
Shiyu Zhao 31 / 50
Solve the Bellman optimality equation

Procedure summary:

• For any s, current estimated value vk (s)


• For any a ∈ A(s), calculate
qk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vk (s0 )
P P

• Calculate the greedy policy πk+1 for s as


(
1 a = a∗k (s)
πk+1 (a|s) =
0 a 6= a∗k (s)

where a∗k (s) = arg maxa qk (s, a).


• Calculate vk+1 (s) = maxa qk (s, a)

The above algorithm is actually the value iteration algorithm as discussed


in the next lecture.
Shiyu Zhao 31 / 50
Example

s2
s1 s3
target area

Example: Manually solve the BOE.

• Why manually? Can understand better.


• Why so simple example? Can be calculated manually.

Actions: a` , a0 , ar represent go left, stay unchanged, and go right.


Reward: entering the target area: +1; try to go out of boundary -1.

Shiyu Zhao 32 / 50
Example

s2
s1 s3
target area

The values of q(s, a)

q-value table a` a0 ar
s1 −1 + γv(s1 ) 0 + γv(s1 ) 1 + γv(s2 )
s2 0 + γv(s1 ) 1 + γv(s2 ) 0 + γv(s3 )
s3 1 + γv(s2 ) 0 + γv(s3 ) −1 + γv(s3 )

Consider γ = 0.9

Shiyu Zhao 33 / 50
Example

s2
s1 s3
target area

The values of q(s, a)

q-value table a` a0 ar
s1 −1 + γv(s1 ) 0 + γv(s1 ) 1 + γv(s2 )
s2 0 + γv(s1 ) 1 + γv(s2 ) 0 + γv(s3 )
s3 1 + γv(s2 ) 0 + γv(s3 ) −1 + γv(s3 )

Consider γ = 0.9

Shiyu Zhao 33 / 50
Example

Our objective is to find v ∗ (si ) and π ∗


k = 0:
v-value: select v0 (s1 ) = v0 (s2 ) = v0 (s3 ) = 0
q-value (using the previous table):
a` a0 ar
s1 −1 0 1
s2 0 1 0
s3 1 0 −1
Greedy policy (select the greatest q-value)

π(ar |s1 ) = 1, π(a0 |s2 ) = 1, π(a` |s3 ) = 1

v-value: v1 (s) = maxa q0 (s, a)

v1 (s1 ) = v1 (s2 ) = v1 (s3 ) = 1

This this policy good? Yes!


Shiyu Zhao 34 / 50
Example

Our objective is to find v ∗ (si ) and π ∗


k = 0:
v-value: select v0 (s1 ) = v0 (s2 ) = v0 (s3 ) = 0
q-value (using the previous table):
a` a0 ar
s1 −1 0 1
s2 0 1 0
s3 1 0 −1
Greedy policy (select the greatest q-value)

π(ar |s1 ) = 1, π(a0 |s2 ) = 1, π(a` |s3 ) = 1

v-value: v1 (s) = maxa q0 (s, a)

v1 (s1 ) = v1 (s2 ) = v1 (s3 ) = 1

This this policy good? Yes!


Shiyu Zhao 34 / 50
Example

Our objective is to find v ∗ (si ) and π ∗


k = 0:
v-value: select v0 (s1 ) = v0 (s2 ) = v0 (s3 ) = 0
q-value (using the previous table):
a` a0 ar
s1 −1 0 1
s2 0 1 0
s3 1 0 −1
Greedy policy (select the greatest q-value)

π(ar |s1 ) = 1, π(a0 |s2 ) = 1, π(a` |s3 ) = 1

v-value: v1 (s) = maxa q0 (s, a)

v1 (s1 ) = v1 (s2 ) = v1 (s3 ) = 1

This this policy good? Yes!


Shiyu Zhao 34 / 50
Example

• k = 1:
Excise: With v1 (s) calculated in the last step, calculate by yourself.
q-value:
a` a0 ar
s1 −0.1 0.9 1.9
s2 0.9 1.9 0.9
s3 1.9 0.9 −0.1
Greedy policy (select the greatest q-value):

π(ar |s1 ) = 1, π(a0 |s2 ) = 1, π(a` |s3 ) = 1

The policy is the same as the previous one, which is already optimal.
v-value: v2 (s) = ...
• k = 2, 3, . . .
Shiyu Zhao 35 / 50
Example

• k = 1:
Excise: With v1 (s) calculated in the last step, calculate by yourself.
q-value:
a` a0 ar
s1 −0.1 0.9 1.9
s2 0.9 1.9 0.9
s3 1.9 0.9 −0.1
Greedy policy (select the greatest q-value):

π(ar |s1 ) = 1, π(a0 |s2 ) = 1, π(a` |s3 ) = 1

The policy is the same as the previous one, which is already optimal.
v-value: v2 (s) = ...
• k = 2, 3, . . .
Shiyu Zhao 35 / 50
Example

• k = 1:
Excise: With v1 (s) calculated in the last step, calculate by yourself.
q-value:
a` a0 ar
s1 −0.1 0.9 1.9
s2 0.9 1.9 0.9
s3 1.9 0.9 −0.1
Greedy policy (select the greatest q-value):

π(ar |s1 ) = 1, π(a0 |s2 ) = 1, π(a` |s3 ) = 1

The policy is the same as the previous one, which is already optimal.
v-value: v2 (s) = ...
• k = 2, 3, . . .
Shiyu Zhao 35 / 50
Outline

1 Motivating examples

2 Definition of optimal policy

3 BOE: Introduction

4 BOE: Maximization on the right-hand side

5 BOE: Rewrite as v = f (v)

6 Contraction mapping theorem

7 BOE: Solution

8 BOE: Optimality

9 Analyzing optimal policies

Shiyu Zhao 36 / 50
Policy optimality

Suppose v ∗ is the solution to the Bellman optimality equation. It satisfies

v ∗ = max(rπ + γPπ v ∗ )
π

Suppose
π ∗ = arg max(rπ + γPπ v ∗ )
π

Then
v ∗ = rπ∗ + γPπ∗ v ∗

Therefore, π ∗ is a policy and v ∗ = vπ∗ is the corresponding state value.


Is π ∗ the optimal policy? Is v ∗ the greatest state value can be achieved?

Shiyu Zhao 37 / 50
Policy optimality

Theorem (Policy Optimality)


Suppose that v ∗ is the unique solution to v = maxπ (rπ + γPπ v), and vπ
is the state value function satisfying vπ = rπ + γPπ vπ for any given
policy π, then
v ∗ ≥ vπ , ∀π

For the proof, please see our book.


Now we understand why we study the BOE. That is because it describes
the optimal state value and optimal policy.

Shiyu Zhao 38 / 50
Optimal policy

What does an optimal policy π ∗ look like?

Theorem (Greedy Optimal Policy)

For any s ∈ S, the deterministic greedy policy


(
∗ 1 a = a∗ (s)
π (a|s) = (1)
0 a 6= a∗ (s)

is an optimal policy solving the BOE. Here,

a∗ (s) = arg max q ∗ (a, s),


a

where q ∗ (s, a) := p(s0 |s, a)v ∗ (s0 ).


P P
r p(r|s, a)r + γ s0
!
X X
Proof: simple. π ∗ (s) = arg maxπ p(s0 |s, a)v ∗ (s0 )
P
a π(a|s) p(r|s, a)r + γ
r s0
| {z }
q ∗ (s,a)
Shiyu Zhao 39 / 50
Outline

1 Motivating examples

2 Definition of optimal policy

3 BOE: Introduction

4 BOE: Maximization on the right-hand side

5 BOE: Rewrite as v = f (v)

6 Contraction mapping theorem

7 BOE: Solution

8 BOE: Optimality

9 Analyzing optimal policies

Shiyu Zhao 40 / 50
Analyzing optimal policies

What factors determine the optimal policy?


It can be clearly seen from the BOE
!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s )
π
a r s0

that there are three factors:


• Reward design: r
• System model: p(s0 |s, a), p(r|s, a)
• Discount rate: γ
• v(s), v(s0 ), π(a|s) are unknowns to be calculated
Next, we use examples to show how changing r and γ can change the
optimal policy.

Shiyu Zhao 41 / 50
Analyzing optimal policies

What factors determine the optimal policy?


It can be clearly seen from the BOE
!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s )
π
a r s0

that there are three factors:


• Reward design: r
• System model: p(s0 |s, a), p(r|s, a)
• Discount rate: γ
• v(s), v(s0 ), π(a|s) are unknowns to be calculated
Next, we use examples to show how changing r and γ can change the
optimal policy.

Shiyu Zhao 41 / 50
Analyzing optimal policies

What factors determine the optimal policy?


It can be clearly seen from the BOE
!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s )
π
a r s0

that there are three factors:


• Reward design: r
• System model: p(s0 |s, a), p(r|s, a)
• Discount rate: γ
• v(s), v(s0 ), π(a|s) are unknowns to be calculated
Next, we use examples to show how changing r and γ can change the
optimal policy.

Shiyu Zhao 41 / 50
Analyzing optimal policies

The optimal policy and the corresponding optimal state value are
obtained by solving the BOE.

1 2 3 4 5 1 2 3 4 5

1 1 5.8 5.6 6.2 6.5 5.8

2 2 6.5 7.2 8.0 7.2 6.5

3 3 7.2 8.0 10.0 8.0 7.2

4 4 8.0 10.0 10.0 10.0 8.0

5 5 7.2 9.0 10.0 9.0 8.1

(a) rboundary = rforbidden = −1, rtarget = 1, γ = 0.9

The optimal policy dares to take risks: entering forbidden areas!!

Shiyu Zhao 42 / 50
Analyzing optimal policies

If we change γ = 0.9 to γ = 0.5

1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0

2 2 0.0 0.0 0.0 0.0 0.1

3 3 0.0 0.0 2.0 0.1 0.1

4 4 0.0 2.0 2.0 2.0 0.2

5 5 0.0 1.0 2.0 1.0 0.5

(b) The discount rate is γ = 0.5. Others are the same as (a).

The optimal policy becomes shorted-sighted! Avoid all the forbidden


areas!

Shiyu Zhao 43 / 50
Analyzing optimal policies

If we change γ to 0

1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0

2 2 0.0 0.0 0.0 0.0 0.0

3 3 0.0 0.0 1.0 0.0 0.0

4 4 0.0 1.0 1.0 1.0 0.0

5 5 0.0 0.0 1.0 0.0 0.0

(c) The discount rate is γ = 0. Others are the same as (a).

The optimal policy becomes extremely short-sighted! Also, choose the


action that has the greatest immediate reward! Cannot reach the target!

Shiyu Zhao 44 / 50
Analyzing optimal policies

If we increase the punishment when entering forbidden areas


(rforbidden = −1 to rforbidden = −10)

1 2 3 4 5 1 2 3 4 5

1 1 3.5 3.9 4.3 4.8 5.3

2 2 3.1 3.5 4.8 5.3 5.9

3 3 2.8 2.5 10.0 5.9 6.6

4 4 2.5 10.0 10.0 10.0 7.3

5 5 2.3 9.0 10.0 9.0 8.1

(d) rforbidden = −10. Others are the same as (a).

The optimal policy would also avoid the forbidden areas.

Shiyu Zhao 45 / 50
Analyzing optimal policies

What if we change r → ar + b?
For example,

rboundary = rforbidden = −1, rtarget = 1

becomes

rboundary = rforbidden = 0, rtarget = 2, rotherstep = 1

The optimal policy remains the same!


What matters is not the absolute reward values! It is their relative values!

Shiyu Zhao 46 / 50
Analyzing optimal policies

Theorem (Optimal Policy Invariance)


Consider a Markov decision process with v ∗ ∈ R|S| as the optimal state value
satisfying v ∗ = maxπ (rπ + γPπ v ∗ ). If every reward r is changed by an affine
transformation to ar + b, where a, b ∈ R and a 6= 0, then the corresponding
optimal state value v 0 is also an affine transformation of v ∗ :
b
v 0 = av ∗ + 1,
1−γ

where γ ∈ (0, 1) is the discount rate and 1 = [1, . . . , 1]T . Consequently, the
optimal policies are invariant to the affine transformation of the reward signals.

Shiyu Zhao 47 / 50
Analyzing optimal policies

Meaningless detour?
1 2 1 2 1 2 1 2

1 1 9.0 10.0 1 1 9.0 8.1

2 2 10.0 10.0 2 2 10.0 10.0

(a) Optimal policy (b) Not optimal

The policy in (a) is optimal, the policy in (b) is not.


Question: Why the optimal policy is not (b)? Why does the optimal
policy not take meaningless detours? There is no punishment for taking
detours!!
Due to the discount rate!
Policy (a): return = 1 + γ1 + γ 2 1 + · · · = 1/(1 − γ) = 10.
Policy (b): return = 0 + γ0 + γ 2 1 + γ 3 1 + · · · = γ 2 /(1 − γ) = 8.1
Shiyu Zhao 48 / 50
Analyzing optimal policies

Meaningless detour?
1 2 1 2 1 2 1 2

1 1 9.0 10.0 1 1 9.0 8.1

2 2 10.0 10.0 2 2 10.0 10.0

(a) Optimal policy (b) Not optimal

The policy in (a) is optimal, the policy in (b) is not.


Question: Why the optimal policy is not (b)? Why does the optimal
policy not take meaningless detours? There is no punishment for taking
detours!!
Due to the discount rate!
Policy (a): return = 1 + γ1 + γ 2 1 + · · · = 1/(1 − γ) = 10.
Policy (b): return = 0 + γ0 + γ 2 1 + γ 3 1 + · · · = γ 2 /(1 − γ) = 8.1
Shiyu Zhao 48 / 50
Analyzing optimal policies

Meaningless detour?
1 2 1 2 1 2 1 2

1 1 9.0 10.0 1 1 9.0 8.1

2 2 10.0 10.0 2 2 10.0 10.0

(a) Optimal policy (b) Not optimal

The policy in (a) is optimal, the policy in (b) is not.


Question: Why the optimal policy is not (b)? Why does the optimal
policy not take meaningless detours? There is no punishment for taking
detours!!
Due to the discount rate!
Policy (a): return = 1 + γ1 + γ 2 1 + · · · = 1/(1 − γ) = 10.
Policy (b): return = 0 + γ0 + γ 2 1 + γ 3 1 + · · · = γ 2 /(1 − γ) = 8.1
Shiyu Zhao 48 / 50
Summary

Bellman optimality equation:


• Elementwise form:
!
X X X
0 0
v(s) = max π(a|s) p(r|s, a)r + γ p(s |s, a)v(s ) , ∀s ∈ S
π
a r s0
| {z }
q(s,a)

• Matrix-vector form:
v = max(rπ + γPπ v)
π

Shiyu Zhao 49 / 50
Summary

Questions about the Bellman optimality equation:


• Existence: does this equation have solutions?
• Yes, by the contraction mapping Theorem
• Uniqueness: is the solution to this equation unique?
• Yes, by the contraction mapping Theorem
• Algorithm: how to solve this equation?
• Iterative algorithm suggested by the contraction mapping Theorem
• Optimality: why we study this equation
• Because its solution corresponds to the optimal state value and
optimal policy.
Finally, we understand why it is important to study the BOE!

Shiyu Zhao 50 / 50
Lecture 4: Value Iteration and Policy Iteration

Shiyu Zhao
Outline

Shiyu Zhao 1 / 40
Outline

1 Value iteration algorithm

2 Policy iteration algorithm

3 Truncated policy iteration algorithm

Shiyu Zhao 2 / 40
Outline

1 Value iteration algorithm

2 Policy iteration algorithm

3 Truncated policy iteration algorithm

Shiyu Zhao 3 / 40
Value iteration algorithm

. How to solve the Bellman optimality equation?

v = f (v) = max(rπ + γPπ v)


π

. In the last lecture, we know that the contraction mapping theorem


suggest an iterative algorithm:

vk+1 = f (vk ) = max(rπ + γPπ vk ), k = 1, 2, 3 . . .


π

where v0 can be arbitrary.


. This algorithm can eventually find the optimal state value and an
optimal policy.
. This algorithm is called value iteration!
. We will see that the math about the BOE that we have learned finally
pays off!
Shiyu Zhao 4 / 40
Value iteration algorithm

The algorithm

vk+1 = f (vk ) = max(rπ + γPπ vk ), k = 1, 2, 3 . . .


π

can be decomposed to two steps.


• Step 1: policy update. This step is to solve

πk+1 = arg max(rπ + γPπ vk )


π

where vk is given.
• Step 2: value update.

vk+1 = rπk+1 + γPπk+1 vk

Question: is vk a state value? No, because it is not ensured that vk


satisfies a Bellman equation.
Shiyu Zhao 5 / 40
Value iteration algorithm

. Next, we need to study the elementwise form in order to implement the


algorithm.

• Matrix-vector form is useful for theoretical analysis.


• Elementwise form is useful for implementation.

Shiyu Zhao 6 / 40
Value iteration algorithm - Elementwise form

. Step 1: Policy update


The elementwise form of

πk+1 = arg max(rπ + γPπ vk )


π

is
!
X X X
πk+1 (s) = arg max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 ) , s∈S
π
a r s0
| {z }
qk (s,a)

The optimal policy solving the above optimization problem is


(
1 a = a∗k (s)
πk+1 (a|s) =
0 a 6= a∗k (s)

where a∗k (s) = arg maxa qk (a, s). πk+1 is called a greedy policy, since it
simply selects the greatest q-value.
Shiyu Zhao 7 / 40
Value iteration algorithm - Elementwise form

. Step 2: Value update


The elementwise form of

vk+1 = rπk+1 + γPπk+1 vk

is
!
X X X
vk+1 (s) = πk+1 (a|s) p(r|s, a)r + γ p(s0 |s, a)vk (s0 ) , s∈S
a r s0
| {z }
qk (s,a)

Since πk+1 is greedy, the above equation is simply

vk+1 (s) = max qk (a, s)


a

Shiyu Zhao 8 / 40
Value iteration algorithm - Pseudocode
. Procedure summary:

vk (s) → qk (s, a) → greedy policy πk+1 (a|s) → new value vk+1 = max qk (s, a)
a

Pseudocode: Value iteration algorithm

Initialization: The probability model p(r|s, a) and p(s0 |s, a) for all (s, a) are
known. Initial guess v0 .
Aim: Search the optimal state value and an optimal policy solving the Bellman
optimality equation.

While vk has not converged in the sense that kvk − vk−1 k is greater than a
predefined small threshold, for the kth iteration, do
For every state s ∈ S, do
For every action a ∈ A(s), do
q-value: qk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vk (s0 )
P P

Maximum action value: a∗k (s) = arg maxa qk (a, s)


Policy update: πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise
Value update: vk+1 (s) = maxa qk (a, s)

Shiyu Zhao 9 / 40
Value iteration algorithm - Example

. The reward setting is rboundary = rforbidden = −1, rtarget = 1. The


discount rate is γ = 0.9.

s1 s2 s1 s2 s1 s2

s3 s4 s3 s4 s3 s4

(a) (b) (c)

q-table: The expression of q(s, a).

q-value a1 a2 a3 a4 a5
s1 −1 + γv(s1 ) −1 + γv(s2 ) 0 + γv(s3 ) −1 + γv(s1 ) 0 + γv(s1 )
s2 −1 + γv(s2 ) −1 + γv(s2 ) 1 + γv(s4 ) 0 + γv(s1 ) −1 + γv(s2 )
s3 0 + γv(s1 ) 1 + γv(s4 ) −1 + γv(s3 ) −1 + γv(s3 ) 0 + γv(s3 )
s4 −1 + γv(s2 ) −1 + γv(s4 ) −1 + γv(s4 ) 0 + γv(s3 ) 1 + γv(s4 )
Shiyu Zhao 10 / 40
Value iteration algorithm - Example

• k = 0: let v0 (s1 ) = v0 (s2 ) = v0 (s3 ) = v0 (s4 ) = 0


q-value a1 a2 a3 a4 a5
s1 −1 −1 0 −1 0
s2 −1 −1 1 0 −1
s3 0 1 −1 −1 0
s4 −1 −1 −1 0 1
Step 1: Policy update:

π1 (a5 |s1 ) = 1, π1 (a3 |s2 ) = 1, π1 (a2 |s3 ) = 1, π1 (a5 |s4 ) = 1

This policy is visualized in Figure (b).


Step 2: Value update:

v1 (s1 ) = 0, v1 (s2 ) = 1, v1 (s3 ) = 1, v1 (s4 ) = 1.

Shiyu Zhao 11 / 40
Value iteration algorithm - Example

• k = 1: since v1 (s1 ) = 0, v1 (s2 ) = 1, v1 (s3 ) = 1, v1 (s4 ) = 1, we have


q-table a1 a2 a3 a4 a5
s1 −1 + γ0 −1 + γ1 0 + γ1 −1 + γ0 0 + γ0
s2 −1 + γ1 −1 + γ1 1 + γ1 0 + γ0 −1 + γ1
s3 0 + γ0 1 + γ1 −1 + γ1 −1 + γ1 0 + γ1
s4 −1 + γ1 −1 + γ1 −1 + γ1 0 + γ1 1 + γ1
Step 1: Policy update:

π2 (a3 |s1 ) = 1, π2 (a3 |s2 ) = 1, π2 (a2 |s3 ) = 1, π2 (a5 |s4 ) = 1.

Step 2: Value update:

v2 (s1 ) = γ1, v2 (s2 ) = 1 + γ1, v2 (s3 ) = 1 + γ1, v2 (s4 ) = 1 + γ1.

This policy is visualized in Figure (c).


The policy is already optimal!!
• k = 2, 3, . . . . Stop when kvk − vk+1 k is smaller than a predefined

Shiyu Zhao
threshold. 12 / 40
Outline

1 Value iteration algorithm

2 Policy iteration algorithm

3 Truncated policy iteration algorithm

Shiyu Zhao 13 / 40
Policy iteration algorithm

. Algorithm description:
Given a random initial policy π0 ,
• Step 1: policy evaluation (PE)
This step is to calculate the state value of πk :

vπk = rπk + γPπk vπk

Note that vπk is a state value function.


• Step 2: policy improvement (PI)

πk+1 = arg max(rπ + γPπ vπk )


π

The maximization is componentwise!

Shiyu Zhao 14 / 40
Policy iteration algorithm

. The algorithm leads to a sequence


PE PI PE PI PE PI
π0 −−→ vπ0 −−→ π1 −−→ vπ1 −−→ π2 −−→ vπ2 −−→ . . .

PE=policy evaluation, PI=policy improvement


. Questions:
• Q1: In the policy evaluation step, how to get the state value vπk by
solving the Bellman equation?
• Q2: In the policy improvement step, why is the new policy πk+1 better
than πk ?
• Q3: Why such an iterative algorithm can finally reach an optimal
policy?
• Q4: What is the relationship between this policy iteration algorithm
and the previous value iteration algorithm?
Shiyu Zhao 15 / 40
Policy iteration algorithm

. Q1: In the policy evaluation step, how to get the state value vπk
by solving the Bellman equation?

vπk = rπk + γPπk vπk

• Closed-form solution:

vπk = (I − γPπk )−1 rπk

• Iterative solution:

vπ(j+1)
k
= rπk + γPπk vπ(j)
k
, j = 0, 1, 2, ...

. Already studied in the lecture on Bellman equation.


. Policy iteration is an iterative algorithm with another iterative
algorithm embedded in the policy evaluation step!
Shiyu Zhao 16 / 40
Policy iteration algorithm

. Q2: In the policy improvement step, why is the new policy πk+1
better than πk ?

Lemma (Policy Improvement)

If πk+1 = arg maxπ (rπ + γPπ vπk ), then vπk+1 ≥ vπk for any k.

See the proof in the book.

Shiyu Zhao 17 / 40
Policy iteration algorithm

. Q3: Why can such an iterative algorithm finally reach an optimal


policy?
Since every iteration would improve the policy, we know

vπ0 ≤ vπ1 ≤ vπ2 ≤ · · · ≤ vπk ≤ · · · ≤ v ∗ .

As a result, vπk keeps increasing and will converge. Still need to prove it
converges to v ∗ .

Theorem (Convergence of Policy Iteration)


The state value sequence {vπk }∞
k=0 generated by the policy iteration
algorithm converges to the optimal state value v ∗ . As a result, the policy
sequence {πk }∞
k=0 converges to an optimal policy.

Shiyu Zhao 18 / 40
Policy iteration algorithm

. Q4: What is the relationship between policy iteration and value


iteration?
Related to the answer to Q3 and will be explained in detail later.

Shiyu Zhao 19 / 40
Policy iteration algorithm - Elementwise form

Step 1: Policy evaluation


(j+1) (j)
. Matrix-vector form: vπk = rπk + γPπk vπk , j = 0, 1, 2, . . .
. Elementwise form:
!
X X X
0
vπ(j+1)
k
(s) = πk (a|s) p(r|s, a)r + γ p(s |s, a)vπ(j)
k
(s0 ) , s ∈ S,
a r s0

(j+1) (j)
Stop when j → ∞ or j is sufficiently large or kvπk − vπk k is
sufficiently small.

Shiyu Zhao 20 / 40
Policy iteration algorithm - Elementwise form

Step 2: Policy improvement


. Matrix-vector form: πk+1 = arg maxπ (rπ + γPπ vπk )
. Elementwise form
!
X X X
πk+1 (s) = arg max π(a|s) p(r|s, a)r + γ p(s0 |s, a)vπk (s0 ) , s ∈ S.
π
a r s0
| {z }
qπk (s,a)

Here, qπk (s, a) is the action value under policy πk . Let

a∗k (s) = arg max qπk (a, s)


a

Then, the greedy policy is


(
1 a = a∗k (s),
πk+1 (a|s) =
0 a 6= a∗k (s).

Shiyu Zhao 21 / 40
Policy iteration algorithm - Implementation

Pseudocode: Policy iteration algorithm

Initialization: The probability model p(r|s, a) and p(s0 |s, a) for all (s, a) are
known. Initial guess π0 .
Aim: Search for the optimal state value and an optimal policy.

While the policy has not converged, for the kth iteration, do
Policy evaluation:
(0)
Initialization: an arbitrary initial guess vπk
(j)
While v πk has not converged, for the jth iteration, do
For every state s ∈ S, do hP i
(j+1) P P 0 (j) 0
vπk (s) = a πk (a|s) r p(r|s, a)r + γ s0 p(s |s, a)vπk (s )
Policy improvement:
For every state s ∈ S, do
For every action a ∈ A(s), do
qπk (s, a) = r p(r|s, a)r + γ s0 p(s0 |s, a)vπk (s0 )
P P

a∗k (s) = arg maxa qπk (s, a)


πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise

Shiyu Zhao 22 / 40
Policy iteration algorithm - Simple example

s1 s2 s1 s2

(a) initial policy (b) optimal policy

. The reward setting is rboundary = −1 and rtarget = 1. The discount


rate is γ = 0.9.
. Actions: a` , a0 , ar represent go left, stay unchanged, and go right.
. Aim: use policy iteration to find out the optimal policy.

Shiyu Zhao 23 / 40
Policy iteration algorithm - Simple example

. Iteration k = 0: Step 1: policy evaluation


π0 is selected as the policy in Figure (a). The Bellman equation is
vπ0 (s1 ) = −1 + γvπ0 (s1 ),
vπ0 (s2 ) = 0 + γvπ0 (s1 ).
• Solve the equations directly:

vπ0 (s1 ) = −10, vπ0 (s2 ) = −9.


• Solve the equations iteratively. Select the initial guess as
(0) (0)
vπ0 (s1 ) = vπ0 (s2 ) = 0:
(1) (0)
(
vπ0 (s1 ) = −1 + γvπ0 (s1 ) = −1,
(1) (0)
vπ0 (s2 ) =0+ γvπ0 (s1 ) = 0,
(2) (1)
(
vπ0 (s1 ) = −1 + γvπ0 (s1 ) = −1.9,
(2) (1)
vπ0 (s2 ) = 0 + γvπ0 (s1 ) = −0.9,
(3) (2)
(
vπ0 (s1 ) = −1 + γvπ0 (s1 ) = −2.71,
(3) (2)
vπ0 (s2 ) =0+ γvπ0 (s1 ) = −1.71,

Shiyu Zhao ... 24 / 40


Policy iteration algorithm - Simple example

. Iteration k = 0: Step 2: policy improvement


The expression of qπk (s, a):

qπk (s, a) a` a0 ar
s1 −1 + γvπk (s1 ) 0 + γvπk (s1 ) 1 + γvπk (s2 )
s2 0 + γvπk (s1 ) 1 + γvπk (s2 ) −1 + γvπk (s2 )

Substituting vπ0 (s1 ) = −10, vπ0 (s2 ) = −9 and γ = 0.9 gives

qπ0 (s, a) a` a0 ar
s1 −10 −9 −7.1
s2 −9 −7.1 −9.1

By seeking the greatest value of qπ0 , the improved policy is:

π1 (ar |s1 ) = 1, π1 (a0 |s2 ) = 1.

This policy is optimal after one iteration! In your programming, should


continue until the stopping criterion is satisfied.
Shiyu Zhao 25 / 40
Policy iteration algorithm - Simple example

Excise! Set the left cell as the target area.

Now you know another powerful algorithm searching for optimal policies!
Now let’s apply it and see what we can find.

Shiyu Zhao 26 / 40
Policy iteration algorithm - Complicated example

. Setting: rboundary = −1, rforbidden = −10, rtarget = 1, γ = 0.9.


. Let’s check out the intermediate policies and state values.
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0

2 2 0.0 -100.0 -100.0 0.0 0.0 2 2 0.0 -100.0


0.0 -100.0
0.0 0.0 0.0

3 3 0.0 0.0 -100.0 0.0 0.0 3 3 0.0 0.0 -100.0


10.0 0.0 0.0

4 4 0.0 -100.0 10.0 -100.0 0.0 4 4 0.0 -100.0


10.0 10.0 -100.0
10.0 0.0

5 5 0.0 -100.0 0.0 0.0 0.0 5 5 0.0 -100.0


9.0 10.0
0.0 0.0 0.0

π0 and vπ0 π1 and vπ1

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0

2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0 2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0

3 3 0.0 0.0 -100.0


10.0 0.0 0.0 3 3 0.0 0.0 -100.0
10.0 0.0 0.0

4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0 4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3

5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0 5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1

π2 and vπ2 π3 and vπ3


Shiyu Zhao 27 / 40
Policy iteration algorithm - Complicated example

. Interesting pattern of the policies and state values


1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0


4.3 0.0
4.8 0.0
4.3

2 2 0.0 -100.0
0.0 -100.0
0.0 0.0 0.0 2 2 0.0 -100.0
0.0 -100.0
0.0
4.8 0.0
5.3 0.0
5.9

3 3 0.0 0.0 -100.0


10.0 0.0 0.0
6.6 3 3 0.0 0.0 -100.0
10.0 0.0
5.9 0.0
6.6

4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3 4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3

5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1 5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1

π4 and vπ4 π5 and vπ5


.. ..
. .
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 0.0
3.5 0.0
3.9 0.0
4.3 0.0
4.8 0.0
4.3
5.3 1 1 0.0
3.5 0.0
3.9 0.0
4.3 0.0
4.8 0.0
4.3
5.3

2 2 0.0
3.1 -100.0
0.0 -100.0
3.5 0.0
4.8 0.0
5.3 0.0
5.9 2 2 0.0
3.1 -100.0
0.0 -100.0
3.5 0.0
4.8 0.0
5.3 0.0
5.9

3 3 0.0
2.8 0.0 -100.0
10.0 0.0
5.9 0.0
6.6 3 3 0.0
2.8 0.0
2.5 -100.0
10.0 0.0
5.9 0.0
6.6

4 4 0.0 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3 4 4 0.0
2.5 -100.0
10.0 10.0 -100.0
10.0 0.0
7.3

5 5 0.0 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1 5 5 0.0
2.3 -100.0
9.0 10.0
0.0 0.0
9.0 0.0
8.1

π9 and vπ9 π10 and vπ10

Shiyu Zhao 28 / 40
Outline

1 Value iteration algorithm

2 Policy iteration algorithm

3 Truncated policy iteration algorithm

Shiyu Zhao 29 / 40
Compare value iteration and policy iteration

Policy iteration: start from π0


• Policy evaluation (PE):

vπk = rπk + γPπk vπk

• Policy improvement (PI):

πk+1 = arg max(rπ + γPπ vπk )


π

Value iteration: start from v0


• Policy update (PU):

πk+1 = arg max(rπ + γPπ vk )


π

• Value update (VU):

vk+1 = rπk+1 + γPπk+1 vk


Shiyu Zhao 30 / 40
Compare value iteration and policy iteration

. The two algorithms are very similar:


PE PI PE PI PE PI
Policy iteration: π0 −−→vπ0 −−→ π1 −−→ vπ1 −−→ π2 −−→ vπ2 −−→ . . .
PU VU PU VU PU
Value iteration: u0 −−→ π10 −−→ u1 −−→ π20 −−→ u2 −−→ . . .

PE=policy evaluation. PI=policy improvement.


PU=policy update. VU=value update.

Shiyu Zhao 31 / 40
Compare value iteration and policy iteration

. Let’s compare the steps carefully:


Policy iteration algorithm Value iteration algorithm Comments
1) Policy: π0 N/A
2) Value: vπ0 = rπ0 + γPπ0 vπ0 v0 := vπ0
3) Policy: π1 = arg maxπ (rπ + γPπ vπ0 ) π1 = arg maxπ (rπ + γPπ v0 ) The two policies are the
same
4) Value: vπ1 = rπ1 + γPπ1 vπ1 v1 = rπ1 + γPπ1 v0 vπ1 ≥ v1 since vπ1 ≥
v π0
5) Policy: π2 = arg maxπ (rπ + γPπ vπ1 ) π20 = arg maxπ (rπ + γPπ v1 )
.. .. .. ..
. . . .

• They start from the same initial condition.


• The first three steps are the same.
• The fourth step becomes different:
• In policy iteration, solving vπ1 = rπ1 + γPπ1 vπ1 requires an iterative
algorithm (an infinite number of iterations)
• In value iteration, v1 = rπ1 + γPπ1 v0 is a one-step iteration
Shiyu Zhao 32 / 40
Compare value iteration and policy iteration

Consider the step of solving vπ1 = rπ1 + γPπ1 vπ1 :


vπ(0)
1
= v0
value iteration ← v1 ←−vπ(1)
1
= rπ1 + γPπ1 vπ(0)
1

vπ(2)
1
= rπ1 + γPπ1 vπ(1)
1

..
.
truncated policy iteration ← v̄1 ←−vπ(j)
1
= rπ1 + γPπ1 vπ(j−1)
1

..
.
policy iteration ← vπ1 ←−vπ(∞)
1
= rπ1 + γPπ1 vπ(∞)
1

• The value iteration algorithm computes once.


• The policy iteration algorithm computes an infinite number of
iterations.
• The truncated policy iteration algorithm computes a finite number of
Shiyu Zhao iterations (say j). The rest iterations from j to ∞ are truncated. 33 / 40
Truncated policy iteration - Pseudocode

Pseudocode: Truncated policy iteration algorithm

Initialization: The probability model p(r|s, a) and p(s0 |s, a) for all (s, a) are known. Initial
guess π0 .
Aim: Search for the optimal state value and an optimal policy.

While the policy has not converged, for the kth iteration, do
Policy evaluation:
(0)
Initialization: select the initial guess as vk = vk−1 . The maximum iteration is set to
be jtruncate .
While j < jtruncate , do
For every state s ∈ S, do hP i
(j+1) P P 0 (j) 0
vk (s) = a πk (a|s) r p(r|s, a)r + γ s0 p(s |s, a)vk (s )
(j )
Set vk = vk truncate
Policy improvement:
For every state s ∈ S, do
For every action a ∈ A(s), do
P P 0 0
qk (s, a) = r p(r|s, a)r + γ s0 p(s |s, a)vk (s )

ak (s) = arg maxa qk (s, a)
πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise

Shiyu Zhao 34 / 40
Truncated policy iteration - Convergence

. Will the truncation undermine convergence?

Proposition (Value Improvement)

Consider the iterative algorithm for solving the policy evaluation step:

vπ(j+1)
k
= rπk + γPπk vπ(j)
k
, j = 0, 1, 2, ...

(0)
If the initial guess is selected as vπk = vπk−1 , it holds that

vπ(j+1)
k
≥ vπ(j)
k

for every j = 0, 1, 2, . . . .

For the proof, see the book.

Shiyu Zhao 35 / 40
Truncated policy iteration - Convergence

v*
v
k

vk

Policy iteration
Value iteration
Truncated policy iteration
Optimal state value

Figure: Illustration of the relationship among value iteration, policy iteration, and
truncated policy iteration.

The convergence proof of PI is based on that of VI. Since VI converges,


we know PI converges.
Shiyu Zhao 36 / 40
Truncated policy iteration - Example

. Setup: The same as the previous example. Below is the initial policy.
1 2 3 4 5

. Define kvk − v ∗ k as the state value error at time k. The stop criterion
is kvk − v ∗ k < 0.01.

Shiyu Zhao 37 / 40
Truncated policy iteration - Example

50
Truncated policy iteration-1

0
10 20 30 40 50 60
100
Truncated policy iteration-3
50

0
10 20 30 40 50 60

100 Truncated policy iteration-6


50
0
state value error

10 20 30 40 50 60

200 Truncated policy iteration-100


100
0
10 20 30 40 50 60
iteration num

. “Truncated policy iteration-x” where x = 1, 3, 6, 100 refers to a


truncated policy iteration algorithm where the policy evaluation step runs
x iterations.
Shiyu Zhao 38 / 40
Truncated policy iteration - Example

50
Truncated policy iteration-1

0
10 20 30 40 50 60
100
Truncated policy iteration-3
50

0
10 20 30 40 50 60

100 Truncated policy iteration-6


50
0
state value error

10 20 30 40 50 60

200 Truncated policy iteration-100


100
0
10 20 30 40 50 60
iteration num

. The greater the value of x is, the faster the value estimate converges.
. However, the benefit of increasing x drops quickly when x is large.
. In practice, run a few number of iterations in the policy evaluation step.
Shiyu Zhao 39 / 40
Summary

. Value iteration: it is the iterative algorithm solving the Bellman


optimality equation: given an initial value v0 ,

vk+1 = maxπ (rπ + γPπ vk )


m
(
Policy update: πk+1 = arg maxπ (rπ + γPπ vk )
Value update: vk+1 = rπk+1 + γPπk+1 vk

. Policy iteration: given an initial policy π0 ,


(
Policy evaluation: vπk = rπk + γPπk vπk
Policy improvement: πk+1 = arg maxπ (rπ + γPπ vπk )

. Truncated policy iteration

Shiyu Zhao 40 / 40
Lecture 5: Monte Carlo Learning

Shiyu Zhao
Outline

Shiyu Zhao 1 / 50
Outline

1 Motivating example

2 The simplest MC-based RL algorithm


Algorithm: MC Basic

3 Use data more efficiently


Algorithm: MC Exploring Starts

4 MC without exploring starts


Algorithm: MC ε-Greedy

Shiyu Zhao 2 / 50
Outline

1 Motivating example

2 The simplest MC-based RL algorithm


Algorithm: MC Basic

3 Use data more efficiently


Algorithm: MC Exploring Starts

4 MC without exploring starts


Algorithm: MC ε-Greedy

Shiyu Zhao 3 / 50
Motivating example: Monte Carlo estimation

. How can we estimate something without models?


• The simplest idea: Monte Carlo estimation.

. Example: Flip a coin

The result (either head or tail) is denoted as a random variable X


• if the result is head, then X = +1
• if the result is tail, then X = −1

The aim is to compute E[X].

Shiyu Zhao 4 / 50
Motivating example: Monte Carlo estimation

. Method 1: Model-based
• Suppose the probabilistic model is known as

p(X = 1) = 0.5, p(X = −1) = 0.5

Then by definition
X
E[X] = xp(x) = 1 × 0.5 + (−1) × 0.5 = 0
x

• Problem: it may be impossible to know the precise distribution!!

Shiyu Zhao 5 / 50
Motivating example: Monte Carlo estimation

. Method 2: Model-free

• Idea: Flip the coin many times, and then calculate the average of the
outcomes.
• Suppose we get a sample sequence: {x1 , x2 , . . . , xN }.
Then, the mean can be approximated as
N
1 X
E[X] ≈ x̄ = xj .
N j=1

This is the idea of Monte Carlo estimation!

Shiyu Zhao 6 / 50
Motivating example: Monte Carlo estimation

. Question: Is the Monte Carlo estimation accurate?


• When N is small, the approximation is inaccurate.
• As N increases, the approximation becomes more and more accurate.

2
samples
average

-1

-2
0 50 100 150 200
Sample index

Shiyu Zhao 7 / 50
Motivating example: Monte Carlo estimation

Law of Large Numbers

For a random variable X. Suppose {xj }Nj=1 are some iid samples. Let
1
PN
x̄ = N j=1 xj be the average of the samples. Then,

E[x̄] = E[X],
1
Var[x̄] = Var[X].
N
As a result, x̄ is an unbiased estimate of E[X] and its variance decreases
to zero as N increases to infinity.

. The samples must be iid (independent and identically distributed)


. For the proof, see the book.

Shiyu Zhao 8 / 50
Motivating example: Monte Carlo estimation

. Summary:

• Monte Carlo estimation refers to a broad class of techniques that rely


on repeated random sampling to solve approximation problems.
• Why we care about Monte Carlo estimation? Because it does not
require the model!
• Why we care about mean estimation? Because state value and action
value are defined as expectations of random variables!

Shiyu Zhao 9 / 50
Outline

1 Motivating example

2 The simplest MC-based RL algorithm


Algorithm: MC Basic

3 Use data more efficiently


Algorithm: MC Exploring Starts

4 MC without exploring starts


Algorithm: MC ε-Greedy

Shiyu Zhao 10 / 50
Convert policy iteration to be model-free

The key to understand the algorithm is to understand how to convert the


policy iteration algorithm to be model-free.
• Should understand policy iteration well.
• Should understand the idea of Monte Carlo mean estimation.

Shiyu Zhao 11 / 50
Convert policy iteration to be model-free

Policy iteration has two steps in each iteration:


(
Policy evaluation: vπk = rπk + γPπk vπk
Policy improvement: πk+1 = arg maxπ (rπ + γPπ vπk )

The elementwise form of the policy improvement step is:


" #
X X X 0 0
πk+1 (s) = arg max π(a|s) p(r|s, a)r + γ p(s |s, a)vπk (s )
π
a r s0
X
= arg max π(a|s)qπk (s, a), s∈S
π
a

The key is qπk (s, a)!

Shiyu Zhao 12 / 50
Convert policy iteration to be model-free

Two expressions of action value:


• Expression 1 requires the model:
X X
qπk (s, a) = p(r|s, a)r + γ p(s0 |s, a)vπk (s0 )
r s0

• Expression 2 does not require the model:

qπk (s, a) = E[Gt |St = s, At = a]

Idea to achieve model-free RL: We can use expression 2 to calculate


qπk (s, a) based on data (samples or experiences)!

Shiyu Zhao 13 / 50
Convert policy iteration to be model-free

The procedure of Monte Carlo estimation of action values:


• Starting from (s, a), following policy πk , generate an episode.
• The return of this episode is g(s, a)
• g(s, a) is a sample of Gt in

qπk (s, a) = E[Gt |St = s, At = a]

• Suppose we have a set of episodes and hence {g (j) (s, a)}. Then,
N
1 X (i)
qπk (s, a) = E[Gt |St = s, At = a] ≈ g (s, a).
N i=1

Fundamental idea: When model is unavailable, we can use data.

Shiyu Zhao 14 / 50
The MC Basic algorithm

. Description of the algorithm:


Given an initial policy π0 , there are two steps at the kth iteration.

• Step 1: policy evaluation. This step is to obtain qπk (s, a) for all
(s, a). Specifically, for each action-state pair (s, a), run an infinite
number of (or sufficiently many) episodes. The average of their returns
is used to approximate qπk (s, a).
• Step 2: policy improvement. This step is to solve
P
πk+1 (s) = arg maxπ a π(a|s)qπk (s, a) for all s ∈ S. The greedy
optimal policy is πk+1 (a∗k |s) = 1 where a∗k = arg maxa qπk (s, a).

Exactly the same as the policy iteration algorithm, except


• Estimate qπk (s, a) directly, instead of solving vπk (s).

Shiyu Zhao 15 / 50
The MC Basic algorithm

. Description of the algorithm:

Pseudocode: MC Basic algorithm (a model-free variant of policy iteration)

Initialization: Initial guess π0 .


Aim: Search for an optimal policy.

While the value estimate has not converged, for the kth iteration, do
For every state s ∈ S, do
For every action a ∈ A(s), do
Collect sufficiently many episodes starting from (s, a) following πk
MC-based policy evaluation step:
qπk (s, a) = average return of all the episodes starting from (s, a)
Policy improvement step:
a∗k (s) = arg maxa qπk (s, a)
πk+1 (a|s) = 1 if a = a∗k , and πk+1 (a|s) = 0 otherwise

Shiyu Zhao 16 / 50
The MC Basic algorithm

• MC Basic is a variant of the policy iteration algorithm.


• The model-free algorithms are built up based on model-based ones. It
is, therefore, necessary to understand model-based algorithms first
before studying model-free algorithms.
• MC Basic is useful to reveal the core idea of MC-based model-free RL,
but not practical due to low efficiency.
• Why does MC Basic estimate action values instead of state values?
That is because state values cannot be used to improve policies
directly. When models are not available, we should directly estimate
action values.
• Since policy iteration is convergent, the convergence of MC Basic is
also guaranteed to be convergent given sufficient episodes.

Shiyu Zhao 17 / 50
Illustrative example 1: step by step

s1 s2 s3

s4 s5 s6

s7 s8 s9

Task:
• An initial policy is shown in the figure.
• Use MC Basic to find the optimal policy.
• rboundary = −1, rforbidden = −1, rtarget = 1, γ = 0.9.

Shiyu Zhao 18 / 50
Illustrative example 1: step by step

s1 s2 s3

s4 s5 s6

s7 s8 s9

Outline: given the current policy πk


• Step 1 - policy evaluation: calculate qπk (s, a)
How many state-action pairs? 9 states × 5 actions =45 state-action
pairs!
• Step 2 - policy improvement: select the greedy action
a∗ (s) = arg maxai qπk (s, a)

Shiyu Zhao 19 / 50
Illustrative example 1: step by step

s1 s2 s3

s4 s5 s6

s7 s8 s9

. Due to space limitation, we only show qπk (s1 , a)


. Step 1 - policy evaluation:

• Since the current policy is deterministic, one episode would be


sufficient to get the action value!
• If the current policy is stochastic, an infinite number of episodes (or at
least many) are required!

Shiyu Zhao 20 / 50
Illustrative example 1: step by step

s1 s2 s3

s4 s5 s6

s7 s8 s9

1 a1 1 a a
• Starting from (s1 , a1 ), the episode is s1 −→s1 −→ s1 −→ . . .. Hence, the

action value is

qπ0 (s1 , a1 ) = −1 + γ(−1) + γ 2 (−1) + . . .


2 a3 3 a a
• Starting from (s1 , a2 ), the episode is s1 −→s2 −→ s5 −→ . . .. Hence, the

action value is

qπ0 (s1 , a2 ) = 0 + γ0 + γ 2 0 + γ 3 (1) + γ 4 (1) + . . .


3 a2 3 a a
• Starting from (s1 , a3 ), the episode is s1 −→s4 −→ s5 −→ . . .. Hence, the

action value is

qπ0 (s1 , a3 ) = 0 + γ0 + γ 2 0 + γ 3 (1) + γ 4 (1) + . . .


Shiyu Zhao 21 / 50
Illustrative example 1: step by step

s1 s2 s3

s4 s5 s6

s7 s8 s9

4 a 1 a
1 a
• Starting from (s1 , a4 ), the episode is s1 −→s1 −→ s1 −→ . . .. Hence, the

action value is

qπ0 (s1 , a4 ) = −1 + γ(−1) + γ 2 (−1) + . . .

5 a 1 a
1 a
• Starting from (s1 , a5 ), the episode is s1 −→s1 −→ s1 −→ . . .. Hence, the

action value is

qπ0 (s1 , a5 ) = 0 + γ(−1) + γ 2 (−1) + . . .

Shiyu Zhao 22 / 50
Illustrative example 1: step by step

s1 s2 s3

s4 s5 s6

s7 s8 s9

. Step 2 - policy improvement:


• By observing the action values, we see that

qπ0 (s1 , a2 ) = qπ0 (s1 , a3 )

are the maximum.


• As a result, the policy can be improved as

π1 (a2 |s1 ) = 1 or π1 (a3 |s1 ) = 1.

In either way, the new policy for s1 becomes optimal.


One iteration is sufficient for this simple example!
Shiyu Zhao 23 / 50
Illustrative example 1: step by step

s1 s2 s3

s4 s5 s6

s7 s8 s9

Exercise: now update the policy for s3 using MC Basic!

Shiyu Zhao 24 / 50
Illustrative example 2: Episode length

Examine the impact of episode length:


• We need sample episodes, but the length of an episode cannot be
infinitely long.
• How long should be the episodes?

Example setup:
• 5-by-5 grid world
• Reward setting: rboundary = −1, rforbidden = −10, rtarget = 1, γ = 0.9

Shiyu Zhao 25 / 50
Illustrative example 2: Episode length

. Use MC Basic to search optimal policies with different episode lengths.


Episode length=1 Episode length=1 Episode length=2 Episode length=2
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0 1

2 0.0 0.0 0.0 0.0 0.0 2 2 0.0 0.0 0.0 0.0 0.0 2

3 0.0 0.0 1.0 0.0 0.0 3 3 0.0 0.0 1.9 0.0 0.0 3

4 0.0 1.0 1.0 1.0 0.0 4 4 0.0 1.9 1.9 1.9 0.0 4

5 0.0 0.0 1.0 0.0 0.0 5 5 0.0 0.9 1.9 0.9 0.0 5

Estimated state value and policy with Estimated state value and policy with
episode length=1 episode length=2
Episode length=3 Episode length=3 Episode length=4 Episode length=4
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 0.0 0.0 0.0 0.0 0.0 1 1 0.0 0.0 0.0 0.0 0.0 1

2 0.0 0.0 0.0 0.0 0.0 2 2 0.0 0.0 0.0 0.0 0.0 2

3 0.0 0.0 2.7 0.0 0.0 3 3 0.0 0.0 3.4 0.0 0.0 3

4 0.0 2.7 2.7 2.7 0.0 4 4 0.0 3.4 3.4 3.4 0.7 4

5 0.0 1.7 2.7 1.7 0.8 5 5 0.0 2.4 3.4 2.4 1.5 5

Estimated state value and policy with Estimated state value and policy with
Shiyu Zhao
episode length=3 episode length=4 26 / 50
Illustrative example 2: Episode length
Episode length=14 Episode length=14 Episode length=15 Episode length=15
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1.2 1.6 2.0 2.5 3.0 1 1 1.4 1.8 2.2 2.7 3.3 1

2 0.9 1.2 2.5 3.0 3.6 2 2 1.1 1.4 2.7 3.3 3.8 2

3 0.5 0.3 7.7 3.6 4.3 3 3 0.8 0.5 7.9 3.8 4.5 3

4 0.3 7.7 7.7 7.7 5.0 4 4 0.5 7.9 7.9 7.9 5.2 4

5 0.0 6.7 7.7 6.7 5.8 5 5 0.2 6.9 7.9 6.9 6.0 5

Estimated state value and policy with Estimated state value and policy with
episode length=14 episode length=15

Episode length=30 Episode length=30 Episode length=100 Episode length=100


1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 3.1 3.5 3.9 4.4 4.9 1 1 3.5 3.9 4.3 4.8 5.3 1

2 2.7 3.1 4.4 4.9 5.5 2 2 3.1 3.5 4.8 5.3 5.9 2

3 2.4 2.1 9.6 5.5 6.1 3 3 2.8 2.5 10.0 5.9 6.6 3

4 2.1 9.6 9.6 9.6 6.9 4 4 2.5 10.0 10.0 10.0 7.3 4

5 1.9 8.6 9.6 8.6 7.7 5 5 2.3 9.0 10.0 9.0 8.1 5

Estimated state value and policy with Estimated state value and policy with
episode length=30 episode length=100
Shiyu Zhao 27 / 50
Illustrative example 2: Episode length
Episode length=4 Episode length=4
1 2 3 4 5 1 2 3 4 5

1 0.0 0.0 0.0 0.0 0.0 1

2 0.0 0.0 0.0 0.0 0.0 2

3 0.0 0.0 3.4 0.0 0.0 3

4 0.0 3.4 3.4 3.4 0.7 4

5 0.0 2.4 3.4 2.4 1.5 5

. Findings:
• When the episode length is short, only the states that are close to the
target have nonzero state values.
• As the episode length increases, the states that are closer to the target
have nonzero values earlier than those farther away.
• The episode length should be sufficiently long.
• The episode length does not have to be infinitely long.

Shiyu Zhao 28 / 50
Outline

1 Motivating example

2 The simplest MC-based RL algorithm


Algorithm: MC Basic

3 Use data more efficiently


Algorithm: MC Exploring Starts

4 MC without exploring starts


Algorithm: MC ε-Greedy

Shiyu Zhao 29 / 50
Use data more efficiently

The MC Basic algorithm:


• Advantage: reveal the core idea clearly!
• Disadvantage: too simple to be practical.
However, MC Basic can be extended to be more efficient.

Shiyu Zhao 30 / 50
Use data more efficiently

. Consider a grid-world example, following a policy π, we can get an


episode such as
a2 4a 2 a 3 a 1 a
s1 −→ s2 −→ s1 −→ s2 −→ s5 −→ ...

. Visit: every time a state-action pair appears in the episode, it is called


a visit of that state-action pair.
. Methods to use the data: Initial-visit method
• Just calculate the return and approximate qπ (s1 , a2 ).
• This is what the MC Basic algorithm does.
• Disadvantage: Not fully utilize the data.

Shiyu Zhao 31 / 50
Use data more efficiently

. The episode also visits other state-action pairs.


a2 a
4 2 a 3 a1 a
s1 −→ s2 −→ s1 −→ s2 −→ s5 −→ ... [original episode]
a4 a2 a3 a1
s2 −→ s1 −→ s2 −→ s5 −→ . . . [episode starting from (s2 , a4 )]
a2 a3 a1
s1 −→ s2 −→ s5 −→ . . . [episode starting from (s1 , a2 )]
a3 a1
s2 −→ s5 −→ . . . [episode starting from (s2 , a3 )]
a1
s5 −→ ... [episode starting from (s5 , a1 )]

Can estimate qπ (s1 , a2 ), qπ (s2 , a4 ), qπ (s2 , a3 ), qπ (s5 , a1 ),...


Data-efficient methods:
• first-visit method
• every-visit method

Shiyu Zhao 32 / 50
Update value estimate more efficiently

. Another aspect in MC-based RL is when to update the policy. There


are two methods.
• The first method is, in the policy evaluation step, to collect all the
episodes starting from a state-action pair and then use the average
return to approximate the action value.
• This is the one adopted by the MC Basic algorithm.
• The problem of this method is that the agent has to wait until all
episodes have been collected.
• The second method uses the return of a single episode to
approximate the action value.
• In this way, we can improve the policy episode-by-episode.

Shiyu Zhao 33 / 50
Update value estimate more efficiently

. Will the second method cause problems?


• One may say that the return of a single episode cannot accurately
approximate the corresponding action value.
• In fact, we have done that in the truncated policy iteration algorithm
introduced in the last chapter!
. Generalized policy iteration:
• Not a specific algorithm.
• It refers to the general idea or framework of switching between
policy-evaluation and policy-improvement processes.
• Many model-based and model-free RL algorithms fall into this
framework.

Shiyu Zhao 34 / 50
MC Exploring Starts
. If we use data and update estimate more efficiently, we get a new
algorithm called MC Exploring Starts:
Pseudocode: MC Exploring Starts (a sample-efficient variant of MC Basic)

Initialization: Initial guess π0 .


Aim: Search for an optimal policy.

For each episode, do


Episode generation: Randomly select a starting state-action pair (s0 , a0 ) and ensure
that all pairs can be possibly selected. Following the current policy, generate an episode
of length T : s0 , a0 , r1 , . . . , sT −1 , aT −1 , rT .
Policy evaluation and policy improvement:
Initialization: g ← 0
For each step of the episode, t = T − 1, T − 2, . . . , 0, do
g ← γg + rt+1
Use the first-visit method:
If (st , at ) does not appear in (s0 , a0 , s1 , a1 , . . . , st−1 , at−1 ), then
Returns(st , at ) ← Returns(st , at ) + g
q(st , at ) = average(Returns(st , at ))
π(a|st ) = 1 if a = arg maxa q(st , a)

Shiyu Zhao 35 / 50
MC Exploring Starts

. What is exploring starts?


• Exploring starts means we need to generate sufficiently many episodes
starting from every state-action pair.
• Both MC Basic and MC Exploring Starts need this assumption.

Shiyu Zhao 36 / 50
MC Exploring Starts

. Why do we need to consider exploring starts?


• In theory, only if every action value for every state is well explored, can
we select the optimal actions correctly.
On the contrary, if an action is not explored, this action may happen to
be the optimal one and hence be missed.
• In practice, exploring starts is difficult to achieve. For many
applications, especially those involving physical interactions with
environments, it is difficult to collect episodes starting from every
state-action pair.
Therefore, there is a gap between theory and practice.
Can we remove the requirement of exploring starts? We next show that
we can do that by using soft policies.

Shiyu Zhao 37 / 50
Outline

1 Motivating example

2 The simplest MC-based RL algorithm


Algorithm: MC Basic

3 Use data more efficiently


Algorithm: MC Exploring Starts

4 MC without exploring starts


Algorithm: MC ε-Greedy

Shiyu Zhao 38 / 50
Soft policies

. A policy is called soft if the probability to take any action is positive.


. Why introduce soft policies?
• With a soft policy, a few episodes that are sufficiently long can visit
every state-action pair for sufficiently many times.
• Then, we do not need to have a large number of episodes starting from
every state-action pair. Hence, the requirement of exploring starts can
thus be removed.

Shiyu Zhao 39 / 50
ε-greedy policies

. What soft policies will we use? Answer: ε-greedy policies


• What is an ε-greedy policy?
 ε
 1− (|A(s)| − 1), for the greedy action,


 |A(s)|
π(a|s) =
 ε
for the other |A(s)| − 1 actions.


 ,
|A(s)|
where ε ∈ [0, 1] and |A(s)| is the number of actions for s.
• The chance to choose the greedy action is always greater than other
ε ε ε
actions, because 1 − |A(s)| (|A(s)| − 1) = 1 − ε + |A(s)| ≥ |A(s)| .
• Why use ε-greedy? Balance between exploitation and exploration
• When ε = 0, it becomes greedy! Less exploration but more
exploitation!
• When ε = 1, it becomes a uniform distribution. More exploration
but less exploitation.
Shiyu Zhao 40 / 50
MC ε-Greedy algorithm

. How to embed ε-greedy into the MC-based RL algorithms?


Originally, the policy improvement step in MC Basic and MC Exploring
Starts is to solve
X
πk+1 (s) = arg max π(a|s)qπk (s, a).
π∈Π
a

where Π denotes the set of all possible policies. The optimal policy here is
(
1, a = a∗k ,
πk+1 (a|s) =
0, a 6= a∗k ,

where a∗k = arg maxa qπk (s, a).

Shiyu Zhao 41 / 50
MC ε-Greedy algorithm

. How to embed ε-greedy into the MC-based RL algorithms?


Now, the policy improvement step is changed to solve
X
πk+1 (s) = arg max π(a|s)qπk (s, a),
π∈Πε
a

where Πε denotes the set of all ε-greedy policies with a fixed value of ε.
The optimal policy here is
|A(s)|−1
(
1− |A(s)| ε, a = a∗k ,
πk+1 (a|s) = 1
|A(s)| ε, a 6= a∗k .

• MC ε-Greedy is the same as that of MC Exploring Starts except that


the former uses ε-greedy policies.
• It does not require exploring starts, but still requires to visit all
state-action pairs in a different form.
Shiyu Zhao 42 / 50
MC ε-Greedy algorithm

Pseudocode: MC -Greedy (a variant of MC Exploring Starts)

Initialization: Initial guess π0 and the value of  ∈ [0, 1]


Aim: Search for an optimal policy.

For each episode, do


Episode generation: Randomly select a starting state-action pair (s0 , a0 ). Following the
current policy, generate an episode of length T : s0 , a0 , r1 , . . . , sT −1 , aT −1 , rT .
Policy evaluation and policy improvement:
Initialization: g ← 0
For each step of the episode, t = T − 1, T − 2, . . . , 0, do
g ← γg + rt+1
Use the every-visit method:
Returns(st , at ) ← Returns(st , at ) + g
q(st , at ) = average(Returns(st , at ))
Let a∗ = arg maxa q(st , a) and

|A(st )|−1
a = a∗
(
1− |A(st )|
,
π(a|st ) = 1
|A(st )|
, a 6= a∗

Shiyu Zhao 43 / 50
Exploration ability

. Can a single episode visit all state-action pairs?

When ε = 1, the policy (uniform distribution) has the strongest


exploration ability.
1 2 3 4 5
8300
1
8200

8100

Visited times
2
8000

3 7900

7800
4
7700

7600
5 20 40 60 80 100 120
State-action index

(a) 100 steps (b) 1000 steps (c) 10000 steps (d)

Click here to play a video (the video is only on my computer)

Shiyu Zhao 44 / 50
Exploration ability

. Can a single episode visit all state-action pairs?

When ε is small, the exploration ability of the policy is also small.


1 2 3 4 5
105
3
1
2.5

Visited times
2 2

1.5
3
1

4 0.5

0
5 20 40 60 80 100 120
State-action index

(a) 100 steps (b) 1000 steps (c) 10000 steps (d)

Shiyu Zhao 45 / 50
Estimate based on one episode
. Run the MC ε-Greedy algorithm as follows. In every iteration:
• In the episode generation step, use the previous policy generates an
episode of 1 million steps!
• In the rest steps, use the single episode to update the policy.
• Two iterations can lead to the optimal ε-greedy policy.

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

(a) Initial policy (b) After the first (c) After the second
iteration iteration

Here, rboundary = −1, rforbidden = −10, rtarget = 1, γ = 0.9


Shiyu Zhao 46 / 50
Optimality vs exploration

. Compared to greedy policies,


• The advantage of ε-greedy policies is that they have stronger
exploration ability so that the exploring starts condition is not required.
• The disadvantage is that ε-greedy polices are not optimal in general
(we can only show that there always exist greedy policies that are
optimal).
• The final policy given by the MC ε-Greedy algorithm is only optimal
in the set Πε of all ε-greedy policies.
• ε cannot be too large.
. Next, we use examples to demonstrate. The setup is rboundary = −1,
rforbidden = −10, rtarget = 1, γ = 0.9

Shiyu Zhao 47 / 50
Optimality
. Given an ε-greedy policy, what is its state value?
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 3.5 3.9 4.3 4.8 5.3 1 1 0.4 0.5 0.9 1.3 1.4

2 2 3.1 3.5 4.8 5.3 5.9 2 2 0.1 0.0 0.5 1.3 1.7

3 3 2.8 2.5 10.0 5.9 6.6 3 3 0.1 -0.4 3.4 1.4 1.9

4 4 2.5 10.0 10.0 10.0 7.3 4 4 -0.1 3.4 3.3 3.7 2.2

5 5 2.3 9.0 10.0 9.0 8.1 5 5 -0.3 2.6 3.7 3.1 2.7

ε=0 ε = 0.1

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 -2.2 -2.4 -2.1 -1.7 -1.8 1 1 -8.0 -9.0 -8.4 -7.2 -7.8

2 2 -2.5 -3.0 -3.3 -2.3 -2.0 2 2 -8.7 -10.8 -12.4 -9.6 -8.9

3 3 -2.3 -3.3 -2.5 -2.8 -2.2 3 3 -8.3 -12.3 -15.3 -12.3 -10.5

4 4 -2.5 -2.5 -2.8 -2.0 -2.4 4 4 -9.7 -15.5 -17.0 -14.4 -12.2

5 5 -2.8 -3.2 -2.1 -2.3 -2.2 5 5 -10.9 -16.7 -15.2 -14.3 -12.4

ε = 0.2 ε = 0.5

. When ε increases, the optimality of the policy becomes worse!


. Why is the state value of the target state negative?
Shiyu Zhao 48 / 50
Consistency
. Find the optimal ε-greedy policies and their state values?
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 3.5 3.9 4.3 4.8 5.3 1 1 0.4 0.5 0.9 1.3 1.4

2 2 3.1 3.5 4.8 5.3 5.9 2 2 0.1 0.0 0.5 1.3 1.7

3 3 2.8 2.5 10.0 5.9 6.6 3 3 0.1 -0.4 3.4 1.4 1.9

4 4 2.5 10.0 10.0 10.0 7.3 4 4 -0.1 3.4 3.3 3.7 2.2

5 5 2.3 9.0 10.0 9.0 8.1 5 5 -0.3 2.6 3.7 3.1 2.7

ε=0 ε = 0.1

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 -1.1 -1.5 -1.1 -0.6 -0.6 1 1 -4.3 -5.5 -4.5 -2.6 -2.3

2 2 -1.5 -2.2 -2.3 -1.0 -0.6 2 2 -5.6 -7.7 -7.7 -4.1 -2.4

3 3 -1.2 -2.4 -2.2 -1.5 -0.6 3 3 -5.5 -9.0 -8.0 -5.6 -2.8

4 4 -1.6 -2.3 -2.6 -1.4 -1.1 4 4 -6.8 -8.9 -9.4 -5.5 -4.2

5 5 -2.0 -3.0 -1.8 -1.4 -1.0 5 5 -7.9 -10.1 -6.7 -5.1 -3.7

ε = 0.2 ε = 0.5

. The optimal ε-greedy policies are not consistent with the greedy
optimal one! Why is that? Consider the target for example.
Shiyu Zhao 49 / 50
Summary

Key points:
• Mean estimation by the Monte Carlo methods
• Three algorithms:
• MC Basic
• MC Exploring Starts
• MC ε-Greedy
• Relationship among the three algorithms
• Optimality vs exploration of ε-greedy policies

Shiyu Zhao 50 / 50
Lecture 6:
Stochastic Approximation
and
Stochastic Gradient Descent

Shiyu Zhao
Outline

Shiyu Zhao 1 / 65
Introduction

• In the last lecture, we introduced Monte-Carlo learning.


• In the next lecture, we will introduce temporal-difference (TD) learning.
• In this lecture, we press the pause button to get us better prepared.

Why?
• The ideas and expressions of TD algorithms are very different from the
algorithms we studied so far.
• Many students who see the TD algorithms the first time many wonder
why these algorithms were designed in the first place and why they
work effectively.
• There is a knowledge gap!

Shiyu Zhao 2 / 65
Introduction

In this lecture,
• We fill the knowledge gap between the previous and upcoming lectures
by introducing basic stochastic approximation (SA) algorithms.
• We will see in the next lecture that the temporal-difference algorithms
are special SA algorithms. As a result, it will be much easier to
understand these algorithms.

Shiyu Zhao 3 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 4 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 5 / 65
Motivating example: mean estimation, again

Revisit the mean estimation problem:


• Consider a random variable X.
• Our aim is to estimate E[X].
• Suppose that we collected a sequence of iid samples {xi }N
i=1 .

• The expectation of X can be approximated by


N
1 X
E[X] ≈ x̄ := xi .
N i=1

We already know from the last lecture:


• This approximation is the basic idea of Monte Carlo estimation.
• We know that x̄ → E[X] as N → ∞.
Why do we care about mean estimation so much?
• Many values in RL such as state/action values are defined as means.
Shiyu Zhao 6 / 65
Motivating example: mean estimation

New question: how to calculate the mean x̄?

N
1 X
E[X] ≈ x̄ := xi .
N i=1

We have two ways.


• The first way, which is trivial, is to collect all the samples then
calculate the average.
• The drawback of such way is that, if the samples are collected one
by one over a period of time, we have to wait until all the samples
to be collected.
• The second way can avoid this drawback because it calculates the
average in an incremental and iterative manner.

Shiyu Zhao 7 / 65
Motivating example: mean estimation

In particular, suppose
k
1X
wk+1 = xi , k = 1, 2, . . .
k i=1
and hence
k−1
1 X
wk = xi , k = 2, 3, . . .
k − 1 i=1
Then, wk+1 can be expressed in terms of wk as
k k−1
!
1X 1 X
wk+1 = xi = xi + xk
k i=1 k i=1
1 1
= ((k − 1)wk + xk ) = wk − (wk − xk ).
k k
Therefore, we obtain the following iterative algorithm:
1
wk+1 = wk − (wk − xk ).
k
Shiyu Zhao 8 / 65
Motivating example: mean estimation

We can use
1
wk+1 = wk − (wk − xk ).
k
to calculate the mean x̄ incrementally:

w1 = x1 ,
1
w2 = w1 − (w1 − x1 ) = x1 ,
1
1 1 1
w3 = w2 − (w2 − x2 ) = x1 − (x1 − x2 ) = (x1 + x2 ),
2 2 2
1 1
w4 = w3 − (w3 − x3 ) = (x1 + x2 + x3 ),
3 3
..
.
k
1X
wk+1 = xi .
k i=1

Shiyu Zhao 9 / 65
Motivating example: mean estimation

Remarks about this algorithm:


1
wk+1 = wk − (wk − xk ).
k
• An advantage of this algorithm is that a mean estimate can be
obtained immediately once a sample is received. Then, the mean
estimate can be used for other purposes immediately.
• The mean estimate is not accurate in the beginning due to insufficient
samples (that is wk 6= E[X]). However, it is better than nothing. As
more samples are obtained, the estimate can be improved gradually
(that is wk → E[X] as k → ∞).

Shiyu Zhao 10 / 65
Motivating example: mean estimation

Furthermore, consider an algorithm with a more general expression:

wk+1 = wk − αk (wk − xk ),

where 1/k is replaced by αk > 0.


• Does this algorithm still converge to the mean E[X]? We will show
that the answer is yes if {αk } satisfy some mild conditions.
• We will also show that this algorithm is a special SA algorithm and
also a special stochastic gradient descent algorithm.
• In the next lecture, we will see that the temporal-difference algorithms
have similar (but more complex) expressions.

Shiyu Zhao 11 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 12 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 13 / 65
Robbins-Monro algorithm

Stochastic approximation (SA):


• SA refers to a broad class of stochastic iterative algorithms solving
root finding or optimization problems.
• Compared to many other root-finding algorithms such as
gradient-based methods, SA is powerful in the sense that it does not
require to know the expression of the objective function nor its
derivative.
Robbins-Monro (RM) algorithm:
• The is a pioneering work in the field of stochastic approximation.
• The famous stochastic gradient descent algorithm is a special form of
the RM algorithm.
• It can be used to analyze the mean estimation algorithms introduced in
the beginning.
Shiyu Zhao 14 / 65
Robbins-Monro algorithm – Problem statement

Problem statement: Suppose we would like to find the root of the


equation
g(w) = 0,

where w ∈ R is the variable to be solved and g : R → R is a function.

• Many problems can be eventually converted to this root finding


problem. For example, suppose J(w) is an objective function to be
minimized. Then, the optimization problem can be converged to

g(w) = ∇w J(w) = 0

• Note that an equation like g(w) = c with c as a constant can also be


converted to the above equation by rewriting g(w) − c as a new
function.

Shiyu Zhao 15 / 65
Robbins-Monro algorithm – Problem statement

How to calculate the root of g(w) = 0?


• If the expression of g or its derivative is known, there are many
numerical algorithms that can solve this problem.
• What if the expression of the function g is unknown? For example, the
function is represented by an artificial neuron network.

Shiyu Zhao 16 / 65
Robbins-Monro algorithm – The algorithm

The Robbins-Monro (RM) algorithm can solve this problem:

wk+1 = wk − ak g̃(wk , ηk ), k = 1, 2, 3, . . .

where
• wk is the kth estimate of the root
• g̃(wk , ηk ) = g(wk ) + ηk is the kth noisy observation
• ak is a positive coefficient.
The function g(w) is a black box! This algorithm relies on data:
• Input sequence: {wk }
• Noisy output sequence: {g̃(wk , ηk )}
Philosophy: without model, we need data!
• Here, the model refers to the expression of the function.
Shiyu Zhao 17 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 18 / 65
Robbins-Monro algorithm – Illustrative examples

Excise: manually solve g(w) = w − 10 using the RM algorithm.


Set: w1 = 20, ak ≡ 0.5, ηk = 0 (i.e., no observation error)

w1 = 20 =⇒ g(w1 ) = 10
w2 = w1 − a1 g(w1 ) = 20 − 0.5 ∗ 10 = 15 =⇒ g(w2 ) = 5
w3 = w2 − a2 g(w2 ) = 15 − 0.5 ∗ 5 = 12.5 =⇒ g(w3 ) = 2.5
..
.
wk → 10

Excises:
• What if ak = 1?
• What if ak = 2?

Shiyu Zhao 19 / 65
Robbins-Monro algorithm – Illustrative examples

Another example: solve g(w) = w3 − 5 using the RM algorithm.


• The true root is 51/3 ≈ 1.71.
• We only know is g̃(w) = g(w) + η.
• Suppose ηk is iid and obeys a standard normal distribution with a
mean of zero and standard deviation of 1.
• The initial guess is w1 = 0 and ak is selected to be ak = 1/k.
The evolution of wk is shown in the figure. As can be seen, the estimate
wk can converge to the true root.
Estimated root wk

0
0 10 20 30 40 50
k

2
Observation noise

-2
0 10 20 30 40 50
Iteration index k
Shiyu Zhao 20 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 21 / 65
Robbins-Monro algorithm – Convergence properties

Why can the RM algorithm find the root of g(w) = 0?

• First present an illustrative example.


• Second give the rigorous convergence analysis.

Shiyu Zhao 22 / 65
Robbins-Monro algorithm – Convergence properties

An illustrative example:
• g(w) = tanh(w − 1)
• The true root of g(w) = 0 is w∗ = 1.
• Parameters: w1 = 3, ak = 1/k, ηk ≡ 0 (no noise for the sake of
simplicity)
The RM algorithm in this case is

wk+1 = wk − ak g(wk )

since g̃(wk , ηk ) = g(wk ) when ηk = 0.

Shiyu Zhao 23 / 65
Robbins-Monro algorithm – Convergence properties
Simulation result: wk converges to the true root w∗ = 1.

1.5

0.5

g(w) 0
...... w w
4 3
w2 w1

-0.5

-1

0.5 1 1.5 2 2.5 3 3.5 4


w

Intuition: wk+1 is closer to w∗ than wk .


• When wk > w∗ , we have g(wk ) > 0. Then,
wk+1 = wk − ak g(wk ) < wk and hence wk+1 is closer to w∗ than wk .
• When wk < w∗ , we have g(wk ) < 0. Then,
wk+1 = wk − ak g(wk ) > wk and wk+1 is closer to w∗ than wk .
Shiyu Zhao 24 / 65
Robbins-Monro algorithm – Convergence properties

The above analysis is intuitive, but not rigorous. A rigorous convergence


result is given below.

Theorem (Robbins-Monro Theorem)

In the Robbins-Monro algorithm, if


1) 0 < c1 ≤ ∇w g(w) ≤ c2 for all w;
P∞ P∞ 2
2) k=1 ak = ∞ and k=1 ak < ∞;

3) E[ηk |Hk ] = 0 and E[ηk2 |Hk ] < ∞;


where Hk = {wk , wk−1 , . . . }, then wk converges with probability 1
(w.p.1) to the root w∗ satisfying g(w∗ ) = 0.

Shiyu Zhao 25 / 65
Robbins-Monro algorithm – Convergence properties

Explanation of the three conditions:


• 0 < c1 ≤ ∇w g(w) ≤ c2 for all w
This condition indicates
• g to be monotonically increasing,which ensures that the root of g(w) = 0
exists and is unique
• The gradient is bounded from the above.
P∞ P∞
• k=1 ak = ∞ and k=1 a2k < ∞
P∞
• The condition of k=1 a2k < ∞ ensures that ak converges to zero as
k → ∞.
P∞
• The condition of k=1 ak = ∞ ensures that ak do not converge to zero
too fast.
• E[ηk |Hk ] = 0 and E[ηk2 |Hk ] < ∞
• A special yet common case is that {ηk } is an iid stochastic sequence
satisfying E[ηk ] = 0 and E[ηk2 ] < ∞. The observation error ηk is not
required to be Gaussian.
Shiyu Zhao 26 / 65
Robbins-Monro algorithm – Convergence properties

Examine the second condition more closely:



X ∞
X
a2k < ∞ ak = ∞
k=1 k=1
P∞
• First, k=1 a2k < ∞ indicates that ak → 0 as k → ∞.
• Why is this condition important?
Since
wk+1 − wk = −ak g̃(wk , ηk ),

• If ak → 0, then ak g̃(wk , ηk ) → 0 and hence wk+1 − wk → 0.


• We need the fact that wk+1 − wk → 0 if wk converges eventually.
• If wk → w∗ , g(wk ) → 0 and g̃(wk , ηk ) is dominant by ηk .

Shiyu Zhao 27 / 65
Robbins-Monro algorithm – Convergence properties

Examine the second condition more closely:



X ∞
X
a2k < ∞ ak = ∞
k=1 k=1
P∞
• Second, k=1 ak = ∞ indicates that ak should not converge to zero
too fast.
• Why is this condition important?
Summarizing w2 = w1 − a1 g̃(w1 , η1 ), w3 = w2 − a2 g̃(w2 , η2 ), . . . ,
wk+1 = wk − ak g̃(wk , ηk ) leads to

X
w1 − w∞ = ak g̃(wk , ηk ).
k=1
P∞ P∞
Suppose w∞ = w∗ . If k=1 ak < ∞, then k=1 ak g̃(wk , ηk ) may be
bounded. Then, if the initial guess w1 is chosen arbitrarily far away from w∗ ,
then the above equality would be invalid.
Shiyu Zhao 28 / 65
Robbins-Monro algorithm – Convergence properties
P∞ P∞
What {ak } satisfies the two conditions? k=1 a2k < ∞, k=1 ak = ∞
One typical sequence is
1
ak =
k

• It holds that !
n
X 1
lim − ln n = κ,
n→∞ k
k=1

where κ ≈ 0.577 is called the Euler-Mascheroni constant (also called


Euler’s constant).
• It is notable that

X 1 π2
2
= < ∞.
k 6
k=1
P∞ 2
The limit k=1 1/k also has a specific name in the number theory:
Basel problem.
Shiyu Zhao 29 / 65
Robbins-Monro algorithm – Convergence properties

If the three conditions are not satisfied, the algorithm may not work.
• For example, g(w) = w3 − 5 does not satisfy the first condition on
gradient boundedness. If the initial guess is good, the algorithm can
converge (locally). Otherwise, it will diverge.
We will see that ak is often selected as a sufficiently small constant in
many RL algorithms. Although the second condition is not satisfied in
this case, the algorithm can still work effectively.

Shiyu Zhao 30 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 31 / 65
Robbins-Monro algorithm – Apply to mean estimation

Recall that
wk+1 = wk + αk (xk − wk ).

is the mean estimation algorithm.


We know that
Pk
• If αk = 1/k, then wk+1 = 1/k i=1 xi .
• If αk is not 1/k, the convergence was not analyzed.
Next, we show that this algorithm is a special case of the RM algorithm.
Then, its convergence naturally follows.

Shiyu Zhao 32 / 65
Robbins-Monro algorithm – Apply to mean estimation

1) Consider a function:
.
g(w) = w − E[X].

Our aim is to solve g(w) = 0. If we can do that, then we can obtain E[X].
2) The observation we can get is
.
g̃(w, x) = w − x,

because we can only obtain samples of X. Note that

g̃(w, η) = w − x = w − x + E[X] − E[X]


.
= (w − E[X]) + (E[X] − x) = g(w) + η,

3) The RM algorithm for solving g(x) = 0 is

wk+1 = wk − αk g̃(wk , ηk ) = wk − αk (wk − xk ),

which is exactly the mean estimation algorithm.


The convergence naturally follows.
Shiyu Zhao 33 / 65
Dvoretzkys convergence theorem (optional)

Theorem (Dvoretzky’s Theorem)


Consider a stochastic process

wk+1 = (1 − αk )wk + βk ηk ,

where {αk }∞ ∞ ∞
k=1 , {βk }k=1 , {ηk }k=1 are stochastic sequences. Here αk ≥ 0, βk ≥ 0 for
all k. Then, wk would converge to zero with probability 1 if the following conditions
are satisfied:
P∞ P∞ 2
P∞ 2
1) k=1 αk = ∞, k=1 αk < ∞; k=1 βk < ∞ uniformly w.p.1;

2) E[ηk |Hk ] = 0 and E[ηk2 |Hk ] ≤ C w.p.1;


where Hk = {wk , wk−1 , . . . , ηk−1 , . . . , αk−1 , . . . , βk−1 , . . . }.

• A more general result than the RM theorem. It can be used to prove


the RM theorem
• It can also directly analyze the mean estimation problem.
• An extension of it can be used to analyze Q-learning and TD learning
algorithms.
Shiyu Zhao 34 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 35 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 36 / 65
Stochastic gradient descent

Next, we introduce stochastic gradient descent (SGD) algorithms:


• SGD is widely used in the field of machine learning and also in RL.
• SGD is a special RM algorithm.
• The mean estimation algorithm is a special SGD algorithm.

Suppose we aim to solve the following optimization problem:

min J(w) = E[f (w, X)]


w

• w is the parameter to be optimized.


• X is a random variable. The expectation is with respect to X.
• w and X can be either scalars or vectors. The function f (·) is a scalar.

Shiyu Zhao 37 / 65
Stochastic gradient descent

Method 1: gradient descent (GD)

wk+1 = wk − αk ∇w E[f (wk , X)] = wk − αk E[∇w f (wk , X)]

Drawback: the expected value is difficult to obtain.

Method 2: batch gradient descent (BGD)


n
1X
E[∇w f (wk , X)] ≈ ∇w f (wk , xi ).
n i=1

n
1X
wk+1 = wk − αk ∇w f (wk , xi ).
n i=1

Drawback: it requires many samples in each iteration for each wk .

Shiyu Zhao 38 / 65
Stochastic gradient descent – Algorithm

Method 3: stochastic gradient descent (SGD)

wk+1 = wk − αk ∇w f (wk , xk ),

• Compared to the gradient descent method: Replace the true gradient


E[∇w f (wk , X)] by the stochastic gradient ∇w f (wk , xk ).
• Compared to the batch gradient descent method: let n = 1.

Shiyu Zhao 39 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 40 / 65
Stochastic gradient descent – Example and application

We next consider an example:


 
1 2
min J(w) = E[f (w, X)] = E kw − Xk ,
w 2

where
f (w, X) = kw − Xk2 /2 ∇w f (w, X) = w − X

Excises:

• Excise 1: Show that the optimal solution is w∗ = E[X].


• Excise 2: Write out the GD algorithm for solving this problem.
• Excise 3: Write out the SGD algorithm for solving this problem.

Shiyu Zhao 41 / 65
Stochastic gradient descent – Example and application

Answer:
• The GD algorithm for solving the above problem is

wk+1 = wk − αk ∇w J(wk )
= wk − αk E[∇w f (wk , X)]
= wk − αk E[wk − X].

• The SGD algorithm for solving the above problem is

wk+1 = wk − αk ∇w f (wk , xk ) = wk − αk (wk − xk )

• Note:
• It is the same as the mean estimation algorithm we presented before.
• That mean estimation algorithm is a special SGD algorithm.

Shiyu Zhao 42 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 43 / 65
Stochastic gradient descent – Convergence

From GD to SGD:

wk+1 = wk −αk E[∇w f (wk , X)]



wk+1 = wk −αk ∇w f (wk , xk )

∇w f (wk , xk ) can be viewed as a noisy measurement of E[∇w f (w, X)]:

∇w f (wk , xk ) = E[∇w f (w, X)] + ∇w f (wk , xk ) − E[∇w f (w, X)] .


| {z }
η

Since
∇w f (wk , xk ) 6= E[∇w f (w, X)]

whether wk → w∗ as k → ∞ by SGD?

Shiyu Zhao 44 / 65
Stochastic gradient descent – Convergence

We next show that SGD is a special RM algorithm. Then, the


convergence naturally follows.

The aim of SGD is to minimize

J(w) = E[f (w, X)]

This problem can be converted to a root-finding problem:

∇w J(w) = E[∇w f (w, X)] = 0

Let
g(w) = ∇w J(w) = E[∇w f (w, X)].

Then, the aim of SGD is to find the root of g(w) = 0.

Shiyu Zhao 45 / 65
Stochastic gradient descent – Convergence

What we can measure is

g̃(w, η) = ∇w f (w, x)
= E[∇w f (w, X)] + ∇w f (w, x) − E[∇w f (w, X)] .
| {z } | {z }
g(w) η

Then, the RM algorithm for solving g(w) = 0 is

wk+1 = wk − ak g̃(wk , ηk ) = wk − ak ∇w f (wk , xk ).

• It is exactly the SGD algorithm.


• Therefore, SGD is a special RM algorithm.

Shiyu Zhao 46 / 65
Stochastic gradient descent – Convergence

Since SGD is a special RM algorithm, its convergence naturally follows.

Theorem (Convergence of SGD)

In the SGD algorithm, if


1) 0 < c1 ≤ ∇2w f (w, X) ≤ c2 ;
P∞ P∞ 2
2) k=1 ak = ∞ and k=1 ak < ∞;

3) {xk }∞
k=1 is iid;

then wk converges to the root of ∇w E[f (w, X)] = 0 with probability 1.

For the proof see the book.

Shiyu Zhao 47 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 48 / 65
Stochastic gradient descent – Convergence pattern

Question: Since the stochastic gradient is random and hence the


approximation is inaccurate, whether the convergence of SGD is slow or
random?
To answer this question, we consider the relative error between the
stochastic and batch gradients:

. |∇w f (wk , xk ) − E[∇w f (wk , X)]|


δk = .
|E[∇w f (wk , X)]|

Since E[∇w f (w∗ , X)] = 0, we further have


|∇w f (wk , xk ) − E[∇w f (wk , X)]| |∇w f (wk , xk ) − E[∇w f (wk , X)]|
δk = = .
|E[∇w f (wk , X)] − E[∇w f (w∗ , X)]| |E[∇2w f (w̃k , X)(wk − w∗ )]|
where the last equality is due to the mean value theorem and
w̃k ∈ [wk , w∗ ].

Shiyu Zhao 49 / 65
Stochastic gradient descent – Convergence pattern

Suppose f is strictly convex such that

∇2w f ≥ c > 0

for all w, X, where c is a positive bound.


Then, the denominator of δk becomes

E[∇2w f (w̃k , X)(wk − w∗ )] = E[∇2w f (w̃k , X)](wk − w∗ )


= E[∇2w f (w̃k , X)] (wk − w∗ ) ≥ c|wk − w∗ |.

Substituting the above inequality to δk gives

∇w f (wk , xk ) − E[∇w f (wk , X)]


δk ≤ .
c|wk − w∗ |

Shiyu Zhao 50 / 65
Stochastic gradient descent – Convergence pattern

Note that
stochastic gradient true gradient
z }| { z }| {
∇w f (wk , xk ) − E[∇w f (wk , X)]
δk ≤ .
c|wk − w∗ |
| {z }
distance to the optimal solution

The above equation suggests an interesting convergence pattern of SGD.


• The relative error δk is inversely proportional to |wk − w∗ |.
• When |wk − w∗ | is large, δk is small and SGD behaves like GD.
• When wk is close to w∗ , the relative error may be large and the
convergence exhibits more randomness in the neighborhood of w∗ .

Shiyu Zhao 51 / 65
Stochastic gradient descent – Convergence pattern

Consider an illustrative example:


Setup: X ∈ R2 represents a random position in the plane. Its
distribution is uniform in the square area centered at the origin with the
side length as 20. The true mean is E[X] = 0. The mean estimation is
based on 100 iid samples {xi }100
i=1 .

Shiyu Zhao 52 / 65
Stochastic gradient descent – Convergence pattern

Result:
20 30
mean SG (m=1)
data MBGD (m=5)
SGD (m=1) MBGD (m=50)
15 25
MBGD (m=5)
MBGD (m=50)

10 20

Distance to mean
5 15
y

0 10

-5 5

0
-20 -15 -10 -5 0 5 10 0 5 10 15 20 25 30
x Iteration step

• Although the initial guess of the mean is far away from the true value,
the SGD estimate can approach the neighborhood of the true value
fast.
• When the estimate is close to the true value, it exhibits certain
randomness but still approaches the true value gradually.
Shiyu Zhao 53 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 54 / 65
Stochastic gradient descent – A deterministic formulation

• The formulation of SGD we introduced above involves random


variables and expectation.
• One may often encounter a deterministic formulation of SGD without
involving any random variables.

Consider the optimization problem:


n
1X
min J(w) = f (w, xi ),
w n i=1

• f (w, xi ) is a parameterized function.


• w is the parameter to be optimized.
• a set of real numbers {xi }ni=1 , where xi does not have to be a sample
of any random variable.

Shiyu Zhao 55 / 65
Stochastic gradient descent – A deterministic formulation

The gradient descent algorithm for solving this problem is


n
1X
wk+1 = wk − αk ∇w J(wk ) = wk − αk ∇w f (wk , xi ).
n i=1

Suppose the set is large and we can only fetch a single number every
time. In this case, we can use the following iterative algorithm:

wk+1 = wk − αk ∇w f (wk , xk ).

Questions:
• Is this algorithm SGD? It does not involve any random variables or
expected values.
• How should we use the finite set of numbers {xi }ni=1 ? Should we sort
these numbers in a certain order and then use them one by one? Or
should we randomly sample a number from the set?
Shiyu Zhao 56 / 65
Stochastic gradient descent – A deterministic formulation

A quick answer to the above questions is that we can introduce a random


variable manually and convert the deterministic formulation to the
stochastic formulation of SGD.
In particular, suppose X is a random variable defined on the set {xi }ni=1 .
Suppose its probability distribution is uniform such that

p(X = xi ) = 1/n

Then, the deterministic optimization problem becomes a stochastic one:


n
1X
min J(w) = f (w, xi ) = E[f (w, X)].
w n i=1
• The last equality in the above equation is strict instead of approximate.
Therefore, the algorithm is SGD.
• The estimate converges if xk is uniformly and independently sampled from
{xi }n n
i=1 . xk may repeatedly take the same number in {xi }i=1 since it is

sampled randomly.
Shiyu Zhao 57 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 58 / 65
BGD, MBGD, and SGD

Suppose we would like to minimize J(w) = E[f (w, X)] given a set of random
samples {xi }n
i=1 of X. The BGD, SGD, MBGD algorithms solving this problem

are, respectively,
n
1X
wk+1 = wk − αk ∇w f (wk , xi ), (BGD)
n i=1
1 X
wk+1 = wk − αk ∇w f (wk , xj ), (MBGD)
m j∈I
k

wk+1 = wk − αk ∇w f (wk , xk ). (SGD)

• In the BGD algorithm, all the samples are used in every iteration. When n
is large, (1/n) n
P
i=1 ∇w f (wk , xi ) is close to the true gradient
E[∇w f (wk , X)].
• In the MBGD algorithm, Ik is a subset of {1, . . . , n} with the size as
|Ik | = m. The set Ik is obtained by m times idd samplings.
• In the SGD algorithm, xk is randomly sampled from {xi }n
i=1 at time k.
Shiyu Zhao 59 / 65
BGD, MBGD, and SGD

Compare MBGD with BGD and SGD:


• Compared to SGD, MBGD has less randomness because it uses more
samples instead of just one as in SGD.
• Compared to BGD, MBGD does not require to use all the samples in
every iteration, making it more flexible and efficient.
• If m = 1, MBGD becomes SGD.
• If m = n, MBGD does NOT become BGD strictly speaking because
MBGD uses randomly fetched n samples whereas BGD uses all n
numbers. In particular, MBGD may use a value in {xi }ni=1 multiple
times whereas BGD uses each number once.

Shiyu Zhao 60 / 65
BGD, MBGD, and SGD – Illustrative examples

Given some numbers {xi }ni=1 , our aim is to calculate the mean
Pn
x̄ = i=1 xi /n. This problem can be equivalently stated as the following
optimization problem:
n
1 X
min J(w) = kw − xi k2
w 2n i=1

The three algorithms for solving this problem are, respectively,


n
1X
wk+1 = wk − αk (wk − xi ) = wk − αk (wk − x̄), (BGD)
n i=1
1 X 
(m)

wk+1 = wk − αk (wk − xj ) = wk − αk wk − x̄k , (MBGD)
m j∈I
k

wk+1 = wk − αk (wk − xk ), (SGD)

(m) P
where x̄k = j∈Ik xj /m.
Shiyu Zhao 61 / 65
BGD, MBGD, and SGD

Furthermore, if αk = 1/k, the above equation can be solved as


k
1X
wk+1 = x̄ = x̄, (BGD)
k j=1
k
1 X (m)
wk+1 = x̄ , (MBGD)
k j=1 j
k
1X
wk+1 = xj . (SGD)
k j=1

• The estimate of BGD at each step is exactly the optimal solution


w∗ = x̄.
• The estimate of MBGD approaches the mean faster than SGD because
(m)
x̄k is already an average.

Shiyu Zhao 62 / 65
BGD, MBGD, and SGD

Let αk = 1/k. Given 100 points, using different mini-batch sizes leads to
different convergence speed.

20 30
mean SG (m=1)
data MBGD (m=5)
SGD (m=1) MBGD (m=50)
15 25
MBGD (m=5)
MBGD (m=50)

10 20

Distance to mean
5 15
y

0 10

-5 5

0
-20 -15 -10 -5 0 5 10 0 5 10 15 20 25 30
x Iteration step

Figure: An illustrative example for mean estimation by different GD algorithms.

Shiyu Zhao 63 / 65
Outline
1 Motivating examples
2 Robbins-Monro algorithm
Algorithm description
Illustrative examples
Convergence analysis
Application to mean estimation
3 Stochastic gradient descent
Algorithm description
Examples and application
Convergence analysis
Convergence pattern
A deterministic formulation
4 BGD, MBGD, and SGD
5 Summary
Shiyu Zhao 64 / 65
Summary

• Mean estimation: compute E[X] using {xk }


1
wk+1 = wk −
(wk − xk ).
k
• RM algorithm: solve g(w) = 0 using {g̃(wk , ηk )}

wk+1 = wk − ak g̃(wk , ηk )

• SGD algorithm: minimize J(w) = E[f (w, X)] using {∇w f (wk , xk )}

wk+1 = wk − αk ∇w f (wk , xk ),

These results are useful:


• We will see in the next chapter that the temporal-difference learning
algorithms can be viewed as stochastic approximation algorithms and
hence have similar expressions.
• They are important optimization techniques that can be applied to
many other fields.
Shiyu Zhao 65 / 65
Lecture 8: Value Function Approximation

Shiyu Zhao
Introduction

Chapter 6:
Chapter 5: Stochastic
Monte Carlo Approximation
Chapter 4: Learning
Value Iteration &
Policy Iteration Chapter 7:
Temporal‐Difference
Learning

tabular representation
to
Chapter 3: Chapter 2:
function representation
Bellman Optimality Bellman Algorithm/Methods
Equation Equation
Chapter 8:
Value Function
Fundamental tools Approximation

Chapter 9:
Chapter 10:
Policy Function
Actor‐Critic
Approximation
Methods
(or Policy Gradient)

Shiyu Zhao 1 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 2 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 3 / 70
Motivating examples: curve fitting

So far in this book, state and action values are represented by tables.

• For example, action value:

a1 a2 a3 a4 a5
s1 qπ (s1 , a1 ) qπ (s1 , a2 ) qπ (s1 , a3 ) qπ (s1 , a4 ) qπ (s1 , a5 )
. . . . . .
. . . . . .
. . . . . .
s9 qπ (s9 , a1 ) qπ (s9 , a2 ) qπ (s9 , a3 ) qπ (s9 , a4 ) qπ (s9 , a5 )

• Advantage: intuitive and easy to analyze


• Disadvantage: difficult to handle large or continuous state or action
spaces. Two aspects: 1) storage; 2) generalization ability

Shiyu Zhao 4 / 70
Motivating examples: curve fitting

Consider an example:
• Suppose there are one-dimensional states s1 , . . . , s|S| .
• Their state values are vπ (s1 ), . . . , vπ (s|S| ), where π is a given policy.
• Suppose |S| is very large and we hope to use a simple curve to
approximate these dots to save storage.

Figure: An illustration of function approximation of samples.

Shiyu Zhao 5 / 70
Motivating examples: curve fitting

First, we use the simplest straight line to fit the dots.


Suppose the equation of the straight line is
" #
a
v̂(s, w) = as + b = [s, 1] = φT (s)w
|{z} b
φT (s) | {z }
w

where
• w is the parameter vector
• φ(s) the feature vector of s
• v̂(s, w) is linear in w

Shiyu Zhao 6 / 70
Motivating examples: curve fitting

" #
a
v̂(s, w) = as + b = [s, 1] = φT (s)w,
|{z} b
φT (s) | {z }
w

The benefits:
• The tabular representation needs to store |S| state values. Now, we
need to only store two parameters a and b.
• Every time we would like to use the value of s, we can calculate
φT (s)w.
• Such a benefit is not free. It comes with a cost: the state values can
not be represented accurately. This is why this method is called value
approximation.

Shiyu Zhao 7 / 70
Motivating examples: curve fitting

Second, we can also fit the points using a second-order curve:


 
a
v̂(s, w) = as2 + bs + c = [s2 , s, 1]  T
 
| {z }   = φ (s)w.
b 
φT (s) c
| {z }
w

In this case,
• The dimensions of w and φ(s) increase, but the values may be fitted
more accurately.
• Although v̂(s, w) is nonlinear in s, it is linear in w. The nonlinearity is
contained in φ(s).

Shiyu Zhao 8 / 70
Motivating examples: curve fitting

Third, we can use even higher-order polynomial curves or other


complex curves to fit the dots.

• Advantage: It can better approximate.


• Disadvantage: It needs more parameters.

Shiyu Zhao 9 / 70
Motivating examples: curve fitting

Quick summary:
• Idea: Approximate the state and action values using parameterized
functions: v̂(s, w) ≈ vπ (s) where w ∈ Rm is the parameter vector.
• Key difference: How to access and assign the value of v(s)
• Advantage:
1) Storage: The dimension of w may be much less than |S|.
2) Generalization: When a state s is visited, the parameter w is
updated so that the values of some other unvisited states can also
be updated.

Shiyu Zhao 10 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 11 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 12 / 70
Objective function

Introduce in a more formal way:


• Let vπ (s) and v̂(s, w) be the true state value and a function for
approximation.
• Our goal is to find an optimal w so that v̂(s, w) can best approximate
vπ (s) for every s.
• This is a policy evaluation problem. Later we will extend to policy
improvement.
• To find the optimal w, we need two steps.
• The first step is to define an objective function.
• The second step is to derive algorithms optimizing the objective
function.

Shiyu Zhao 13 / 70
Objective function

The objective function is

J(w) = E[(vπ (S) − v̂(S, w))2 ].

• Our goal is to find the best w that can minimize J(w).


• The expectation is with respect to the random variable S ∈ S. What is
the probability distribution of S?
• This is often confusing because we have not discussed the
probability distribution of states so far in this book.
• There are several ways to define the probability distribution of S.

Shiyu Zhao 14 / 70
Objective function

The first way is to use a uniform distribution.


• That is to treat all the states to be equally important by setting the
probability of each state as 1/|S|.
• In this case, the objective function becomes
1 X
J(w) = E[(vπ (S) − v̂(S, w))2 ] = (vπ (s) − v̂(s, w))2 .
|S|
s∈S

• Drawback:
• The states may not be equally important. For example, some states
may be rarely visited by a policy. Hence, this way does not consider
the real dynamics of the Markov process under the given policy.

Shiyu Zhao 15 / 70
Objective function

The second way is to use the stationary distribution.


• Stationary distribution is an important concept that will be frequently
used in this course. In short, it describes the long-run behavior of a
Markov process.
• Let {dπ (s)}s∈S denote the stationary distribution of the Markov
P
process under policy π. By definition, dπ (s) ≥ 0 and s∈S dπ (s) = 1.
• The objective function can be rewritten as
X
J(w) = E[(vπ (S) − v̂(S, w))2 ] = dπ (s)(vπ (s) − v̂(s, w))2 .
s∈S

This function is a weighted squared error.


• Since more frequently visited states have higher values of dπ (s), their
weights in the objective function are also higher than those rarely
visited states.
Shiyu Zhao 16 / 70
Objective function – Stationary distribution

More explanation about stationary distribution:


• Distribution: Distribution of the state
• Stationary: Long-run behavior
• Summary: after the agent runs a long time following a policy, the
probability that the agent is at any state can be described by this
distribution.
Remarks:
• Stationary distribution is also called steady-state distribution, or
limiting distribution.
• It is critical to understand the value function approximation method.
• It is also important for the policy gradient method in the next lecture.

Shiyu Zhao 17 / 70
Objective function - Stationary distribution

Illustrative example:
• Given a policy shown in the figure.
• Let nπ (s) denote the number of times that s has been visited in a very
long episode generated by π.
• Then, dπ (s) can be approximated by
nπ (s)
dπ (s) ≈ P 0
s0 ∈S nπ (s )
1 2
0.8

Percentage of each state visited


0.6
1
s1
s2
0.4
s3
s4

0.2
2

0
0 200 400 600 800 1000
Step index

Shiyu Zhao
Figure: Long-run behavior of an -greedy policy with  = 0.5. 18 / 70
Objective function - Stationary distribution

The converged values can be predicted because they are the entries of dπ :

dTπ = dTπ Pπ

For this example, we have Pπ as


 
0.3 0.1 0.6 0
 
 0.1 0.3 0 0.6 
Pπ =  0.1 0 0.3 0.6
.

 
0 0.1 0.1 0.8

It can be calculated that the left eigenvector for the eigenvalue of one is
h iT
dπ = 0.0345, 0.1084, 0.1330, 0.7241

A comprehensive introduction can be found in the book.


Shiyu Zhao 19 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 20 / 70
Optimization algorithms

While we have the objective function, the next step is to optimize it.
• To minimize the objective function J(w), we can use the
gradient-descent algorithm:

wk+1 = wk − αk ∇w J(wk )

The true gradient is

∇w J(w) = ∇w E[(vπ (S) − v̂(S, w))2 ]


= E[∇w (vπ (S) − v̂(S, w))2 ]
= 2E[(vπ (S) − v̂(S, w))(−∇w v̂(S, w))]
= −2E[(vπ (S) − v̂(S, w))∇w v̂(S, w)]

The true gradient above involves the calculation of an expectation.

Shiyu Zhao 21 / 70
Optimization algorithms

We can use the stochastic gradient to replace the true gradient:

wt+1 = wt + αt (vπ (st ) − v̂(st , wt ))∇w v̂(st , wt ),

where st is a sample of S. Here, 2αk is merged to αk .


• This algorithm is not implementable because it requires the true state
value vπ , which is the unknown to be estimated.
• We can replace vπ (st ) with an approximation so that the algorithm is
implementable.

Shiyu Zhao 22 / 70
Optimization algorithms

In particular,
• First, Monte Carlo learning with function approximation
Let gt be the discounted return starting from st in the episode. Then,
gt can be used to approximate vπ (st ). The algorithm becomes

wt+1 = wt + αt (gt − v̂(st , wt ))∇w v̂(st , wt ).

• Second, TD learning with function approximation


By the spirit of TD learning, rt+1 + γv̂(st+1 , wt ) can be viewed as an
approximation of vπ (st ). Then, the algorithm becomes

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt ).

Shiyu Zhao 23 / 70
Optimization algorithms

Pseudocode: TD learning with function approximation

Initialization: A function v̂(s, w) that is a differentiable in w. Initial pa-


rameter w0 .
Aim: Approximate the true state values of a given policy π.

For each episode generated following the policy π, do


For each step (st , rt+1 , st+1 ), do
In the general case,
wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )
In the linear case,
wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st )
 

It can only estimate the state values of a given policy, but it is important
to understand other algorithms introduced later.
Shiyu Zhao 24 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 25 / 70
Selection of function approximators

An important question that has not been answered: How to select the
function v̂(s, w)?
• The first approach, which was widely used before, is to use a linear
function
v̂(s, w) = φT (s)w

Here, φ(s) is the feature vector, which can be a polynomial basis,


Fourier basis, ... (see my book for details). We have seen in the
motivating example and will see again in the illustrative examples later.
• The second approach, which is widely used nowadays, is to use a
neural network as a nonlinear function approximator.
The input of the NN is the state, the output is v̂(s, w), and the
network parameter is w.

Shiyu Zhao 26 / 70
Linear function approximation

In the linear case where v̂(s, w) = φT (s)w, we have

∇w v̂(s, w) = φ(s).

Substituting the gradient into the TD algorithm

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )

yields

wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ),


 

which is the algorithm of TD learning with linear function approximation.


It is called TD-Linear in our course in short.

Shiyu Zhao 27 / 70
Linear function approximation

• Disadvantages of linear function approximation:


• Difficult to select appropriate feature vectors.
• Advantages of linear function approximation:
• The theoretical properties of the TD algorithm in the linear case can
be much better understood than in the nonlinear case.
• Linear function approximation is still powerful in the sense that the
tabular representation is merely a special case of linear function
approximation.

Shiyu Zhao 28 / 70
Linear function approximation

We next show that the tabular representation is a special case of linear


function approximation.
• First, consider the special feature vector for state s:

φ(s) = es ∈ R|S| ,

where es is a vector with the sth entry as 1 and the others as 0.


• In this case,
v̂(s, w) = eTs w = w(s),

where w(s) is the sth entry of w.

Shiyu Zhao 29 / 70
Linear function approximation

Recall that the TD-Linear algorithm is

wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ),


 

• When φ(st ) = es , the above algorithm becomes

wt+1 = wt + αt (rt+1 + γwt (st+1 ) − wt (st )) est .

This is a vector equation that merely updates the st th entry of wt .


• Multiplying eTst on both sides of the equation gives

wt+1 (st ) = wt (st ) + αt (rt+1 + γwt (st+1 ) − wt (st )) ,

which is exactly the tabular TD algorithm.

Shiyu Zhao 30 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 31 / 70
Illustrative examples

Consider a 5x5 grid-world example:


• Given a policy: π(a|s) = 0.2 for any s, a
• Our aim is to estimate the state values of this policy (policy evaluation
problem).
• There are 25 state values in total. We next show that we can use less
than 25 parameters to approximate these state values.
• Set rforbidden = rboundary = −1, rtarget = 1, and γ = 0.9.
1 2 3 4 5

Shiyu Zhao 32 / 70
Illustrative examples

Ground truth:
• The true state values and the 3D visualization
1 2 3 4 5 1 2 3 4 5

1 1 -3.8 -3.8 -3.6 -3.1 -3.2

2 2 -3.8 -3.8 -3.8 -3.1 -2.9

3 3 -3.6 -3.9 -3.4 -3.2 -2.9

4 4 -3.9 -3.6 -3.4 -2.9 -3.2

5 5 -4.5 -4.2 -3.4 -3.4 -3.5

Experience samples:
• 500 episodes were generated following the given policy.
• Each episode has 500 steps and starts from a randomly selected
state-action pair following a uniform distribution.
Shiyu Zhao 33 / 70
Illustrative examples

For comparison, the results given by the tabular TD algorithm (called


TD-Table in short):
4
TD-Table: =0.005

State value error (RMSE)


3

0
0 100 200 300 400 500
Episode index

Shiyu Zhao 34 / 70
Illustrative examples

We next show the results by the TD-Linear algorithm.


Feature vector selection:
 
1
3
 
 x ∈R .
φ(s) =  
y

In this case, the approximated state value is


 
w1
v̂(s, w) = φT (s)w = [1, x, y] 
 
w2
 = w1 + w2 x + w3 y.
 
w3

Notably, φ(s) can also be defined as φ(s) = [x, y, 1]T , where the order of
the elements does not matter.

Shiyu Zhao 35 / 70
Illustrative examples

Results by the TD-Linear algorithm:

5
TD-Linear: =0.0005

State value error (RMSE)


4

0
0 100 200 300 400 500
Episode index

• The trend is right, but there are errors due to limited approximation
ability!
• We are trying to use a plane to approximate a non-plane surface!

Shiyu Zhao 36 / 70
Illustrative examples

To enhance the approximation ability, we can use high-order feature


vectors and hence more parameters.
• For example, we can consider

φ(s) = [1, x, y, x2 , y 2 , xy]T ∈ R6 .

In this case,

v̂(s, w) = φT (s)w = w1 + w2 x + w3 y + w4 x2 + w5 y 2 + w6 xy

which corresponds to a quadratic surface.


• We can further increase the dimension of the feature vector:

φ(s) = [1, x, y, x2 , y 2 , xy, x3 , y 3 , x2 y, xy 2 ]T ∈ R10 .

Shiyu Zhao 37 / 70
Illustrative examples

Results by the TD-Linear algorithm with higher-order feature vectors:


5
TD-Linear: =0.0005

State value error (RMSE)


4

0
0 100 200 300 400 500
Episode index

The above figure: φ(s) ∈ R6


3.5
TD-Linear: =0.0005

State value error (RMSE)


3

2.5

1.5

0.5

0
0 100 200 300 400 500
Episode index

The above figure: φ(s) ∈ R10

More examples and types of features are given in the book.


Shiyu Zhao 38 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 39 / 70
Summary of the story

Up to now, we finished the story of TD learning with value function


approximation.
1) This story started from the objective function:

J(w) = E[(vπ (S) − v̂(S, w))2 ].

The objective function suggests that it is a policy evaluation


problem.
2) The gradient-descent algorithm is

wt+1 = wt + αt (vπ (st ) − v̂(st , wt ))∇w v̂(st , wt ),

3) The true value function, which is unknown, in the algorithm is


replaced by an approximation, leading to the algorithm:

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt ).

Although this story is helpful to understand the basic idea, it is not


mathematically rigorous.
Shiyu Zhao 40 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 41 / 70
Theoretical analysis

• The algorithm

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )

does not minimize the following objective function:

J(w) = E[(vπ (S) − v̂(S, w))2 ]

Shiyu Zhao 42 / 70
Theoretical analysis

Different objective functions:


• Objective function 1: True value error

JE (w) = E[(vπ (S) − v̂(S, w))2 ] = kv̂(w) − vπ k2D

• Objective function 2: Bellman error


.
JBE (w) = kv̂(w) − (rπ + γPπ v̂(w))k2D = kv̂(w) − Tπ (v̂(w))k2D ,
.
where Tπ (x) = rπ + γPπ x

Shiyu Zhao 43 / 70
Theoretical analysis

• Objective function 3: Projected Bellman error

JP BE (w) = kv̂(w) − M Tπ (v̂(w))k2D ,

where M is a projection matrix.

The TD-Linear algorithm minimizes the projected Bellman error.


Details can be found in the book.

Shiyu Zhao 44 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 45 / 70
Sarsa with function approximation

So far, we merely considered the problem of state value estimation. That


is we hope
v̂ ≈ vπ

To search for optimal policies, we need to estimate action values.

The Sarsa algorithm with value function approximation is


h i
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt ).

This is the same as the algorithm we introduced previously in this


lecture except that v̂ is replaced by q̂.

Shiyu Zhao 46 / 70
Sarsa with function approximation
To search for optimal policies, we can combine policy evaluation and
policy improvement.
Pseudocode: Sarsa with function approximation

Aim: Search a policy that can lead the agent to the target from an initial state-
action pair (s0 , a0 ).

For each episode, do


If the current st is not the target state, do
Take action at following πt (st ), generate rt+1 , st+1 , and then take
action at+1 following πt (st+1 )
Value update (parameter update): h
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) −
i
q̂(st , at , wt ) ∇w q̂(st , at , wt )
Policy update:
ε
πt+1 (a|st ) = 1 − |A(s)|
(|A(s)| − 1) if a =
arg maxa∈A(st ) q̂(st , a, wt+1 )
ε
πt+1 (a|st ) = |A(s)|
otherwise
Shiyu Zhao 47 / 70
Sarsa with function approximation

Illustrative example:
• Sarsa with linear function approximation.
• γ = 0.9,  = 0.1, rboundary = rforbidden = −10, rtarget = 1, α = 0.001.

1 2 3 4 5
0
Total reward

1
-500

-1000 2

0 100 200 300 400 500


3
Episode length

500
4

5
0
0 100 200 300 400 500
Episode index

For details, please see the book.

Shiyu Zhao 48 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 49 / 70
Q-learning with function approximation

Similar to Sarsa, tabular Q-learning can also be extended to the case of


value function approximation.

The q-value update rule is


h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt ),
a∈A(st+1 )

which is the same as Sarsa except that q̂(st+1 , at+1 , wt ) is replaced by


maxa∈A(st+1 ) q̂(st+1 , a, wt ).

Shiyu Zhao 50 / 70
Q-learning with function approximation

Pseudocode: Q-learning with function approximation (on-policy version)

Initialization: Initial parameter vector w0 . Initial policy π0 . Small ε > 0.


Aim: Search a good policy that can lead the agent to the target from an initial
state-action pair (s0 , a0 ).

For each episode, do


If the current st is not the target state, do
Take action at following πt (st ), and generate rt+1 , st+1
Value update (parameter update): h
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) −
a∈A(st+1 )
i
q̂(st , at , wt ) ∇w q̂(st , at , wt )
Policy update:
ε
πt+1 (a|st ) = 1 − |A(s)|
(|A(s)| − 1) if a =
arg maxa∈A(st ) q̂(st , a, wt+1 )
ε
πt+1 (a|st ) = |A(s)|
otherwise

Shiyu Zhao 51 / 70
Q-learning with function approximation

Illustrative example:
• Q-learning with linear function approximation.
• γ = 0.9,  = 0.1, rboundary = rforbidden = −10, rtarget = 1, α = 0.001.

1 2 3 4 5
0
Total reward

-2000
2
-4000
0 100 200 300 400 500
3
Episode length

1000 4

5
0
0 100 200 300 400 500
Episode index

Shiyu Zhao 52 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 53 / 70
Deep Q-learning

Deep Q-learning or deep Q-network (DQN):


• One of the earliest and most successful algorithms that introduce deep
neural networks into RL.
• The role of neural networks is to be a nonlinear function approximator.
• Different from the following algorithm:
h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt )
a∈A(st+1 )

because of the way of training a network.

Shiyu Zhao 54 / 70
Deep Q-learning

Deep Q-learning aims to minimize the objective function/loss function:


" 2 #
J(w) = E R + γ max0 q̂(S 0 , a, w) − q̂(S, A, w) ,
a∈A(S )

where (S, A, R, S 0 ) are random variables.


• This is actually the Bellman optimality error. That is because
 
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , ∀s, a
a∈A(St+1 )

The value of R + γ maxa∈A(S 0 ) q̂(S 0 , a, w) − q̂(S, A, w) should be zero


in the expectation sense

Shiyu Zhao 55 / 70
Deep Q-learning

How to minimize the objective function? Gradient-descent!


• How to calculate the gradient of the objective function? Tricky!
• That is because, in this objective function
" 2 #
0
J(w) = E R + γ max0 q̂(S , a, w) − q̂(S, A, w) ,
a∈A(S )

the parameter w not only appears in q̂(S, A, w) but also in


.
y = R + γ max0 q̂(S 0 , a, w)
a∈A(S )

• For the sake of simplicity, we can assume that w in y is fixed (at least
for a while) when we calculate the gradient.

Shiyu Zhao 56 / 70
Deep Q-learning

To do that, we can introduce two networks.


• One is a main network representing q̂(s, a, w)
• The other is a target network q̂(s, a, wT ).
The objective function in this case degenerates to
" 2 #
0
J =E R + γ max0 q̂(S , a, wT ) − q̂(S, A, w) ,
a∈A(S )

where wT is the target network parameter.

Shiyu Zhao 57 / 70
Deep Q-learning

When wT is fixed, the gradient of J can be easily obtained as


  
∇w J = E R + γ max0 q̂(S 0 , a, wT ) − q̂(S, A, w) ∇w q̂(S, A, w) .
a∈A(S )

• The basic idea of deep Q-learning is to use the gradient-descent


algorithm to minimize the objective function.
• However, such an optimization process evolves some important
techniques that deserve special attention.

Shiyu Zhao 58 / 70
Deep Q-learning - Two networks

First technique:
• Two networks, a main network and a target network.
Why is it used?
• The mathematical reason has been explained when we calculate the
gradient.
Implementation details:
• Let w and wT denote the parameters of the main and target networks,
respectively. They are set to be the same initially.
• In every iteration, we draw a mini-batch of samples {(s, a, r, s0 )} from
the replay buffer (will be explained later).
• The inputs of the networks include state s and action a. The target
.
output is yT = r + γ maxa∈A(s0 ) q̂(s0 , a, wT ). Then, we directly
minimize the TD error or called loss function (yT − q̂(s, a, w))2 over
the mini-batch {(s, a, yT )}.
Shiyu Zhao 59 / 70
Deep Q-learning - Experience replay

Another technique:
• Experience replay
Question: What is experience replay?
Answer:
• After we have collected some experience samples, we do NOT use
these samples in the order they were collected.
.
• Instead, we store them in a set, called replay buffer B = {(s, a, r, s0 )}
• Every time we train the neural network, we can draw a mini-batch of
random samples from the replay buffer.
• The draw of samples, or called experience replay, should follow a
uniform distribution (why?).

Shiyu Zhao 60 / 70
Deep Q-learning - Experience replay

Question: Why is experience replay necessary in deep Q-learning? Why


does the replay must follow a uniform distribution?
Answer: The answers lie in the objective function.
" 2 #
0
J =E R + γ max0 q̂(S , a, w) − q̂(S, A, w)
a∈A(S )

• (S, A) ∼ d: (S, A) is an index and treated as a single random variable


• R ∼ p(R|S, A), S 0 ∼ p(S 0 |S, A): R and S are determined by the
system model.
• The distribution of the state-action pair (S, A) is assumed to be
uniform.

Shiyu Zhao 61 / 70
Deep Q-learning - Experience replay

Answer (continued):
• However, the samples are not uniformly collected because they are
generated consequently by certain policies.
• To break the correlation between consequent samples, we can use the
experience replay technique by uniformly drawing samples from the
replay buffer.
• This is the mathematical reason why experience replay is necessary
and why the experience replay must be uniform.

Shiyu Zhao 62 / 70
Deep Q-learning - Experience replay

Revisit the tabular case:


• Question: Why does not tabular Q-learning require experience replay?
• Answer: No uniform distribution requirement.
• Question: Why Deep Q-learning involves distribution?
• Answer: The objective function in the deep case is a scalar average
over all (S, A). The tabular case does not involve any distribution of
S or A. The algorithm in the tabular case aims to solve a set of
equations for all (s, a) (Bellman optimality equation).
• Question: Can we use experience replay in tabular Q-learning?
• Answer: Yes, we can. And more sample efficient (why?)

Shiyu Zhao 63 / 70
Deep Q-learning
Pseudocode: Deep Q-learning (off-policy version)

Aim: Learn an optimal target network to approximate the optimal action values
from the experience samples generated by a behavior policy πb .

Store the experience samples generated by πb in a replay buffer B = {(s, a, r, s0 )}


For each iteration, do
Uniformly draw a mini-batch of samples from B
For each sample (s, a, r, s0 ), calculate the target value as yT = r +
γ maxa∈A(s0 ) q̂(s0 , a, wT ), where wT is the parameter of the target
network
Update the main network to minimize (yT − q̂(s, a, w))2 using the mini-
batch {(s, a, yT )}
Set wT = w every C iterations

Remarks:
• Why no policy update?
• Why not using the policy update equation that we derived?
• The network input and output are different from the DQN paper.
Shiyu Zhao 64 / 70
Deep Q-learning

Illustrative example:
• This example aims to learn optimal action values for every state-action
pair.
• Once the optimal action values are obtained, the optimal greedy policy
can be obtained immediately.

Shiyu Zhao 65 / 70
Deep Q-learning

Setup:
• One single episode is used to train the network.
• This episode is generated by an exploratory behavior policy shown in
Figure (a).
• The episode only has 1,000 steps! The tabular Q-learning requires
100,000 steps.
• A shallow neural network with one single hidden layer is used as a
nonlinear approximator of q̂(s, a, w). The hidden layer has 100 neurons.
See details in the book.

Shiyu Zhao 66 / 70
Deep Q-learning
1 2 3 4 5 1 2 3 4 5

1 1

2 2

3 3

4 4

5 5

The behavior policy. An episode of 1,000 steps. The obtained policy.

5 10

State value error (RMSE)


TD error / loss function

4 8

3 6

2 4

1 2

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index

The TD error converges to zero. The state estimation error converges to zero.

Shiyu Zhao 67 / 70
Deep Q-learning

What if we only use a single episode of 100 steps? Insufficient data


1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

The behavior policy. An episode of 100 steps. The final policy.

7 8

State value error (RMSE)


6
TD error / loss function

5 7

4
6
3

2
5
1

0 4
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index

Shiyu Zhao
The TD error converges to zero. The state error does not converge to zero. 68 / 70
Outline

1 Motivating examples: curve fitting

2 Algorithm for state value estimation


Objective function
Optimization algorithms
Selection of function approximators
Illustrative examples
Summary of the story
Theoretical analysis

3 Sarsa with function approximation

4 Q-learning with function approximation

5 Deep Q-learning

6 Summary

Shiyu Zhao 69 / 70
Summary

This lecture introduces the method of value function approximation.


• First, understand the basic idea.
• Second, understand the basic algorithms.

Shiyu Zhao 70 / 70
Lecture 7: Temporal-Difference Learning

Shiyu Zhao
Outline

Shiyu Zhao 1 / 60
Introduction

• This lecture introduces temporal-difference (TD) learning, which is one


of the most well-known methods in reinforcement learning (RL).
• Monte Carlo (MC) learning is the first model-free method. TD learning
is the second model-free method. TD has some advantages compared
to MC.
• We will see how the stochastic approximation methods studied in the
last lecture are useful.

Shiyu Zhao 2 / 60
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: Expected Sarsa

5 TD learning of action values: n-step Sarsa

6 TD learning of optimal action values: Q-learning

7 A unified point of view

8 Summary

Shiyu Zhao 3 / 60
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: Expected Sarsa

5 TD learning of action values: n-step Sarsa

6 TD learning of optimal action values: Q-learning

7 A unified point of view

8 Summary

Shiyu Zhao 4 / 60
Motivating example: stochastic algorithms

We next consider some stochastic problems and show how to use the RM
algorithm to solve them.
First, consider the simple mean estimation problem: calculate

w = E[X],

based on some iid samples {x} of X.


• By writing g(w) = w − E[X], we can reformulate the problem to a
root-finding problem
g(w) = 0.
• Since we can only obtain samples {x} of X, the noisy observation is
.
g̃(w, η) = w − x = (w − E[X]) + (E[X] − x) = g(w) + η.

• Then, according to the last lecture, we know the RM algorithm for


solving g(w) = 0 is
wk+1 = wk − αk g̃(wk , ηk ) = wk − αk (wk − xk )
Shiyu Zhao 5 / 60
Motivating example: stochastic algorithms

Second, consider a little more complex problem. That is to estimate


the mean of a function v(X),

w = E[v(X)],

based on some iid random samples {x} of X.

• To solve this problem, we define

g(w) = w − E[v(X)]
.
g̃(w, η) = w − v(x) = (w − E[v(X)]) + (E[v(X)] − v(x)) = g(w) + η.

• Then, the problem becomes a root-finding problem: g(w) = 0. The


corresponding RM algorithm is

wk+1 = wk − αk g̃(wk , ηk ) = wk − αk [wk − v(xk )]

Shiyu Zhao 6 / 60
Motivating example: stochastic algorithms

Third, consider an even more complex problem: calculate

w = E[R + γv(X)],

where R, X are random variables, γ is a constant, and v(·) is a function.


• Suppose we can obtain samples {x} and {r} of X and R. we define

g(w) = w − E[R + γv(X)],


g̃(w, η) = w − [r + γv(x)]
= (w − E[R + γv(X)]) + (E[R + γv(X)] − [r + γv(x)])
.
= g(w) + η.

• Then, the problem becomes a root-finding problem: g(w) = 0. The


corresponding RM algorithm is

wk+1 = wk − αk g̃(wk , ηk ) = wk − αk [wk − (rk + γv(xk ))]


Shiyu Zhao 7 / 60
Motivating example: stochastic algorithms

Quick summary:
• The above three examples are more and more complex.
• They can all be solved by the RM algorithm.
• We will see that the TD algorithms have similar expressions.

Shiyu Zhao 8 / 60
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: Expected Sarsa

5 TD learning of action values: n-step Sarsa

6 TD learning of optimal action values: Q-learning

7 A unified point of view

8 Summary

Shiyu Zhao 9 / 60
TD learning of state values

Note that
• TD learning often refers to a broad class of RL algorithms.
• TD learning also refers to a specific algorithm for estimating state
values as introduced below.

Shiyu Zhao 10 / 60
TD learning of state values – Algorithm description

The data/experience required by the algorithm:


• (s0 , r1 , s1 , . . . , st , rt+1 , st+1 , . . . ) or {(st , rt+1 , st+1 )}t generated
following the given policy π.
The TD learning algorithm is
h i
vt+1 (st ) = vt (st ) − αt (st ) vt (st ) − [rt+1 + γvt (st+1 )] , (1)

vt+1 (s) = vt (s), ∀s 6= st , (2)

where t = 0, 1, 2, . . . . Here, vt (st ) is the estimated state value of vπ (st );


αt (st ) is the learning rate of st at time t.
• At time t, only the value of the visited state st is updated whereas the
values of the unvisited states s 6= st remain unchanged.
• The update in (2) will be omitted when the context is clear.

Shiyu Zhao 11 / 60
TD learning of state values – Algorithm properties

The TD algorithm can be annotated as


TD error δt
z }| {
vt+1 (st ) = vt (st ) −αt (st ) vt (st ) − [rt+1 + γvt (st+1 )] , (3)
| {z } | {z } | {z }
new estimate current estimate TD target v̄t

Here,
.
v̄t = rt+1 + γv(st+1 )

is called the TD target.


.
δt = v(st ) − [rt+1 + γv(st+1 )] = v(st ) − v̄t

is called the TD error.


It is clear that the new estimate vt+1 (st ) is a combination of the current
estimate vt (st ) and the TD error.
Shiyu Zhao 12 / 60
TD learning of state values – Algorithm properties

First, why is v̄t called the TD target?


That is because the algorithm drives v(st ) towards v̄t .
To see that,
 
vt+1 (st ) = vt (st ) − αt (st ) vt (st ) − v̄t
 
=⇒ vt+1 (st )−v̄t = vt (st )−v̄t − αt (st ) vt (st ) − v̄t
=⇒ vt+1 (st )−v̄t = [1 − αt (st )][vt (st )−v̄t ]
=⇒ |vt+1 (st )−v̄t | = |1 − αt (st )||vt (st )−v̄t |

Since αt (st ) is a small positive number, we have

0 < 1 − αt (st ) < 1

Therefore,

|vt+1 (st )−v̄t | ≤ |vt (st )−v̄t |

which means v(st ) is driven towards v̄t !


Shiyu Zhao 13 / 60
TD learning of state values – Algorithm properties

Second, what is the interpretation of the TD error?

δt = v(st ) − [rt+1 + γv(st+1 )]

• It is a difference between two consequent time steps.


• It reflects the deficiency between vt and vπ . To see that, denote
.
δπ,t = vπ (st ) − [rt+1 + γvπ (st+1 )]

Note that
 
E[δπ,t |St = st ] = vπ (st ) − E Rt+1 + γvπ (St+1 )|St = st = 0.

• If vt = vπ , then δt should be zero (in the expectation sense).


• Hence, if δt is not zero, then vt is not equal to vπ .
• The TD error can be interpreted as innovation, which means new
information obtained from the experience (st , rt+1 , st+1 ).
Shiyu Zhao 14 / 60
TD learning of state values – Algorithm properties

Other properties:
• The TD algorithm in (3) only estimates the state value of a given
policy.
• It does not estimate the action values.
• It does not search for optimal policies.
• Later, we will see how to estimate action values and then search for
optimal policies.
• Nonetheless, the TD algorithm in (3) is fundamental for understanding
the core idea.

Shiyu Zhao 15 / 60
TD learning of state values – The idea of the algorithm

Q: What does this TD algorithm do mathematically?


A: It solves the Bellman equation of a given policy π.

Shiyu Zhao 16 / 60
TD learning of state values – The idea of the algorithm

First, a new expression of the Bellman equation.


The definition of state value of π is
 
vπ (s) = E R + γG|S = s , s∈S (4)

where G is discounted return. Since


X X
E[G|S = s] = π(a|s) p(s0 |s, a)vπ (s0 ) = E[vπ (S 0 )|S = s],
a s0

where S 0 is the next state, we can rewrite (4) as

vπ (s) = E R + γvπ (S 0 )|S = s ,


 
s ∈ S. (5)

Equation (5) is another expression of the Bellman equation. It is


sometimes called the Bellman expectation equation, an important tool to
design and analyze TD algorithms.
Shiyu Zhao 17 / 60
TD learning of state values – The idea of the algorithm

Second, solve the Bellman equation in (5) using the RM algorithm.


In particular, by defining

g(v(s)) = v(s) − E R + γvπ (S 0 )|s ,


 

we can rewrite (5) as


g(v(s)) = 0.

Since we can only obtain the samples r and s0 of R and S 0 , the noisy
observation we have is

g̃(v(s)) = v(s) − r + γvπ (s0 )


 
    
= v(s) − E R + γvπ (S 0 )|s + E R + γvπ (S 0 )|s − r + γvπ (s0 ) .
  
| {z } | {z }
g(v(s)) η

Shiyu Zhao 18 / 60
TD learning of state values – The idea of the algorithm

Therefore, the RM algorithm for solving g(v(s)) = 0 is

vk+1 (s) = vk (s) − αk g̃(vk (s))


 
= vk (s) − αk vk (s) − rk + γvπ (s0k ) ,

k = 1, 2, 3, . . . (6)

where vk (s) is the estimate of vπ (s) at the kth step; rk , s0k are the
samples of R, S 0 obtained at the kth step.

The RM algorithm in (6) has two assumptions that deserve special


attention.
• We must have the experience set {(s, r, s0 )} for k = 1, 2, 3, . . . .
• We assume that vπ (s0 ) is already known for any s0 .

Shiyu Zhao 19 / 60
TD learning of state values – The idea of the algorithm

Therefore, the RM algorithm for solving g(v(s)) = 0 is

vk+1 (s) = vk (s) − αk g̃(vk (s))


 
= vk (s) − αk vk (s) − rk + γvπ (s0k ) ,

k = 1, 2, 3, . . .

where vk (s) is the estimate of vπ (s) at the kth step; rk , s0k are the
samples of R, S 0 obtained at the kth step.

To remove the two assumptions in the RM algorithm, we can modify it.


• One modification is that {(s, r, s0 )} is changed to {(st , rt+1 , st+1 )} so
that the algorithm can utilize the sequential samples in an episode.
• Another modification is that vπ (s0 ) is replaced by an estimate of it
because we don’t know it in advance.

Shiyu Zhao 20 / 60
TD learning of state values – Algorithm convergence

Theorem (Convergence of TD Learning)

By the TD algorithm (1), vt (s) converges with probability 1 to vπ (s) for all
s ∈ S as t → ∞ if t αt (s) = ∞ and t αt2 (s) < ∞ for all s ∈ S.
P P

Remarks:
• This theorem says the state value can be found by the TD algorithm for a
given a policy π.
P P 2
• t αt (s) = ∞ and t αt (s) < ∞ must be valid for all s ∈ S. At time step
t, if s = st which means that s is visited at time t, then αt (s) > 0;
otherwise, αt (s) = 0 for all the other s 6= st . That requires every state must
be visited an infinite (or sufficiently many) number of times.
• The learning rate α is often selected as a small constant. In this case, the
condition that t αt2 (s) < ∞ is invalid anymore. When α is constant, it can
P

still be shown that the algorithm converges in the sense of expectation sense.
For the proof of the theorem, see my book.
Shiyu Zhao 21 / 60
TD learning of state values – Algorithm properties

While TD learning and MC learning are both model-free, what are the
advantages and disadvantages of TD learning compared to MC
learning?

TD/Sarsa learning MC learning

Online: TD learning is online. It can Offline: MC learning is offline. It


update the state/action values imme- has to wait until an episode has been
diately after receiving a reward. completely collected.

Continuing tasks: Since TD learning Episodic tasks: Since MC learning


is online, it can handle both episodic is offline, it can only handle episodic
and continuing tasks. tasks that has terminate states.

Table: Comparison between TD learning and MC learning.

Shiyu Zhao 22 / 60
TD learning of state values – Algorithm properties

While TD learning and MC learning are both model-free, what are the
advantages and disadvantages of TD learning compared to MC
learning?

TD/Sarsa learning MC learning

Bootstrapping: TD bootstraps be- Non-bootstrapping: MC is not


cause the update of a value relies on bootstrapping, because it can directly
the previous estimate of this value. estimate state/action values without
Hence, it requires initial guesses. any initial guess.

Low estimation variance: TD has High estimation variance: To esti-


lower than MC because there are few- mate qπ (st , at ), we need samples of
er random variables. For instance, Rt+1 + γRt+2 + γ 2 Rt+3 + . . . . Sup-
Sarsa requires Rt+1 , St+1 , At+1 . pose the length of each episode is L.
There are |A|L possible episodes.

Table: Comparison between TD learning and MC learning (continued).


Shiyu Zhao 23 / 60
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: Expected Sarsa

5 TD learning of action values: n-step Sarsa

6 TD learning of optimal action values: Q-learning

7 A unified point of view

8 Summary

Shiyu Zhao 24 / 60
TD learning of action values – Sarsa

• The TD algorithm introduced in the last section can only estimate


state values.
• Next, we introduce, Sarsa, an algorithm that can directly estimate
action values.
• We will also see how to use Sarsa to find optimal policies.

Shiyu Zhao 25 / 60
Sarsa – Algorithm

First, our aim is to estimate the action values of a given policy π.


Suppose we have some experience {(st , at , rt+1 , st+1 , at+1 )}t .
We can use the following Sarsa algorithm to estimate the action values:
h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 + γqt (st+1 , at+1 )] ,

qt+1 (s, a) = qt (s, a), ∀(s, a) 6= (st , at ),

where t = 0, 1, 2, . . . .
• qt (st , at ) is an estimate of qπ (st , at );
• αt (st , at ) is the learning rate depending on st , at .

Shiyu Zhao 26 / 60
Sarsa – Algorithm

• Why is this algorithm called Sarsa? That is because each step of the
algorithm involves (st , at , rt+1 , st+1 , at+1 ). Sarsa is the abbreviation of
state-action-reward-state-action.

• What is the relationship between Sarsa and the previous TD learning


algorithm? We can obtain Sarsa by replacing the state value estimate v(s)
in the TD algorithm with the action value estimate q(s, a). As a result,
Sarsa is an action-value version of the TD algorithm.

• What does the Sarsa algorithm do mathematically? The expression of


Sarsa suggests that it is a stochastic approximation algorithm solving the
following equation:

qπ (s, a) = E R + γqπ (S 0 , A0 )|s, a ,


 
∀s, a.

This is another expression of the Bellman equation expressed in terms of


action values. The proof is given in my book.
Shiyu Zhao 27 / 60
Sarsa – Algorithm

Theorem (Convergence of Sarsa learning)

By the Sarsa algorithm, qt (s, a) converges with probability 1 to the action


value qπ (s, a) as t → ∞ for all (s, a) if t αt (s, a) = ∞ and t αt2 (s, a) < ∞
P P

for all (s, a).

Remarks:

• This theorem says the action value can be found by Sarsa for a given a
policy π.

Shiyu Zhao 28 / 60
Sarsa – Implementation

The ultimate goal of RL is to find optimal policies.


To do that, we can combine Sarsa with a policy improvement step.
The combined algorithm is also called Sarsa.

Pseudocode: Policy searching by Sarsa

For each episode, do


If the current st is not the target state, do
Collect the experience (st , at , rt+1 , st+1 , at+1 ): In particular, take ac-
tion at following πt (st ), generate rt+1 , st+1 , and then take action at+1
following πt (st+1 ).
Update q-value: h
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 +
i
γqt (st+1 , at+1 )]
Update policy:

πt+1 (a|st ) = 1 − |A|
(|A| − 1) if a = arg maxa qt+1 (st , a)

πt+1 (a|st ) = |A|
otherwise

Shiyu Zhao 29 / 60
Sarsa – Implementation

Remarks about this algorithm:


• The policy of st is updated immediately after q(st , at ) is updated.
This is based on the idea of generalized policy iteration.
• The policy is -greedy instead of greedy to well balance exploitation
and exploration.
Be clear about the core idea and complication:
• The core idea is simple: that is to use an algorithm to solve the
Bellman equation of a given policy.
• The complication emerges when we try to find optimal policies and
work efficiently.

Shiyu Zhao 30 / 60
Sarsa – Examples

Task description:
• The task is to find a good path from a specific starting state to the
target state.
• This task is different from all the previous tasks where we need to
find out the optimal policy for every state!
• Each episode starts from the top-left state and end in the target
state.
• In the future, pay attention to what the task is.
• rtarget = 0, rforbidden = rboundary = −10, and rother = −1. The
learning rate is α = 0.1 and the value of  is 0.1.

Shiyu Zhao 31 / 60
Sarsa – Examples

Results:

• The left figures above show the final policy obtained by Sarsa.

• Not all states have the optimal policy.

• The right figures show the total reward and length of every episode.

• The metric of total reward per episode will be frequently used.

1 2 3 4 5

Total rewards
0
1
-200

2 -400

0 100 200 300 400 500


3
Episode length

200

4 100

0
5 0 100 200 300 400 500
Episode index

Shiyu Zhao 32 / 60
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: Expected Sarsa

5 TD learning of action values: n-step Sarsa

6 TD learning of optimal action values: Q-learning

7 A unified point of view

8 Summary

Shiyu Zhao 33 / 60
TD learning of action values: Expected Sarsa

A variant of Sarsa is the Expected Sarsa algorithm:


h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − (rt+1 + γE[qt (st+1 , A)]) ,

qt+1 (s, a) = qt (s, a), ∀(s, a) 6= (st , at ),

where
X .
E[qt (st+1 , A)]) = πt (a|st+1 )qt (st+1 , a) = vt (st+1 )
a

is the expected value of qt (st+1 , a) under policy πt .


Compared to Sarsa:

• The TD target is changed from rt+1 + γqt (st+1 , at+1 ) as in Sarsa to


rt+1 + γE[qt (st+1 , A)] as in Expected Sarsa.

• Need more computation. But it is beneficial in the sense that it reduces the
estimation variances because it reduces random variables in Sarsa from
{st , at , rt+1 , st+1 , at+1 } to {st , at , rt+1 , st+1 }.
Shiyu Zhao 34 / 60
TD learning of action values: Expected Sarsa

What does the algorithm do mathematically? Expected Sarsa is a stochastic


approximation algorithm for solving the following equation:
h i
qπ (s, a) = E Rt+1 + γEAt+1∼π(St+1 ) [qπ (St+1 , At+1 )] St = s, At = a , ∀s, a.

The above equation is another expression of the Bellman equation:


h i
qπ (s, a) = E Rt+1 + γvπ (St+1 )|St = s, At = a ,

Illustrative example:
1 2 3 4 5

Total rewards
0
1
-200

2
-400
0 100 200 300 400 500
Episode length

3
200

4 100

0
5 0 100 200 300 400 500
Episode index

Shiyu Zhao 35 / 60
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: Expected Sarsa

5 TD learning of action values: n-step Sarsa

6 TD learning of optimal action values: Q-learning

7 A unified point of view

8 Summary

Shiyu Zhao 36 / 60
TD learning of action values: n-step Sarsa

n-step Sarsa: can unify Sarsa and Monte Carlo learning


The definition of action value is

qπ (s, a) = E[Gt |St = s, At = a].


The discounted return Gt can be written in different forms as
(1)
Sarsa ←− Gt = Rt+1 + γqπ (St+1 , At+1 ),
(2)
Gt = Rt+1 + γRt+2 + γ 2 qπ (St+2 , At+2 ),
..
.
(n)
n-step Sarsa ←− Gt = Rt+1 + γRt+2 + · · · + γ n qπ (St+n , At+n ),
..
.
(∞)
MC ←− Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + . . .
(1) (2) (n) (∞)
It should be noted that Gt = Gt = Gt = Gt = Gt , where the
superscripts merely indicate the different decomposition structures of Gt .
Shiyu Zhao 37 / 60
TD learning of action values: n-step Sarsa

• Sarsa aims to solve


(1)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γqπ (St+1 , At+1 )|s, a].

• MC learning aims to solve


(∞)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γRt+2 + γ 2 Rt+3 + . . . |s, a].

• An intermediate algorithm called n-step Sarsa aims to solve


(n)
qπ (s, a) = E[Gt |s, a] = E[Rt+1 + γRt+2 + · · · + γ n qπ (St+n , At+n )|s, a].

• The algorithm of n-step Sarsa is


qt+1 (st , at ) = qt (st , at )
h i
− αt (st , at ) qt (st , at ) − [rt+1 + γrt+2 + · · · + γ n qt (st+n , at+n )] .

n-step Sarsa is more general because it becomes the (one-step) Sarsa


algorithm when n = 1 and the MC learning algorithm when n = ∞.
Shiyu Zhao 38 / 60
TD learning of action values: n-step Sarsa

• n-step Sarsa needs (st , at , rt+1 , st+1 , at+1 , . . . , rt+n , st+n , at+n ).
• Since (rt+n , st+n , at+n ) has not been collected at time t, we are not able to
implement n-step Sarsa at step t. However, we can wait until time t + n to
update the q-value of (st , at ):

qt+n (st , at ) = qt+n−1 (st , at )


h i
− αt+n−1 (st , at ) qt+n−1 (st , at ) − [rt+1 + γrt+2 + · · · + γ n qt+n−1 (st+n , at+n )]

• Since n-step Sarsa includes Sarsa and MC learning as two extreme cases, its
performance is a blend of Sarsa and MC learning:
• If n is large, its performance is close to MC learning and hence has a large
variance but a small bias.
• If n is small, its performance is close to Sarsa and hence has a relatively
large bias due to the initial guess and relatively low variance.
• Finally, n-step Sarsa is also for policy evaluation. It can be combined with
the policy improvement step to search for optimal policies.
Shiyu Zhao 39 / 60
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: Expected Sarsa

5 TD learning of action values: n-step Sarsa

6 TD learning of optimal action values: Q-learning

7 A unified point of view

8 Summary

Shiyu Zhao 40 / 60
TD learning of optimal action values: Q-learning

• Next, we introduce Q-learning, one of the most widely used RL


algorithms.
• Sarsa can estimate the action values of a given policy. It must be
combined with a policy improvement step to find optimal policies.
• Q-learning can directly estimate optimal action values and hence
optimal policies.

Shiyu Zhao 41 / 60
Q-learning – Algorithm

The Q-learning algorithm is


 
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 + γ max qt (st+1 , a)] ,
a∈A

qt+1 (s, a) = qt (s, a), ∀(s, a) 6= (st , at ),

Q-learning is very similar to Sarsa. They are different only in terms of the
TD target:
• The TD target in Q-learning is rt+1 + γ maxa∈A qt (st+1 , a)
• The TD target in Sarsa is rt+1 + γqt (st+1 , at+1 ).

Shiyu Zhao 42 / 60
Q-learning – Algorithm

What does Q-learning do mathematically?


It aims to solve
h i
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , ∀s, a.
a

This is the Bellman optimality equation expressed in terms of action


values. See the proof in my book.

Shiyu Zhao 43 / 60
Off-policy vs on-policy

Before further studying Q-learning, we first introduce two important


concepts: on-policy learning and off-policy learning.
There exist two policies in a TD learning task:
• The behavior policy is used to generate experience samples.
• The target policy is constantly updated toward an optimal policy.
On-policy vs off-policy:
• When the behavior policy is the same as the target policy, such kind of
learning is called on-policy.
• When they are different, the learning is called off-policy.

Shiyu Zhao 44 / 60
Off-policy vs on-policy

Advantages of off-policy learning:

• It can search for optimal policies based on the experience samples


generated by any other policies.
• As an important special case, the behavior policy can be selected to
be exploratory. For example, if we would like to estimate the action
values of all state-action pairs, we can use a exploratory policy to
generate episodes visiting every state-action pair sufficiently many
times.

Shiyu Zhao 45 / 60
Off-policy vs on-policy

How to judge if a TD algorithm is on-policy or off-policy?


• First, check what the algorithm does mathematically.
• Second, check what things are required to implement the algorithm.
It deserves special attention because it is one of the most confusing
problems to beginners.

Shiyu Zhao 46 / 60
Off-policy vs on-policy

Sarsa is on-policy.
• First, Sarsa aims to solve the Bellman equation of a given policy π:

qπ (s, a) = E [R + γqπ (S 0 , A0 )|s, a] , ∀s, a.

where R ∼ p(R|s, a), S 0 ∼ p(S 0 |s, a), A0 ∼ π(A0 |S 0 ).


• Second, the algorithm is
h i
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 + γqt (st+1 , at+1 )] ,

which requires (st , at , rt+1 , st+1 , at+1 ):


• If (st , at ) is given, then rt+1 and st+1 do not depend on any policy!
• at+1 is generated following πt (st+1 )!
• πt is both the target and behavior policy.
Shiyu Zhao 47 / 60
Off-policy vs on-policy

Monte Carlo learning is on-policy.

• First, the MC method aims to solve

qπ (s, a) = E [Rt+1 + γRt+2 + . . . |St = s, At = a] , ∀s, a.

where the sample is generated following a given policy π.


• Second, the implementation of the MC method is

q(s, a) ≈ rt+1 + γrt+2 + . . .

• A policy is used to generate samples, which is further used to estimate


the action values of the policy. Based on the action values, we can
improve the policy.

Shiyu Zhao 48 / 60
Off-policy vs on-policy

Q-learning is off-policy.

• First, Q-learning aims to solve the Bellman optimality equation


h i
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , ∀s, a.
a

• Second, the algorithm is


 
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 + γ max qt (st+1 , a)]
a∈A

which requires (st , at , rt+1 , st+1 ).


• If (st , at ) is given, then rt+1 and st+1 do not depend on any policy!
• The behavior policy to generate at from st can be anything. The
target policy will converge to the optimal policy.

Shiyu Zhao 49 / 60
Q-learning – Implementation

Since Q-learning is off-policy, it can be implemented in an either off-policy or


on-policy fashion.

Pseudocode: Policy searching by Q-learning (on-policy version)

For each episode, do


If the current st is not the target state, do
Collect the experience (st , at , rt+1 , st+1 ): In particular, take action at
following πt (st ), generate rt+1 , st+1 .
Update q-value: h
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) qt (st , at ) − [rt+1 +
i
γ maxa qt (st+1 , a)]
Update policy:

πt+1 (a|st ) = 1 − |A|
(|A| − 1) if a = arg maxa qt+1 (st , a)

πt+1 (a|st ) = |A|
otherwise

See the book for more detailed pseudocode.


Shiyu Zhao 50 / 60
Q-learning – Algorithm

Pseudocode: Optimal policy search by Q-learning (off-policy version)

For each episode {s0 , a0 , r1 , s1 , a1 , r2 , . . . } generated by πb , do


For each step t = 0, 1, 2, . . . of the episode, do
Update q-value: h
qt+1 (st , at ) = qt (st , at ) − αt (st , at ) q(st , at ) − [rt+1 +
i
γ maxa qt (st+1 , a)]
Update target policy:
πT,t+1 (a|st ) = 1 if a = arg maxa qt+1 (st , a)
πT,t+1 (a|st ) = 0 otherwise

See the book for more detailed pseudocode.

Shiyu Zhao 51 / 60
Q-learning – Examples

Task description:
• The task in these examples is to find an optimal policy for all the
states.
• The reward setting is rboundary = rforbidden = −1, and rtarget = 1.
The discount rate is γ = 0.9. The learning rate is α = 0.1.
Ground truth: an optimal policy and the corresponding optimal state
values.
1 2 3 4 5 1 2 3 4 5

1 1 5.8 5.6 6.2 6.5 5.8

2 2 6.5 7.2 8.0 7.2 6.5

3 3 7.2 8.0 10.0 8.0 7.2

4 4 8.0 10.0 10.0 10.0 8.0

5 5 7.2 9.0 10.0 9.0 8.1

(a) Optimal policy (b) Optimal state value


Shiyu Zhao 52 / 60
Q-learning – Examples

The behavior policy and the generated experience (105 steps):


1 2 3 4 5

(a) Behavior policy (b) Generated episode


The policy found by off-policy Q-learning:
1 2 3 4 5
8
1

6
State value error

4
3

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

Shiyu Zhao
(a) Estimated policy (b) State value error 53 / 60
Q-learning – Examples

The importance of exploration: episodes of 105 steps


If the policy is not sufficiently exploratory, the samples are not good.

1 2 3 4 5

1 8

State value error


2

3 4

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

(a) Behavior policy (b) Generated episode (c) Q-learning result


 = 0.5

Shiyu Zhao 54 / 60
Q-learning – Examples
1 2 3 4 5

State value error


2

4
3

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

(a) Behavior policy (b) Generated episode (c) Q-learning result


 = 0.1

1 2 3 4 5

State value error


2

4
3

2
4

0
5 0 2 4 6 8 10
Step in the episode 104

(a) Behavior policy (b) Generated episode (c) Q-learning result


 = 0.1
Shiyu Zhao 55 / 60
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: Expected Sarsa

5 TD learning of action values: n-step Sarsa

6 TD learning of optimal action values: Q-learning

7 A unified point of view

8 Summary

Shiyu Zhao 56 / 60
A unified point of view

All the algorithms we introduced in this lecture can be expressed in a unified


expression:

qt+1 (st , at ) = qt (st , at ) − αt (st , at )[qt (st , at ) − q̄t ],

where q̄t is the TD target.


Different TD algorithms have different q̄t .

Algorithm Expression of q̄t


Sarsa q̄t = rt+1 + γqt (st+1 , at+1 )
n-step Sarsa q̄t = rt+1 + γrt+2 + · · · + γ n qt (st+n , at+n )
P
Expected Sarsa q̄t = rt+1 + γ a πt (a|st+1 )qt (st+1 , a)
Q-learning q̄t = rt+1 + γ maxa qt (st+1 , a)
Monte Carlo q̄t = rt+1 + γrt+2 + . . .

The MC method can also be expressed in this unified expression by setting


αt (st , at ) = 1 and hence qt+1 (st , at ) = q̄t .
Shiyu Zhao 57 / 60
A unified point of view

All the algorithms can be viewed as stochastic approximation algorithms


solving the Bellman equation or Bellman optimality equation:

Algorithm Equation aimed to solve


Sarsa BE: qπ (s, a) = E [Rt+1 + γqπ (St+1 , At+1 )|St = s, At = a]
n-step BE: qπ (s, a) = E[Rt+1 + γRt+2 + · · · + γ n qπ (st+n , at+n )|St =
Sarsa s, At = a]
 
Expected BE: qπ (s, a) = E Rt+1 + γEAt+1 [qπ (St+1 , At+1 )] St = s, At = a
Sarsa
 
Q-learning BOE: q(s, a) = E Rt+1 + maxa q(St+1 , a) St = s, At = a
Monte Car- BE: qπ (s, a) = E[Rt+1 + γRt+2 + . . . |St = s, At = a]
lo

Shiyu Zhao 58 / 60
Outline

1 Motivating examples

2 TD learning of state values

3 TD learning of action values: Sarsa

4 TD learning of action values: Expected Sarsa

5 TD learning of action values: n-step Sarsa

6 TD learning of optimal action values: Q-learning

7 A unified point of view

8 Summary

Shiyu Zhao 59 / 60
Summary

• Introduced various TD learning algorithms


• Their expressions, math interpretations, implementation, relationship,
examples
• Unified point of view

Shiyu Zhao 60 / 60
Lecture 9: Policy Gradient Methods

Shiyu Zhao
Introduction

Chapter 6:
Chapter 5: Stochastic
Monte Carlo Approximation
Chapter 4: Learning
Value Iteration &
Policy Iteration Chapter 7:
Temporal‐Difference
Learning

tabular representation
to
Chapter 3: Chapter 2:
function representation
Bellman Optimality Bellman Algorithm/Methods
Equation Equation
Chapter 8:
Value Function
Fundamental tools Approximation

Chapter 9:
Chapter 10:
Policy Function
Actor‐Critic
Approximation
Methods
(or Policy Gradient)

Shiyu Zhao 1 / 43
Introduction

In this lecture, we will move


• from value-based methods to policy-based methods
• from value function approximation to policy function approximation

Shiyu Zhao 2 / 43
Outline

1 Basic idea of policy gradient

2 Metrics to define optimal policies

3 Gradients of the metrics

4 Gradient-ascent algorithm (REINFORCE)

5 Summary

Shiyu Zhao 3 / 43
Outline

1 Basic idea of policy gradient

2 Metrics to define optimal policies

3 Gradients of the metrics

4 Gradient-ascent algorithm (REINFORCE)

5 Summary

Shiyu Zhao 4 / 43
Basic idea of policy gradient

Previously, policies have been represented by tables:


• The action probabilities of all states are stored in a table π(a|s). Each
entry of the table is indexed by a state and an action.

a1 a2 a3 a4 a5
s1 π(a1 |s1 ) π(a2 |s1 ) π(a3 |s1 ) π(a4 |s1 ) π(a5 |s1 )
. . . . . .
. . . . . .
. . . . . .
s9 π(a1 |s9 ) π(a2 |s9 ) π(a3 |s9 ) π(a4 |s9 ) π(a5 |s9 )

• We can directly access or change a value in the table.

Shiyu Zhao 5 / 43
Basic idea of policy gradient

Now, policies can be represented by parameterized functions:

π(a|s, θ)

where θ ∈ Rm is a parameter vector.


• The function can be, for example, a neural network, whose input is s,
output is the probability to take each action, and parameter is θ.
• Advantage: when the state space is large, the tabular representation
will be of low efficiency in terms of storage and generalization.
• The function representation is also sometimes written as π(a, s, θ),
πθ (a|s), or πθ (a, s).

Shiyu Zhao 6 / 43
Basic idea of policy gradient

Differences between tabular and function representations:


• First, how to define optimal policies?
• When represented as a table, a policy π is optimal if it can maximize
every state value.
• When represented by a function, a policy π is optimal if it can
maximize certain scalar metrics.

Shiyu Zhao 7 / 43
Basic idea of policy gradient

Differences between tabular and function representations:


• Second, how to access the probability of an action?
• In the tabular case, the probability of taking a at s can be directly
accessed by looking up the tabular policy.
• In the case of function representation, we need to calculate the value
of π(a|s, θ) given the function structure and the parameter.

Shiyu Zhao 8 / 43
Basic idea of policy gradient

Differences between tabular and function representations:


• Third, how to update policies?
• When represented by a table, a policy π can be updated by directly
changing the entries in the table.
• When represented by a parameterized function, a policy π cannot be
updated in this way anymore. Instead, it can only be updated by
changing the parameter θ.

Shiyu Zhao 9 / 43
Basic idea of policy gradient

The basic idea of the policy gradient is simple:

• First, metrics (or objective functions) to define optimal policies: J(θ),


which can define optimal policies.
• Second, gradient-based optimization algorithms to search for optimal
policies:

θt+1 = θt + α∇θ J(θt )

Although the idea is simple, the complication emerges when we try to


answer the following questions.
• What appropriate metrics should be used?
• How to calculate the gradients of the metrics?
These questions will be answered in detail in this lecture.
Shiyu Zhao 10 / 43
Outline

1 Basic idea of policy gradient

2 Metrics to define optimal policies

3 Gradients of the metrics

4 Gradient-ascent algorithm (REINFORCE)

5 Summary

Shiyu Zhao 11 / 43
Metrics to define optimal policies - 1) The average value

There are two metrics.


The first metric is the average state value or simply called average
value. In particular, the metric is defined as
X
v̄π = d(s)vπ (s)
s∈S

• v̄π is a weighted average of the state values.


• d(s) ≥ 0 is the weight for state s.
P
• Since s∈S d(s) = 1, we can interpret d(s) as a probability
distribution. Then, the metric can be written as

v̄π = E[vπ (S)]

where S ∼ d.

Shiyu Zhao 12 / 43
Metrics to define optimal policies - 1) The average value

Vector-product form:
X
v̄π = d(s)vπ (s) = dT vπ
s∈S

where

vπ = [. . . , vπ (s), . . . ]T ∈ R|S|
d = [. . . , d(s), . . . ]T ∈ R|S| .

This expression is particularly useful when we analyze its gradient.

Shiyu Zhao 13 / 43
Metrics to define optimal policies - 1) The average value

How to select the distribution d? There are two cases.


The first case is that d is independent of the policy π.
• This case is relatively simple because the gradient of the metric is
easier to calculate.
• In this case, we specifically denote d as d0 and v̄π as v̄π0 .
• How to select d0 ?
• One trivial way is to treat all the states equally important and hence
select d0 (s) = 1/|S|.
• Another important case is that we are only interested in a specific
state s0 . For example, the episodes in some tasks always start from
the same state s0 . Then, we only care about the long-term return
starting from s0 . In this case,

d0 (s0 ) = 1, d0 (s 6= s0 ) = 0.
Shiyu Zhao 14 / 43
Metrics to define optimal policies - 1) The average value

How to select the distribution d? There are two cases.


The second case is that d depends on the policy π.

• A common way to select d as dπ (s), which is the stationary


distribution under π. Details of stationary distribution can be found in
the last lecture and the book.
• One basic property of dπ is that it satisfies

dTπ Pπ = dTπ ,

where Pπ is the state transition probability matrix.


• The interpretation of selecting dπ is as follows.
• If one state is frequently visited in the long run, it is more
important and deserves more weight.
• If a state is hardly visited, then we give it less weight.
Shiyu Zhao 15 / 43
Metrics to define optimal policies - 2) The average reward

The second metric is average one-step reward or simply average


reward. In particular, the metric is
. X
r̄π = dπ (s)rπ (s) = E[rπ (S)],
s∈S

where S ∼ dπ . Here,
. X
rπ (s) = π(a|s)r(s, a)
a∈A

is the average of the one-step immediate reward that can be obtained


starting from state s, and
X
r(s, a) = E[R|s, a] = rp(r|s, a)
r

• The weight dπ is the stationary distribution.


• As its name suggests, r̄π is simply a weighted average of the one-step
immediate rewards.
Shiyu Zhao 16 / 43
Metrics to define optimal policies - 2) The average reward

An equivalent definition!

• Suppose an agent follows a given policy and generate a trajectory with


the rewards as (Rt+1 , Rt+2 , . . . ).
• The average single-step reward along this trajectory is
1 h i
lim E Rt+1 + Rt+2 + · · · + Rt+n |St = s0
n→∞ n
" n #
1 X
= lim E Rt+k |St = s0
n→∞ n
k=1

where s0 is the starting state of the trajectory.

Shiyu Zhao 17 / 43
Metrics to define optimal policies - Remarks

An important property is that


" n # " n #
1 X 1 X
lim E Rt+k |St = s0 = lim E Rt+k
n→∞ n n→∞ n
k=1 k=1
X
= dπ (s)rπ (s)
s

= r̄π

Note that
• The starting state s0 does not matter.
• The two definitions of r̄π are equivalent.
See the proof in the book.

Shiyu Zhao 18 / 43
Metrics to define optimal policies - Remarks

Remark 1 about the metrics:


• All these metrics are functions of π.
• Since π is parameterized by θ, these metrics are functions of θ.
• In other words, different values of θ can generate different metric
values.
• Therefore, we can search for the optimal values of θ to maximize these
metrics.
This is the basic idea of policy gradient methods.

Shiyu Zhao 19 / 43
Metrics to define optimal policies - Remarks

Remark 2 about the metrics:


• One complication is that the metrics can be defined in either the
discounted case where γ ∈ (0, 1) or the undiscounted case where
γ = 1.
• We only consider the discounted case so far in this book. For details
about the undiscounted case, see the book.

Shiyu Zhao 20 / 43
Metrics to define optimal policies - Remarks

Remark 3 about the metrics:


• Intuitively, r̄π is more short-sighted because it merely considers the
immediate rewards, whereas v̄π considers the total reward overall steps.
• However, the two metrics are equivalent to each other.
In the discounted case where γ < 1, it holds that

r̄π = (1 − γ)v̄π .

See the proof in the book.

Shiyu Zhao 21 / 43
Metrics to define optimal policies - Excise

Excise:
You will see the following metric often in the literature:
"∞ #
X
t
J(θ) = E γ Rt+1
t=0

What is its relationship to the metrics we introduced just now?

Shiyu Zhao 22 / 43
Metrics to define optimal policies - Excise


" #
X
t
J(θ) = E γ Rt+1
t=0

Answer: First, clarify and understand this metric.


• It starts from S0 ∼ d and then A0 , R1 , S1 , A1 , R2 , S2 , . . .
• At ∼ π(St ) and Rt+1 , St+1 ∼ p(Rt+1 |St , At ), p(St+1 |St , At )
Then, we know this metric is the same as the average value because
"∞ # "∞ #
X X X
J(θ) = E γ t Rt+1 = d(s)E γ t Rt+1 |S0 = s
t=0 s∈S t=0
X
= d(s)vπ (s)
s∈S

= v̄π

Shiyu Zhao 23 / 43
Outline

1 Basic idea of policy gradient

2 Metrics to define optimal policies

3 Gradients of the metrics

4 Gradient-ascent algorithm (REINFORCE)

5 Summary

Shiyu Zhao 24 / 43
Gradients of the metrics

Given a metric, we next


• derive its gradient
• and then, apply gradient-based methods to optimize the metric.
The gradient calculation is one of the most complicated parts of policy
gradient methods! That is because
• first, we need to distinguish different metrics v̄π , r̄π , v̄π0
• second, we need to distinguish the discounted and undiscounted cases.
The calculation of the gradients:
• We will not discuss the details in this lecture.
• Interested readers may see my book for details.

Shiyu Zhao 25 / 43
Gradients of the metrics

Summary of the results about the gradients:


X X
∇θ J(θ) = η(s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A

where
• J(θ) can be v̄π , r̄π , or v̄π0 .
• “=” may denote strict equality, approximation, or proportional to.
• η is a distribution or weight of the states.

Shiyu Zhao 26 / 43
Gradients of the metrics

Some specific results:


X X
∇θ r̄π ' dπ (s) ∇θ π(a|s, θ)qπ (s, a),
s a

1
∇θ v̄π = ∇θ r̄π
1−γ

X X
∇θ v̄π0 = ρπ (s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A

Details are not given here. Interested readers can read my book.

Shiyu Zhao 27 / 43
Gradients of the metrics

A compact and useful form of the gradient:


X X
∇θ J(θ) = η(s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
 
= E ∇θ ln π(A|S, θ)qπ (S, A)

where S ∼ η and A ∼ π(A|S, θ).

Why is this expression useful?


• Because we can use samples to approximate the gradient!

∇θ J ≈ ∇θ ln π(a|s, θ)qπ (s, a)

Shiyu Zhao 28 / 43
Gradients of the metrics

X X
∇θ J(θ) = η(s) ∇θ π(a|s, θ)qπ (s, a)
s∈S a∈A
 
= E ∇θ ln π(A|S, θ)qπ (S, A)

How to prove the above equation?


Consider the function ln π where ln is the natural logarithm. It is easy to
see that
∇θ π(a|s, θ)
∇θ ln π(a|s, θ) =
π(a|s, θ)

and hence

∇θ π(a|s, θ) = π(a|s, θ)∇θ ln π(a|s, θ).

Shiyu Zhao 29 / 43
Gradients of the metrics

Then, we have
X X
∇θ J = d(s) ∇θ π(a|s, θ)qπ (s, a)
s a
X X
= d(s) π(a|s, θ)∇θ ln π(a|s, θ)qπ (s, a)
s a
" #
X
= ES∼d π(a|S, θ)∇θ ln π(a|S, θ)qπ (S, a)
a
 
= ES∼d,A∼π ∇θ ln π(A|S, θ)qπ (S, A)
.  
= E ∇θ ln π(A|S, θ)qπ (S, A)

Shiyu Zhao 30 / 43
Gradients of the metrics

Some remarks: Because we need to calculate ln π(a|s, θ), we must


ensure that for all s, a, θ
π(a|s, θ) > 0

• This can be archived by using softmax functions that can normalize


the entries in a vector from (−∞, +∞) to (0, 1).
• For example, for any vector x = [x1 , . . . , xn ]T ,
exi
zi = Pn
j=1 exj
Pn
where zi ∈ (0, 1) and i=1 zi = 1.
• Then, the policy function has the form of
eh(s,a,θ)
π(a|s, θ) = P h(s,a0 ,θ)
,
a0 ∈A e

where h(s, a, θ) is another function.


Shiyu Zhao 31 / 43
Gradients of the metrics

Some remarks:
• Such a form based on the softmax function can be realized by a neural
network whose input is s and parameter is θ. The network has |A|
outputs, each of which corresponds to π(a|s, θ) for an action a. The
activation function of the output layer should be softmax.
• Since π(a|s, θ) > 0 for all a, the parameterized policy is stochastic and
hence exploratory.
• There also exist deterministic policy gradient (DPG) methods.

Shiyu Zhao 32 / 43
Outline

1 Basic idea of policy gradient

2 Metrics to define optimal policies

3 Gradients of the metrics

4 Gradient-ascent algorithm (REINFORCE)

5 Summary

Shiyu Zhao 33 / 43
Gradient-ascent algorithm

Now, we are ready to present the first policy gradient algorithm to find
optimal policies!
• The gradient-ascent algorithm maximizing J(θ) is

θt+1 = θt + α∇θ J(θ)


h i
= θt + αE ∇θ ln π(A|S, θt )qπ (S, A)

• The true gradient can be replaced by a stochastic one:

θt+1 = θt + α∇θ ln π(at |st , θt )qπ (st , at )

Shiyu Zhao 34 / 43
Gradient-ascent algorithm

• Furthermore, since qπ is unknown, it can be approximated:

θt+1 = θt + α∇θ ln π(at |st , θt )qt (st , at )

There are different methods to approximate qπ (st , at )


• In this lecture, Monte-Carlo based method, REINFORCE
• In the next lecture, TD method and more

Shiyu Zhao 35 / 43
Gradient-ascent algorithm

Remark 1: How to do sampling?


h i
ES∼d,A∼π ∇θ ln π(A|S, θt )qπ (S, A) −→ ∇θ ln π(a|s, θt )qπ (s, a)

• How to sample S?
• S ∼ d, where the distribution d is a long-run behavior under π.
• How to sample A?
• A ∼ π(A|S, θ). Hence, at should be sampled following π(θt ) at st .
• Therefore, the policy gradient method is on-policy.

Shiyu Zhao 36 / 43
Gradient-ascent algorithm

Remark 2: How to interpret this algorithm?

Since
∇θ π(at |st , θt )
∇θ ln π(at |st , θt ) =
π(at |st , θt )
the algorithm can be rewritten as

θt+1 = θt + α∇θ ln π(at |st , θt )qt (st , at )


 
qt (st , at )
= θt + α ∇θ π(at |st , θt ).
π(at |st , θt )
| {z }
βt

Therefore, we have the important expression of the algorithm:

θt+1 = θt + αβt ∇θ π(at |st , θt )

Shiyu Zhao 37 / 43
Gradient-ascent algorithm

It is a gradient-ascent algorithm for maximizing π(at |st , θ):

θt+1 = θt + αβt ∇θ π(at |st , θt )

Intuition: When αβt is sufficiently small


• If βt > 0, the probability of choosing (st , at ) is enhanced:

π(at |st , θt+1 ) > π(at |st , θt )

The greater βt is, the stronger the enhancement is.


• If βt < 0, then π(at |st , θt+1 ) < π(at |st , θt ).
Math: When θt+1 − θt is sufficiently small, we have

π(at |st , θt+1 ) ≈ π(at |st , θt ) + (∇θ π(at |st , θt ))T (θt+1 − θt )
= π(at |st , θt ) + αβt (∇θ π(at |st , θt ))T (∇θ π(at |st , θt ))
= π(at |st , θt ) + αβt k∇θ π(at |st , θt )k2
Shiyu Zhao 38 / 43
Gradient-ascent algorithm

 
qt (st , at )
θt+1 = θt + α ∇θ π(at |st , θt )
π(at |st , θt )
| {z }
βt

The coefficient βt can well balance exploration and exploitation.


• First, βt is proportional to qt (st , at ).
• If qt (st , at ) is great, then βt is great.
• Therefore, the algorithm intends to enhance actions with greater
values.
• Second, βt is inversely proportional to π(at |st , θt ).
• If π(at |st , θt ) is small, then βt is large.
• Therefore, the algorithm intends to explore actions that have low
probabilities.
Shiyu Zhao 39 / 43
REINFORCE algorithm

Recall that

θt+1 = θt + α∇θ ln π(at |st , θt )qπ (st , at )

is replaced by

θt+1 = θt + α∇θ ln π(at |st , θt )qt (st , at )

where qt (st , at ) is an approximation of qπ (st , at ).


• If qπ (st , at ) is approximated by Monte Carlo estimation, the algorithm
has a specifics name, REINFORCE.
• REINFORCE is one of earliest and simplest policy gradient algorithms.
• Many other policy gradient algorithms such as the actor-critic methods
can be obtained by extending REINFORCE (next lecture).

Shiyu Zhao 40 / 43
REINFORCE algorithm

Pseudocode: Policy Gradient by Monte Carlo (REINFORCE)

Initialization: A parameterized function π(a|s, θ), γ ∈ (0, 1), and α > 0.


Aim: Search for an optimal policy maximizing J(θ).

For the kth iteration, do


Select s0 and generate an episode following π(θk ). Suppose the
episode is {s0 , a0 , r1 , . . . , sT −1 , aT −1 , rT }.
For t = 0, 1, . . . , T − 1, do
PT
Value update: qt (st , at ) = k=t+1 γ k−t−1 rk
Policy update: θt+1 = θt + α∇θ ln π(at |st , θt )qt (st , at )
θk = θT

Shiyu Zhao 41 / 43
Outline

1 Basic idea of policy gradient

2 Metrics to define optimal policies

3 Gradients of the metrics

4 Gradient-ascent algorithm (REINFORCE)

5 Summary

Shiyu Zhao 42 / 43
Summary

Contents of this lecture:


• Metrics for optimality
• Gradients of the metrics
• Gradient-ascent algorithm
• A special case: REINFORCE
Next lecture: Actor-critic

Shiyu Zhao 43 / 43

You might also like