0% found this document useful (0 votes)

3 views

Rl Lecture4

This lecture on dynamic programming in reinforcement learning covers key concepts such as policy evaluation, policy iteration, and value iteration, referencing Sutton & Barto and David Silver. It emphasizes the importance of dynamic programming for solving planning problems in Markov Decision Processes (MDPs) and provides algorithms for policy evaluation and improvement. The lecture concludes with a summary of the main points and extensions for further exploration.

Uploaded by

vidhathrigujji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Rl Lecture4

Uploaded by

vidhathrigujji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Reinforcement Learning

Lecture 4: Dynamic programming

Chris G. Willcocks
Durham University
Lecture overview
Lecture covers Chapter 4 in Sutton & Barto [1] and adaptations from David Silver [2]
1 Introduction
definition
examples
planning in an MDP
2 Policy evaluation
definition
synchronous algorithm
3 Policy iteration
policy improvement
definition
modified policy iteration
4 Value iteration
definition
summary and extensions

2 / 16
Introduction dynamic programming deﬁnition

Deﬁnition: Dynamic programming Planning: what’s the optimal policy?

Dynamic programming is an optimisation
method for sequential problems. DP
algorithms are able to solve complex
‘planning’ problems.

Given a complete MDP, dynamic

programming can ﬁnd an optimal policy.
This is achieved with two principles:
1. Breaking down the problem into
subproblems
2. Caching and reusing optimal solutions
to subproblems to ﬁnd the overall
optimal solution

3 / 16
Introduction dynamic programming examples

1
Famous examples
2 1
• Dijkstra’s algorithm 6
2
• Backpropagation
2
• Doing basic math
0 5 5 3 4
8 3
...so it’s really just recursion and common
sense! 8 7
1
∞ 5

4 / 16
Introduction planning in an MDP

Dynamic programming for planning MDPs

In reinforcement learning, we want to use dynamic programming to solve

MDPs. So given an MDP hS, A, P, R, γi and a policy π:

First, we want to ﬁnd the value function vπ for that policy:

• This is done by policy evaluation (the prediction problem)

Then, when we’re able to evaluate the policy, we want ﬁnd the best policy
v∗ (the control problem). This is done with two strategies:
1. Policy iteration
2. Value iteration

Follow along in Colab: W

5 / 16
Policy evaluation deﬁnition

Deﬁnition: Policy evaluation Example: frozen lake environment

We want to evaluate a given policy π.
We’ll achieve this with the Bellman
expectation equation, v1 → v2 → ... → vπ

1 2 3

4 5 6 7

8 9 10 11

12 13 14

6 / 16
Policy evaluation synchronous algorithm

Algorithm: policy evaluation

def policy_evaluation(env, policy, γ, theta):

→V = np.zeros(env.num_states)
while True: 0.0 0.0 0.0 0.0
delta = 0
for s in range(env.num_states): 0.0 0.0 0.0 0.0
Vs = 0
for a, a_prob in enumerate(policy[s]):
for prob, s’, reward, done in env.P[s][a]: 0.0 0.0 0.0 0.0
Vs += a_prob ∗ prob ∗ (reward + γ ∗ V[s’])
delta = max(delta, abs(V[s]−Vs))
V[s] = Vs 0.0 0.0 0.0
if delta < theta:
break
return V

7 / 16
Policy evaluation synchronous algorithm

Recap: Bellman expectation equation

X
vπ (s) = π(a|s)qπ (s, a) iteration 1, π =
a∈A
!
X X 0
= π(a|s) Ras + γ a
Pss 0 vπ (s )
0.0 0.0 0.0 0.0
a∈A s0 ∈S

0.0 0.0 0.0 0.0

Algorithm: policy evaluation
0.0 0.0 0.0 0.0
(iteration=1, γ=1)
for s in range(env.num_states):
Vs = 0 0.0 0.0 0.25
for a, a_prob in enumerate(policy[s]):
for prob, s’, reward, done in env.P[s][a]:
Vs += a_prob ∗ prob ∗ (reward + γ ∗ V[s’])
→V[s] = Vs

8 / 16
Policy evaluation synchronous algorithm

Recap: Bellman expectation equation

X
vπ (s) = π(a|s)qπ (s, a) iteration 2, π =
a∈A
!
X X 0
= π(a|s) Ras + γ a
Pss 0 vπ (s )
0.0 0.0 0.0 0.0
a∈A s0 ∈S

0.0 0.0 0.0 0.0

Algorithm: policy evaluation
0.0 0.0 0.06 0.0
(iteration=2, γ=1)
for s in range(env.num_states):
Vs = 0 0.0 0.06 0.34
for a, a_prob in enumerate(policy[s]):
for prob, s’, reward, done in env.P[s][a]:
Vs += a_prob ∗ prob ∗ (reward + γ ∗ V[s’])
→V[s] = Vs

9 / 16
Policy evaluation synchronous algorithm

iteration 3, π = iteration 4, π = iteration ∞, π =

.004 .001 .014 .012 .021 .010

.016 .025 ··· .016 .041

.031 .098 .008 .054 .117 .035 .088 .142

.109 .388 .138 .411 .176 .439

10 / 16
Policy iteration greedy policy improvement

random policy iteration ∞, π = improved policy

.014 .012 .021 .010

→ .016 .041 max

.035 .088 .142

.176 .439

11 / 16
Policy iteration deﬁnition

Deﬁnition: Policy iteration

Example: learning a better policy
Given a policy π (e.g. starting with a random
policy), iteratively evaluate:

vπ (s) = E[Rt+1 , +γRt+2 + ... | St = s]

π 0 = greedy(vπ )

This always converges to the optimal policy

π ∗ . That is, if the improvements stop:

qπ (s, π 0 (s)) = max qπ (s, a) = qπ (s, π(s)) = vπ (s)

a∈A

then the Bellman equation has been

satisﬁed vπ (s) = maxa∈A qπ (s, a) therefore
vπ = v∗ (s) for all s ∈ S

12 / 16
Policy iteration modiﬁed policy iteration

iteration 3, π =
Algorithm: modiﬁed policy iteration

What if we don’t do iterative policy evaluation to ∞?

What if we just do a crude, e.g. k = 3 small amount of
iteration?
.016
Does it still converge?
• Yes! It still converges to the optimal policy
.031 .098
• except in the case k = 1 which is equivilent to value
iteration .109 .388

13 / 16
Value iteration deﬁnition

Bellman optimality equation Algorithm: value iteration

If we recap the deﬁnition of the def value_iteration(env, γ, theta):

optimal value function according V = np.zeros(env.nS)
to the Bellman optimality while True:
delta = 0
equation: for s in range(env.nS):
v_s = V[s]
v∗ (s) = max q∗ (s, a) q_s = np.zeros(env.nA)
a
X for a in range(env.nA):
0
= max Ras + γ a
Pss0 v∗ (s ) for prob, s’, reward, done in env.P[s][a]:
a q_s[a] += prob ∗ (reward + γ ∗ V[s’])
s0 ∈S
V[s] = max(q_s)
We can also iteratively apply the delta = max(delta, abs(V[s] − v_s))
if delta < theta: break
update with the one-step policy = greedily_from(env, V, gamma)
look-ahead to learn v∗ (s) return policy, V

14 / 16
Take Away Points

Summary

In summary, dynamic programming:

• solves the planning problem, but not the full reinforcement learning
problem
• requires a complete model of the environment
• policy evaluation solves the prediction problem
• there’s a spectrum between policy iteration and value iteration
• these solve the control problem
Extensions:
• Asynchronous DP (read section 4.5 of Sutton & Barto [1])
• Play with the interactive demo by Andrej Karpathy W

15 / 16
References I

[1] Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction (second edition). Available online I. MIT
press, 2018.
[2] David Silver. Reinforcement Learning lectures.
https://www.davidsilver.uk/teaching/. 2015.

16 / 16

Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
National Geographic Kids Almanac 2014 (Sample)
No ratings yet
National Geographic Kids Almanac 2014 (Sample)
72 pages
Module 04
No ratings yet
Module 04
63 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Lec 09
No ratings yet
Lec 09
51 pages
MDP 2
No ratings yet
MDP 2
53 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
RL UNIT-4
No ratings yet
RL UNIT-4
18 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
15 MDP
No ratings yet
15 MDP
35 pages
Chapter4 221208 183603
No ratings yet
Chapter4 221208 183603
23 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
CS229
No ratings yet
CS229
17 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
04_RL_DP
No ratings yet
04_RL_DP
76 pages
2-dynamic
No ratings yet
2-dynamic
50 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Lec 3
No ratings yet
Lec 3
15 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
M 2
No ratings yet
M 2
12 pages
Assignment 2
No ratings yet
Assignment 2
13 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Lec 4
No ratings yet
Lec 4
16 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Solution to Assignment_4_Dynamic Programming
No ratings yet
Solution to Assignment_4_Dynamic Programming
11 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Swce MCQ
No ratings yet
Swce MCQ
12 pages
Alvin Toffler - The Third Wave - 1980 - jtm2
No ratings yet
Alvin Toffler - The Third Wave - 1980 - jtm2
14 pages
Unit-4
No ratings yet
Unit-4
32 pages
An Introduction to Continuum Mechanics 2ed. Edition Reddy J.N. pdf download
100% (3)
An Introduction to Continuum Mechanics 2ed. Edition Reddy J.N. pdf download
54 pages
Introduction and Perspectives: December 28, 2015 14:27 Macroscopic Electrodynamics: An - . - 9in X 6in Embook
No ratings yet
Introduction and Perspectives: December 28, 2015 14:27 Macroscopic Electrodynamics: An - . - 9in X 6in Embook
28 pages
Website Copy Only - Not For Reproduction: Landscape Function Analysis Field Guide
No ratings yet
Website Copy Only - Not For Reproduction: Landscape Function Analysis Field Guide
62 pages
Integrity
No ratings yet
Integrity
3 pages
Surveying and Levelling QB
No ratings yet
Surveying and Levelling QB
17 pages
33rd Degree Mason
50% (2)
33rd Degree Mason
17 pages
NESCOM Fellowship Syllabus-Watermark
No ratings yet
NESCOM Fellowship Syllabus-Watermark
2 pages
An iterative runoff propagation approach to identify priority locations for
No ratings yet
An iterative runoff propagation approach to identify priority locations for
16 pages
Galileo: The Man Who Changed Our Views of The Heavens and Earth
No ratings yet
Galileo: The Man Who Changed Our Views of The Heavens and Earth
33 pages
قاموس مصطلحات تقنيات التعليم
No ratings yet
قاموس مصطلحات تقنيات التعليم
42 pages
What Will Be The Magnetic Field at The Point O in The Diagram Shown Aside? Here, The Angle Is Expressed in Degree
No ratings yet
What Will Be The Magnetic Field at The Point O in The Diagram Shown Aside? Here, The Angle Is Expressed in Degree
6 pages
6 Cylinder: John Deere Engine Head Torque Sequence
No ratings yet
6 Cylinder: John Deere Engine Head Torque Sequence
1 page
Spoken Word Production: A Theory of Lexical Access: Willem J. M. Levelt
No ratings yet
Spoken Word Production: A Theory of Lexical Access: Willem J. M. Levelt
8 pages
Arm
No ratings yet
Arm
46 pages
Peer and Self-Assessment - TeachingEnglish - British Council
No ratings yet
Peer and Self-Assessment - TeachingEnglish - British Council
8 pages
Case Study On Swastik Pipes
No ratings yet
Case Study On Swastik Pipes
10 pages
AI ALL Specification
No ratings yet
AI ALL Specification
18 pages
Armstrong Nature of The Mind
No ratings yet
Armstrong Nature of The Mind
23 pages
Get Algorithms in Java THIRD EDITION pdf Robert Sedgewick free all chapters
100% (2)
Get Algorithms in Java THIRD EDITION pdf Robert Sedgewick free all chapters
81 pages
Galvanizing Weld
No ratings yet
Galvanizing Weld
11 pages
Quick Installation Guide: ASW3000-S-G2/ASW3680-S-G2/ASW4000-S-G2 ASW5000-S-G2/ASW6000-S-G2
No ratings yet
Quick Installation Guide: ASW3000-S-G2/ASW3680-S-G2/ASW4000-S-G2 ASW5000-S-G2/ASW6000-S-G2
13 pages
Marantz Av9000 Owners Manual
No ratings yet
Marantz Av9000 Owners Manual
41 pages
Good Instruction and Defects in Instruction
100% (1)
Good Instruction and Defects in Instruction
4 pages
Occular Prosthesis
100% (1)
Occular Prosthesis
30 pages
CH 04
No ratings yet
CH 04
132 pages
Camloc Ram H
No ratings yet
Camloc Ram H
65 pages

Rl Lecture4

Uploaded by

Rl Lecture4

Uploaded by

Reinforcement Learning

Lecture 4: Dynamic programming

Deﬁnition: Dynamic programming Planning: what’s the optimal policy?

Given a complete MDP, dynamic

Dynamic programming for planning MDPs

In reinforcement learning, we want to use dynamic programming to solve

First, we want to ﬁnd the value function vπ for that policy:

Follow along in Colab: W

Deﬁnition: Policy evaluation Example: frozen lake environment

Algorithm: policy evaluation

def policy_evaluation(env, policy, γ, theta):

Recap: Bellman expectation equation

0.0 0.0 0.0 0.0

Recap: Bellman expectation equation

0.0 0.0 0.0 0.0

iteration 3, π = iteration 4, π = iteration ∞, π =

.004 .001 .014 .012 .021 .010

.016 .025 ··· .016 .041

.031 .098 .008 .054 .117 .035 .088 .142

.109 .388 .138 .411 .176 .439

random policy iteration ∞, π = improved policy

.014 .012 .021 .010

→ .016 .041 max

.035 .088 .142

Deﬁnition: Policy iteration

vπ (s) = E[Rt+1 , +γRt+2 + ... | St = s]

This always converges to the optimal policy

qπ (s, π 0 (s)) = max qπ (s, a) = qπ (s, π(s)) = vπ (s)

then the Bellman equation has been

What if we don’t do iterative policy evaluation to ∞?

Bellman optimality equation Algorithm: value iteration

If we recap the deﬁnition of the def value_iteration(env, γ, theta):

In summary, dynamic programming:

[1] Richard S Sutton and Andrew G Barto.

You might also like