Rl Lecture4
Rl Lecture4
Chris G. Willcocks
Durham University
Lecture overview
Lecture covers Chapter 4 in Sutton & Barto [1] and adaptations from David Silver [2]
1 Introduction
definition
examples
planning in an MDP
2 Policy evaluation
definition
synchronous algorithm
3 Policy iteration
policy improvement
definition
modified policy iteration
4 Value iteration
definition
summary and extensions
2 / 16
Introduction dynamic programming definition
3 / 16
Introduction dynamic programming examples
1
Famous examples
2 1
• Dijkstra’s algorithm 6
2
• Backpropagation
2
• Doing basic math
0 5 5 3 4
8 3
...so it’s really just recursion and common
sense! 8 7
1
∞ 5
4 / 16
Introduction planning in an MDP
Then, when we’re able to evaluate the policy, we want find the best policy
v∗ (the control problem). This is done with two strategies:
1. Policy iteration
2. Value iteration
5 / 16
Policy evaluation definition
1 2 3
4 5 6 7
8 9 10 11
12 13 14
6 / 16
Policy evaluation synchronous algorithm
7 / 16
Policy evaluation synchronous algorithm
8 / 16
Policy evaluation synchronous algorithm
9 / 16
Policy evaluation synchronous algorithm
10 / 16
Policy iteration greedy policy improvement
.176 .439
11 / 16
Policy iteration definition
12 / 16
Policy iteration modified policy iteration
iteration 3, π =
Algorithm: modified policy iteration
13 / 16
Value iteration definition
14 / 16
Take Away Points
Summary
• solves the planning problem, but not the full reinforcement learning
problem
• requires a complete model of the environment
• policy evaluation solves the prediction problem
• there’s a spectrum between policy iteration and value iteration
• these solve the control problem
Extensions:
• Asynchronous DP (read section 4.5 of Sutton & Barto [1])
• Play with the interactive demo by Andrej Karpathy W
15 / 16
References I
16 / 16