0% found this document useful (0 votes)
3 views

Rl Lecture4

This lecture on dynamic programming in reinforcement learning covers key concepts such as policy evaluation, policy iteration, and value iteration, referencing Sutton & Barto and David Silver. It emphasizes the importance of dynamic programming for solving planning problems in Markov Decision Processes (MDPs) and provides algorithms for policy evaluation and improvement. The lecture concludes with a summary of the main points and extensions for further exploration.

Uploaded by

vidhathrigujji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Rl Lecture4

This lecture on dynamic programming in reinforcement learning covers key concepts such as policy evaluation, policy iteration, and value iteration, referencing Sutton & Barto and David Silver. It emphasizes the importance of dynamic programming for solving planning problems in Markov Decision Processes (MDPs) and provides algorithms for policy evaluation and improvement. The lecture concludes with a summary of the main points and extensions for further exploration.

Uploaded by

vidhathrigujji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Reinforcement Learning

Lecture 4: Dynamic programming

Chris G. Willcocks
Durham University
Lecture overview
Lecture covers Chapter 4 in Sutton & Barto [1] and adaptations from David Silver [2]
1 Introduction
definition
examples
planning in an MDP
2 Policy evaluation
definition
synchronous algorithm
3 Policy iteration
policy improvement
definition
modified policy iteration
4 Value iteration
definition
summary and extensions

2 / 16
Introduction dynamic programming definition

Definition: Dynamic programming Planning: what’s the optimal policy?


Dynamic programming is an optimisation
method for sequential problems. DP
algorithms are able to solve complex
‘planning’ problems.

Given a complete MDP, dynamic


programming can find an optimal policy.
This is achieved with two principles:
1. Breaking down the problem into
subproblems
2. Caching and reusing optimal solutions
to subproblems to find the overall
optimal solution

3 / 16
Introduction dynamic programming examples

1
Famous examples
2 1
• Dijkstra’s algorithm 6
2
• Backpropagation
2
• Doing basic math
0 5 5 3 4
8 3
...so it’s really just recursion and common
sense! 8 7
1
∞ 5

4 / 16
Introduction planning in an MDP

Dynamic programming for planning MDPs

In reinforcement learning, we want to use dynamic programming to solve


MDPs. So given an MDP hS, A, P, R, γi and a policy π:

First, we want to find the value function vπ for that policy:


• This is done by policy evaluation (the prediction problem)

Then, when we’re able to evaluate the policy, we want find the best policy
v∗ (the control problem). This is done with two strategies:
1. Policy iteration
2. Value iteration

Follow along in Colab: W

5 / 16
Policy evaluation definition

Definition: Policy evaluation Example: frozen lake environment


We want to evaluate a given policy π.
We’ll achieve this with the Bellman
expectation equation, v1 → v2 → ... → vπ

1 2 3

4 5 6 7

8 9 10 11

12 13 14

6 / 16
Policy evaluation synchronous algorithm

Algorithm: policy evaluation

def policy_evaluation(env, policy, γ, theta):


→V = np.zeros(env.num_states)
while True: 0.0 0.0 0.0 0.0
delta = 0
for s in range(env.num_states): 0.0 0.0 0.0 0.0
Vs = 0
for a, a_prob in enumerate(policy[s]):
for prob, s’, reward, done in env.P[s][a]: 0.0 0.0 0.0 0.0
Vs += a_prob ∗ prob ∗ (reward + γ ∗ V[s’])
delta = max(delta, abs(V[s]−Vs))
V[s] = Vs 0.0 0.0 0.0
if delta < theta:
break
return V

7 / 16
Policy evaluation synchronous algorithm

Recap: Bellman expectation equation


X
vπ (s) = π(a|s)qπ (s, a) iteration 1, π =
a∈A
!
X X 0
= π(a|s) Ras + γ a
Pss 0 vπ (s )
0.0 0.0 0.0 0.0
a∈A s0 ∈S

0.0 0.0 0.0 0.0


Algorithm: policy evaluation
0.0 0.0 0.0 0.0
(iteration=1, γ=1)
for s in range(env.num_states):
Vs = 0 0.0 0.0 0.25
for a, a_prob in enumerate(policy[s]):
for prob, s’, reward, done in env.P[s][a]:
Vs += a_prob ∗ prob ∗ (reward + γ ∗ V[s’])
→V[s] = Vs

8 / 16
Policy evaluation synchronous algorithm

Recap: Bellman expectation equation


X
vπ (s) = π(a|s)qπ (s, a) iteration 2, π =
a∈A
!
X X 0
= π(a|s) Ras + γ a
Pss 0 vπ (s )
0.0 0.0 0.0 0.0
a∈A s0 ∈S

0.0 0.0 0.0 0.0


Algorithm: policy evaluation
0.0 0.0 0.06 0.0
(iteration=2, γ=1)
for s in range(env.num_states):
Vs = 0 0.0 0.06 0.34
for a, a_prob in enumerate(policy[s]):
for prob, s’, reward, done in env.P[s][a]:
Vs += a_prob ∗ prob ∗ (reward + γ ∗ V[s’])
→V[s] = Vs

9 / 16
Policy evaluation synchronous algorithm

iteration 3, π = iteration 4, π = iteration ∞, π =

.004 .001 .014 .012 .021 .010

.016 .025 ··· .016 .041

.031 .098 .008 .054 .117 .035 .088 .142

.109 .388 .138 .411 .176 .439

10 / 16
Policy iteration greedy policy improvement

random policy iteration ∞, π = improved policy

.014 .012 .021 .010

→ .016 .041 max


a

.035 .088 .142

.176 .439

11 / 16
Policy iteration definition

Definition: Policy iteration


Example: learning a better policy
Given a policy π (e.g. starting with a random
policy), iteratively evaluate:

vπ (s) = E[Rt+1 , +γRt+2 + ... | St = s]


π 0 = greedy(vπ )

This always converges to the optimal policy


π ∗ . That is, if the improvements stop:

qπ (s, π 0 (s)) = max qπ (s, a) = qπ (s, π(s)) = vπ (s)


a∈A

then the Bellman equation has been


satisfied vπ (s) = maxa∈A qπ (s, a) therefore
vπ = v∗ (s) for all s ∈ S

12 / 16
Policy iteration modified policy iteration

iteration 3, π =
Algorithm: modified policy iteration

What if we don’t do iterative policy evaluation to ∞?


What if we just do a crude, e.g. k = 3 small amount of
iteration?
.016
Does it still converge?
• Yes! It still converges to the optimal policy
.031 .098
• except in the case k = 1 which is equivilent to value
iteration .109 .388

13 / 16
Value iteration definition

Bellman optimality equation Algorithm: value iteration

If we recap the definition of the def value_iteration(env, γ, theta):


optimal value function according V = np.zeros(env.nS)
to the Bellman optimality while True:
delta = 0
equation: for s in range(env.nS):
v_s = V[s]
v∗ (s) = max q∗ (s, a) q_s = np.zeros(env.nA)
a
X for a in range(env.nA):
0
= max Ras + γ a
Pss0 v∗ (s ) for prob, s’, reward, done in env.P[s][a]:
a q_s[a] += prob ∗ (reward + γ ∗ V[s’])
s0 ∈S
V[s] = max(q_s)
We can also iteratively apply the delta = max(delta, abs(V[s] − v_s))
if delta < theta: break
update with the one-step policy = greedily_from(env, V, gamma)
look-ahead to learn v∗ (s) return policy, V

14 / 16
Take Away Points

Summary

In summary, dynamic programming:

• solves the planning problem, but not the full reinforcement learning
problem
• requires a complete model of the environment
• policy evaluation solves the prediction problem
• there’s a spectrum between policy iteration and value iteration
• these solve the control problem
Extensions:
• Asynchronous DP (read section 4.5 of Sutton & Barto [1])
• Play with the interactive demo by Andrej Karpathy W

15 / 16
References I

[1] Richard S Sutton and Andrew G Barto.


Reinforcement learning: An introduction (second edition). Available online I. MIT
press, 2018.
[2] David Silver. Reinforcement Learning lectures.
https://www.davidsilver.uk/teaching/. 2015.

16 / 16

You might also like