Lecture26 Ri
Lecture26 Ri
Reinforcement Learning
Barnabás Póczos
Contents
2
RL Books
Introduction to
Reinforcement Learning
4
Reinforcement Learning
Applications
Finance
Portfolio optimization
Trading
Inventory optimization
Control
Elevator, Air conditioning, power grid, …
Robotics
Games
Go, Chess, Backgammon
Computer games
Chatbots
… 5
Reinforcement Learning
Framework
Agent
Environment
... ...
6
Markov Decision Processes
RL Framework + Markov assumption
7
Discount Rates
An issue:
Solution:
8
RL is different from
Supervised/Unsupervised learning
9
State-Value Function
Backup Diagram:
10
Bellman Equation
Proof of Bellman Equation:
11
Action-Value Function
12
Relation between Q and V Functions
Q from V:
V from Q:
13
The Optimal Value Function
and Optimal Policy
Partial ordering between policies:
V*(s) shows the maximum expected discounted reward that one can
achieve from state s with optimal play 14
The Optimal Action-Value
Function
Similarly, the optimal action-value function:
Important Properties:
15
The Existence of the Optimal Policy
17
Calculating the Value of Policy ¼
¼1 : always choosing Action 1
Goal
18
Calculating the Value of Policy ¼
¼2 : always choosing Action 2
Goal
Similarly as before:
19
Calculating the Value of Policy ¼
¼3 : mixed
Goal
20
Comparing the 3 policies
21
Bellman optimality equation for V*
Similarly, as we derived Bellman Equation for V and Q,
we can derive Bellman Equations for V* and Q* as well
Backup Diagram:
22
Bellman optimality equation for V*
23
Bellman optimality equation for Q*
24
Greedy Policy for V
25
The Optimal Value Function
and Optimal Policy
26
RL Tasks
Policy evaluation:
Policy improvement
27
Policy Evaluation
28
Policy Evaluation with Bellman Operator
Bellman equation:
Iteration:
Theorem: 29
Policy Improvement
30
Policy Improvement
Theorem:
31
Proof of Policy Improvement
Proof:
32
Finding the Optimal Policy
33
Finding the Optimal Policy
Model based approaches:
First we will discuss methods that need to know the model:
Policy Iteration
Value Iteration
Model-free approaches:
2. Policy Evaluation
35
Policy Iteration
3. Policy Improvement
One drawback of policy iteration is that each iteration involves policy evaluation 36
Value Iteration
Main idea:
37
Model Free Methods
38
Monte Carlo Policy Evaluation
39
Monte Carlo Policy Evaluation
Without knowing the model
40
Monte Carlo Estimation of V(s)
Empirical average: Let us use N simulations
starting from state s following policy ¼. The
observed rewards are:
Let
Similarly,
42
A better MC method
From one single trajectory we can get lots of R estimates:
s0 ! s1 ! s2 ! … ! sT
r1 r2 r3 r4
R(s0)
R(s1)
R(s2)
43
Temporal Differences method
We already know the MC estimation of V:
44
Temporal Differences method
Temporal difference:
Benefits
No need for model! (Dynamic Programming with Bellman
operators need them!)
No need to wait for the end of the episode! (MC methods need
them)
We use an estimator for creating another estimator
45
(=bootstrapping ) … still it works
Comparisons: DP, MC, TD
They all estimate V
DP:
Estimate comes from the Bellman equation
It needs to know the model
TD:
Expectation is approximated with random samples
Doesn’t need to wait for the end of the episodes.
MC:
Expectation is approximated with random samples
It needs to wait for the end of the episodes
46
MDP Backup Diagrams
White circle: state
Black circle: action
T: terminal state st
T
T T TT T T
TT T TT T TT
47
Monte Carlo Backup Diagram
st
rt 1
st 1
rt 2
st 2
T
T T TT T T
TT T TT T TT
48
Temporal Differences Backup Diagram
st
rt 1
st 1
T TT TT T T
T T
T TT T TT
49
Dynamic Programming Backup Diagram
st
rt 1
st 1
T TT T T T
TT T T T T
50
TD for function Q
This was our TD estimate for V:
51
Finding The Optimal Policy with TD
52
Finding The Optimal Policy with TD
We already know the Bellman-equation for Q*:
DP update:
53
Q Learning Algorithm
Q(s,a) arbitrary
For each episode
s:=s0; t:=0
t:=t+1
Execute action a
s:=s’
End For
End For
54
Q Learning Algorithm
55