0% found this document useful (0 votes)
4 views

Lecture26 Ri

Uploaded by

Lucky Stars
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture26 Ri

Uploaded by

Lucky Stars
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Introduction to Machine Learning

Reinforcement Learning

Barnabás Póczos
Contents

 Markov Decision Processes:


 State-Value function, Action-Value Function
 Bellman Equation
 Policy Evaluation, Policy Improvement, Optimal Policy
 Dynamical programming:
 Policy Iteration
 Value Iteration
 Modell Free methods:
 MC Tree search
 TD Learning

2
RL Books
Introduction to
Reinforcement Learning

4
Reinforcement Learning
Applications
 Finance
 Portfolio optimization
 Trading
 Inventory optimization
 Control
 Elevator, Air conditioning, power grid, …
 Robotics
 Games
 Go, Chess, Backgammon
 Computer games
 Chatbots
 … 5
Reinforcement Learning
Framework
Agent

Environment

... ...

6
Markov Decision Processes
RL Framework + Markov assumption

7
Discount Rates

An issue:

Solution:

8
RL is different from
Supervised/Unsupervised learning

9
State-Value Function

Bellman Equation of V state-value function:

Backup Diagram:

10
Bellman Equation
Proof of Bellman Equation:

11
Action-Value Function

Bellman Equation of the Q Action-Value function:

Proof: similar to the proof of the Bellman Equation of V state-value function.


Backup Diagram:

12
Relation between Q and V Functions

Q from V:

V from Q:

13
The Optimal Value Function
and Optimal Policy
Partial ordering between policies:

Some policies are not comparable!

Optimal policy and optimal state-value function:

V*(s) shows the maximum expected discounted reward that one can
achieve from state s with optimal play 14
The Optimal Action-Value
Function
Similarly, the optimal action-value function:

Important Properties:

15
The Existence of the Optimal Policy

Theorem: For any Markov Decision Processes

(*) There is always a deterministic optimal policy for any MDP


16
Example

Goal = Terminal state


 4 states
 2 possible actions in each state. [E.g in A: 1) go to B or 2) go to C ]
 P(s’ | s, a) = (0.9 , 0.1) with 10% we go to a wrong direction

17
Calculating the Value of Policy ¼
¼1 : always choosing Action 1

Goal

18
Calculating the Value of Policy ¼
¼2 : always choosing Action 2

Goal

Similarly as before:

19
Calculating the Value of Policy ¼
¼3 : mixed

Goal

20
Comparing the 3 policies

21
Bellman optimality equation for V*
Similarly, as we derived Bellman Equation for V and Q,
we can derive Bellman Equations for V* and Q* as well

We proved this for V:

Theorem: Bellman optimality equation for V*:

Backup Diagram:

22
Bellman optimality equation for V*

Proof of Bellman optimality equation for V*:

23
Bellman optimality equation for Q*

Bellman optimality equation for Q*:

Proof: similar to the proof of the Bellman Equation of V*.


Backup Diagram:

24
Greedy Policy for V

Equivalently, (Greedy policy for a given V(s) function):

25
The Optimal Value Function
and Optimal Policy

Bellman optimality equation for V*:

This is a nonlinear equation!

Theorem: A greedy policy for V* is an optimal policy. Let us denote it with ¼*

Theorem: A greedy optimal policy from the optimal Value function:

26
RL Tasks

 Policy evaluation:

 Policy improvement

 Finding an optimal policy

27
Policy Evaluation

28
Policy Evaluation with Bellman Operator
Bellman equation:

This equation can be used as a fix point equation to evaluate policy ¼


Bellman operator: (one step with ¼, then using V)

Iteration:

Theorem: 29
Policy Improvement

30
Policy Improvement

Theorem:

31
Proof of Policy Improvement
Proof:

32
Finding the Optimal Policy

33
Finding the Optimal Policy
Model based approaches:
First we will discuss methods that need to know the model:

 Policy Iteration
 Value Iteration

Model-free approaches:

 Monte Carlo Method


 TD Learning 34
Policy Iteration
1. Initialization

2. Policy Evaluation

35
Policy Iteration
3. Policy Improvement

One drawback of policy iteration is that each iteration involves policy evaluation 36
Value Iteration
Main idea:

The greedy operator:

The value iteration update:

37
Model Free Methods

38
Monte Carlo Policy Evaluation

39
Monte Carlo Policy Evaluation
Without knowing the model

40
Monte Carlo Estimation of V(s)
 Empirical average: Let us use N simulations
starting from state s following policy ¼. The
observed rewards are:

 Let

 This is the so-called „Monte Carlo” method.


 MC can estimate V(s) without knowing the model
41
Online Averages (=Running averages)
 If we don’t want to store the N sample points:

Similarly,

42
A better MC method
 From one single trajectory we can get lots of R estimates:

s0 ! s1 ! s2 ! … ! sT
r1 r2 r3 r4
R(s0)
R(s1)
R(s2)

 Warning: These R(si) random variables might be dependent!

43
Temporal Differences method
We already know the MC estimation of V:

Here is an other estimate:

44
Temporal Differences method

Instead of waiting for Rk, we estimate it using Vk-1

 Temporal difference:
 Benefits
 No need for model! (Dynamic Programming with Bellman
operators need them!)
 No need to wait for the end of the episode! (MC methods need
them)
 We use an estimator for creating another estimator
45
(=bootstrapping ) … still it works
Comparisons: DP, MC, TD
 They all estimate V
 DP:
 Estimate comes from the Bellman equation
 It needs to know the model
 TD:
 Expectation is approximated with random samples
 Doesn’t need to wait for the end of the episodes.
 MC:
 Expectation is approximated with random samples
 It needs to wait for the end of the episodes

46
MDP Backup Diagrams
 White circle: state
 Black circle: action
 T: terminal state st

T
T T TT T T

TT T TT T TT

47
Monte Carlo Backup Diagram

st
rt 1
st 1
rt  2
st  2
T
T T TT T T

TT T TT T TT

48
Temporal Differences Backup Diagram

st
rt 1
st 1

T TT TT T T

T T
T TT T TT

49
Dynamic Programming Backup Diagram
st

rt 1
st 1

T TT T T T

TT T T T T

50
TD for function Q
This was our TD estimate for V:

We can use the same for Q(s,a):

51
Finding The Optimal Policy with TD

52
Finding The Optimal Policy with TD
 We already know the Bellman-equation for Q*:

 DP update:

 TD update for Q [= Q Learning]

53
Q Learning Algorithm

 Q(s,a) arbitrary
 For each episode
 s:=s0; t:=0

 For each time step t in the actual episode

 t:=t+1

 Choose action a according to a policy ¼ e.g. (epsilon-greedy)

 Execute action a

 Observer reward r and new state s’

 s:=s’
 End For

 End For

54
Q Learning Algorithm

 Q-learning learns an optimal policy no matter which policy the


agent is actually following (i.e., which action a it selects for any
state s)
as long as there is no bound on the number of times it tries an
action in any state (i.e., it does not always do the same subset
of actions in a state).

 Because it learns an optimal policy no matter which policy it is


carrying out, it is called an off-policy method.

55

You might also like