0% found this document useful (0 votes)

4 views

Lecture26 Ri

Uploaded by

Lucky Stars

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Lecture26 Ri

Uploaded by

Lucky Stars

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Introduction to Machine Learning

Reinforcement Learning

Barnabás Póczos
Contents

 Markov Decision Processes:

 State-Value function, Action-Value Function
 Bellman Equation
 Policy Evaluation, Policy Improvement, Optimal Policy
 Dynamical programming:
 Policy Iteration
 Value Iteration
 Modell Free methods:
 MC Tree search
 TD Learning

2
RL Books
Introduction to
Reinforcement Learning

4
Reinforcement Learning
Applications
 Finance
 Portfolio optimization
 Trading
 Inventory optimization
 Control
 Elevator, Air conditioning, power grid, …
 Robotics
 Games
 Go, Chess, Backgammon
 Computer games
 Chatbots
 … 5
Reinforcement Learning
Framework
Agent

Environment

... ...

6
Markov Decision Processes
RL Framework + Markov assumption

7
Discount Rates

An issue:

Solution:

8
RL is different from
Supervised/Unsupervised learning

9
State-Value Function

Bellman Equation of V state-value function:

Backup Diagram:

10
Bellman Equation
Proof of Bellman Equation:

11
Action-Value Function

Bellman Equation of the Q Action-Value function:

Proof: similar to the proof of the Bellman Equation of V state-value function.

Backup Diagram:

12
Relation between Q and V Functions

Q from V:

V from Q:

13
The Optimal Value Function
and Optimal Policy
Partial ordering between policies:

Some policies are not comparable!

Optimal policy and optimal state-value function:

V*(s) shows the maximum expected discounted reward that one can
achieve from state s with optimal play 14
The Optimal Action-Value
Function
Similarly, the optimal action-value function:

Important Properties:

15
The Existence of the Optimal Policy

Theorem: For any Markov Decision Processes

(*) There is always a deterministic optimal policy for any MDP

16
Example

Goal = Terminal state

 4 states
 2 possible actions in each state. [E.g in A: 1) go to B or 2) go to C ]
 P(s’ | s, a) = (0.9 , 0.1) with 10% we go to a wrong direction

17
Calculating the Value of Policy ¼
¼1 : always choosing Action 1

Goal

18
Calculating the Value of Policy ¼
¼2 : always choosing Action 2

Goal

Similarly as before:

19
Calculating the Value of Policy ¼
¼3 : mixed

Goal

20
Comparing the 3 policies

21
Bellman optimality equation for V*
Similarly, as we derived Bellman Equation for V and Q,
we can derive Bellman Equations for V* and Q* as well

We proved this for V:

Theorem: Bellman optimality equation for V*:

Backup Diagram:

22
Bellman optimality equation for V*

Proof of Bellman optimality equation for V*:

23
Bellman optimality equation for Q*

Bellman optimality equation for Q*:

Proof: similar to the proof of the Bellman Equation of V*.

Backup Diagram:

24
Greedy Policy for V

Equivalently, (Greedy policy for a given V(s) function):

25
The Optimal Value Function
and Optimal Policy

Bellman optimality equation for V*:

This is a nonlinear equation!

Theorem: A greedy policy for V* is an optimal policy. Let us denote it with ¼*

Theorem: A greedy optimal policy from the optimal Value function:

26
RL Tasks

 Policy evaluation:

 Policy improvement

 Finding an optimal policy

27
Policy Evaluation

28
Policy Evaluation with Bellman Operator
Bellman equation:

This equation can be used as a fix point equation to evaluate policy ¼

Bellman operator: (one step with ¼, then using V)

Iteration:

Theorem: 29
Policy Improvement

30
Policy Improvement

Theorem:

31
Proof of Policy Improvement
Proof:

32
Finding the Optimal Policy

33
Finding the Optimal Policy
Model based approaches:
First we will discuss methods that need to know the model:

 Policy Iteration
 Value Iteration

Model-free approaches:

 Monte Carlo Method

 TD Learning 34
Policy Iteration
1. Initialization

2. Policy Evaluation

35
Policy Iteration
3. Policy Improvement

One drawback of policy iteration is that each iteration involves policy evaluation 36
Value Iteration
Main idea:

The greedy operator:

The value iteration update:

37
Model Free Methods

38
Monte Carlo Policy Evaluation

39
Monte Carlo Policy Evaluation
Without knowing the model

40
Monte Carlo Estimation of V(s)
 Empirical average: Let us use N simulations
starting from state s following policy ¼. The
observed rewards are:

 Let

 This is the so-called „Monte Carlo” method.

 MC can estimate V(s) without knowing the model
41
Online Averages (=Running averages)
 If we don’t want to store the N sample points:

Similarly,

42
A better MC method
 From one single trajectory we can get lots of R estimates:

s0 ! s1 ! s2 ! … ! sT
r1 r2 r3 r4
R(s0)
R(s1)
R(s2)

 Warning: These R(si) random variables might be dependent!

43
Temporal Differences method
We already know the MC estimation of V:

Here is an other estimate:

44
Temporal Differences method

Instead of waiting for Rk, we estimate it using Vk-1

 Temporal difference:
 Benefits
 No need for model! (Dynamic Programming with Bellman
operators need them!)
 No need to wait for the end of the episode! (MC methods need
them)
 We use an estimator for creating another estimator
45
(=bootstrapping ) … still it works
Comparisons: DP, MC, TD
 They all estimate V
 DP:
 Estimate comes from the Bellman equation
 It needs to know the model
 TD:
 Expectation is approximated with random samples
 Doesn’t need to wait for the end of the episodes.
 MC:
 Expectation is approximated with random samples
 It needs to wait for the end of the episodes

46
MDP Backup Diagrams
 White circle: state
 Black circle: action
 T: terminal state st

T
T T TT T T

TT T TT T TT

47
Monte Carlo Backup Diagram

st
rt 1
st 1
rt  2
st  2
T
T T TT T T

TT T TT T TT

48
Temporal Differences Backup Diagram

st
rt 1
st 1

T TT TT T T

T T
T TT T TT

49
Dynamic Programming Backup Diagram
st

rt 1
st 1

T TT T T T

TT T T T T

50
TD for function Q
This was our TD estimate for V:

We can use the same for Q(s,a):

51
Finding The Optimal Policy with TD

52
Finding The Optimal Policy with TD
 We already know the Bellman-equation for Q*:

 DP update:

 TD update for Q [= Q Learning]

53
Q Learning Algorithm

 Q(s,a) arbitrary
 For each episode
 s:=s0; t:=0

 For each time step t in the actual episode

 t:=t+1

 Choose action a according to a policy ¼ e.g. (epsilon-greedy)

 Execute action a

 Observer reward r and new state s’

 s:=s’
 End For

 End For

54
Q Learning Algorithm

 Q-learning learns an optimal policy no matter which policy the

agent is actually following (i.e., which action a it selects for any
state s)
as long as there is no bound on the number of times it tries an
action in any state (i.e., it does not always do the same subset
of actions in a state).

 Because it learns an optimal policy no matter which policy it is

carrying out, it is called an off-policy method.

Ben Howard:Digital Booklet - Every Kingdom
0% (3)
Ben Howard:Digital Booklet - Every Kingdom
14 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CM MCQ-1
80% (5)
CM MCQ-1
2 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
No ratings yet
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
36 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Lec 09
No ratings yet
Lec 09
51 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
2 - Overview of this book
No ratings yet
2 - Overview of this book
4 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
2-dynamic
No ratings yet
2-dynamic
50 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Lecture4 Model Free Prediction
No ratings yet
Lecture4 Model Free Prediction
34 pages
MDP 2
No ratings yet
MDP 2
53 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Optimal Control Theory
No ratings yet
Optimal Control Theory
28 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
RL UNIT-4
No ratings yet
RL UNIT-4
18 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
notes
No ratings yet
notes
6 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
CS229
No ratings yet
CS229
17 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
lecture-06
No ratings yet
lecture-06
98 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
1 - Table of contents
No ratings yet
1 - Table of contents
6 pages
Book All in One
No ratings yet
Book All in One
288 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet
Corpus
No ratings yet
Corpus
22 pages
Enoch 1
No ratings yet
Enoch 1
15 pages
Quickstart - Ren'Py Documentation
No ratings yet
Quickstart - Ren'Py Documentation
14 pages
Prompt Engineering Lecture Elvis
100% (10)
Prompt Engineering Lecture Elvis
50 pages
Consumer Behaviour Exercises
No ratings yet
Consumer Behaviour Exercises
2 pages
Ax2014 04
No ratings yet
Ax2014 04
11 pages
LPL - Paschim Vhr-Iv Dr. Umesh Mittal, House No - 233, Block A-5 Delhi
100% (1)
LPL - Paschim Vhr-Iv Dr. Umesh Mittal, House No - 233, Block A-5 Delhi
1 page
Hysteresis Loop
No ratings yet
Hysteresis Loop
2 pages
Protection Settings For 11Kv GTG Incomers (Ref620)
No ratings yet
Protection Settings For 11Kv GTG Incomers (Ref620)
15 pages
AE December 2016 98 Civ B1
No ratings yet
AE December 2016 98 Civ B1
7 pages
Food Hydrocolloids: Marco Santagiuliana, Betina Piqueras-Fiszman, Erik Van Der Linden, Markus Stieger, Elke Scholten
No ratings yet
Food Hydrocolloids: Marco Santagiuliana, Betina Piqueras-Fiszman, Erik Van Der Linden, Markus Stieger, Elke Scholten
10 pages
The California Psychological Inventory: Introduction of Cpi
No ratings yet
The California Psychological Inventory: Introduction of Cpi
17 pages
MCQ of RGVP
No ratings yet
MCQ of RGVP
2 pages
A S Davydov 1979 Phys. Scr. 20 387-2
No ratings yet
A S Davydov 1979 Phys. Scr. 20 387-2
9 pages
Site Save of CodeSlinger - Co.uk - GameBoy Emulator Programming in C++ - Memory
No ratings yet
Site Save of CodeSlinger - Co.uk - GameBoy Emulator Programming in C++ - Memory
3 pages
Chap.2 - Inside Reading 2
No ratings yet
Chap.2 - Inside Reading 2
9 pages
Utility Services Testing
No ratings yet
Utility Services Testing
2 pages
Australopithecus Africanus Handout
No ratings yet
Australopithecus Africanus Handout
3 pages
Cartucho Sellante 3M W2600 MSDS (EN)
No ratings yet
Cartucho Sellante 3M W2600 MSDS (EN)
5 pages
IELTS - Thesis Statements Paraphrasing
No ratings yet
IELTS - Thesis Statements Paraphrasing
38 pages
ADVANTAGES OF GROUP WORK
No ratings yet
ADVANTAGES OF GROUP WORK
2 pages
Unit 3.-Finite Automatic: 3.1 Concepts: Definition and Classification of Finite Automata (AF) - Definition 1
No ratings yet
Unit 3.-Finite Automatic: 3.1 Concepts: Definition and Classification of Finite Automata (AF) - Definition 1
22 pages
Apollo International School: It Is Required To Attend All The Exams Through Physical Supervision
No ratings yet
Apollo International School: It Is Required To Attend All The Exams Through Physical Supervision
12 pages
Iso 3425 1975
No ratings yet
Iso 3425 1975
4 pages
Commercial RO Plant Manufacturer in Delhi
No ratings yet
Commercial RO Plant Manufacturer in Delhi
2 pages
Exam Answer Sheet 2
No ratings yet
Exam Answer Sheet 2
6 pages
SB3.15 Guideline ICA PDF
No ratings yet
SB3.15 Guideline ICA PDF
259 pages
Spencer Consultants UK Company Profile
No ratings yet
Spencer Consultants UK Company Profile
21 pages
2ndQ Exam 22 23
No ratings yet
2ndQ Exam 22 23
3 pages
Summative Assessments Criteria
No ratings yet
Summative Assessments Criteria
2 pages
Simulation of Masonry Wall Using Concrete Damage Plasticity Model
0% (1)
Simulation of Masonry Wall Using Concrete Damage Plasticity Model
4 pages
EX/EGI-II/33/UG/2024 (I) 19.10.2024: B. Com. / B. Com. (2019/2023 Syllabus) Day &date Code Name of Paper
No ratings yet
EX/EGI-II/33/UG/2024 (I) 19.10.2024: B. Com. / B. Com. (2019/2023 Syllabus) Day &date Code Name of Paper
20 pages

Lecture26 Ri

Uploaded by

Lecture26 Ri

Uploaded by

Introduction to Machine Learning

 Markov Decision Processes:

Bellman Equation of V state-value function:

Bellman Equation of the Q Action-Value function:

Proof: similar to the proof of the Bellman Equation of V state-value function.

Some policies are not comparable!

Optimal policy and optimal state-value function:

Theorem: For any Markov Decision Processes

(*) There is always a deterministic optimal policy for any MDP

Goal = Terminal state

We proved this for V:

Theorem: Bellman optimality equation for V*:

Proof of Bellman optimality equation for V*:

Bellman optimality equation for Q*:

Proof: similar to the proof of the Bellman Equation of V*.

Equivalently, (Greedy policy for a given V(s) function):

Bellman optimality equation for V*:

This is a nonlinear equation!

Theorem: A greedy policy for V* is an optimal policy. Let us denote it with ¼*

Theorem: A greedy optimal policy from the optimal Value function:

 Finding an optimal policy

This equation can be used as a fix point equation to evaluate policy ¼

 Monte Carlo Method

The greedy operator:

The value iteration update:

 This is the so-called „Monte Carlo” method.

 Warning: These R(si) random variables might be dependent!

Here is an other estimate:

Instead of waiting for Rk, we estimate it using Vk-1

We can use the same for Q(s,a):

 TD update for Q [= Q Learning]

 For each time step t in the actual episode

 Choose action a according to a policy ¼ e.g. (epsilon-greedy)

 Observer reward r and new state s’

 Q-learning learns an optimal policy no matter which policy the

 Because it learns an optimal policy no matter which policy it is

You might also like