0% found this document useful (0 votes)

12 views

EE675 Lecture 10

The document discusses Bellman expectation equations and optimal policies in reinforcement learning. It covers: 1) Defining transition probabilities and rewards for a fixed policy in a Markov decision process. 2) Expressing the Bellman expectation equation in vector form relating the value function to expected rewards and transition probabilities. 3) Proving the uniqueness of the value function solution by showing the matrix in the equation is invertible. 4) Defining an optimal policy as one that maximizes expected rewards for each state, and providing examples of finding optimal policies.

Uploaded by

SAURABH VISHWAKARMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

EE675 Lecture 10

Uploaded by

SAURABH VISHWAKARMA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

EE625A: Introduction to Reinforcement Learning

Lecture 10: Bellman Expectation equations and Optimal Policy

13-02-2023
Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Amit Kumar Yadav

In Lecture 9 we learnt about Markov decision process, value function and action value function.
Towards the end of the class we learnt about Bellman expectation equations.
Today we will prove the uniqueness of the solution of the Bellman expectation equation and we
will deﬁne the optimal policy and its existence.

1 Fixing a policy
Consider a coin toss Markov decision process where head leads to state S+1 if we are in state S
and tail leads to state S-1. Now we have two coins blue and red, blue lands in head with probability
pblue and red lands in head with probability pred . Below is the pictorial view of the MDP.

Figure 1: MDP

Any deterministic policy were for each state we have decided the coin we are going to toss the
MDP is converted in markov chain. for example if we have decided for state 4 we are going to toss
red coin then p45 = pred and p43 = 1 − pred . Similarly for eacch state we can define the transition
probability and rewards.
For a stochastic policy if we fix the policy i.e., we know the probability by which we are going
to take an action for each state we can find out the transition probability for each state as follows.
∑
π ′ ′
′ = P (s |s) = P (s |s, a)π(a|s)
π
Pss T ransition probabilities (1)
a
∫
Rsa= E[Rt+1 |st = s, At = a] = rP (r|s, a)
∑ r

Rs = Eπ [Rt+1 |st = s] =
π
E[Rt+1 |st , At = a]π(a|s) Reward f or policy π (2)
a

1
2 Vector form of Bellman expectation equation
Let us use Rsπ , Pss
π
′ notation for vector form.

Vπ (s) = Eπ [Gt |st = s] = Eπ [Rt+1 + γGt+1 |st = s]

Vπ (s) = Eπ [Rt+1 |st = s] + γEπ [Gt+1 |st = s]
∑ ′
Vπ (s) = Rsπ + γ ′ Eπ [Gt+1 |st = s, st+1 = s ]
π
Pss
s′
∑ ′
Vπ (s) = Rsπ + γ π
Pss ′ Vπ (s )

s′
 
Vπ (s1 )
 Vπ (s2 ) 
 
Vπ =  .. 
 . 
Vπ (sn ) nX1
[.]
..
Rπ = .
..
nX1
 
···
π  .. 
P = . 
··· nXn
so the vector form of Bellmann expectation equation is:

Vπ = Rπ + γPπ Vπ (3)

3 Uniqueness of Vπ
If we have ’n’ number of states, we have ’n’ unknowns: [Vπ (s)]s∈S . We have ’n’ linear equations.
Lets write it in Ax = b form.

Vπ = Rπ + γPπ Vπ
Vπ − γPπ Vπ = Rπ
(I − γPπ Vπ )Vπ = Rπ (4)
Vπ = (I − γPπ Vπ )−1 Rπ (5)

Equation 4 is of the form Ax = b where A = (I − γPπ Vπ ) and b = Rπ . Vπ will have unique solution
of A is invertible and A is invertible iff λ ̸= 0 for all eigen values.
We know that Pπ is transition probability matrix where sum of a row is equal to 1 and every element
is non negative, this makes Pπ a stochastic matrix.

2
For a stochastic matrix λmax ≤ 1
Let x is a eigen vector of Pπ , then consider
A = (I − γPπ Vπ )x = Ix − γPπ Vπ x
A = x − γλpπ x
= (1 − γλpπ )x (1 − γλpπ ) ≥ 0 as γ ≤ 1 and λpπ ≤ 1 (6)
From equation 6 we can see that the eigen value of the matrix A is always positive, hence making
the matrix invertible, therefore Vπ is unique

4 Discussion on optimal policy

For a given state the optimal policy would the the policy for which the expected reward will be
maximum.
Hence for a given state s the optimum policy will be argmaxVπ
x

4.1 Finding shortest path in grid to a Goal G

Consider a grid with one of the cell as goal state. we want to reach the goal cell from any cell with
least steps possible. lets formulate this problem as MPD.
• State = Cell
• Action = Up/Down/Left/Right
• The state transition will be deterministic.
• Reward of -1 in each state for any action.
• zero reward in terminal state i.e., Goal state.
For state S the following is an optimal policy.

Figure 2: Optimal Policy

3
• There can be many optimal policy for the state S for example for State (0,0) we can change
its action to any other action we will still end up with optimal policy for state S.

• Conceptually we can do the following to ﬁnd the best policy.

– Each state ’S’ may have multiple policies which are optimal for that state.
– collect the set of optimal policies for a state s. Repeat this for each state
– Take intersection of all those sets
– if intersection is non empty then the policy is the optimal policy.
– only if that intersection is non empty the optimal policy for whole MDP exists.

4.2 Sufﬁcient conditions for existance of optimal policy

• Finite number of states

• Finite action space

• Bounded reward.

Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Game of Thrones - Ep 510 Mothers Mercy-D Benioff and D.B PDF
No ratings yet
Game of Thrones - Ep 510 Mothers Mercy-D Benioff and D.B PDF
49 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
policy (RL IITH)
No ratings yet
policy (RL IITH)
46 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
2-dynamic
No ratings yet
2-dynamic
50 pages
Lec 09
No ratings yet
Lec 09
51 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
CS229
No ratings yet
CS229
17 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
No ratings yet
slidedeck_6_MAS_2021_22_RL_2_MDP_Model-based
36 pages
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
No ratings yet
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
21 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
1-markov
No ratings yet
1-markov
34 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
MDP 2
No ratings yet
MDP 2
53 pages
336 Lecture4 2007
No ratings yet
336 Lecture4 2007
5 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
RL and ObC Lecture 2
No ratings yet
RL and ObC Lecture 2
20 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
AIS462_Reinforcement Learning_Spring2025_Lec4
No ratings yet
AIS462_Reinforcement Learning_Spring2025_Lec4
13 pages
15 MDP
No ratings yet
15 MDP
35 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Lec 3
No ratings yet
Lec 3
15 pages
2
No ratings yet
2
23 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
DRL_Homework_1
No ratings yet
DRL_Homework_1
4 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
note7 (1)
No ratings yet
note7 (1)
5 pages
Module-2 For Btech in Topic
No ratings yet
Module-2 For Btech in Topic
29 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Lec 02
No ratings yet
Lec 02
89 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Exercise Sheet 10
No ratings yet
Exercise Sheet 10
3 pages
EE675A Lecture 4
No ratings yet
EE675A Lecture 4
7 pages
EE675 Lecture 5
No ratings yet
EE675 Lecture 5
6 pages
Exercise Sheet 12
No ratings yet
Exercise Sheet 12
6 pages
Lecture 2 EE675
No ratings yet
Lecture 2 EE675
4 pages
NAVODAYA VIDYALAYA SAMITI XI
No ratings yet
NAVODAYA VIDYALAYA SAMITI XI
3 pages
OLYMPIC-LOP 10-FILE CHINH THUC
No ratings yet
OLYMPIC-LOP 10-FILE CHINH THUC
11 pages
55 96 1 SM PDF
No ratings yet
55 96 1 SM PDF
13 pages
Poly Phase System
No ratings yet
Poly Phase System
27 pages
Wow 365 Yummy Finger Food Recipes More Than A Yummy Finger Food Cookbook by Dill, Kelly
No ratings yet
Wow 365 Yummy Finger Food Recipes More Than A Yummy Finger Food Cookbook by Dill, Kelly
652 pages
Curriculam Vitae Rumana Simmi: Cont. - 8588071998/8447669154
No ratings yet
Curriculam Vitae Rumana Simmi: Cont. - 8588071998/8447669154
3 pages
Riddhi Siddhi Annual Report 2023
No ratings yet
Riddhi Siddhi Annual Report 2023
166 pages
Software Patching Guide: Hig1800 Element Management System (Ems)
No ratings yet
Software Patching Guide: Hig1800 Element Management System (Ems)
47 pages
The Need of MEMS
No ratings yet
The Need of MEMS
22 pages
Information Technology BY A.Anil (006-09-5001) R.Gaurav Singh (006-08-5012)
No ratings yet
Information Technology BY A.Anil (006-09-5001) R.Gaurav Singh (006-08-5012)
23 pages
Vegetables
No ratings yet
Vegetables
15 pages
MC Lab (Bcs402) Ipcc
No ratings yet
MC Lab (Bcs402) Ipcc
40 pages
Medicine/Medicine 6th Year 2016
100% (1)
Medicine/Medicine 6th Year 2016
31 pages
Detailed Lesson Plan in Active Ingredients
No ratings yet
Detailed Lesson Plan in Active Ingredients
5 pages
Better Management Practices For Feed FISH FEED
No ratings yet
Better Management Practices For Feed FISH FEED
96 pages
CHM2104 Group Assignment 2024-2025
No ratings yet
CHM2104 Group Assignment 2024-2025
3 pages
Application of Vacuum Systems and Ejectors in The Vegetable Oil Industry
No ratings yet
Application of Vacuum Systems and Ejectors in The Vegetable Oil Industry
50 pages
Pega Low Code Application Development
No ratings yet
Pega Low Code Application Development
456 pages
Chumaceras REX
No ratings yet
Chumaceras REX
64 pages
CBSE Class 9 English Workbook Solutions Unit 5 Connectors
No ratings yet
CBSE Class 9 English Workbook Solutions Unit 5 Connectors
11 pages
Respondents Profile Thesis
100% (2)
Respondents Profile Thesis
14 pages
EV HUB Business Report - L036 PDF
No ratings yet
EV HUB Business Report - L036 PDF
32 pages
Nitesh Resume Public
100% (1)
Nitesh Resume Public
6 pages
Imran Sir Notes 1
No ratings yet
Imran Sir Notes 1
25 pages
Peak 211sac 211sach
No ratings yet
Peak 211sac 211sach
1 page
University of Ottawa - ADM 3351 Fixed Income Investments
No ratings yet
University of Ottawa - ADM 3351 Fixed Income Investments
28 pages
College"Giants of An Earlier Capitalism": The Chartered Trading Companies As Modern Multinationals
No ratings yet
College"Giants of An Earlier Capitalism": The Chartered Trading Companies As Modern Multinationals
23 pages
Do-Ptg 71702 - Joko Tole - Westpark - Airfreight
No ratings yet
Do-Ptg 71702 - Joko Tole - Westpark - Airfreight
10 pages
Design of Flexible Pavement
No ratings yet
Design of Flexible Pavement
53 pages

EE675 Lecture 10

Uploaded by

EE675 Lecture 10

Uploaded by

EE625A: Introduction to Reinforcement Learning

Lecture 10: Bellman Expectation equations and Optimal Policy

Vπ (s) = Eπ [Gt |st = s] = Eπ [Rt+1 + γGt+1 |st = s]

4 Discussion on optimal policy

4.1 Finding shortest path in grid to a Goal G

Figure 2: Optimal Policy

• Conceptually we can do the following to ﬁnd the best policy.

4.2 Sufﬁcient conditions for existance of optimal policy

• Finite action space

You might also like