0% found this document useful (0 votes)
12 views

EE675 Lecture 10

The document discusses Bellman expectation equations and optimal policies in reinforcement learning. It covers: 1) Defining transition probabilities and rewards for a fixed policy in a Markov decision process. 2) Expressing the Bellman expectation equation in vector form relating the value function to expected rewards and transition probabilities. 3) Proving the uniqueness of the value function solution by showing the matrix in the equation is invertible. 4) Defining an optimal policy as one that maximizes expected rewards for each state, and providing examples of finding optimal policies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

EE675 Lecture 10

The document discusses Bellman expectation equations and optimal policies in reinforcement learning. It covers: 1) Defining transition probabilities and rewards for a fixed policy in a Markov decision process. 2) Expressing the Bellman expectation equation in vector form relating the value function to expected rewards and transition probabilities. 3) Proving the uniqueness of the value function solution by showing the matrix in the equation is invertible. 4) Defining an optimal policy as one that maximizes expected rewards for each state, and providing examples of finding optimal policies.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

EE625A: Introduction to Reinforcement Learning

Lecture 10: Bellman Expectation equations and Optimal Policy


13-02-2023
Lecturer: Prof. Subrahmanya Swamy Peruru Scribe: Amit Kumar Yadav

In Lecture 9 we learnt about Markov decision process, value function and action value function.
Towards the end of the class we learnt about Bellman expectation equations.
Today we will prove the uniqueness of the solution of the Bellman expectation equation and we
will define the optimal policy and its existence.

1 Fixing a policy
Consider a coin toss Markov decision process where head leads to state S+1 if we are in state S
and tail leads to state S-1. Now we have two coins blue and red, blue lands in head with probability
pblue and red lands in head with probability pred . Below is the pictorial view of the MDP.

Figure 1: MDP

Any deterministic policy were for each state we have decided the coin we are going to toss the
MDP is converted in markov chain. for example if we have decided for state 4 we are going to toss
red coin then p45 = pred and p43 = 1 − pred . Similarly for eacch state we can define the transition
probability and rewards.
For a stochastic policy if we fix the policy i.e., we know the probability by which we are going
to take an action for each state we can find out the transition probability for each state as follows.

π ′ ′
′ = P (s |s) = P (s |s, a)π(a|s)
π
Pss T ransition probabilities (1)
a

Rsa= E[Rt+1 |st = s, At = a] = rP (r|s, a)
∑ r

Rs = Eπ [Rt+1 |st = s] =
π
E[Rt+1 |st , At = a]π(a|s) Reward f or policy π (2)
a

1
2 Vector form of Bellman expectation equation
Let us use Rsπ , Pss
π
′ notation for vector form.

Vπ (s) = Eπ [Gt |st = s] = Eπ [Rt+1 + γGt+1 |st = s]


Vπ (s) = Eπ [Rt+1 |st = s] + γEπ [Gt+1 |st = s]
∑ ′
Vπ (s) = Rsπ + γ ′ Eπ [Gt+1 |st = s, st+1 = s ]
π
Pss
s′
∑ ′
Vπ (s) = Rsπ + γ π
Pss ′ Vπ (s )

s′
 
Vπ (s1 )
 Vπ (s2 ) 
 
Vπ =  .. 
 . 
Vπ (sn ) nX1
[.]
..
Rπ = .
..
nX1
 
···
π  .. 
P = . 
··· nXn
so the vector form of Bellmann expectation equation is:

Vπ = Rπ + γPπ Vπ (3)

3 Uniqueness of Vπ
If we have ’n’ number of states, we have ’n’ unknowns: [Vπ (s)]s∈S . We have ’n’ linear equations.
Lets write it in Ax = b form.

Vπ = Rπ + γPπ Vπ
Vπ − γPπ Vπ = Rπ
(I − γPπ Vπ )Vπ = Rπ (4)
Vπ = (I − γPπ Vπ )−1 Rπ (5)

Equation 4 is of the form Ax = b where A = (I − γPπ Vπ ) and b = Rπ . Vπ will have unique solution
of A is invertible and A is invertible iff λ ̸= 0 for all eigen values.
We know that Pπ is transition probability matrix where sum of a row is equal to 1 and every element
is non negative, this makes Pπ a stochastic matrix.

2
For a stochastic matrix λmax ≤ 1
Let x is a eigen vector of Pπ , then consider
A = (I − γPπ Vπ )x = Ix − γPπ Vπ x
A = x − γλpπ x
= (1 − γλpπ )x (1 − γλpπ ) ≥ 0 as γ ≤ 1 and λpπ ≤ 1 (6)
From equation 6 we can see that the eigen value of the matrix A is always positive, hence making
the matrix invertible, therefore Vπ is unique

4 Discussion on optimal policy


For a given state the optimal policy would the the policy for which the expected reward will be
maximum.
Hence for a given state s the optimum policy will be argmaxVπ
x

4.1 Finding shortest path in grid to a Goal G


Consider a grid with one of the cell as goal state. we want to reach the goal cell from any cell with
least steps possible. lets formulate this problem as MPD.
• State = Cell
• Action = Up/Down/Left/Right
• The state transition will be deterministic.
• Reward of -1 in each state for any action.
• zero reward in terminal state i.e., Goal state.
For state S the following is an optimal policy.

Figure 2: Optimal Policy

3
• There can be many optimal policy for the state S for example for State (0,0) we can change
its action to any other action we will still end up with optimal policy for state S.

• Conceptually we can do the following to find the best policy.

– Each state ’S’ may have multiple policies which are optimal for that state.
– collect the set of optimal policies for a state s. Repeat this for each state
– Take intersection of all those sets
– if intersection is non empty then the policy is the optimal policy.
– only if that intersection is non empty the optimal policy for whole MDP exists.

4.2 Sufficient conditions for existance of optimal policy


• Finite number of states

• Finite action space

• Bounded reward.

You might also like