EE675 Lecture 10
EE675 Lecture 10
In Lecture 9 we learnt about Markov decision process, value function and action value function.
Towards the end of the class we learnt about Bellman expectation equations.
Today we will prove the uniqueness of the solution of the Bellman expectation equation and we
will define the optimal policy and its existence.
1 Fixing a policy
Consider a coin toss Markov decision process where head leads to state S+1 if we are in state S
and tail leads to state S-1. Now we have two coins blue and red, blue lands in head with probability
pblue and red lands in head with probability pred . Below is the pictorial view of the MDP.
Figure 1: MDP
Any deterministic policy were for each state we have decided the coin we are going to toss the
MDP is converted in markov chain. for example if we have decided for state 4 we are going to toss
red coin then p45 = pred and p43 = 1 − pred . Similarly for eacch state we can define the transition
probability and rewards.
For a stochastic policy if we fix the policy i.e., we know the probability by which we are going
to take an action for each state we can find out the transition probability for each state as follows.
∑
π ′ ′
′ = P (s |s) = P (s |s, a)π(a|s)
π
Pss T ransition probabilities (1)
a
∫
Rsa= E[Rt+1 |st = s, At = a] = rP (r|s, a)
∑ r
Rs = Eπ [Rt+1 |st = s] =
π
E[Rt+1 |st , At = a]π(a|s) Reward f or policy π (2)
a
1
2 Vector form of Bellman expectation equation
Let us use Rsπ , Pss
π
′ notation for vector form.
s′
Vπ (s1 )
Vπ (s2 )
Vπ = ..
.
Vπ (sn ) nX1
[.]
..
Rπ = .
..
nX1
···
π ..
P = .
··· nXn
so the vector form of Bellmann expectation equation is:
Vπ = Rπ + γPπ Vπ (3)
3 Uniqueness of Vπ
If we have ’n’ number of states, we have ’n’ unknowns: [Vπ (s)]s∈S . We have ’n’ linear equations.
Lets write it in Ax = b form.
Vπ = Rπ + γPπ Vπ
Vπ − γPπ Vπ = Rπ
(I − γPπ Vπ )Vπ = Rπ (4)
Vπ = (I − γPπ Vπ )−1 Rπ (5)
Equation 4 is of the form Ax = b where A = (I − γPπ Vπ ) and b = Rπ . Vπ will have unique solution
of A is invertible and A is invertible iff λ ̸= 0 for all eigen values.
We know that Pπ is transition probability matrix where sum of a row is equal to 1 and every element
is non negative, this makes Pπ a stochastic matrix.
2
For a stochastic matrix λmax ≤ 1
Let x is a eigen vector of Pπ , then consider
A = (I − γPπ Vπ )x = Ix − γPπ Vπ x
A = x − γλpπ x
= (1 − γλpπ )x (1 − γλpπ ) ≥ 0 as γ ≤ 1 and λpπ ≤ 1 (6)
From equation 6 we can see that the eigen value of the matrix A is always positive, hence making
the matrix invertible, therefore Vπ is unique
3
• There can be many optimal policy for the state S for example for State (0,0) we can change
its action to any other action we will still end up with optimal policy for state S.
– Each state ’S’ may have multiple policies which are optimal for that state.
– collect the set of optimal policies for a state s. Repeat this for each state
– Take intersection of all those sets
– if intersection is non empty then the policy is the optimal policy.
– only if that intersection is non empty the optimal policy for whole MDP exists.
• Bounded reward.