SF2863 Systems Engineering, 7.5 HP - Intro To Markov Decision Processes PDF
SF2863 Systems Engineering, 7.5 HP - Intro To Markov Decision Processes PDF
5 HP
- Intro to Markov Decision Processes
4 The End
The owner of the unit clearly has an incentive to keep it working well.
By performing maintenance the unit is more likely to work well, but
there is also a cost associated with the maintenance. (described on next page)
1 1 2 2
d(R1 ) = , d(R2 ) = , d(R3 ) = , d(R4 ) = .
1 2 1 2
7 1
π(1) = 8 8
,
2 1
π(2) = 3 3
,
7 4
π(3) = 11 11
,
1 2
π(4) = 3 3
.
Here:
Then
7 1
g(R1 ) = C11 π1 (1) + C21 π2 (1) = (−350) + (−50) = −312.5
8 8
2 1
g(R2 ) = C11 π1 (2) + C22 π2 (2) = (−350) + (−250) = −316.7
3 3
7 4
g(R3 ) = C12 π1 (3) + C21 π2 (3) = (−400) + (−50) = −272.7
11 11
1 2
g(R4 ) = C12 π1 (4) + C22 π2 (4) = (−400) + (−250) = −300
3 3
So Policy R2 is the best.
P. Enqvist Systems Engineering
Optimal solution
Note that we solved the problem by computing expected costs for all
possible policies.
This works well for small problems, but for larger problems a more
efficient method should be used.
Linear Programming can be used to solve the problem, but we will
consider a faster algorithm based on policy improvement.
HOW? We will consider the expected cost for n time steps and then
take an average.
Let vin (R) be the total expected cost of a system starting in state i and
evolving for n time steps.
Then, from stochastic dynamic programming we obtain the recursion
M
X
vin (R) = Cik + pij (k )vjn−1 (R), where di (R) = k .
j=0
P. Enqvist Systems Engineering
Policy Improvement Algorithm, brief derivation
Now taking the average and letting n → ∞ we have that vin (R)/n goes
to g(R). (independent of i)
with respect to k .
Let k̂i be the minimizing value of k (for each i), and define the next
policy by
di (Rn+1 ) = k̂i .
It can be shown that g(Rn+1 ) ≤ g(Rn ), and if Rn+1 = Rn , then the
policy is optimal.
P. Enqvist Systems Engineering
Policy Improvement Algorithm
It is based on two steps, starting with some policy R0 , n = 0:
Solve for g(Rn ), v0 (Rn ), · · · , vM (Rn ) (assuming vM (Rn ) = 0)
from the Value Determination Equations (vde), where k = di (Rn ),
M
X
g(Rn ) = Cik + pij (k )vj (Rn ) − vi (RN ), i = 0, 1, · · · , M.
j=0
M
X
min Cik + pij (k )vj (Rn ) − vi (Rn ), i = 0, 1, · · · , M.
k =1,2,··· ,K
j=0
2
X
g(R0 ) = Cik + pij (k )vj (R0 ) − vi (R0 ), i = 1, 2.
j=1
Where
C= C11 C12 C21 C22 = −350 −400 −50 −250 ,
0.9 0.1 0.6 0.4
P(1) = , P(2) = .
0.7 0.3 0.2 0.8
g(R0 ) = C11 + p11 (1)v1 (R0 ) + p12 (1)v2 (R0 ) − v1 (R0 ), i=1,k =d1 (R0 )=1
g(R0 ) = C21 + p21 (1)v1 (R0 ) + p22 (1)v2 (R0 ) − v2 (R0 ), i=2,k =d2 (R0 )=1
C= C11 C12 C21 C22 = −350 −400 −50 −250 ,
0.9 0.1 0.6 0.4
P(1) = , P(2) = .
0.7 0.3 0.2 0.8
g(R0 ) = C11 + p11 (1)v1 (R0 ) + p12 (1)v2 (R0 ) − v1 (R0 ), i=1,k =1
g(R0 ) = C21 + p21 (1)v1 (R0 ) + p22 (1)v2 (R0 ) − v2 (R0 ), i=2,k =1
For i = 1
For i = 2
2
X
g(R1 ) = Cik + pij (k )vj (R1 ) − vi (R1 ), i = 1, 2.
j=1
Where
C= C11 C12 C21 C22 = −350 −400 −50 −250 ,
0.9 0.1 0.6 0.4
P(1) = , P(2) = .
0.7 0.3 0.2 0.8
g(R1 ) = C11 + p11 (1)v1 (R1 ) + p12 (1)v2 (R1 ) − v1 (R1 ), i=1,k =d1 (R1 )=1
g(R1 ) = C22 + p21 (2)v1 (R1 ) + p22 (2)v2 (R1 ) − v2 (R1 ), i=2,k =d2 (R1 )=2
C= C11 C12 C21 C22 = −350 −400 −50 −250 ,
0.9 0.1 0.6 0.4
P(1) = , P(2) = .
0.7 0.3 0.2 0.8
g(R1 ) = C11 + p11 (1)v1 (R1 ) + p12 (1)v2 (R1 ) − v1 (R1 ), i=1,k =1
g(R1 ) = C22 + p21 (2)v1 (R1 ) + p22 (2)v2 (R1 ) − v2 (R1 ), i=2,k =2
For i = 1
For i = 2
1
The updated policy R2 is such that d(R2 ) = = d(R1 ).
2
The algorithm has converged and the optimal policy is the same as
determined before, do maintenance only when unit is working well.
The algorithm needed one iteration to find the optimal policy and one
iteration to verify convergence.
Why discounting?
A discount factor α is used to decrease the value for each time step
further in the future considered.
M
X
Vi (RN ) = Cik + α pij (k )Vj (Rn ) i = 0, 1, · · · , M.
j=0
2
X
Vi (R0 ) = Cik + α pij (k )Vj (R0 ), i = 1, 2.
j=1
Where
C= C11 C12 C21 C22 = −350 −400 −50 −250 ,
0.9 0.1 0.6 0.4
P(1) = , P(2) = .
0.7 0.3 0.2 0.8
V1 (R0 ) = C11 + α [p11 (1)V1 (R0 ) + p12 (1)V2 (R0 )] , i=1,k =d1 (R0 )=1
V2 (R0 ) = C21 + α [p21 (1)V1 (R0 ) + p22 (1)V2 (R0 )] , i=2,k =d2 (R0 )=1
C= C11 C12 C21 C22 = −350 −400 −50 −250 ,
0.9 0.1 0.6 0.4
P(1) = , P(2) = .
0.7 0.3 0.2 0.8
V1 (R0 ) = C11 + 0.9 [p11 (1)V1 (R0 ) + p12 (1)V2 (R0 )] , i=1,k =1
V2 (R0 ) = C21 + 0.9 [p21 (1)V1 (R0 ) + p22 (1)V2 (R0 )] , i=2,k =1
Then
V1 (R0 ) = −350 + 0.9 [0.9V1 (R0 ) + 0.1V2 (R0 )] ,
V2 (R0 ) = −50 + 0.9 [0.7V1 (R0 ) + 0.3V2 (R0 )] ,
For i = 1
For i = 2
2
X
Vi (R1 ) = Cik + α pij (k )Vj (R1 ), i = 1, 2.
j=1
Where
C= C11 C12 C21 C22 = −350 −400 −50 −250 ,
0.9 0.1 0.6 0.4
P(1) = , P(2) = .
0.7 0.3 0.2 0.8
V1 (R1 ) = C11 + 0.9 [p11 (1)V1 (R1 ) + p12 (1)V2 (R1 )] , i=1,k =d1 (R1 )=1
V2 (R1 ) = C22 + 0.9 [p21 (2)V1 (R1 ) + p22 (2)V2 (R1 )] , i=2,k =d2 (R1 )=2
C= C11 C12 C21 C22 = −350 −400 −50 −250 ,
0.9 0.1 0.6 0.4
P(1) = , P(2) = .
0.7 0.3 0.2 0.8
V1 (R1 ) = C11 + 0.9 [p11 (1)V1 (R1 ) + p12 (1)V2 (R1 )] , i=1,k =1
V2 (R1 ) = C21 + 0.9 [p21 (2)V1 (R1 ) + p22 (2)V2 (R1 )] , i=2,k =2
Then
V1 (R0 ) = −350 + 0.9 [0.9V1 (R1 ) + 0.1V2 (R1 )] ,
V2 (R1 ) = −250 + 0.9 [0.2V1 (R1 ) + 0.8V2 (R1 )] ,
For i = 1
For i = 2
The algorithm needed one iteration to find the optimal policy and one
iteration to verify convergence.
The optimal expected cost is -3257 if we start with unit working well.
The optimal expected cost is -2986 if we start with unit working poorly.
The costs are roughly 10 times larger than for the non-discounted
case. We consider the whole discounted future instead of just one time step.
∞
X 1
Note that the sum αk = = 10 for α = 0.9.
1−α
k =0
For a discount factor α close to one it is quite likely that the optimal
policy will be the same as for the problem without discount.
The smaller the discount factor α is, the more likely it is that the
optimal policy will focus on minimizing the immediate cost at each
state, and not on the long term effects.