Lecture 2: Markov Decision Processes: David Silver
Lecture 2: Markov Decision Processes: David Silver
1 Markov Processes
4 Extensions to MDPs
Introduction to MDPs
Markov Property
The future is independent of the past given the present
Definition
A state st is Markov if and only if
P [st+1 | st ] = P [st+1 | s1 , ..., st ]
P11 . . . P1n
P = from ...
P11 . . . Pnn
where each row of the matrix sums to 1.
Markov Process
0.9
Sleep
0.1
0.5
Class 1
1.0
0.2
0.5
Class 2
0.8
Class 3
0.4
0.2
0.4
Pub
0.4
0.6
Pass
s1 , s2 , ..., sT
Sleep
0.1
0.5
Class 1 0.5
1.0
0.2
Class 2
0.8
Class 3
0.4
0.2
Pass
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass Sleep
0.4
Pub
0.6
0.4
C1 FB FB C1 C2 C3 Pub C1 FB FB
FB C1 C2 C3 Pub C2 Sleep
0.9
Sleep
C1
0.1
0.5
Class 1 0.5
1.0
0.2
Class 2
0.8
Class 3
0.4
0.2
0.4
Pub
0.4
0.6
Pass
P =
C1
C2
C3
Pass
Pub
FB
Sleep
0.2
0.1
C2
C3
Pass
Pub
0.5
FB
Sleep
0.5
0.8
0.2
0.6
0.4
1.0
0.4
0.4
0.9
1
0.9
Sleep
0.1
r = -1
r=0
0.5
Class 1
1.0
0.2
0.5
r = -2
Class 2
0.2
0.8
r = -2
0.4
Pub
r = +1
0.4
Class 3
0.4
0.6
r = -2
Pass
r = +10
Return
Definition
The return vt is the total discounted reward from time-step t.
vt = rt+1 + rt+2 + ... =
k rt+k+1
k=0
Why discount?
Most Markov reward and decision processes are discounted. Why?
Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markov processes
Uncertainty about the future may not be fully represented
If the reward is financial, immediate rewards may earn more
interest than delayed rewards
Animal/human behaviour shows preference for immediate
reward
It sometimes possible to use undiscounted Markov reward
processes (i.e. = 1), e.g. if all sequences terminate.
Value Function
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass Sleep
C1 FB FB C1 C2 C3 Pub C1 ...
FB FB FB C1 C2 C3 Pub C2 Sleep
0.9
Sleep
0.1
0.5
Class 1 0.5
1.0
0.2
Class 2
0.8
Class 3
0.4
0.2
0.4
Pub
0.4
0.6
Pass
=
=
=
+4
8
+1
21
-1
0.1
0
r = -1
r=0
0.5
-2
1.0
0.2
0.5
-2
r = -2
0.2
0.8
r = -2
0.4
+1
r = +1
0.4
-2
0.4
0.6
r = -2
10
r = +10
-7.6
0.1
0
r = -1
r=0
0.5
-5.0
1.0
0.2
0.5
0.9
r = -2
0.2
0.8
r = -2
0.4
1.9
r = +1
0.4
4.1
0.4
0.6
r = -2
10
r = +10
-23
0.1
0
r = -1
r=0
0.5
-13
1.0
0.2
0.5
1.5
r = -2
0.2
0.8
r = -2
0.4
+0.8
r = +1
0.4
4.3
0.4
0.6
r = -2
10
r = +10
V (s) = E [vt | st = s]
= E rt+1 + rt+2 + 2 rt+3 + ... | st = s
= E [rt+1 + (rt+2 + rt+3 + ...) | st = s]
= E [rt+1 + vt+1 | st = s]
= E [rt+1 + V (st+1 ) | st = s]
V (s) = E r + V (s 0 ) | s
s
V(s)
r
s'
V (s) = Rs +
V(s')
X
s 0 S
Pss 0 V (s 0 )
-23
0.1
0
r = -1
r=0
0.5
-13
1.0
0.2
0.5
1.5
r = -2
0.2
0.8
r = -2
0.4
0.8
r = +1
0.4
4.3
0.4
0.6
r = -2
10
r = +10
V = R + PV
where V is a column vector with one entry per state
V (1)
R1
P11 . . . P1n
V (1)
.. ..
.
..
. = . + ..
.
V (n)
Rn
P11 . . . Pnn
V (n)
V = R + PV
(I P) V = R
V = (I P)1 R
Computational complexity is O(n3 ) for n states
Direct solution only possible for small MRPs
There are many iterative methods for large MRPs, e.g.
Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning
Quit
r=0
Sleep
r=0
r = -1
Study
Study
Study
r = -2
r = -2
Pub
r = +1
0.2
0.4
0.4
r = +10
Policies (1)
Definition
A policy is a distribution over actions given states,
(s, a) = P [a | s]
Policies (2)
Ps,s
0 =
a
(s, a)Ps,s
0
aA
Rs
X
aA
(s, a)Ras
Value Function
Definition
The state-value function V (s) of an MDP is the expected return
starting from state s, and then following policy
V (s) = E [vt | st = s]
Definition
The action-value function Q (s, a) is the expected return
starting from state s, taking action a, and then following policy
Q (s, a) = E [vt | st = s, at = a]
Facebook
r = -1
-2.3
Quit
r=0
Sleep
r=0
r = -1
Study
-1.3
Study
2.7
r = -2
Study
r = -2
Pub
r = +1
0.2
0.4
0.4
7.4
r = +10
V!(s)
Q!(s,a)
a
V (s) =
X
aA
s,a
Q!(s,a)
r
V!(s')
s'
Q (s, a) = Ras +
X
s 0 S
0
a
Pss
0 V (s )
V!(s)
a
r
V!(s')
s'
!
V (s) =
X
aA
(s, a) Ras +
X
s 0 S
a
0
Pss
0 V (s )
Q!(s,a)
r
s'
Q!(s',a')
a'
Q (s, a) = Ras +
X
s 0 S
a
Pss
0
X
a0 A
(s 0 , a0 )Q (s 0 , a0 )
Facebook
r = -1
-2.3
Quit
r=0
Sleep
r=0
r = -1
Study
-1.3
Study
2.7
r = -2
Study
r = -2
Pub
r = +1
0.2
0.4
0.4
7.4
r = +10
Facebook
r = -1
Quit
r=0
Sleep
r=0
r = -1
Study
Study
r = -2
Study
r = -2
Pub
r = +1
0.2
0.4
0.4
10
r = +10
Facebook
r = -1
Q* =5
Quit
r=0
Q* =6
Sleep
r=0
Q* =0
r = -1
Q* =5
Study
r = -2
Q* =6
Study
Study
r = -2
Q* =8
Pub
0.2
0.4
r = +1
Q* =8.4
0.4
10
r = +10
Q* =10
Optimal Policy
Define a partial ordering over policies
0
0 if V (s) V (s), s
Theorem
For any Markov Decision Process
There exists an optimal policy that is better than or equal
to all other policies, ,
All optimal policies achieve the optimal value function,
V (s) = V (s)
All optimal policies achieve the optimal action-value function,
Q (s, a) = Q (s, a)
aA
(s, a) =
0 otherwise
There is always a deterministic optimal policy for any MDP
If we know Q (s, a), we immediately have the optimal policy
Facebook
r = -1
Q* =5
Quit
r=0
Q* =6
Sleep
r=0
Q* =0
r = -1
Q* =5
Study
r = -2
Q* =6
Study
Study
r = -2
Q* =8
Pub
0.2
0.4
r = +1
Q* =8.4
0.4
10
r = +10
Q* =10
V*(s)
Q*(s,a)
Q*(s,a)
r
V*(s')
s'
Q (s, a) = Ras +
X
s 0 S
a
0
Pss
0 V (s )
V*(s)
a
r
V*(s')
s'
X
s 0 S
a
0
Pss
0 V (s )
Q*(s,a)
r
s'
Q*(s',a')
a'
Q (s, a) = Ras +
X
s 0 S
Pssa 0 max
Q (s 0 , a0 )
0
a
Facebook
r = -1
Quit
r=0
Sleep
r=0
r = -1
Study
Study
r = -2
Study
r = -2
Pub
r = +1
0.2
0.4
0.4
10
r = +10
Extensions to MDPs
Infinite MDPs
Continuous time
Requires partial differential equations
Hamilton-Jacobi-Bellman (HJB) equation
Limiting case of Bellman equation as time-step 0
POMDPs
A POMDP is an MDP with hidden states.
It is a hidden Markov model with actions.
Definition
A Partially Observable Markov Decision Process is a tuple
hS, A, O, P, R, Z, i
S is a finite set of states
A is a finite set of actions
O is a finite set of observations
a = P [s 0 | s, a]
P is a state transition probability matrix, Pss
0
Belief States
Definition
A history ht is a sequence of actions, observations and rewards,
ht = a1 , o1 , r1 , ..., at , ot , rt
Definition
A belief state b(ht ) is a probability distribution over states,
conditioned on the history ht
b(ht ) = (P st = s 1 | ht , ..., P [st = s n | ht ])
Reductions of POMDPs
The history ht satisfies the Markov property
The belief state b(ht ) satisfies the Markov property
History tree
Belief tree
P(s)
a1
a2
a1
o1
a1o1a1
a2
o2
a1o2
a1o1
a1
a1
o1
a2o1
P(s|a1)
o2
a2o2
a2
a1o1a2
o1
...
...
P(s|a2)
o2
P(s|a1o1)
a1
...
a2
o1
o2
P(s|a1o2)
P(s|a2o1)
P(s|a2o2)
...
...
...
a2
P(s|a1o1a1) P(s|a1o1a2)
Ergodic MDP
Definition
An MDP is ergodic if the Markov chain induced by any policy is
ergodic.
For any policy , an ergodic MDP has an average reward per
time-step that is independent of start state.
" T #
X
1
= lim
E
rt
T T
t=1
V (s) = E
(rt+k ) | st = s
k=1
V (s) = E (rt+1 ) +
(rt+k+1 ) | st = s
k=1
h
i
(st+1 ) | st = s
= E (rt+1 ) + V
Questions?