0% found this document useful (0 votes)
113 views

Lecture 2: Markov Decision Processes: David Silver

This document provides an overview of the key concepts covered in Lecture 2 on Markov Decision Processes (MDPs). It introduces MDPs as an extension of Markov reward processes that allow for decisions. MDPs are defined by states, actions, transition probabilities, rewards and discounts. Example MDPs are presented, including a student MDP. Policies in MDPs are defined as distributions over actions given states. Value functions for MDPs, including state-value and action-value functions, are introduced as expected returns following a given policy.

Uploaded by

Cường Mậm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
113 views

Lecture 2: Markov Decision Processes: David Silver

This document provides an overview of the key concepts covered in Lecture 2 on Markov Decision Processes (MDPs). It introduces MDPs as an extension of Markov reward processes that allow for decisions. MDPs are defined by states, actions, transition probabilities, rewards and discounts. Example MDPs are presented, including a student MDP. Policies in MDPs are defined as distributions over actions given states. Value functions for MDPs, including state-value and action-value functions, are introduced as expected returns following a given policy.

Uploaded by

Cường Mậm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Lecture 2: Markov Decision Processes

Lecture 2: Markov Decision Processes


David Silver

Lecture 2: Markov Decision Processes

1 Markov Processes

2 Markov Reward Processes

3 Markov Decision Processes

4 Extensions to MDPs

Lecture 2: Markov Decision Processes


Markov Processes
Introduction

Introduction to MDPs

Markov decision processes formally describe an environment


for reinforcement learning
Where the environment is fully observable
i.e. The current state completely characterises the process
Almost all RL problems can be formalised as MDPs, e.g.
Optimal control primarily deals with continuous MDPs
Partially observable problems can be converted into MDPs
Bandits are MDPs with one state

Lecture 2: Markov Decision Processes


Markov Processes
Markov Property

Markov Property
The future is independent of the past given the present
Definition
A state st is Markov if and only if
P [st+1 | st ] = P [st+1 | s1 , ..., st ]

The state captures all relevant information from the history


Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future

Lecture 2: Markov Decision Processes


Markov Processes
Markov Property

State Transition Matrix


For a Markov state s and successor state s 0 , the state transition
probability is defined by


Pss 0 = P s 0 | s
State transition matrix P defines transition probabilities from all
states s to all successor states s 0 ,
to

P11 . . . P1n

P = from ...

P11 . . . Pnn
where each row of the matrix sums to 1.

Lecture 2: Markov Decision Processes


Markov Processes
Markov Chains

Markov Process

A Markov process is a memoryless random process, i.e. a sequence


of random states s1 , s2 , ... with the Markov property.
Definition
A Markov Process (or Markov Chain) is a tuple hS, Pi
S is a (finite) set of states
P is a state transition probability matrix, Pss 0 = P [s 0 | s]

Lecture 2: Markov Decision Processes


Markov Processes
Markov Chains

Example: Student Markov Chain

0.9

Sleep

Facebook

0.1

0.5

Class 1

1.0

0.2
0.5

Class 2

0.8

Class 3
0.4

0.2

0.4

Pub

0.4

0.6

Pass

Lecture 2: Markov Decision Processes


Markov Processes
Markov Chains

Example: Student Markov Chain Episodes


Sample episodes for Student Markov
Chain starting from s1 = C1
0.9

s1 , s2 , ..., sT

Sleep

Facebook

0.1

0.5

Class 1 0.5

1.0

0.2

Class 2

0.8

Class 3
0.4

0.2

Pass

C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass Sleep

0.4

Pub

0.6

0.4

C1 FB FB C1 C2 C3 Pub C1 FB FB
FB C1 C2 C3 Pub C2 Sleep

Lecture 2: Markov Decision Processes


Markov Processes
Markov Chains

Example: Student Markov Chain Transition Matrix

0.9

Sleep

Facebook

C1

0.1

0.5

Class 1 0.5

1.0

0.2

Class 2

0.8

Class 3
0.4

0.2

0.4

Pub

0.4

0.6

Pass

P =

C1
C2
C3
Pass
Pub
FB
Sleep

0.2
0.1

C2

C3

Pass

Pub

0.5

FB

Sleep

0.5
0.8

0.2

0.6

0.4
1.0

0.4

0.4
0.9
1

Lecture 2: Markov Decision Processes


Markov Reward Processes
MRP

Markov Reward Process

A Markov reward process is a Markov chain with values.


Definition
A Markov Reward Process is a tuple hS, P, R, i
S is a finite set of states
P is a state transition probability matrix, Pss 0 = P [s 0 | s]
R is a reward function, Rs = E [r | s]
is a discount factor, [0, 1]

Lecture 2: Markov Decision Processes


Markov Reward Processes
MRP

Example: Student MRP

0.9

Sleep

Facebook

0.1

r = -1

r=0

0.5

Class 1

1.0

0.2
0.5
r = -2

Class 2

0.2

0.8
r = -2

0.4

Pub
r = +1

0.4

Class 3
0.4

0.6
r = -2

Pass
r = +10

Lecture 2: Markov Decision Processes


Markov Reward Processes
Return

Return
Definition
The return vt is the total discounted reward from time-step t.
vt = rt+1 + rt+2 + ... =

k rt+k+1

k=0

The discount [0, 1] is the present value of future rewards


The value of receiving reward r after k + 1 time-steps is k r .
This values immediate reward above delayed reward.
close to 0 leads to myopic evaluation
close to 1 leads to far-sighted evaluation

Lecture 2: Markov Decision Processes


Markov Reward Processes
Return

Why discount?
Most Markov reward and decision processes are discounted. Why?
Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markov processes
Uncertainty about the future may not be fully represented
If the reward is financial, immediate rewards may earn more
interest than delayed rewards
Animal/human behaviour shows preference for immediate
reward
It sometimes possible to use undiscounted Markov reward
processes (i.e. = 1), e.g. if all sequences terminate.

Lecture 2: Markov Decision Processes


Markov Reward Processes
Value Function

Value Function

The value function V (s) gives the long-term value of state s


Definition
The state value function V (s) of an MRP is the expected return
starting from state s
V (s) = E [vt | st = s]

Lecture 2: Markov Decision Processes


Markov Reward Processes
Value Function

Example: Student Markov Chain Returns


Sample returns for Student Markov Chain: Undiscounted ( = 1),
starting from s1 = C1
v1 = r1 + r2 + ... + T 1 rT = r1 + r2 + ... + rT
v1 = 2 2 2 + 10
v1 = 2 1 1 2 2
v1 = 2 2 2 + 1 2 2 + 10
v1 = 2 1 1 2 2 2 + 1 2
1 1 1 2 2 2 + 1 2

C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass Sleep
C1 FB FB C1 C2 C3 Pub C1 ...
FB FB FB C1 C2 C3 Pub C2 Sleep

0.9

Sleep

Facebook

0.1

0.5

Class 1 0.5

1.0

0.2

Class 2

0.8

Class 3
0.4

0.2

0.4

Pub

0.4

0.6

Pass

=
=
=

+4
8
+1

21

Lecture 2: Markov Decision Processes


Markov Reward Processes
Value Function

Example: State-Value Function for Student MRP (1)


V(s) for =0
0.9

-1
0.1

0
r = -1

r=0

0.5

-2

1.0

0.2
0.5

-2

r = -2

0.2

0.8
r = -2

0.4

+1
r = +1

0.4

-2
0.4

0.6
r = -2

10
r = +10

Lecture 2: Markov Decision Processes


Markov Reward Processes
Value Function

Example: State-Value Function for Student MRP (2)


V(s) for =0.9
0.9

-7.6
0.1

0
r = -1

r=0

0.5

-5.0

1.0

0.2
0.5

0.9

r = -2

0.2

0.8
r = -2

0.4

1.9
r = +1

0.4

4.1
0.4

0.6
r = -2

10
r = +10

Lecture 2: Markov Decision Processes


Markov Reward Processes
Value Function

Example: State-Value Function for Student MRP (3)


V(s) for =1
0.9

-23
0.1

0
r = -1

r=0

0.5

-13

1.0

0.2
0.5

1.5

r = -2

0.2

0.8
r = -2

0.4

+0.8
r = +1

0.4

4.3
0.4

0.6
r = -2

10
r = +10

Lecture 2: Markov Decision Processes


Markov Reward Processes
Bellman Equation

Bellman Equation for MRPs


The value function can be decomposed into two parts:
immediate reward r
discounted value of successor state V (s 0 )

V (s) = E [vt | st = s]


= E rt+1 + rt+2 + 2 rt+3 + ... | st = s
= E [rt+1 + (rt+2 + rt+3 + ...) | st = s]
= E [rt+1 + vt+1 | st = s]
= E [rt+1 + V (st+1 ) | st = s]

Lecture 2: Markov Decision Processes


Markov Reward Processes
Bellman Equation

Bellman Equation for MRPs (2)



V (s) = E r + V (s 0 ) | s
s

V(s)

r
s'

V (s) = Rs +

V(s')

X
s 0 S

Pss 0 V (s 0 )

Lecture 2: Markov Decision Processes


Markov Reward Processes
Bellman Equation

Example: Bellman Equation for Student MRP


4.3 = -2 + 0.6*10 + 0.4*0.8
0.9

-23
0.1

0
r = -1

r=0

0.5

-13

1.0

0.2
0.5

1.5

r = -2

0.2

0.8
r = -2

0.4

0.8
r = +1

0.4

4.3
0.4

0.6
r = -2

10
r = +10

Lecture 2: Markov Decision Processes


Markov Reward Processes
Bellman Equation

Bellman Equation in Matrix Form


The Bellman equation can be expressed concisely using matrices,

V = R + PV
where V is a column vector with one entry per state

V (1)
R1
P11 . . . P1n
V (1)
.. ..
.
..
. = . + ..
.
V (n)
Rn
P11 . . . Pnn
V (n)

Lecture 2: Markov Decision Processes


Markov Reward Processes
Bellman Equation

Solving the Bellman Equation


The Bellman equation is a linear equation
It can be solved directly:

V = R + PV
(I P) V = R
V = (I P)1 R
Computational complexity is O(n3 ) for n states
Direct solution only possible for small MRPs
There are many iterative methods for large MRPs, e.g.
Dynamic programming
Monte-Carlo evaluation
Temporal-Difference learning

Lecture 2: Markov Decision Processes


Markov Decision Processes
MDP

Markov Decision Process


A Markov decision process (MDP) is a Markov reward process with
decisions. It is an environment in which all states are Markov.
Definition
A Markov Decision Process is a tuple hS, A, P, R, i
S is a finite set of states
A is a finite set of actions
a = P [s 0 | s, a]
P is a state transition probability matrix, Pss
0

R is a reward function, Ras = E [r | s, a]


is a discount factor [0, 1].

Lecture 2: Markov Decision Processes


Markov Decision Processes
MDP

Example: Student MDP


Facebook
r = -1

Quit
r=0

Facebook

Sleep
r=0

r = -1

Study
Study

Study

r = -2

r = -2

Pub

r = +1
0.2

0.4
0.4

r = +10

Lecture 2: Markov Decision Processes


Markov Decision Processes
Policies

Policies (1)

Definition
A policy is a distribution over actions given states,
(s, a) = P [a | s]

A policy fully defines the behaviour of an agent


MDP policies depend on the current state (not the history)
i.e. Policies are stationary (time-independent),
at (st , ), t > 0

Lecture 2: Markov Decision Processes


Markov Decision Processes
Policies

Policies (2)

Given an MDP M = hS, A, P, R, i and a policy


The state sequence s1 , s2 , ... is a Markov process hS, P i
The state and reward sequence s1 , r2 , s2 , ... is a Markov reward
process hS, P , R , i
where

Ps,s
0 =

a
(s, a)Ps,s
0

aA

Rs

X
aA

(s, a)Ras

Lecture 2: Markov Decision Processes


Markov Decision Processes
Value Functions

Value Function
Definition
The state-value function V (s) of an MDP is the expected return
starting from state s, and then following policy
V (s) = E [vt | st = s]

Definition
The action-value function Q (s, a) is the expected return
starting from state s, taking action a, and then following policy
Q (s, a) = E [vt | st = s, at = a]

Lecture 2: Markov Decision Processes


Markov Decision Processes
Value Functions

Example: State-Value Function for Student MDP


V(s) for (s,a)=0.5, =1

Facebook
r = -1

-2.3

Quit

Facebook

r=0

Sleep
r=0

r = -1

Study

-1.3

Study

2.7

r = -2

Study
r = -2

Pub

r = +1
0.2

0.4
0.4

7.4

r = +10

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Expectation Equation

Bellman Expectation Equation


The state-value function can again be decomposed into immediate
reward plus discounted value of successor state,
V (s) = E [rt+1 + V (st+1 ) | st = s]

The action-value function can similarly be decomposed,


Q (s, a) = E [rt+1 + Q (st+1 , at+1 ) | st = s, at = a]

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Expectation Equation

Bellman Expectation Equation for V

V!(s)

Q!(s,a)

a
V (s) =

X
aA

(s, a)Q (s, a)

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Expectation Equation

Bellman Expectation Equation for Q

s,a

Q!(s,a)

r
V!(s')

s'
Q (s, a) = Ras +

X
s 0 S

0
a
Pss
0 V (s )

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Expectation Equation

Bellman Expectation Equation for V (2)


s

V!(s)

a
r
V!(s')

s'

!
V (s) =

X
aA

(s, a) Ras +

X
s 0 S

a
0
Pss
0 V (s )

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Expectation Equation

Bellman Expectation Equation for Q (2)


s,a

Q!(s,a)

r
s'

Q!(s',a')

a'

Q (s, a) = Ras +

X
s 0 S

a
Pss
0

X
a0 A

(s 0 , a0 )Q (s 0 , a0 )

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Expectation Equation

Example: Bellman Expectation Equation in Student MDP


7.4 = 0.5 * (1 + 0.2* -1.3 + 0.4 * 2.7 + 0.4 * 7.4)
+ 0.5 * 10

Facebook
r = -1

-2.3

Quit

Facebook

r=0

Sleep
r=0

r = -1

Study

-1.3

Study

2.7

r = -2

Study
r = -2

Pub

r = +1
0.2

0.4
0.4

7.4

r = +10

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Expectation Equation

Bellman Expectation Equation (Matrix Form)

The Bellman expectation equation can be expressed concisely


using the induced MRP,
V = R + P V
with direct solution
V = (I P )1 R

Lecture 2: Markov Decision Processes


Markov Decision Processes
Optimal Value Functions

Optimal Value Function


Definition
The optimal state-value function V (s) is the maximum value
function over all policies
V (s) = max V (s)

The optimal action-value function Q (s, a) is the maximum


action-value function over all policies
Q (s, a) = max Q (s, a)

The optimal value function specifies the best possible


performance in the MDP.
An MDP is solved when we know the optimal value fn.

Lecture 2: Markov Decision Processes


Markov Decision Processes
Optimal Value Functions

Example: Optimal Value Function for Student MDP


V*(s) for =1

Facebook
r = -1

Quit

Facebook

r=0

Sleep
r=0

r = -1

Study

Study

r = -2

Study
r = -2

Pub

r = +1
0.2

0.4
0.4

10

r = +10

Lecture 2: Markov Decision Processes


Markov Decision Processes
Optimal Value Functions

Example: Optimal Action-Value Function for Student MDP


Q*(s,a) for =1

Facebook
r = -1
Q* =5

Quit

Facebook

r=0
Q* =6

Sleep

r=0
Q* =0

r = -1
Q* =5

Study

r = -2
Q* =6

Study

Study
r = -2
Q* =8

Pub

0.2

0.4

r = +1
Q* =8.4
0.4

10

r = +10
Q* =10

Lecture 2: Markov Decision Processes


Markov Decision Processes
Optimal Value Functions

Optimal Policy
Define a partial ordering over policies
0

0 if V (s) V (s), s
Theorem
For any Markov Decision Process
There exists an optimal policy that is better than or equal
to all other policies, ,
All optimal policies achieve the optimal value function,

V (s) = V (s)
All optimal policies achieve the optimal action-value function,

Q (s, a) = Q (s, a)

Lecture 2: Markov Decision Processes


Markov Decision Processes
Optimal Value Functions

Finding an Optimal Policy

An optimal policy can be found by maximising over Q (s, a),


(
1 if a = argmax Q (s, a)

aA
(s, a) =
0 otherwise
There is always a deterministic optimal policy for any MDP
If we know Q (s, a), we immediately have the optimal policy

Lecture 2: Markov Decision Processes


Markov Decision Processes
Optimal Value Functions

Example: Optimal Policy for Student MDP


*(s,a) for =1

Facebook
r = -1
Q* =5

Quit

Facebook

r=0
Q* =6

Sleep

r=0
Q* =0

r = -1
Q* =5

Study

r = -2
Q* =6

Study

Study
r = -2
Q* =8

Pub

0.2

0.4

r = +1
Q* =8.4
0.4

10

r = +10
Q* =10

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Optimality Equation

Bellman Optimality Equation for V


The optimal value functions are recursively related by the Bellman
optimality equations:

V*(s)

Q*(s,a)

V (s) = max Q (s, a)


a

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Optimality Equation

Bellman Optimality Equation for Q


s,a

Q*(s,a)

r
V*(s')

s'
Q (s, a) = Ras +

X
s 0 S

a
0
Pss
0 V (s )

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Optimality Equation

Bellman Optimality Equation for V (2)


s

V*(s)

a
r
V*(s')

s'

V (s) = max Ras +


a

X
s 0 S

a
0
Pss
0 V (s )

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Optimality Equation

Bellman Optimality Equation for Q (2)


s,a

Q*(s,a)

r
s'

Q*(s',a')

a'

Q (s, a) = Ras +

X
s 0 S

Pssa 0 max
Q (s 0 , a0 )
0
a

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Optimality Equation

Example: Bellman Optimality Equation in Student MDP


6 = max {-2 + 8, -1 + 6}

Facebook
r = -1

Quit

Facebook

r=0

Sleep
r=0

r = -1

Study

Study

r = -2

Study
r = -2

Pub

r = +1
0.2

0.4
0.4

10

r = +10

Lecture 2: Markov Decision Processes


Markov Decision Processes
Bellman Optimality Equation

Solving the Bellman Optimality Equation

Bellman Optimality Equation is non-linear


No closed form solution (in general)
Many iterative solution methods
Value Iteration
Policy Iteration
Q-learning
Sarsa

Lecture 2: Markov Decision Processes


Extensions to MDPs

Extensions to MDPs

Infinite and continuous MDPs


Partially observable MDPs
Undiscounted, average reward MDPs

Lecture 2: Markov Decision Processes


Extensions to MDPs
Infinite MDPs

Infinite MDPs

The following extensions are all possible:


Countably infinite state and/or action spaces
Straightforward

Continuous state and/or action spaces


Conceptually similar, but must deal with measure theory

Continuous time
Requires partial differential equations
Hamilton-Jacobi-Bellman (HJB) equation
Limiting case of Bellman equation as time-step 0

Lecture 2: Markov Decision Processes


Extensions to MDPs
Partially Observable MDPs

POMDPs
A POMDP is an MDP with hidden states.
It is a hidden Markov model with actions.
Definition
A Partially Observable Markov Decision Process is a tuple
hS, A, O, P, R, Z, i
S is a finite set of states
A is a finite set of actions
O is a finite set of observations
a = P [s 0 | s, a]
P is a state transition probability matrix, Pss
0

R is a reward function, Ras = E [r | s, a]


Z is an observation function, Zsa = P [o | s, a]
is a discount factor [0, 1].

Lecture 2: Markov Decision Processes


Extensions to MDPs
Partially Observable MDPs

Belief States
Definition
A history ht is a sequence of actions, observations and rewards,
ht = a1 , o1 , r1 , ..., at , ot , rt

Definition
A belief state b(ht ) is a probability distribution over states,
conditioned on the history ht


b(ht ) = (P st = s 1 | ht , ..., P [st = s n | ht ])

Lecture 2: Markov Decision Processes


Extensions to MDPs
Partially Observable MDPs

Reductions of POMDPs
The history ht satisfies the Markov property
The belief state b(ht ) satisfies the Markov property
History tree

Belief tree
P(s)

a1

a2

a1
o1

a1o1a1

a2
o2
a1o2

a1o1
a1

a1

o1
a2o1

P(s|a1)
o2
a2o2

a2
a1o1a2

o1

...

...

P(s|a2)
o2

P(s|a1o1)
a1

...

a2

o1

o2

P(s|a1o2)

P(s|a2o1)

P(s|a2o2)

...

...

...

a2

P(s|a1o1a1) P(s|a1o1a2)

A POMDP can be reduced to an (infinite) history tree


A POMDP can be reduced to an (infinite) belief state tree

Lecture 2: Markov Decision Processes


Extensions to MDPs
Average Reward MDPs

Ergodic Markov Process


An ergodic Markov process is
Recurrent: each state is visited an infinite number of times
Aperiodic: each state is visited without any systematic period
Theorem
An ergodic Markov process has a limiting stationary distribution
d (s) with the property
X
k 0
d (s) =
Pss
0 d (s )
s 0 S

Lecture 2: Markov Decision Processes


Extensions to MDPs
Average Reward MDPs

Ergodic MDP

Definition
An MDP is ergodic if the Markov chain induced by any policy is
ergodic.
For any policy , an ergodic MDP has an average reward per
time-step that is independent of start state.
" T #
X
1
= lim
E
rt
T T

t=1

Lecture 2: Markov Decision Processes


Extensions to MDPs
Average Reward MDPs

Average Reward Value Function


The value function of an undiscounted, ergodic MDP can be
expressed in terms of average reward.
(s) is the extra reward due to starting from state s,
V
"

V (s) = E

(rt+k ) | st = s

k=1

There is a corresponding average reward Bellman equation,


"

V (s) = E (rt+1 ) +

(rt+k+1 ) | st = s

k=1

h
i
(st+1 ) | st = s
= E (rt+1 ) + V

Lecture 2: Markov Decision Processes


Extensions to MDPs
Average Reward MDPs

Questions?

The only stupid question is the one you were afraid to


ask but never did.
-Rich Sutton

You might also like