0% found this document useful (0 votes)
62 views20 pages

Markov Decision Processes: Stochastic, Sequential Environments

This document describes a Markov decision process (MDP) using the example of a game show with increasing difficulty levels and monetary rewards. It explains how to model the game show as an MDP and determine the optimal strategy by calculating the expected utility of continuing versus quitting at each level. The key steps are: 1) Representing the game as states with actions and transition probabilities; 2) Calculating the expected utility of continuing versus quitting at each level; 3) Determining the optimal action is to continue if expected utility is higher than reward for quitting. It then provides an overview of solving MDPs using value iteration or policy iteration to find the policy with highest expected rewards.

Uploaded by

mikey61
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views20 pages

Markov Decision Processes: Stochastic, Sequential Environments

This document describes a Markov decision process (MDP) using the example of a game show with increasing difficulty levels and monetary rewards. It explains how to model the game show as an MDP and determine the optimal strategy by calculating the expected utility of continuing versus quitting at each level. The key steps are: 1) Representing the game as states with actions and transition probabilities; 2) Calculating the expected utility of continuing versus quitting at each level; 3) Determining the optimal action is to continue if expected utility is higher than reward for quitting. It then provides an overview of solving MDPs using value iteration or policy iteration to find the policy with highest expected rewards.

Uploaded by

mikey61
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 20

Markov Decision Processes

Stochastic, sequential
environments
Game show
A series of questions with increasing level of
difficulty and increasing payoff
Decision: at each step, take your earnings and
quit, or go for the next question
If you answer wrong, you lose everything
Q1 Q2 Q3 Q4
Correct
Incorrect:
$0
Correct
Incorrect:
$0
Quit:
$100
Correct
Incorrect:
$0
Quit:
$1,100
Correct:
$61,100
Incorrect:
$0
Quit:
$11,100
$100
question
$1,000
question
$10,000
question
$50,000
question
Game show
Consider $50,000 question
Probability of guessing correctly: 1/10
Quit or go for the question?
What is the expected payoff for continuing?
0.1 * 61,100 + 0.9 * 0 = 6,110
What is the optimal decision?
Q1 Q2 Q3 Q4
Correct
Incorrect:
$0
Correct
Incorrect:
$0
Quit:
$100
Correct
Incorrect:
$0
Quit:
$1,100
Correct:
$61,100
Incorrect:
$0
Quit:
$11,100
$100
question
$1,000
question
$10,000
question
$50,000
question
Game show
What should we do in Q3?
Payoff for quitting: $1,100
Payoff for continuing: 0.5 * $11,100 = $5,550
What about Q2?
$100 for quitting vs. $4,162 for continuing
What about Q1?
Q1 Q2 Q3 Q4
Correct
Incorrect:
$0
Correct
Incorrect:
$0
Quit:
$100
Correct
Incorrect:
$0
Quit:
$1,100
Correct:
$61,100
Incorrect:
$0
Quit:
$11,100
$100
question
$1,000
question
$10,000
question
$50,000
question
9/10 3/4 1/2
1/10
U = $11,100 U = $5,550 U = $4,162 U = $3,746
Markov Decision Processes
Components:
States s, beginning with initial state s
0
Actions a
Each state s has actions A(s) available from it
Transition model P(s | s, a)
Markov assumption: the probability of going to s from
s depends only on s and a and not on any other past
actions or states
Reward function R(s)
Policy t(s): the action that an agent takes in any given state
The solution to an MDP
Overview
First, we will look at how to solve MDPs,
or find the optimal policy when the
transition model and the reward function
are known
Second, we will consider reinforcement
learning, where we dont know the rules
of the environment or the consequences of
our actions
Grid world

R(s) = -0.04 for every
non-terminal state
Transition model:
Grid world
Optimal policy when
R(s) = -0.04 for every
non-terminal state
Grid world
Optimal policies for other values of R(s):
Solving MDPs
MDP components:
States s

Actions a
Transition model P(s | s, a)
Reward function R(s)
The solution:
Policy t(s): mapping from states to actions
How to find the optimal policy?
Maximizing expected utility
The optimal policy should maximize the expected
utility over all possible state sequences produced
by following that policy:




How to define the utility of a state sequence?
Sum of rewards of individual states
Problem: infinite state sequences

0
s from starting
sequences state
) sequence ( ) sequence ( U P
Utilities of state sequences
Normally, we would define the utility of a state sequence
as the sum of the rewards of the individual states
Problem: infinite state sequences
Solution: discount the individual state rewards by a factor
between 0 and 1:





Sooner rewards count more than later rewards
Makes sure the total utility stays bounded
Helps algorithms converge



) 1 0 (
1
) (
) ( ) ( ) ( ]) , , , ([
max
0
2
2
1 0 2 1 0
< <

s =
+ + + =


R
s R
s R s R s R s s s U
t
t
t

Utilities of states

Expected utility obtained by policy t starting in state s:




The true utility of a state, denoted U(s), is the expected
sum of discounted rewards if the agent executes an
optimal policy starting in state s
Reminiscent of minimax values of states

=
s from starting
sequences state
) sequence ( ) sequence ( ) ( U P s U
t
Finding the utilities of states

'
) ' ( ) , | ' (
s
s U a s s P
U(s)
Max node
Chance node

e
=
'
) (
*
) ' ( ) , | ' ( max arg ) (
s
s A a
s U a s s P s t
P(s | s, a)
What is the expected utility of
taking action a in state s?



How do we choose the optimal
action?
What is the recursive expression for U(s) in terms of the
utilities of its successor states?

+ =
'
) ' ( ) , | ' ( max ) ( ) (
s
a
s U a s s P s R s U
The Bellman equation
Recursive relationship between the utilities of
successive states:
End up here with P(s | s, a)
Get utility U(s)
(discounted by )
Receive reward R(s)
Choose optimal action a

e
+ =
'
) (
) ' ( ) , | ' ( max ) ( ) (
s
s A a
s U a s s P s R s U
The Bellman equation
Recursive relationship between the utilities of
successive states:


For N states, we get N equations in N unknowns
Solving them solves the MDP
We could try to solve them through expectimax
search, but that would run into trouble with infinite
sequences
Instead, we solve them algebraically
Two methods: value iteration and policy iteration

e
+ =
'
) (
) ' ( ) , | ' ( max ) ( ) (
s
s A a
s U a s s P s R s U
Method 1: Value iteration
Start out with every U(s) = 0
Iterate until convergence
During the ith iteration, update the utility of each state
according to this rule:



In the limit of infinitely many iterations, guaranteed
to find the correct utility values
In practice, dont need an infinite number of iterations

e
+
+
'
) (
1
) ' ( ) , | ' ( max ) ( ) (
s
i
s A a
i
s U a s s P s R s U
Value iteration
What effect does the update have?


e
+
+
'
) (
1
) ' ( ) , | ' ( max ) ( ) (
s
i
s A a
i
s U a s s P s R s U
Value iteration demo
Method 2: Policy iteration
Start with some initial policy t
0
and alternate
between the following steps:

Policy evaluation: calculate U
t
i
(s) for every
state s
Policy improvement: calculate a new policy
t
i+1
based on the updated utilities

e
+
=
'
) (
1
) ' ( ) , | ' ( max arg ) (
s
s A a
i
s U a s s P s
i
t
t
Policy evaluation
Given a fixed policy t, calculate U
t
(s) for every
state s
The Bellman equation for the optimal policy:


How does it need to change if our policy is fixed?


Can solve a linear system to get all the utilities!
Alternatively, can apply the following update:

+
+
'
1
) ' ( )) ( , | ' ( ) ( ) (
s
i i i
s U s s s P s R s U t

e
+ =
'
) (
) ' ( ) , | ' ( max ) ( ) (
s
s A a
s U a s s P s R s U

+ =
'
) ' ( )) ( , | ' ( ) ( ) (
s
s U s s s P s R s U
t t
t

You might also like