0% found this document useful (0 votes)

62 views20 pages

Markov Decision Processes: Stochastic, Sequential Environments

This document describes a Markov decision process (MDP) using the example of a game show with increasing difficulty levels and monetary rewards. It explains how to model the game show as an MDP and determine the optimal strategy by calculating the expected utility of continuing versus quitting at each level. The key steps are: 1) Representing the game as states with actions and transition probabilities; 2) Calculating the expected utility of continuing versus quitting at each level; 3) Determining the optimal action is to continue if expected utility is higher than reward for quitting. It then provides an overview of solving MDPs using value iteration or policy iteration to find the policy with highest expected rewards.

Uploaded by

mikey61

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views20 pages

Markov Decision Processes: Stochastic, Sequential Environments

Uploaded by

mikey61

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Markov Decision Processes

Stochastic, sequential
environments
Game show
A series of questions with increasing level of
difficulty and increasing payoff
Decision: at each step, take your earnings and
quit, or go for the next question
If you answer wrong, you lose everything
Q1 Q2 Q3 Q4
Correct
Incorrect:
$0
Correct
Incorrect:
$0
Quit:
$100
Correct
Incorrect:
$0
Quit:
$1,100
Correct:
$61,100
Incorrect:
$0
Quit:
$11,100
$100
question
$1,000
question
$10,000
question
$50,000
question
Game show
Consider $50,000 question
Probability of guessing correctly: 1/10
Quit or go for the question?
What is the expected payoff for continuing?
0.1 * 61,100 + 0.9 * 0 = 6,110
What is the optimal decision?
Q1 Q2 Q3 Q4
Correct
Incorrect:
$0
Correct
Incorrect:
$0
Quit:
$100
Correct
Incorrect:
$0
Quit:
$1,100
Correct:
$61,100
Incorrect:
$0
Quit:
$11,100
$100
question
$1,000
question
$10,000
question
$50,000
question
Game show
What should we do in Q3?
Payoff for quitting: $1,100
Payoff for continuing: 0.5 * $11,100 = $5,550
What about Q2?
$100 for quitting vs. $4,162 for continuing
What about Q1?
Q1 Q2 Q3 Q4
Correct
Incorrect:
$0
Correct
Incorrect:
$0
Quit:
$100
Correct
Incorrect:
$0
Quit:
$1,100
Correct:
$61,100
Incorrect:
$0
Quit:
$11,100
$100
question
$1,000
question
$10,000
question
$50,000
question
9/10 3/4 1/2
1/10
U = $11,100 U = $5,550 U = $4,162 U = $3,746
Markov Decision Processes
Components:
States s, beginning with initial state s
0
Actions a
Each state s has actions A(s) available from it
Transition model P(s | s, a)
Markov assumption: the probability of going to s from
s depends only on s and a and not on any other past
actions or states
Reward function R(s)
Policy t(s): the action that an agent takes in any given state
The solution to an MDP
Overview
First, we will look at how to solve MDPs,
or find the optimal policy when the
transition model and the reward function
are known
Second, we will consider reinforcement
learning, where we dont know the rules
of the environment or the consequences of
our actions
Grid world

R(s) = -0.04 for every
non-terminal state
Transition model:
Grid world
Optimal policy when
R(s) = -0.04 for every
non-terminal state
Grid world
Optimal policies for other values of R(s):
Solving MDPs
MDP components:
States s

Actions a
Transition model P(s | s, a)
Reward function R(s)
The solution:
Policy t(s): mapping from states to actions
How to find the optimal policy?
Maximizing expected utility
The optimal policy should maximize the expected
utility over all possible state sequences produced
by following that policy:

How to define the utility of a state sequence?
Sum of rewards of individual states
Problem: infinite state sequences

0
s from starting
sequences state
) sequence ( ) sequence ( U P
Utilities of state sequences
Normally, we would define the utility of a state sequence
as the sum of the rewards of the individual states
Problem: infinite state sequences
Solution: discount the individual state rewards by a factor
between 0 and 1:

Sooner rewards count more than later rewards
Makes sure the total utility stays bounded
Helps algorithms converge

) 1 0 (
1
) (
) ( ) ( ) ( ]) , , , ([
max
0
2
2
1 0 2 1 0
< <

s =
+ + + =

R
s R
s R s R s R s s s U
t
t
t

Utilities of states

Expected utility obtained by policy t starting in state s:

The true utility of a state, denoted U(s), is the expected
sum of discounted rewards if the agent executes an
optimal policy starting in state s
Reminiscent of minimax values of states

=
s from starting
sequences state
) sequence ( ) sequence ( ) ( U P s U
t
Finding the utilities of states

'
) ' ( ) , | ' (
s
s U a s s P
U(s)
Max node
Chance node

e
=
'
) (
*
) ' ( ) , | ' ( max arg ) (
s
s A a
s U a s s P s t
P(s | s, a)
What is the expected utility of
taking action a in state s?

How do we choose the optimal
action?
What is the recursive expression for U(s) in terms of the
utilities of its successor states?

+ =
'
) ' ( ) , | ' ( max ) ( ) (
s
a
s U a s s P s R s U
The Bellman equation
Recursive relationship between the utilities of
successive states:
End up here with P(s | s, a)
Get utility U(s)
(discounted by )
Receive reward R(s)
Choose optimal action a

e
+ =
'
) (
) ' ( ) , | ' ( max ) ( ) (
s
s A a
s U a s s P s R s U
The Bellman equation
Recursive relationship between the utilities of
successive states:

For N states, we get N equations in N unknowns
Solving them solves the MDP
We could try to solve them through expectimax
search, but that would run into trouble with infinite
sequences
Instead, we solve them algebraically
Two methods: value iteration and policy iteration

e
+ =
'
) (
) ' ( ) , | ' ( max ) ( ) (
s
s A a
s U a s s P s R s U
Method 1: Value iteration
Start out with every U(s) = 0
Iterate until convergence
During the ith iteration, update the utility of each state
according to this rule:

In the limit of infinitely many iterations, guaranteed
to find the correct utility values
In practice, dont need an infinite number of iterations

e
+
+
'
) (
1
) ' ( ) , | ' ( max ) ( ) (
s
i
s A a
i
s U a s s P s R s U
Value iteration
What effect does the update have?

e
+
+
'
) (
1
) ' ( ) , | ' ( max ) ( ) (
s
i
s A a
i
s U a s s P s R s U
Value iteration demo
Method 2: Policy iteration
Start with some initial policy t
0
and alternate
between the following steps:

Policy evaluation: calculate U
t
i
(s) for every
state s
Policy improvement: calculate a new policy
t
i+1
based on the updated utilities

e
+
=
'
) (
1
) ' ( ) , | ' ( max arg ) (
s
s A a
i
s U a s s P s
i
t
t
Policy evaluation
Given a fixed policy t, calculate U
t
(s) for every
state s
The Bellman equation for the optimal policy:

How does it need to change if our policy is fixed?

Can solve a linear system to get all the utilities!
Alternatively, can apply the following update:

+
+
'
1
) ' ( )) ( , | ' ( ) ( ) (
s
i i i
s U s s s P s R s U t

e
+ =
'
) (
) ' ( ) , | ' ( max ) ( ) (
s
s A a
s U a s s P s R s U

+ =
'
) ' ( )) ( , | ' ( ) ( ) (
s
s U s s s P s R s U
t t
t

17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
2025_MDPs 2
No ratings yet
2025_MDPs 2
42 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
06 MDP
No ratings yet
06 MDP
89 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
lecture-06
No ratings yet
lecture-06
98 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
lec12
No ratings yet
lec12
60 pages
F20-AI-L9
No ratings yet
F20-AI-L9
44 pages
Lec 08
No ratings yet
Lec 08
59 pages
5 - MDP
No ratings yet
5 - MDP
42 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
Math Vii
No ratings yet
Math Vii
272 pages
2025_MDPs_Part 2 (1)
No ratings yet
2025_MDPs_Part 2 (1)
41 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
2-dynamic
No ratings yet
2-dynamic
50 pages
1-markov
No ratings yet
1-markov
34 pages
Lec 09
No ratings yet
Lec 09
51 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Chapter17 1
No ratings yet
Chapter17 1
40 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
Markov Decision Process
No ratings yet
Markov Decision Process
29 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
MDP PDF
No ratings yet
MDP PDF
37 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
4Z155 INVENTORY
No ratings yet
4Z155 INVENTORY
45 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Decision Making
No ratings yet
Decision Making
63 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Markov decision
No ratings yet
Markov decision
4 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
subtitle (13)
No ratings yet
subtitle (13)
2 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Cypecad Vs Etabs - En2
100% (3)
Cypecad Vs Etabs - En2
33 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
SFD & BMD
100% (2)
SFD & BMD
44 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Finite Difference Methods For Boundary Value Problems: October 2, 2013
No ratings yet
Finite Difference Methods For Boundary Value Problems: October 2, 2013
52 pages
Bond Graph Modelling of A Fan
No ratings yet
Bond Graph Modelling of A Fan
33 pages
B.sccs Syllabus
No ratings yet
B.sccs Syllabus
148 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Practice Session-2
No ratings yet
Practice Session-2
32 pages
Instant download Understand Mathematics Understand Computing Discrete Mathematics That All Computing Students Should Know Arnold L. Rosenberg pdf all chapter
100% (4)
Instant download Understand Mathematics Understand Computing Discrete Mathematics That All Computing Students Should Know Arnold L. Rosenberg pdf all chapter
65 pages
SSC Steno Official Paper Held On 17 Nov 2022 Shift 3 English
No ratings yet
SSC Steno Official Paper Held On 17 Nov 2022 Shift 3 English
56 pages
The Dirty Little Secrets of NURBS
No ratings yet
The Dirty Little Secrets of NURBS
19 pages
Ekonomi Transportasi
No ratings yet
Ekonomi Transportasi
18 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
Electrostatics DPP
No ratings yet
Electrostatics DPP
39 pages
Research - and - Application - of - Buckling - Res - PDF TAIWAN
No ratings yet
Research - and - Application - of - Buckling - Res - PDF TAIWAN
15 pages
882-AN-R1 - Analysis and Optimisation of A Transmission For Component Life
No ratings yet
882-AN-R1 - Analysis and Optimisation of A Transmission For Component Life
9 pages
COSC 6342"machine Learning" Homework1 Spring 2013
No ratings yet
COSC 6342"machine Learning" Homework1 Spring 2013
9 pages
5th Semester Question Paper (DSS and SurveyingII)
No ratings yet
5th Semester Question Paper (DSS and SurveyingII)
3 pages
30 2 1 Maths Standard
No ratings yet
30 2 1 Maths Standard
12 pages
Fabric Fault Detection Using Image Processing in MATLAB
No ratings yet
Fabric Fault Detection Using Image Processing in MATLAB
5 pages
Finalpractice 2017w
No ratings yet
Finalpractice 2017w
15 pages
How To Develop A Model Using COMSOL Multiphysics: Tom Dougherty
No ratings yet
How To Develop A Model Using COMSOL Multiphysics: Tom Dougherty
12 pages
Diagrid Seminar Report
No ratings yet
Diagrid Seminar Report
19 pages
(Main) : Computer Based Test (CBT)
100% (1)
(Main) : Computer Based Test (CBT)
10 pages
Handouts On DOM
No ratings yet
Handouts On DOM
10 pages
Cpe 2
No ratings yet
Cpe 2
15 pages
Conditionals
No ratings yet
Conditionals
5 pages
Market Potential
No ratings yet
Market Potential
13 pages
Cape Computer Science 2012 Unit 2 P2
No ratings yet
Cape Computer Science 2012 Unit 2 P2
6 pages
Anna University DSP
No ratings yet
Anna University DSP
2 pages
GCSE Maths Revision: Cheeky Revision Shortcuts
From Everand
GCSE Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (2)

Markov Decision Processes: Stochastic, Sequential Environments

Uploaded by

Markov Decision Processes: Stochastic, Sequential Environments

Uploaded by

Markov Decision Processes

You might also like