2025_MDPs 1
2025_MDPs 1
[Human-level control through deep reinforcement learning. Mnih et al. Nature 2015] 2
Examples of (Deep) Reinforcement Learning
2016: Playing Go (and beating human champion)
[Mastering the game of Go with deep neural networks and tree search. Silver et al. Nature 2016] 3
Examples of (Deep) Reinforcement Learning
2022: Training Language Assistants with Human Feedback
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, Daya Guo et al., 2025
Turing Award 2024 (Nobel)
ACM has named Andrew G. Barto and Richard S. Sutton as the recipients
of the 2024 ACM A.M. Turing Award for developing the conceptual and
algorithmic foundations of reinforcement learning
Non-deterministic search
Example: Grid World
A maze-like problem
The agent lives in a grid
Walls block the agent’s path
Andrey Markov
(1856-1922)
This is just like search, where the successor function could
only depend on the current state (not the history)
11
Markov Decision Processes (Grid World)
An MDP is defined by a tuple (S,A,T,R)
Why is it called Markov Decision Process?
Decision:
Agent decides what action to take in each time
step
Process:
The system (environment + agent) is changing
over time
The Grid World problem as an MDP
Policies
A policy gives an action for each state, : S → A
Warm
Slow
Fast 0.5 +2
Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
Each MDP state projects an expectimax-like search tree
s s is a state
(s, a) is a
s, a
q-state
(s,a,s’) called a transition
Why discount?
Reward now is better than later
Also helps our algorithms converge
Quiz 2: For = 0.1, what is the optimal policy? <- <- ->
Quiz 3: For which are West and East equally good when in state d?
1=10 3
Infinite Utilities?!
Recap: Defining MDPs
Markov decision processes: s
Set of states S
Start state s0 a
Set of actions A s, a
Transitions P(s’|s,a) (or T(s,a,s’))
Rewards R(s,a,s’) (and discount ) s,a,s’
s’
Noise = 0.2
Discount = 0.9
Living reward = 0
Gridworld Q* Values
Noise = 0.2
Discount = 0.9
Living reward = 0
The Bellman Equations
How to be optimal:
Step 1: Take correct first action
Step 2: Keep being optimal
Values of States
33
Recall: Racing MDP
A robot car wants to travel far, quickly
Three states: Cool, Warm, Overheated
Two actions: Slow, Fast
0.5 +1
Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5
Warm
Slow
Fast 0.5 +2
Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
Racing Search Tree
We’re doing way too much
work with expectimax!
Noise = 0.2
Discount = 0.9
Living reward = 0
k=1
Noise = 0.2
Discount = 0.9
Living reward = 0
k=2
Noise = 0.2
Discount = 0.9
Living reward = 0
k=3
Noise = 0.2
Discount = 0.9
Living reward = 0
k=4
Noise = 0.2
Discount = 0.9
Living reward = 0
k=5
Noise = 0.2
Discount = 0.9
Living reward = 0
k=6
Noise = 0.2
Discount = 0.9
Living reward = 0
k=7
Noise = 0.2
Discount = 0.9
Living reward = 0
k=8
Noise = 0.2
Discount = 0.9
Living reward = 0
k=9
Noise = 0.2
Discount = 0.9
Living reward = 0
k=10
Noise = 0.2
Discount = 0.9
Living reward = 0
k=11
Noise = 0.2
Discount = 0.9
Living reward = 0
k=12
Noise = 0.2
Discount = 0.9
Living reward = 0
k=100
Noise = 0.2
Discount = 0.9
Living reward = 0
Computing Time-Limited Values
Value Iteration
Value Iteration
Start with V0(s) = 0: no time steps left means an expected reward sum of zero
Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
a
s, a
a
s, a
s,a,s’
Value iteration computes them: V(s’)
Value Iteration (again ) s
a
Init:
s, a
∀𝑠: 𝑉 𝑠 = 0
s,a,s’
Iterate: s’
𝑉 = 𝑉𝑛𝑒𝑤
Note: can even directly assign to V(s), which will not compute the sequence of Vk but will still converge to V*
Example: Value Iteration
S: 1
F: .5*2+.5*2=2
Assume no discount!
0 0
0
Example: Value Iteration
S: .5*1+.5*1=1
2 F: -10
Assume no discount!
0 0
0
Example: Value Iteration
2 1 0
Assume no discount!
0 0
0
Example: Value Iteration
S: 1+2=3
F:
.5*(2+2)+.5*(2+1)=3.5
2 1 0
Assume no discount!
0 0
0
Example: Value Iteration
3.5 2.5 0
2 1 0
Assume no discount!
0 0
0
Convergence*
How do we know the Vk vectors are going to converge?
(assuming 0 < γ < 1)
Proof Sketch:
For any state Vk and Vk+1 can be viewed as depth k+1
expectimax results in nearly identical search trees
The difference is that on the bottom layer, Vk+1 has actual
rewards while Vk has zeros
That last layer is at best all RMAX
It is at worst RMIN
But everything is discounted by γk that far out
So Vk and Vk+1 are at most γk max|R| different
So as k increases, the values converge