0% found this document useful (0 votes)

2 views62 pages

2025_MDPs 1

The document discusses Markov Decision Processes (MDPs) and their applications in deep reinforcement learning, highlighting examples such as playing Atari games and Go. It defines MDPs, their components, and the significance of policies and utilities in decision-making processes. Additionally, it covers methods for solving MDPs, including value iteration and the importance of discounting future rewards.

Uploaded by

23021685

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views62 pages

2025_MDPs 1

Uploaded by

23021685

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

Markov Decision Processes (Part 1)

Instructor: Nguyen Van Vinh –UET, Hanoi VNU

(Slides based on AI course, University of California, Berkeley)
Examples of (Deep) Reinforcement Learning
 2013: Playing Atari Games

[Human-level control through deep reinforcement learning. Mnih et al. Nature 2015] 2
Examples of (Deep) Reinforcement Learning
 2016: Playing Go (and beating human champion)

[Mastering the game of Go with deep neural networks and tree search. Silver et al. Nature 2016] 3
Examples of (Deep) Reinforcement Learning
 2022: Training Language Assistants with Human Feedback

[Aligning language models to follow instructions. Ouyang et al. 2022] 4

Examples of (Deep) Reinforcement Learning
 Reasoning LLM: Deepseek-R1

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, Daya Guo et al., 2025
Turing Award 2024 (Nobel)
 ACM has named Andrew G. Barto and Richard S. Sutton as the recipients
of the 2024 ACM A.M. Turing Award for developing the conceptual and
algorithmic foundations of reinforcement learning
Non-deterministic search
Example: Grid World
 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path

 Noisy movement: actions do not always go as

planned
 80% of the time, the action North takes the agent
North
(if there is no wall there)
 10% of the time, North takes the agent West; 10%
East
 If there is a wall in the direction the agent would have
been taken, the agent stays put

 The agent receives rewards each time step

 Small “living” reward each step (can be negative)
 Big rewards come at the end (good or bad)

 Goal: maximize sum of rewards

Action in Grid World
Deterministic Non-deterministic
Markov Decision Process (MDP)
 An MDP is defined by:
 A set of states s  S
 A set of actions a  A
 A transition model T(s, a, s’)
 Probability that a from s leads to s’, i.e., P(s’| s, a)
 Also called the model or the dynamics
 A reward function R(s, a, s’) for each transition
 A start state
 Possibly a terminal state (or absorbing state)
 Utility function which is additive (discounted) rewards

 MDPs are fully observable but probabilistic search problems (non-deterministic

search problems)
 One way to solve them is with expectimax search
What is Markov about MDPs?
 “Markov” generally means that given the present state,
the future and the past are independent
 For Markov decision processes, “Markov” means action
outcomes depend only on the current state

Andrey Markov
(1856-1922)
 This is just like search, where the successor function could
only depend on the current state (not the history)

11
Markov Decision Processes (Grid World)
 An MDP is defined by a tuple (S,A,T,R)
 Why is it called Markov Decision Process?
Decision:
 Agent decides what action to take in each time
step
Process:
 The system (environment + agent) is changing
over time
The Grid World problem as an MDP
Policies
 A policy  gives an action for each state, : S → A

 In deterministic single-agent search problems, we

wanted an optimal plan, or sequence of actions, from
start to a goal

 For MDPs, we want an optimal policy *: S → A

 A policy  gives an action for each state
 An optimal policy is one that maximizes expected utility if
followed
 An explicit policy defines a reflex agent
Sample Optimal Policies
Example: Racing
Example: Racing
 A robot car wants to travel far, quickly
 Three states: Cool, Warm, Overheated
 Two actions: Slow, Fast
0.5 +1
 Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
MDP Search Trees
 Each MDP state projects an expectimax-like search tree

s s is a state

(s, a) is a
s, a
q-state
(s,a,s’) called a transition

s,a,s’ T(s,a,s’) = P(s’|s,a)

R(s,a,s’)
s’
Utilities of Sequences
Utilities of Sequences
 What preferences should an agent have over reward sequences?

 More or less? [1, 2, 2] or [2, 3, 4]

 Now or later? [0, 0, 1] or [1, 0, 0]

Discounting
 It’s reasonable to maximize the sum of rewards
 It’s also reasonable to prefer rewards now to rewards later
 One solution: values of rewards decay exponentially

Worth Now Worth Next Step Worth In Two Steps

Discounting
 How to discount?
 Each time we descend a level, we
multiply in the discount once

 Why discount?
 Reward now is better than later
 Also helps our algorithms converge

 Example: discount of 0.5

 U([1,2,3]) = 1*1 + 0.5*2 + 0.25*3
 U([1,2,3]) < U([3,2,1])
Ưa thích dừng (stationary preferences)
 Định lý: Nếu yêu cầu ưa thích thoả mãn tính chất:

 Thì: chỉ có hai cách định nghĩa hàm lợi ích

 Hàm tổng:
 Hàm tổng chiết khấu:
Quiz: Discounting
 Given:

 Actions: East, West, and Exit (only available in exit states a, e)

 Transitions: deterministic

 Quiz 1: For  = 1, what is the optimal policy? <- <- <-

 Quiz 2: For  = 0.1, what is the optimal policy? <- <- ->

 Quiz 3: For which  are West and East equally good when in state d?
1=10 3
Infinite Utilities?!
Recap: Defining MDPs
 Markov decision processes: s
 Set of states S
 Start state s0 a
 Set of actions A s, a
 Transitions P(s’|s,a) (or T(s,a,s’))
 Rewards R(s,a,s’) (and discount ) s,a,s’
s’

 MDP quantities so far:

 Policy = Choice of action for each state
 Utility = sum of (discounted) rewards
Solving MDPs
MDP Quantities
Optimal Quantities of MDP

 The value (utility) of a state s:

V*(s) = expected utility starting in s s s is a
and acting optimally state
a
(s, a) is a
 The value (utility) of a q-state (s,a): s, a q-state
Q*(s,a) = expected utility starting out
having taken action a from state s s,a,s’ (s,a,s’) is a
and (thereafter) acting optimally s’ transition

 The optimal policy:

*(s) = optimal action from state s
Solve MDP: Find 𝜋∗, 𝑉∗ and/or 𝑄∗
Gridworld V* Values

Noise = 0.2
Discount = 0.9
Living reward = 0
Gridworld Q* Values

Noise = 0.2
Discount = 0.9
Living reward = 0
The Bellman Equations

How to be optimal:
Step 1: Take correct first action
Step 2: Keep being optimal
Values of States

33
Recall: Racing MDP
 A robot car wants to travel far, quickly
 Three states: Cool, Warm, Overheated
 Two actions: Slow, Fast
0.5 +1
 Going faster gets double reward
1.0
Fast
Slow -10
+1
0.5

Warm
Slow

Fast 0.5 +2

Cool 0.5
+1 Overheated
1.0
+2
Racing Search Tree
Racing Search Tree
 We’re doing way too much
work with expectimax!

 Problem: States are repeated

 Idea: Only compute needed
quantities once

 Problem: Tree goes on forever

 Idea: Do a depth-limited
computation, but with increasing
depths until change is small
 Note: deep parts of the tree
eventually don’t matter if γ < 1
Time-Limited Values
 Key idea: time-limited values

 Define Vk(s) to be the optimal value of s if the game ends

in k more time steps
 Equivalently, it’s what a depth-k expectimax would give from s

[Demo – time-limited values (L8D4)]

k=0

Noise = 0.2
Discount = 0.9
Living reward = 0
k=1

Noise = 0.2
Discount = 0.9
Living reward = 0
k=2

Noise = 0.2
Discount = 0.9
Living reward = 0
k=3

Noise = 0.2
Discount = 0.9
Living reward = 0
k=4

Noise = 0.2
Discount = 0.9
Living reward = 0
k=5

Noise = 0.2
Discount = 0.9
Living reward = 0
k=6

Noise = 0.2
Discount = 0.9
Living reward = 0
k=7

Noise = 0.2
Discount = 0.9
Living reward = 0
k=8

Noise = 0.2
Discount = 0.9
Living reward = 0
k=9

Noise = 0.2
Discount = 0.9
Living reward = 0
k=10

Noise = 0.2
Discount = 0.9
Living reward = 0
k=11

Noise = 0.2
Discount = 0.9
Living reward = 0
k=12

Noise = 0.2
Discount = 0.9
Living reward = 0
k=100

Noise = 0.2
Discount = 0.9
Living reward = 0
Computing Time-Limited Values
Value Iteration
Value Iteration
 Start with V0(s) = 0: no time steps left means an expected reward sum of zero

 Given vector of Vk(s) values, do one ply of expectimax from each state:
Vk+1(s)
a
s, a

 Repeat until convergence, which yields V* s,a,s’

Vk(s’)

 Complexity of each iteration: O(S2A)

 Theorem: will converge to unique optimal values

 Basic idea: approximations get refined towards optimal values
 Policy may converge long before values do
Value Iteration
 Bellman equations characterize the optimal values: V(s)

a
s, a

s,a,s’
 Value iteration computes them: V(s’)
Value Iteration (again  ) s

a
 Init:
s, a
∀𝑠: 𝑉 𝑠 = 0
s,a,s’
 Iterate: s’

∀𝑠: 𝑉𝑛𝑒𝑤 𝑠 = max 𝑇 𝑠, 𝑎, 𝑠 ′ [𝑅 𝑠, 𝑎, 𝑠 ′ + 𝛾𝑉 𝑠 ′ ]

𝑎
𝑠′

𝑉 = 𝑉𝑛𝑒𝑤
Note: can even directly assign to V(s), which will not compute the sequence of Vk but will still converge to V*
Example: Value Iteration

S: 1
F: .5*2+.5*2=2

Assume no discount!

0 0
0
Example: Value Iteration

S: .5*1+.5*1=1
2 F: -10

Assume no discount!

0 0
0
Example: Value Iteration

2 1 0
Assume no discount!

0 0
0
Example: Value Iteration

S: 1+2=3
F:
.5*(2+2)+.5*(2+1)=3.5

2 1 0
Assume no discount!

0 0
0
Example: Value Iteration

3.5 2.5 0

2 1 0
Assume no discount!

0 0
0
Convergence*
 How do we know the Vk vectors are going to converge?
(assuming 0 < γ < 1)

 Proof Sketch:
 For any state Vk and Vk+1 can be viewed as depth k+1
expectimax results in nearly identical search trees
 The difference is that on the bottom layer, Vk+1 has actual
rewards while Vk has zeros
 That last layer is at best all RMAX
 It is at worst RMIN
 But everything is discounted by γk that far out
 So Vk and Vk+1 are at most γk max|R| different
 So as k increases, the values converge

Text Mining and Natural Language Processing in Construction
No ratings yet
Text Mining and Natural Language Processing in Construction
16 pages
Walter Helitronic Tool Studio V1.9 Booklet
100% (2)
Walter Helitronic Tool Studio V1.9 Booklet
39 pages
Tableau Cheat Sheet
100% (4)
Tableau Cheat Sheet
14 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Lec 08
No ratings yet
Lec 08
59 pages
06 MDP
No ratings yet
06 MDP
89 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Lec17-ReinforcementLearning
No ratings yet
Lec17-ReinforcementLearning
58 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Lec 09
No ratings yet
Lec 09
51 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Markov Decision Process I
No ratings yet
Markov Decision Process I
111 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
A crash course on reinforcement learning - Felix Wagner
No ratings yet
A crash course on reinforcement learning - Felix Wagner
84 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
UNIT-5 AI
No ratings yet
UNIT-5 AI
19 pages
Unit 4
No ratings yet
Unit 4
49 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
RL-1
No ratings yet
RL-1
30 pages
DLMAIRIL01_Q4-2024_Session2
No ratings yet
DLMAIRIL01_Q4-2024_Session2
68 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
15 MDP
No ratings yet
15 MDP
35 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
EJPT
No ratings yet
EJPT
3 pages
Knowledge Management System: Company. Our Mission Is To Continually Raise
No ratings yet
Knowledge Management System: Company. Our Mission Is To Continually Raise
8 pages
HSE PLAN-Approval Letter (44AC9100-WS-TEL-410T-L-AS-0193)
No ratings yet
HSE PLAN-Approval Letter (44AC9100-WS-TEL-410T-L-AS-0193)
3 pages
HiPath 3000 & 5000 V8 Dial Digit & Dial Rule Format
No ratings yet
HiPath 3000 & 5000 V8 Dial Digit & Dial Rule Format
10 pages
PO802S230208833
No ratings yet
PO802S230208833
1 page
Cost Modelling PDF
No ratings yet
Cost Modelling PDF
24 pages
101 Exercises
No ratings yet
101 Exercises
94 pages
Addison Wesley - Process Quality Assurance For UML Based Projects
No ratings yet
Addison Wesley - Process Quality Assurance For UML Based Projects
279 pages
Chapter 20 I Matrices ENHANCE
No ratings yet
Chapter 20 I Matrices ENHANCE
18 pages
Resume 20190529
No ratings yet
Resume 20190529
1 page
TM3 Programming Guide BASIC en EIO0000003345 04
No ratings yet
TM3 Programming Guide BASIC en EIO0000003345 04
90 pages
Classification
No ratings yet
Classification
14 pages
Implementation of FPGA-based Accelerator For CNN
No ratings yet
Implementation of FPGA-based Accelerator For CNN
7 pages
Varun 4
No ratings yet
Varun 4
6 pages
CICS V5 TechOverview
No ratings yet
CICS V5 TechOverview
94 pages
log-20241228174528
No ratings yet
log-20241228174528
194 pages
Web Technologies Important Questions
No ratings yet
Web Technologies Important Questions
5 pages
Object Oriented Programming (Seoo-122) I. Course Details
No ratings yet
Object Oriented Programming (Seoo-122) I. Course Details
4 pages
DanelecConnect Training - 03 - VRS Functional Overview
No ratings yet
DanelecConnect Training - 03 - VRS Functional Overview
15 pages
Sebesta Chapter 4 With Additions
No ratings yet
Sebesta Chapter 4 With Additions
46 pages
Designing An Adaptive Security Architecture For Protection From Advanced Attacks
No ratings yet
Designing An Adaptive Security Architecture For Protection From Advanced Attacks
8 pages
KJKNB
No ratings yet
KJKNB
16 pages
Online e - Rto System
No ratings yet
Online e - Rto System
10 pages
Rotograph EVO
No ratings yet
Rotograph EVO
11 pages
Important Questions
No ratings yet
Important Questions
3 pages
P94 1821
No ratings yet
P94 1821
4 pages
MSAN SURPASS HiX56xx Product Architecture
No ratings yet
MSAN SURPASS HiX56xx Product Architecture
31 pages