0% found this document useful (0 votes)

19 views

Algorithms To Solve An MDP

The document discusses algorithms for solving Markov decision processes (MDPs), including value iteration, policy iteration, and Q-learning. Value iteration and policy iteration are model-based methods that use dynamic programming to iteratively estimate state values or policies. Q-learning is a model-free method that learns through trial-and-error interactions with an environment to estimate state-action values.

Uploaded by

ee23b007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

Algorithms To Solve An MDP

Uploaded by

ee23b007

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

ALGORITHMS TO

SOLVE AN MDP
TO SOLVE AN MDP REFERS TO ESTIMATING THE OPTIMAL
STATE VALUES OR THE OPTIMAL POLICY USING WHICH OUR
AGENT CAN MAKE START DECISIONS REGARDING THE
ACTIONS TO BE TAKEN IN THE ENVIRONMENT.
HOWEVER, AS WE KNOW THAT STATE VALUES ARE THE
EXPECTED RETURNS FROM A STATE OVER A LARGE NUMBER OF
FUTURE ACTIONS/TILL THE END OF A LARGE EPISODE, WE
CANNOT DIRECTLY COMPUTE THEM BY FINDING ALL POSSIBLE
PATHWAYS
HENCE WE DEPEND ON ITERATIVE OR RECURSIVE SOLUTIONS
(DYNAMIC PROGRAMMING) WHERE THE VALUE OF ONE
QUANTITY CAN BE ESTIMATED IN TERMS OF VALUE OF THE
QUANTITY IN THE NEXT TIME STEP/ STATE.
MODEL-BASED AND
MODEL-FREE
METHODS
Value Iteration
In value iteration, we start with
assuming random state values for
each state, and using these
random values, we iterate
through different state action
pairs using dynamic programming
and correspondingly update the
state values till convergence.

We use this updation formula from the Bellman equation

Let’s create a simple game

Consider a 3*4 grid as our environment where our

agent Superhero is trapped, he needs to reach the
green cell from where he can gain higher powers and
get out, but if he falls in the red cell, he dies due to
deathtrap.

We need to view this as an RL model

11 STATES, 4 ACTIONS, +1,-1 REWARD
MAY THE BEST STATE WIN BE SELECTED
Step 1: Initialisation
In the first step, we have no clue about the 0 0 0 0

state values so we just randomly allot the

states, or bring them all to 0. 0 0 0

0 0 0 0
Step 2: Q value estimation
Hence, in next iteration, we find our state value to be as follows
Next, we use these updated values to find the Q values for all state-
action pairs in the model. 0 0 0 1
First let’s focus on one particular state: (1,3)
Now, Q values for this state:
0 0 -1

(1,3),RIGHT = 0.8(0 + 0.91) + 0.1(0 + 0.90) + 0.1(0 + 0.90) = 0.72

(1,3),DOWN = 0.1(0 + 0.91) + 0.8(0 + 0.90) + 0.1(0 + 0.90) = 0.09

0 0 0 0

(1,3),LEFT = 0.1(0 + 0.91) + 0.1(0 + 0.90) + 0.8(0 + 0.90) = 0.09

ACTUALLY ACTUALLY ACTUALLY

RIGHT DOWN LEFT
Step 3: State Value Updation
Now, we repeat step 2 and 3 with our updated state value table, and in
the further iterations we get

0 0.52 0.78 1
0 0 0.72 1

0 0 0.43 -1
0 0 0 -1

0 0 0 0
0 0 0 0

We keep on repeating our iterations until the difference between the

current and the previous iteration is negligible/less than the threshold
value. Hence, we get to a convergent value of the state value function,
which is our final optimal state value function!
Step 4: Optimal Policy
With our optimal state value function, we can easily make a
deterministic optimal policy such that from any state, the
corresponding action should be such that the agent moves to the state
with the maximum state value of all other possibilities.

Here, hence since value of state(1,3) will be the largest among all other
neighboring cells of state(1,2), the policy asks the Superhero Pranay to
move right from the cell (1,2) and similarly from state(1,3) to state(1,4),
where he wins!
Policy Iteration
Unlike value iteration, in policy
iteration, we start with assuming
a random policy for the agent as
we do not know which action
would be optimal.

Then for this policy, we calculate

the state values which are initially
all random. After calculating the
new state values, we update our
policy accordingly and iterate till
Policy iteration has 2 main steps:
convergence.
1) Policy Evaluation
2) Policy Improvement
Policy Evaluation
Now, we can randomly initialise our state values to 0 after a random policy and using
our infamous Bellman Equation, we can compute the state values for the next
iteration.

𝑉 𝑠 = 𝑟 𝑠 + 𝛾 ∗ max ෍ 𝑝(𝑠 ′ , 𝑟|𝑠, 𝜋(𝑠) ∗ 𝑉 𝑠 ′

𝑠′ ,𝑟

Example: Let our random policy be to

take North everytime for
S1, S2 and S3.
V[S1] = 0; V[S4] = -2
V[S2] = 2; V[S5] = 1
V[S3] = 1; V[S6] = -0.5

State Diagram Transition Probabilities

Policy Improvement
After computing the state values, before directly
jumping to the next iteration, we first check if our
policy is optimal, we do that by changing our
policy such that our agent follows the action
leading to a state with maximum state value for a
given initial state.

In this case, we can check that for state 2, the

optimal action according to Bellman Equation
would be to move South, hence we change our
policy to {N,S,N}.
Q-Learning
Fundamental algorithm for deep RL networks
Q-learning is a popular model-free reinforcement
learning algorithm used to find an optimal policy for
an agent in an environment without requiring a model of
the environment's dynamics.

QUITE INTERESTING IS ALL I’LL SAY

.
Step 1: Initialisation
In the first step, we have no clue about the state-action Q values so
we just randomly allot the states, or bring them all to 0.

Consider another simple grid game, a 3x3 grid, where the player
starts in the Start square and wants to reach the Goal square as
their final destination, with danger cells.

This problem has 9 states since the player can be positioned in any
of the 9 squares of the grid. It has 4 actions. So we construct a Q-
table with 9 rows and 4 columns.
Step 2: Action!
Unlike model-based methods, here in the Q –Learning algorithm,
we need to actually interact with the environment to learn
something about it. Hence, from whichever state we are in, we
need to pick an action correspondingly.
How do we choose the action?

Here, we use the concept of exploitation exploration trade-off.

Step 2: Action!
EXPLORATION EXPLOITATION
When we first start learning, we have no idea which actions are On the other end of the spectrum, when the model is fully
‘good’ and which are ‘bad. So we go through a process of trained, we have already explored all possible actions, so we can
discovery where we randomly try different actions and observe pick the best actions which will yield the maximum return.
the rewards. Hence, exploitation refers to the policy of “exploiting” our
Hence, exploration refers to the policy of randomly choosing an knowledge of the best action and following that rather than
action from any particular state so that w “explore” the different taking risks.
options.

Seeing that both exploration and exploitation have their benefits,

we need to find a perfect balance between the two.
Step 2: Action!
Here we use a unique policy called as The ε-Greedy.
Now, whenever it picks an action in every state, it selects a random action) with
probability ε. And similarly, with probability ‘1 — ε’, it selects the best.

In the first few iterations, we have learnt close to nothing about the model, hence it is
better to explore the different options rather than relying on our little knowledge
about the model. Hence ε is close to 1 in the beginning encouraging exploration.
Towards the end of an episode or a high number of iterations, we have learnt almost
everything about the model and hence we can focus on our knowledge of the model to
extract the best actions and gain greater rewards. Hence, at very large number of
iterations, the ε boils down to 0, encouraging exploitation.
Step 3: Q value estimation
Now that we have picked our action, we need to build our
intuition behind the formula.
Now we know that after picking the action, the reward we get
(say r) plays a role in the Q value, for an optimum Q value, we
can say that

The Q value of a particular state-action is the reward after

taking that action + the optimum return expected from the
next state-actions.
Here, a3 is the action which has the highest Q value for all the
actions from state S3
Hence, we say that the Q value = Reward + gamma*maximum
Q value for all actions in the next state.

Therefore we should be training our model such that the two

values are close, in other words the ERROR is reduced.
Step 3: Q value estimation
Therefore, we need to add the error term to our estimate such
that the two ways of calculating the State-Action Value:

1.One way is the State-Action Value from the current state

2.The other way is the immediate reward from taking one step
plus the State-Action Value from the next state

should provide very similar results.

Hence, we add the error term is added scaled with a learning
rate. Here, a3 is the action which has the highest Q value for all
The algorithm incrementally updates its Q-value estimates in a the actions from state S3
way that reduces the error.

This process is then repeated for the next state after we take the
particular action till CONVERGENCE.
Step 3: Putting all together

Note that when we compare the two ways of computing the Q value for a particular state-action, we take the action in the next
state which has maximum Q value, but it is not necessary that in the next iteration we actually follow that action, because we can
also be exploring with a probability of ε. Hence, as it is not following exactly the same policy as it is using to compute the Q values, it
is called an OFF-POLICY algorithm.
THANK YOU!

RLbook Solutions Manual
No ratings yet
RLbook Solutions Manual
35 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
199 pages
lecture-06
No ratings yet
lecture-06
98 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
Lec 4
No ratings yet
Lec 4
16 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Lecture#3_Bellmann_Equation_and_Dynamic_programming_DP_2024_Part
No ratings yet
Lecture#3_Bellmann_Equation_and_Dynamic_programming_DP_2024_Part
33 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Lec 25
No ratings yet
Lec 25
20 pages
Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
No ratings yet
Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
22 pages
SET394 - AI - Lecture 06 - Adversarial Search
No ratings yet
SET394 - AI - Lecture 06 - Adversarial Search
27 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
Intro to Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro to Reinforcement Learning - DQ Q AC A3C
36 pages
5.4-Reinforcement learning-part3-Q-Learning
No ratings yet
5.4-Reinforcement learning-part3-Q-Learning
18 pages
Chapter_14-_Manual_-_1.0
No ratings yet
Chapter_14-_Manual_-_1.0
10 pages
Exercises
No ratings yet
Exercises
50 pages
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
No ratings yet
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
524 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Sections
No ratings yet
Sections
76 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
Reinforcement Learning: A Short Cut
No ratings yet
Reinforcement Learning: A Short Cut
7 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
Lec 09
No ratings yet
Lec 09
51 pages
[24F-COSE361] 2. Game (1)
No ratings yet
[24F-COSE361] 2. Game (1)
37 pages
39-Q Learning Numerical
No ratings yet
39-Q Learning Numerical
13 pages
01 Module 1 Early Reinforcement Learning
No ratings yet
01 Module 1 Early Reinforcement Learning
134 pages
Game Playing
No ratings yet
Game Playing
48 pages
Cs188 Lecture 6 - Adversarial Search - Print (Edx) (2PP)
No ratings yet
Cs188 Lecture 6 - Adversarial Search - Print (Edx) (2PP)
35 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Lec 08
No ratings yet
Lec 08
59 pages
Ai-Module 2
No ratings yet
Ai-Module 2
47 pages
Sol 7
No ratings yet
Sol 7
5 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Ai Assignment 1
No ratings yet
Ai Assignment 1
5 pages
05 Adversarial Search
No ratings yet
05 Adversarial Search
51 pages
Game Trees: Introduction To Artificial Intelligence COS302 Michael L. Littman Fall 2001
No ratings yet
Game Trees: Introduction To Artificial Intelligence COS302 Michael L. Littman Fall 2001
32 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Markov Decision Process
No ratings yet
Markov Decision Process
29 pages
RL_MJJ
No ratings yet
RL_MJJ
32 pages
cs188-su24-lec06
No ratings yet
cs188-su24-lec06
79 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Artificial Intelligence: Adversarial Search
No ratings yet
Artificial Intelligence: Adversarial Search
62 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
RL Examples
No ratings yet
RL Examples
6 pages
CZ3005 Module 5_Reinforcement Learning(1)
No ratings yet
CZ3005 Module 5_Reinforcement Learning(1)
31 pages
UNIT II
No ratings yet
UNIT II
53 pages
Lec 04
No ratings yet
Lec 04
79 pages
06. Chapter. 06 - Adversarial Search and Games - No Embedded Videos
No ratings yet
06. Chapter. 06 - Adversarial Search and Games - No Embedded Videos
51 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
ML unit 4
No ratings yet
ML unit 4
17 pages
Games in Learning
No ratings yet
Games in Learning
99 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Artificial Intelligence & Machine Learning
No ratings yet
Artificial Intelligence & Machine Learning
21 pages
AI Lec07 Adversarial Search
No ratings yet
AI Lec07 Adversarial Search
29 pages
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Genetic Algorithms
From Everand
Genetic Algorithms
Isuru Abeysinghe
4.5/5 (3)
UE2MDP2019
No ratings yet
UE2MDP2019
197 pages
Bio-Inspired_Rhythmic_Locomotion_for_Quadruped_Robots
No ratings yet
Bio-Inspired_Rhythmic_Locomotion_for_Quadruped_Robots
8 pages
Markov
No ratings yet
Markov
61 pages
Untitled
No ratings yet
Untitled
140 pages
1 s2.0 S2215098621000070 Main
No ratings yet
1 s2.0 S2215098621000070 Main
12 pages
Unit 4
100% (1)
Unit 4
7 pages
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
No ratings yet
Dynamic Programming: Thomas J. Sargent and John Stachurski January 16, 2024
446 pages
Unifying Count-Based Exploration and Intrinsic Motivation: ON Tezuma S Evenge
No ratings yet
Unifying Count-Based Exploration and Intrinsic Motivation: ON Tezuma S Evenge
26 pages
Machine Learning Mathematics in Python -- Jamie Flux -- 2024
No ratings yet
Machine Learning Mathematics in Python -- Jamie Flux -- 2024
238 pages
TD-MPC2 - Scalable, Robust World Models For Continuous Control
No ratings yet
TD-MPC2 - Scalable, Robust World Models For Continuous Control
27 pages
PolicyGradient
No ratings yet
PolicyGradient
33 pages
p905 Xianglong
No ratings yet
p905 Xianglong
8 pages
Introduction To Operations Research: Ninth Edition
No ratings yet
Introduction To Operations Research: Ninth Edition
8 pages
ml unit 2
No ratings yet
ml unit 2
23 pages
Traffic flow optimization A reinforcement learning approach
No ratings yet
Traffic flow optimization A reinforcement learning approach
10 pages
6 390 Lecture Notes Fall24 (1)
No ratings yet
6 390 Lecture Notes Fall24 (1)
146 pages
Question Bank RL
No ratings yet
Question Bank RL
4 pages
POMDP Tutoria POMDP - Tutoriall
No ratings yet
POMDP Tutoria POMDP - Tutoriall
55 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
Reinforcement Learning Syllabus
No ratings yet
Reinforcement Learning Syllabus
1 page
Wang-2024-Deep Reinforcement Learning For Dema
No ratings yet
Wang-2024-Deep Reinforcement Learning For Dema
13 pages
Neurocomputing: Xiaofeng Li, Lei Xue, Changyin Sun
No ratings yet
Neurocomputing: Xiaofeng Li, Lei Xue, Changyin Sun
8 pages
Cost-Effective Condition-Based Maintenance Using Markov Decision Processes
No ratings yet
Cost-Effective Condition-Based Maintenance Using Markov Decision Processes
6 pages
Lecture 5: Model-Free Control: David Silver
No ratings yet
Lecture 5: Model-Free Control: David Silver
43 pages
Complete Download Markov Decision Processes in Practice 1st Edition Richard J. Boucherie PDF All Chapters
100% (1)
Complete Download Markov Decision Processes in Practice 1st Edition Richard J. Boucherie PDF All Chapters
55 pages
CSE4037 - REINFORCEMENT-LEARNING - ETH - 1.0 - 8 - CSE4037 - Reinforcement Learning - 1.0
No ratings yet
CSE4037 - REINFORCEMENT-LEARNING - ETH - 1.0 - 8 - CSE4037 - Reinforcement Learning - 1.0
2 pages
Cs181 Textbook
No ratings yet
Cs181 Textbook
163 pages
Fundamentals of Multi Agent Systems
No ratings yet
Fundamentals of Multi Agent Systems
155 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages

Algorithms To Solve An MDP

Uploaded by

Algorithms To Solve An MDP

Uploaded by

ALGORITHMS TO

We use this updation formula from the Bellman equation

Consider a 3*4 grid as our environment where our

We need to view this as an RL model

state values so we just randomly allot the

(1,3),RIGHT = 0.8*(0 + 0.9*1) + 0.1*(0 + 0.9*0) + 0.1*(0 + 0.9*0) = 0.72

(1,3),DOWN = 0.1*(0 + 0.9*1) + 0.8*(0 + 0.9*0) + 0.1*(0 + 0.9*0) = 0.09

(1,3),LEFT = 0.1*(0 + 0.9*1) + 0.1*(0 + 0.9*0) + 0.8*(0 + 0.9*0) = 0.09

ACTUALLY ACTUALLY ACTUALLY

We keep on repeating our iterations until the difference between the

Then for this policy, we calculate

𝑉 𝑠 = 𝑟 𝑠 + 𝛾 ∗ max ෍ 𝑝(𝑠 ′ , 𝑟|𝑠, 𝜋(𝑠) ∗ 𝑉 𝑠 ′

Example: Let our random policy be to

State Diagram Transition Probabilities

In this case, we can check that for state 2, the

QUITE INTERESTING IS ALL I’LL SAY

Here, we use the concept of exploitation exploration trade-off.

Seeing that both exploration and exploitation have their benefits,

The Q value of a particular state-action is the reward after

Therefore we should be training our model such that the two

1.One way is the State-Action Value from the current state

should provide very similar results.

You might also like

(1,3),RIGHT = 0.8(0 + 0.91) + 0.1(0 + 0.90) + 0.1(0 + 0.90) = 0.72

(1,3),DOWN = 0.1(0 + 0.91) + 0.8(0 + 0.90) + 0.1(0 + 0.90) = 0.09

(1,3),LEFT = 0.1(0 + 0.91) + 0.1(0 + 0.90) + 0.8(0 + 0.90) = 0.09