0% found this document useful (0 votes)

15 views

MarkovDecisionProcesses Analysis

MDP

Uploaded by

Vipul Koti

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

MarkovDecisionProcesses Analysis

MDP

Uploaded by

Vipul Koti

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Assignment 4:

Markov Decision Processes

Vipul Koti
[email protected]

1 INTRODUCTION

This report summarizes the analysis done on two Markov Decision Processes (MDP) problems, In MDP
problems, an agent interacts with an environment, At each time step, the agent takes action which leads to
change in environment state, and it also leads to some reward/penalty for taking this action. The goal is
always to find an optimal policy which maximizes the total value of the rewards for reaching the goal state
or for taking actions for certain time frame. I have chosen Frozen Lake (Grid World) and Forest Management
(Non-Grid World) as the two MDP problems.These two problems are solved using Policy Iteration, Value
Iteration and Q-Learning algorithms. The sizes of these two problems, parameters of the algorithms, are varied
and analyzed in this report. I have used Open AI GYM, Markov Decision Process (MDP) Toolbox Python
libraries for this analysis.

1.1 Problems Introduction

Frozen Lake - It is a grid world problem, where agent try to go

from start position (blue square) to end (green square) navigating
through solid squares (white) avoiding holes (red squares) by
taking allowed actions (UP, LEFT, DOWN, RIGHT). The analogy
for frozen lake, is an agent tries to navigate through the lake to
get a lost Frisbee which is on the other side of the lake by
walking on frozen squares and avoiding holes in the lake. The
environment we choose is stochastic, which makes this problem
interesting, so if the agent tries to take a step in one direction,
there are 0.33 chance it will move in the correct direction and 0.33
chances each it will move to either side left/right of the current
square. I have analyzed this problem for three sizes, 4×4, 8×8 and
32x32.

Forest Management Problem - It is a non-grid world problem, the agent has two actions (Wait(0) and Cut(1)).
Action is decided based on two objectives, First to maintain the old forest for wildlife and second is to earn
reward by selling woods. It has a stochastic component, where there is a probability (1-p) that a fire burns
the forest. The states of this problem are [0, 1 . . . "S" -1 ], where 0 means the youngest state and "S" and -1
means the oldest. If the agent takes action 1, Cut the forest goes to state 0 and agent get reward r1(Selling
woods), if the agent waits for higher reward r2(forest for wildlife), the forest state remains same, but there is
a probability p of fire which brings back the forest to 0 (forest goes to state 0). This problem is interesting as

1
it’s a real world problem and due to the stochastic nature of the problem. In some states waiting is ideal and
in some state cutting for immediate reward.

1.2 Algorithms Introduction

There are two types of algorithms we analyzed in our report, model based algorithms and model free al-
gorithms. In model based learning the agent interacts with the environment and create state transition and
reward models, these models are iterated using Value Iteration/Policy Iteration by the agent to find optimal
policy. On the other hand, in model free learning, the agent doesn’t learn an explicit state transition and
rewards model, it interacts with the environment and derives the optimal policy.(Q-Learning, SARSA)
Few function definitions before discussing algorithms -
V(s), Value Function, It is equal to the expected total reward for an agent starting is state s. It represents how
good is an environment state for the agent. The function with the highest reward is Optimal V*(s). ˆ
ˆ
Q*(s,a), Q Function, is a function to get total reward for an agent in state s for taking action a. Or in other
words, how good is it for agent to pick action a when in state s.
Bellman Equation, defines a recursive way to calculate optimal Q-function (dynamic programming paradigm).
ˆ
π *(s), optimal policy, which gives the optimal action for state s with maximum total reward.

Value Iteration (VI)- The VI algorithm iteratively improves V(s) and compute the optimal state value. First we
initialize V(s) to an arbitrary value, then update Q(s,a) and V(s) values until they converge. VI is guaranteed
to converge to the optimal policy π ∗ .

Policy Iteration (PI)- In contrast to VI, in PI instead of repeatedly improving the V(s) estimate, it re-defines
the policy π at each step and compute the V(s) according to this new policy π until the policy converges. As,
sometime the optimal policy converges before the value function, it often takes fewer iterations than VI. PI is
also guaranteed to converge to the optimal policy π ∗ .

Q- Learning (QL)- It is an example of model free learning, Agent has no state transition and reward model, it
discovers the optimal policy by trial and error. (Exploration and Exploitation). In QL, we basically approximate
Q function using the Q(s,a) samples, that we observed previously. The approach is called Time-Difference
Learning. Three Greek letter hyperparameters. Alpha, which is learning rate, it tells how much we should
trust our observed samples to estimate Q function. Gamma, defines how important is the future reward and
epsilon, which is used to balance explore and exploit, for some random chance less than epsilon, we take
random actions. For q-Learning we decay both alpha and Epsilon.

2 GRID WORLD PROBLEM (FROZEN LAKE (FL)) ANALYSIS -

I did my analysis on problem sizes of 4x4 (16 states), 8x8 (64 states) and 32x32 (1024 states). I used Open AI
gym to create the environment and then from the environment I computed the Transaction and reward models
in (A,S,S) matrix’s. These models are then feed into MDPtoolbox Value Iteration method, with combinations
of Gamma and epsilon, to see the impact of these hyperparameters. Note, We are using the stochastic FL
and there is no negative reward for each step, reward to reach the goal is 1 for below experiments in the

2
table. Also, I have done analysis for each algorithm with negative reward at each step (Impacts Q-learning
Convergence time) explained in respective subsections

2.1 Value Iteration (VI) Analysis

For tuning the hyperparameters for VI. Gamma, I have used [0.3, 0.6, 0.9, 0.99, 0.999] and for epsilon I used
[1e-2, 1e-4, 1e-5, 1e-6]. Below is the table which shows the impact of gamma and epsilon on VI for all three
size problems by comparing time iterations and rewards.

Gamma, discount factor, this is per time step discount factor on future rewards, it varies between 0 and 1,
and the higher the gamma, the more you value the future rewards.
Gamma impact on Iterations, Time, and Reward- From the table we can see, as we increase the gamma, the
number of iterations increases. This is because the reward to reach the goal takes more iterations to decay
close to 0 (when jumping in the hole is equivalent to future reward). Interesting observations, when we see
iterations for different size problems for different Gamma size, for small gamma size 4x4, 8x8, 32x32 takes
equal iterations but for higher value like 0.99 and 0.999, we see it takes more time to iterate as time is directly
proportional to the size of the problem, the reason being the future reward is discounted by a very low factor,
and as we increase the size of the problem the number of holes increases, complexity increases, so there is
more variation in the value function which takes more iterations to converge. Iterations are directly propor-
tional to time. Also, as Gamma, discounts the reward on each step, gamma value is directly proportional to
the reward.
Epsilon, Stopping Criteria, On each iteration the change is maximum V(s) is compared against epsilon, if the

3
ˆ
change falls below the epsilon value the V(s) is considered converged to V*(s)
Epsilon impact on Iterations, Time, and Reward- Epsilon is inversely proportional to time iterations and
reward for all three size problems. This is because epsilon is stopping criteria, if the epsilon value is high
the value functions converges early, if the epsilon value is small, it runs for more iterations to improve the
values function and reduce the difference between max V(s) at each iteration. Interestingly, for Epsilon value
0.01 and 0.0001 and gamma value 0.999, we see PI takes fewer transactions for complex problem 32x32 in
comparison to 4x4 and 8x8, this is because, PI runs VI at each step, for 32x32 problem policy converges faster
for high epsilon value.
Final chosen Gamma and Epsilon for VI, We will use Gamma value of 0.999 and epsilon value of 0.0001
reason being, PI iterative algorithm uses epsilon of 0.0001(MDPToolbox), equal comparison.
Frozen Lake 8x8, VI Convergence plot and policy Analysis - From the
plots of left and table above, we can see FL 8x8, with gamma 0.999 and
epsilon 0.0001 converges in 405 iterations of VI. We can see in the plots
the Max Value function and Error plateaus and converges. The
converged policy is in the below image on the left, As our environment is
stochastic, we can all the arrows in the row 1 pointed up and column 7
pointed right so that there is no chance for the agents to go into the red
holes. Also, all the squares surround the red holes moves outward. To
explore different Strategy, I made -1 reward for each stage and 1000
reward goal state Green (7,7). The resulted policy is the policy on the
right. This policy took more than double iterations (918) and 5 times
more time.(0.5 compared to 0.1). The show’s negative rewards strategy
negatively impacts the VI algorithm.

By comparing both the policies side by side, we can see the policy with -1 reward safely more directed towards
the reward. Some arrows in row 1 and column 7 are pointed towards the reward. Also, we can see square (6,3)

4
chooses to go up rather than away from the reward with equal chance of falling in the hole. Value function is
negative in the squares surrounded by holes.

2.2 Policy Iteration (VI) Analysis

For tuning the hyperparameters for PI, I have use 11 different values of Gamma [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,
0.8, 0.9, 0.99, 0.999], Epsilon value for policy iteration method in MDPtoolbox is 0.0001 (VI at each step using
the policy use this epsilon for convergence).

Gamma (discount factor) impact on Iterations, Time, and Reward- Gamma basically allow the algorithm to
see the future. If the Gamma is high, means we value the future more, the reward (Max V) will be high. We
see the same behavior in the table above, as we increase gamma, all the time, iteration and rewards increases.
For 4x4 and 8x8 we see iteration count doesn’t change much for different gammas, but for 32x32 it varies from
5 to 35, this is because the environment has more holes, its more state to explore and the iterations increases
with the stochastic complex nature of the problem. The reward (Max V) is always high for 32x32 problem as
it depends on how many times the VI runs, and it is increasing with the iterations.
Frozen Lake 32x32, PI Convergence plot and policy Analysis - We
selected Gamma 0.999 same as VI and epsilon is 0.0001 also same as VI.
From the plots of left and table above, we can see FL 32x32, with gamma
0.999 and epsilon 0.0001 converges in 35 iterations of PI(in comparison to
188 iterations for VI). This is because PI runs VI at every iteration, and
the policy converges faster. We can see in the plots the Max Value
function and Error plateaus and converges. The converged policy is in
the below image on the left, we can see how the reward 1 propagates to
the other state, the state space is huge for reward 1, and it is equal for
most of the state. To explore a different Strategy, I made -0.01 reward for
each stage and 1000 reward goal state. The resulted policy is the policy
on the right. This policy took approx 150 iterations in comparison to 33
earlier. This shows for model based learning, negative rewards negatively
impacts the policy convergence.

5
Also, we see the same behavior (top square has less value than bottom squares as top squares are hard to
navigate with the holes structure) if we do negative reward on each state, the policy tries to move towards the
goal safely (same as VI), it seems the state in the bottom are safer to move with the 0.33 probability stochastic
behavior due to which we see good value function on those states.

2.3 Q Learning (QL) Analysis

For Q learning Hyperparameter tuning, Gamma had similar effects as we saw in VI and PI. So I used Gamma
as 0.999 same as PI and VI. For Alpha, I used [0.1,0.2], Alpha_Decay[0.9,0.999], epislon_decay[0.9,0.99,0.9999]
and I used default epsilon, and a million iterations for each problem size as seen in the table below.

Gamma, discount factor, we discussed in VI section. Alpha, Learning Rate Alpha is learning rate, means how
much we value the observed Q(s, a). Epsilon, for random action probability, so based on this value we explore
the environment by taking random actions. alpha_decay & epsilon_decay decays the alpha and epsilon over

6
time, so that we can exploit (low alpha, less epsilon randomness) from the observed Q table. For the table we
can see, a million iterations are way less to learn this problem for all different sizes. When I use high alpha
the learning rate increases, it’s good for 4x4 & 8x8 problem, but for a large problem, we might need more
iterations to see the effect. Interestingly, the highest reward for 32x32 is for lower alpha. Epsilon decay has the
highest impact as this problem has 4 actions plus larger state space like 32x32 and 8x8 shows better reward
for 0.9999. So we chose, alpha = 0.1 (default), alpha_decay(0.999, decay alpha slowly and trust observation
values), epsilon_decay(0.9999, high exploration). Also we ran the Q learning algorithm for 1000 episodes(upper
limit) with 1e5 iterations each, resetting alpha and epsilon at each iteration (Epsilon helped to jump out of
local minima) with convergence criteria as, If 10 consecutive episodes with N_VAR_L10_MeanV(numpy
variance of last 10 episodes max mean V for last 10 episode) - P_VAR_L10_MeanV(numpy variance of last
10 episodes max mean V for previous last 10 episodes) is less than 0.0001 and Maximum of Max V is greater
than 0.99 (which is optimal policy).
Frozen Lake 8x8, QL Convergence plot and policy
Analysis - From the plots of left, we can see FL 8x8,
converges in 600 episodes of 1e5 transaction each. This
is high because Q learning is a model free algorithm
and takes time to explore, I tried two approaches, one
with negative rewards(-0.01 reward for each stage and
100 reward goal state) and one where I decrease the
epsilon decay to 0.9 instead of 0.9999 after 420
episodes. With negative rewards, the meanV
converged around 400 episodes and MaxV around 200
episodes
which is comparatively less than 600 and 400 for non-negative rewards (plots on left), as expected, negative
rewards will make Q learning to move faster to the goal. Changing decay helped to converge early, but
produced the policy with suboptimal value function.
Policy Comparison VI, PI and QL (8x8 FL) Conclusion-

When we see the three policies, side by side as we know PI and VI policies are guaranteed to converge to the

7
optimal policy, we can QL policy also looks very similar to PI and VI, The max V(s) for the optimal policy is
also very close. PI and VI are the same directions for each square, but QL policy as it tries random actions
because it is model-less, sqaure (3,3)(6,3) for QL is different, it seems to move towards high rewards.
Conclusion - For FL problem, when we compare model based learning algorithms, VI and PI, PI takes fewer
iterations to converge but more time as it does more computation at each iteration. PI and VI converges and
give optimal policy as guaranteed. Whereas, model Free learning takes a lot of time (hours) and billions of
iterations because the problem is stochastic and has 4 actions, it does find the optimal policy. Negative reward
strategy helps QL converge faster for all sizes. (400 episodes in comparison to 600, 33% faster)

3 NON-GRID WORLD PROBLEM (FOREST MANAGEMENT (FM)) ANALYSIS -

For Forest management, I analyzed two size problems, 25 states and 625 states. I have kept the reward r1,
reward r2 and probability to default values(4,2 & 0.1). I have used MDPToolbox to generate and solve these
MDP problems. We will discuss both sizes problems side by side for each algorithm.

3.1 Value Iteration (VI) for FM 25 & 625 states

From the below table, we can see the policy doesn’t change at all (both 25 and 625 states) for the selected
gamma (changing gamma changes the policy, as we can look farther by making gamma high). This is because
it’s a chain state problem (V(s) same for all C states) and FM problem converges to ideal policy with first
iteration, and it runs max iterations based on gamma and epsilon selected.

The reason, the rewards are also same for 25 and 625 states is, the wait state and cut state has same reward for
both the problems and reward is Max V value with same gamma value 0.999 and epsilon value the highest
Max V value converges to same for different size problems. To check what’s going on, First I ran VI with
Gamma 0.1 for 25 and 625(r1 4 and r2 2) and noticed the Value function and Policy, only first and last state
were W. Because V(s) for W in last is 4+ 0.1 * (V*ˆ (s’)) = 4.39, and for second last state if we do W V(s) = r(s)
+ 0.1(4) (considering W in last state) which is less than discounted Cut reward at second last state hence
Vmax(s) for the state is C. When we increase the Gamma to 0.3, now for second last state v(s) = r(s) + 0.3(4)
which is greater than 1 and hence last two states are W. So as we increase the gamma, the W increases from
the last state. Because we get waiting reward only by waiting in last state, and it propagates from there,
Also the value function for all C state is same, as they all get r2 for Cut and gamma times the reward for
state 0. First state is always W as there is no reward for C at state 0. These reason also proves why 25 states
and 625 states and has exactly same W from the last state. (When we selected the same Gamma).

8
From the policy plots on the right above, we can see (red, means wait and blue means cut) we see first state is
always w as s=0 has no reward, and all the Waits again in both 25 and 625 is in the end, rest in between is all
the cuts i.e VI takes the cut rewards instead of waiting. Interesting, both policies has one beginning W and 20
W in the end (because of same Gamma, explained earlier), which also affirms why policies for different sizes,
converge at same Max V as last state always has the same largest V(s). (VI convergence plot in PI section)

3.2 Policy Iteration (PI) for FM 25 & 625 states

From the below table, we can see the policy changes with the change in gamma, as expected. We use default
epsilon 0.0001 (MDP toolbox default) for our PI. The number of iterations is less than VI. But the time for con-
vergence is high on PI, which is expected as PI runs VI at each iteration.(More computation at each iteration)

For Gamma 0.999 epsilon 0.0001, VI takes 165 iterations and

converges at 165 Max V, whereas PI takes 20 iterations and
converges at 508 reward. From the convergence plot of VI
and PI error also we can see the same VI converges at error
0.5 and PI converges at 0. (Converged policies are the same.)
The reason is, for PI max iteration also it considers old policy, gamma, and epsilon. But PI runs VI at each step,
the reward/Max V, error depends on how many times the VI runs. This is also the reason for same iterations
for small and big problem. The reason for same number of W, Max V for 25 and 625 state is same as VI.

3.3 Q-Learning (QL) for FM 25 & 625 states

For Q learning, I choose same Gamma 0.999 and varied

alpha [0.1, 0.2], alpha_decay[0.9, 0.999], epsilon_decay [0.9,
0.99, 0.999] and epsilon default value as in the table above. I
ran QL for both 25 and 625 states, 1 million iterations each.
Higher Alpha means high learning rate, so reward moves
faster to max V. Epsilon, introduces randomness of actions
for exploration. Epsilon decay positively impact reward for
0.9 to 0.99, but it negatively impacts reward with epsilon at
0.9999. This is because of the continuous chain state nature
of this problem, where on cutting trees it goes to state 0 and on fire it goes to state 0. So a very high random-

9
ness, for 1M impacts negatively on the reward.

We can see also from the table, in comparison to PI and VI the Q learning policy changes, also for 25
and 625 state it takes different iterations to converge as it’s non-model based learning. From the converged
policy on the right policy plots we can see for 25 states there are 4 W and all cuts, whereas with 625 states
there are 6 W and all other cuts, which is different from PI and VI where the W were equal for small and big
state for selected epsilon gamma values (Reasoned in VI). To plot the convergence for QL, I ran QL for 2400
episodes with 1e5 iterations each, with the same convergence criteria as FL. From the Convergence plots, top
2 (25 states) and bottom 2 (625 states), we can see Mean V and Max V converges at the same value. Max V
converges faster, and Mean V takes more episodes to converge.
Conclusion - Forest Management problem is a state chain problem, due to which the number of W depends
on Gamma (detailed explanation in VI section), First state is always W, as reward is 0 for C at stage 0 and
number of iterations depend on reward epsilon gamma, instead of the size of problem for VI and PI. For
Q-Learning, the state chain problem is even harder, it converges at a suboptimal policy in comparison to
PI and VI, and it takes a lot of iterations and time to converge. Unlike the FL problem, FM problem doesn’t
find the policy similar to VI and PI. instead, there are Wait states in the middle, this is because the state chain
nature of the problem is hard for Q learning.

REFERENCES

[1] Gym is a standard API for reinforcement learning, and a diverse collection of reference environments.
Gym Documentation. (n.d.). Retrieved November 26, 2022, from https://www.gymlibrary.dev/

[2] Markov decision process (MDP) toolbox¶. Markov Decision Process (MDP) Toolbox - Python Markov
Decision Process Toolbox 4.0-b4 documentation. (n.d.). Retrieved November 26, 2022, from https://
pymdptoolbox.readthedocs.io/en/latest/api/mdptoolbox.html

Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
M 2
No ratings yet
M 2
12 pages
06 MDP
No ratings yet
06 MDP
89 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Lec 08
No ratings yet
Lec 08
59 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Reinforcement Learning: Foundations Exam
No ratings yet
Reinforcement Learning: Foundations Exam
42 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Intelligent Optimization Algorithm for Master (2)
No ratings yet
Intelligent Optimization Algorithm for Master (2)
47 pages
Wa 1
No ratings yet
Wa 1
9 pages
qustion bank with solution (3)
No ratings yet
qustion bank with solution (3)
147 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
3 DP PDF
No ratings yet
3 DP PDF
42 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Homework #3: MDPS, Q-Learning, &: Pomdps
No ratings yet
Homework #3: MDPS, Q-Learning, &: Pomdps
18 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
solutions-Markov_Decision_Processes
No ratings yet
solutions-Markov_Decision_Processes
8 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
MDP 2
No ratings yet
MDP 2
53 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
q2B Review
No ratings yet
q2B Review
9 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
RL_Basics_1737166593
No ratings yet
RL_Basics_1737166593
30 pages
lecture-06
No ratings yet
lecture-06
98 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet
Ps 4
No ratings yet
Ps 4
6 pages
Application of Reinforcement Learning - Finance
No ratings yet
Application of Reinforcement Learning - Finance
540 pages
MIE1607 Course Outline 2020
No ratings yet
MIE1607 Course Outline 2020
2 pages
Lecture Notes Stochastic Optimization-Koole
No ratings yet
Lecture Notes Stochastic Optimization-Koole
42 pages
5.5 Reinforcement Learning
No ratings yet
5.5 Reinforcement Learning
5 pages
TSP Csse 31116
No ratings yet
TSP Csse 31116
16 pages
UGRD CYBS6101 Artificial Intelligence Fundamentals Final Lab Exam - 92 Over 100
No ratings yet
UGRD CYBS6101 Artificial Intelligence Fundamentals Final Lab Exam - 92 Over 100
17 pages
Deep Reinforcement Learning - Guide To Deep Q-Learning
No ratings yet
Deep Reinforcement Learning - Guide To Deep Q-Learning
1 page
Revisiting Secretary Problem
No ratings yet
Revisiting Secretary Problem
25 pages
Computer Technology
No ratings yet
Computer Technology
192 pages
ml unit 2
No ratings yet
ml unit 2
23 pages
Rl Catalogue
No ratings yet
Rl Catalogue
3 pages
Bellman, Lee - Functional Equations in Dynamic Programming
No ratings yet
Bellman, Lee - Functional Equations in Dynamic Programming
18 pages
Policy Gradient Method For Robust Reinforcement Learning
No ratings yet
Policy Gradient Method For Robust Reinforcement Learning
44 pages
Sem 6 BTCS 602 18
No ratings yet
Sem 6 BTCS 602 18
2 pages
Reinforcement Learning As A Rehearsal For Planning Ad1196764
No ratings yet
Reinforcement Learning As A Rehearsal For Planning Ad1196764
27 pages
Multi-Agent Connected Autonomous Driving Using Deep Reinforcement Learning
No ratings yet
Multi-Agent Connected Autonomous Driving Using Deep Reinforcement Learning
16 pages
Application of Reinforcement Learning To The Game of Othello
No ratings yet
Application of Reinforcement Learning To The Game of Othello
20 pages
POMDP Tutoria POMDP - Tutoriall
No ratings yet
POMDP Tutoria POMDP - Tutoriall
55 pages
Markovian Decision Process: Chapter Guide. This Chapter Applies Dynamic Programming To The Solution of A Stochas
No ratings yet
Markovian Decision Process: Chapter Guide. This Chapter Applies Dynamic Programming To The Solution of A Stochas
20 pages
Deep Reinforcement Learning for Wireless Communications and Networking: Theory, Applications and Implementation Dinh Thai Hoang - Read the ebook online or download it for the best experience
100% (2)
Deep Reinforcement Learning for Wireless Communications and Networking: Theory, Applications and Implementation Dinh Thai Hoang - Read the ebook online or download it for the best experience
51 pages
databookRL Steve Brunton PDF
No ratings yet
databookRL Steve Brunton PDF
76 pages
AI MDP Assignment
No ratings yet
AI MDP Assignment
4 pages
DOS - Presentation
No ratings yet
DOS - Presentation
44 pages
Wachi 等 - 2024 - A Survey of Constraint Formulations in Safe Reinforcement Learning
No ratings yet
Wachi 等 - 2024 - A Survey of Constraint Formulations in Safe Reinforcement Learning
10 pages
Multi Agent Systems - Algorithmic, Theoretic, and Logical Foundations - ToC
No ratings yet
Multi Agent Systems - Algorithmic, Theoretic, and Logical Foundations - ToC
10 pages
AI Introduction - Adv, Dis, Applications, Techniques
No ratings yet
AI Introduction - Adv, Dis, Applications, Techniques
9 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
9 pages

MarkovDecisionProcesses Analysis

Uploaded by

MarkovDecisionProcesses Analysis

Uploaded by

Assignment 4:

Markov Decision Processes

1.1 Problems Introduction

Frozen Lake - It is a grid world problem, where agent try to go

1.2 Algorithms Introduction

2 GRID WORLD PROBLEM (FROZEN LAKE (FL)) ANALYSIS -

2.1 Value Iteration (VI) Analysis

2.2 Policy Iteration (VI) Analysis

2.3 Q Learning (QL) Analysis

3 NON-GRID WORLD PROBLEM (FOREST MANAGEMENT (FM)) ANALYSIS -

3.1 Value Iteration (VI) for FM 25 & 625 states

3.2 Policy Iteration (PI) for FM 25 & 625 states

For Gamma 0.999 epsilon 0.0001, VI takes 165 iterations and

3.3 Q-Learning (QL) for FM 25 & 625 states

For Q learning, I choose same Gamma 0.999 and varied

You might also like