MarkovDecisionProcesses Analysis
MarkovDecisionProcesses Analysis
1 INTRODUCTION
This report summarizes the analysis done on two Markov Decision Processes (MDP) problems, In MDP
problems, an agent interacts with an environment, At each time step, the agent takes action which leads to
change in environment state, and it also leads to some reward/penalty for taking this action. The goal is
always to find an optimal policy which maximizes the total value of the rewards for reaching the goal state
or for taking actions for certain time frame. I have chosen Frozen Lake (Grid World) and Forest Management
(Non-Grid World) as the two MDP problems.These two problems are solved using Policy Iteration, Value
Iteration and Q-Learning algorithms. The sizes of these two problems, parameters of the algorithms, are varied
and analyzed in this report. I have used Open AI GYM, Markov Decision Process (MDP) Toolbox Python
libraries for this analysis.
Forest Management Problem - It is a non-grid world problem, the agent has two actions (Wait(0) and Cut(1)).
Action is decided based on two objectives, First to maintain the old forest for wildlife and second is to earn
reward by selling woods. It has a stochastic component, where there is a probability (1-p) that a fire burns
the forest. The states of this problem are [0, 1 . . . "S" -1 ], where 0 means the youngest state and "S" and -1
means the oldest. If the agent takes action 1, Cut the forest goes to state 0 and agent get reward r1(Selling
woods), if the agent waits for higher reward r2(forest for wildlife), the forest state remains same, but there is
a probability p of fire which brings back the forest to 0 (forest goes to state 0). This problem is interesting as
1
it’s a real world problem and due to the stochastic nature of the problem. In some states waiting is ideal and
in some state cutting for immediate reward.
There are two types of algorithms we analyzed in our report, model based algorithms and model free al-
gorithms. In model based learning the agent interacts with the environment and create state transition and
reward models, these models are iterated using Value Iteration/Policy Iteration by the agent to find optimal
policy. On the other hand, in model free learning, the agent doesn’t learn an explicit state transition and
rewards model, it interacts with the environment and derives the optimal policy.(Q-Learning, SARSA)
Few function definitions before discussing algorithms -
V(s), Value Function, It is equal to the expected total reward for an agent starting is state s. It represents how
good is an environment state for the agent. The function with the highest reward is Optimal V*(s). ˆ
ˆ
Q*(s,a), Q Function, is a function to get total reward for an agent in state s for taking action a. Or in other
words, how good is it for agent to pick action a when in state s.
Bellman Equation, defines a recursive way to calculate optimal Q-function (dynamic programming paradigm).
ˆ
π *(s), optimal policy, which gives the optimal action for state s with maximum total reward.
Value Iteration (VI)- The VI algorithm iteratively improves V(s) and compute the optimal state value. First we
initialize V(s) to an arbitrary value, then update Q(s,a) and V(s) values until they converge. VI is guaranteed
to converge to the optimal policy π ∗ .
Policy Iteration (PI)- In contrast to VI, in PI instead of repeatedly improving the V(s) estimate, it re-defines
the policy π at each step and compute the V(s) according to this new policy π until the policy converges. As,
sometime the optimal policy converges before the value function, it often takes fewer iterations than VI. PI is
also guaranteed to converge to the optimal policy π ∗ .
Q- Learning (QL)- It is an example of model free learning, Agent has no state transition and reward model, it
discovers the optimal policy by trial and error. (Exploration and Exploitation). In QL, we basically approximate
Q function using the Q(s,a) samples, that we observed previously. The approach is called Time-Difference
Learning. Three Greek letter hyperparameters. Alpha, which is learning rate, it tells how much we should
trust our observed samples to estimate Q function. Gamma, defines how important is the future reward and
epsilon, which is used to balance explore and exploit, for some random chance less than epsilon, we take
random actions. For q-Learning we decay both alpha and Epsilon.
I did my analysis on problem sizes of 4x4 (16 states), 8x8 (64 states) and 32x32 (1024 states). I used Open AI
gym to create the environment and then from the environment I computed the Transaction and reward models
in (A,S,S) matrix’s. These models are then feed into MDPtoolbox Value Iteration method, with combinations
of Gamma and epsilon, to see the impact of these hyperparameters. Note, We are using the stochastic FL
and there is no negative reward for each step, reward to reach the goal is 1 for below experiments in the
2
table. Also, I have done analysis for each algorithm with negative reward at each step (Impacts Q-learning
Convergence time) explained in respective subsections
For tuning the hyperparameters for VI. Gamma, I have used [0.3, 0.6, 0.9, 0.99, 0.999] and for epsilon I used
[1e-2, 1e-4, 1e-5, 1e-6]. Below is the table which shows the impact of gamma and epsilon on VI for all three
size problems by comparing time iterations and rewards.
Gamma, discount factor, this is per time step discount factor on future rewards, it varies between 0 and 1,
and the higher the gamma, the more you value the future rewards.
Gamma impact on Iterations, Time, and Reward- From the table we can see, as we increase the gamma, the
number of iterations increases. This is because the reward to reach the goal takes more iterations to decay
close to 0 (when jumping in the hole is equivalent to future reward). Interesting observations, when we see
iterations for different size problems for different Gamma size, for small gamma size 4x4, 8x8, 32x32 takes
equal iterations but for higher value like 0.99 and 0.999, we see it takes more time to iterate as time is directly
proportional to the size of the problem, the reason being the future reward is discounted by a very low factor,
and as we increase the size of the problem the number of holes increases, complexity increases, so there is
more variation in the value function which takes more iterations to converge. Iterations are directly propor-
tional to time. Also, as Gamma, discounts the reward on each step, gamma value is directly proportional to
the reward.
Epsilon, Stopping Criteria, On each iteration the change is maximum V(s) is compared against epsilon, if the
3
ˆ
change falls below the epsilon value the V(s) is considered converged to V*(s)
Epsilon impact on Iterations, Time, and Reward- Epsilon is inversely proportional to time iterations and
reward for all three size problems. This is because epsilon is stopping criteria, if the epsilon value is high
the value functions converges early, if the epsilon value is small, it runs for more iterations to improve the
values function and reduce the difference between max V(s) at each iteration. Interestingly, for Epsilon value
0.01 and 0.0001 and gamma value 0.999, we see PI takes fewer transactions for complex problem 32x32 in
comparison to 4x4 and 8x8, this is because, PI runs VI at each step, for 32x32 problem policy converges faster
for high epsilon value.
Final chosen Gamma and Epsilon for VI, We will use Gamma value of 0.999 and epsilon value of 0.0001
reason being, PI iterative algorithm uses epsilon of 0.0001(MDPToolbox), equal comparison.
Frozen Lake 8x8, VI Convergence plot and policy Analysis - From the
plots of left and table above, we can see FL 8x8, with gamma 0.999 and
epsilon 0.0001 converges in 405 iterations of VI. We can see in the plots
the Max Value function and Error plateaus and converges. The
converged policy is in the below image on the left, As our environment is
stochastic, we can all the arrows in the row 1 pointed up and column 7
pointed right so that there is no chance for the agents to go into the red
holes. Also, all the squares surround the red holes moves outward. To
explore different Strategy, I made -1 reward for each stage and 1000
reward goal state Green (7,7). The resulted policy is the policy on the
right. This policy took more than double iterations (918) and 5 times
more time.(0.5 compared to 0.1). The show’s negative rewards strategy
negatively impacts the VI algorithm.
By comparing both the policies side by side, we can see the policy with -1 reward safely more directed towards
the reward. Some arrows in row 1 and column 7 are pointed towards the reward. Also, we can see square (6,3)
4
chooses to go up rather than away from the reward with equal chance of falling in the hole. Value function is
negative in the squares surrounded by holes.
For tuning the hyperparameters for PI, I have use 11 different values of Gamma [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,
0.8, 0.9, 0.99, 0.999], Epsilon value for policy iteration method in MDPtoolbox is 0.0001 (VI at each step using
the policy use this epsilon for convergence).
Gamma (discount factor) impact on Iterations, Time, and Reward- Gamma basically allow the algorithm to
see the future. If the Gamma is high, means we value the future more, the reward (Max V) will be high. We
see the same behavior in the table above, as we increase gamma, all the time, iteration and rewards increases.
For 4x4 and 8x8 we see iteration count doesn’t change much for different gammas, but for 32x32 it varies from
5 to 35, this is because the environment has more holes, its more state to explore and the iterations increases
with the stochastic complex nature of the problem. The reward (Max V) is always high for 32x32 problem as
it depends on how many times the VI runs, and it is increasing with the iterations.
Frozen Lake 32x32, PI Convergence plot and policy Analysis - We
selected Gamma 0.999 same as VI and epsilon is 0.0001 also same as VI.
From the plots of left and table above, we can see FL 32x32, with gamma
0.999 and epsilon 0.0001 converges in 35 iterations of PI(in comparison to
188 iterations for VI). This is because PI runs VI at every iteration, and
the policy converges faster. We can see in the plots the Max Value
function and Error plateaus and converges. The converged policy is in
the below image on the left, we can see how the reward 1 propagates to
the other state, the state space is huge for reward 1, and it is equal for
most of the state. To explore a different Strategy, I made -0.01 reward for
each stage and 1000 reward goal state. The resulted policy is the policy
on the right. This policy took approx 150 iterations in comparison to 33
earlier. This shows for model based learning, negative rewards negatively
impacts the policy convergence.
5
Also, we see the same behavior (top square has less value than bottom squares as top squares are hard to
navigate with the holes structure) if we do negative reward on each state, the policy tries to move towards the
goal safely (same as VI), it seems the state in the bottom are safer to move with the 0.33 probability stochastic
behavior due to which we see good value function on those states.
For Q learning Hyperparameter tuning, Gamma had similar effects as we saw in VI and PI. So I used Gamma
as 0.999 same as PI and VI. For Alpha, I used [0.1,0.2], Alpha_Decay[0.9,0.999], epislon_decay[0.9,0.99,0.9999]
and I used default epsilon, and a million iterations for each problem size as seen in the table below.
Gamma, discount factor, we discussed in VI section. Alpha, Learning Rate Alpha is learning rate, means how
much we value the observed Q(s, a). Epsilon, for random action probability, so based on this value we explore
the environment by taking random actions. alpha_decay & epsilon_decay decays the alpha and epsilon over
6
time, so that we can exploit (low alpha, less epsilon randomness) from the observed Q table. For the table we
can see, a million iterations are way less to learn this problem for all different sizes. When I use high alpha
the learning rate increases, it’s good for 4x4 & 8x8 problem, but for a large problem, we might need more
iterations to see the effect. Interestingly, the highest reward for 32x32 is for lower alpha. Epsilon decay has the
highest impact as this problem has 4 actions plus larger state space like 32x32 and 8x8 shows better reward
for 0.9999. So we chose, alpha = 0.1 (default), alpha_decay(0.999, decay alpha slowly and trust observation
values), epsilon_decay(0.9999, high exploration). Also we ran the Q learning algorithm for 1000 episodes(upper
limit) with 1e5 iterations each, resetting alpha and epsilon at each iteration (Epsilon helped to jump out of
local minima) with convergence criteria as, If 10 consecutive episodes with N_VAR_L10_MeanV(numpy
variance of last 10 episodes max mean V for last 10 episode) - P_VAR_L10_MeanV(numpy variance of last
10 episodes max mean V for previous last 10 episodes) is less than 0.0001 and Maximum of Max V is greater
than 0.99 (which is optimal policy).
Frozen Lake 8x8, QL Convergence plot and policy
Analysis - From the plots of left, we can see FL 8x8,
converges in 600 episodes of 1e5 transaction each. This
is high because Q learning is a model free algorithm
and takes time to explore, I tried two approaches, one
with negative rewards(-0.01 reward for each stage and
100 reward goal state) and one where I decrease the
epsilon decay to 0.9 instead of 0.9999 after 420
episodes. With negative rewards, the meanV
converged around 400 episodes and MaxV around 200
episodes
which is comparatively less than 600 and 400 for non-negative rewards (plots on left), as expected, negative
rewards will make Q learning to move faster to the goal. Changing decay helped to converge early, but
produced the policy with suboptimal value function.
Policy Comparison VI, PI and QL (8x8 FL) Conclusion-
When we see the three policies, side by side as we know PI and VI policies are guaranteed to converge to the
7
optimal policy, we can QL policy also looks very similar to PI and VI, The max V(s) for the optimal policy is
also very close. PI and VI are the same directions for each square, but QL policy as it tries random actions
because it is model-less, sqaure (3,3)(6,3) for QL is different, it seems to move towards high rewards.
Conclusion - For FL problem, when we compare model based learning algorithms, VI and PI, PI takes fewer
iterations to converge but more time as it does more computation at each iteration. PI and VI converges and
give optimal policy as guaranteed. Whereas, model Free learning takes a lot of time (hours) and billions of
iterations because the problem is stochastic and has 4 actions, it does find the optimal policy. Negative reward
strategy helps QL converge faster for all sizes. (400 episodes in comparison to 600, 33% faster)
For Forest management, I analyzed two size problems, 25 states and 625 states. I have kept the reward r1,
reward r2 and probability to default values(4,2 & 0.1). I have used MDPToolbox to generate and solve these
MDP problems. We will discuss both sizes problems side by side for each algorithm.
From the below table, we can see the policy doesn’t change at all (both 25 and 625 states) for the selected
gamma (changing gamma changes the policy, as we can look farther by making gamma high). This is because
it’s a chain state problem (V(s) same for all C states) and FM problem converges to ideal policy with first
iteration, and it runs max iterations based on gamma and epsilon selected.
The reason, the rewards are also same for 25 and 625 states is, the wait state and cut state has same reward for
both the problems and reward is Max V value with same gamma value 0.999 and epsilon value the highest
Max V value converges to same for different size problems. To check what’s going on, First I ran VI with
Gamma 0.1 for 25 and 625(r1 4 and r2 2) and noticed the Value function and Policy, only first and last state
were W. Because V(s) for W in last is 4+ 0.1 * (V*ˆ (s’)) = 4.39, and for second last state if we do W V(s) = r(s)
+ 0.1(4) (considering W in last state) which is less than discounted Cut reward at second last state hence
Vmax(s) for the state is C. When we increase the Gamma to 0.3, now for second last state v(s) = r(s) + 0.3(4)
which is greater than 1 and hence last two states are W. So as we increase the gamma, the W increases from
the last state. Because we get waiting reward only by waiting in last state, and it propagates from there,
Also the value function for all C state is same, as they all get r2 for Cut and gamma times the reward for
state 0. First state is always W as there is no reward for C at state 0. These reason also proves why 25 states
and 625 states and has exactly same W from the last state. (When we selected the same Gamma).
8
From the policy plots on the right above, we can see (red, means wait and blue means cut) we see first state is
always w as s=0 has no reward, and all the Waits again in both 25 and 625 is in the end, rest in between is all
the cuts i.e VI takes the cut rewards instead of waiting. Interesting, both policies has one beginning W and 20
W in the end (because of same Gamma, explained earlier), which also affirms why policies for different sizes,
converge at same Max V as last state always has the same largest V(s). (VI convergence plot in PI section)
From the below table, we can see the policy changes with the change in gamma, as expected. We use default
epsilon 0.0001 (MDP toolbox default) for our PI. The number of iterations is less than VI. But the time for con-
vergence is high on PI, which is expected as PI runs VI at each iteration.(More computation at each iteration)
9
ness, for 1M impacts negatively on the reward.
We can see also from the table, in comparison to PI and VI the Q learning policy changes, also for 25
and 625 state it takes different iterations to converge as it’s non-model based learning. From the converged
policy on the right policy plots we can see for 25 states there are 4 W and all cuts, whereas with 625 states
there are 6 W and all other cuts, which is different from PI and VI where the W were equal for small and big
state for selected epsilon gamma values (Reasoned in VI). To plot the convergence for QL, I ran QL for 2400
episodes with 1e5 iterations each, with the same convergence criteria as FL. From the Convergence plots, top
2 (25 states) and bottom 2 (625 states), we can see Mean V and Max V converges at the same value. Max V
converges faster, and Mean V takes more episodes to converge.
Conclusion - Forest Management problem is a state chain problem, due to which the number of W depends
on Gamma (detailed explanation in VI section), First state is always W, as reward is 0 for C at stage 0 and
number of iterations depend on reward epsilon gamma, instead of the size of problem for VI and PI. For
Q-Learning, the state chain problem is even harder, it converges at a suboptimal policy in comparison to
PI and VI, and it takes a lot of iterations and time to converge. Unlike the FL problem, FM problem doesn’t
find the policy similar to VI and PI. instead, there are Wait states in the middle, this is because the state chain
nature of the problem is hard for Q learning.
REFERENCES
[1] Gym is a standard API for reinforcement learning, and a diverse collection of reference environments.
Gym Documentation. (n.d.). Retrieved November 26, 2022, from https://www.gymlibrary.dev/
[2] Markov decision process (MDP) toolbox¶. Markov Decision Process (MDP) Toolbox - Python Markov
Decision Process Toolbox 4.0-b4 documentation. (n.d.). Retrieved November 26, 2022, from https://
pymdptoolbox.readthedocs.io/en/latest/api/mdptoolbox.html
10