Algorithms To Solve An MDP
Algorithms To Solve An MDP
SOLVE AN MDP
TO SOLVE AN MDP REFERS TO ESTIMATING THE OPTIMAL
STATE VALUES OR THE OPTIMAL POLICY USING WHICH OUR
AGENT CAN MAKE START DECISIONS REGARDING THE
ACTIONS TO BE TAKEN IN THE ENVIRONMENT.
HOWEVER, AS WE KNOW THAT STATE VALUES ARE THE
EXPECTED RETURNS FROM A STATE OVER A LARGE NUMBER OF
FUTURE ACTIONS/TILL THE END OF A LARGE EPISODE, WE
CANNOT DIRECTLY COMPUTE THEM BY FINDING ALL POSSIBLE
PATHWAYS
HENCE WE DEPEND ON ITERATIVE OR RECURSIVE SOLUTIONS
(DYNAMIC PROGRAMMING) WHERE THE VALUE OF ONE
QUANTITY CAN BE ESTIMATED IN TERMS OF VALUE OF THE
QUANTITY IN THE NEXT TIME STEP/ STATE.
MODEL-BASED AND
MODEL-FREE
METHODS
Value Iteration
In value iteration, we start with
assuming random state values for
each state, and using these
random values, we iterate
through different state action
pairs using dynamic programming
and correspondingly update the
state values till convergence.
0 0 0 0
Step 2: Q value estimation
Hence, in next iteration, we find our state value to be as follows
Next, we use these updated values to find the Q values for all state-
action pairs in the model. 0 0 0 1
First let’s focus on one particular state: (1,3)
Now, Q values for this state:
0 0 -1
0 0.52 0.78 1
0 0 0.72 1
0 0 0.43 -1
0 0 0 -1
0 0 0 0
0 0 0 0
Here, hence since value of state(1,3) will be the largest among all other
neighboring cells of state(1,2), the policy asks the Superhero Pranay to
move right from the cell (1,2) and similarly from state(1,3) to state(1,4),
where he wins!
Policy Iteration
Unlike value iteration, in policy
iteration, we start with assuming
a random policy for the agent as
we do not know which action
would be optimal.
Consider another simple grid game, a 3x3 grid, where the player
starts in the Start square and wants to reach the Goal square as
their final destination, with danger cells.
This problem has 9 states since the player can be positioned in any
of the 9 squares of the grid. It has 4 actions. So we construct a Q-
table with 9 rows and 4 columns.
Step 2: Action!
Unlike model-based methods, here in the Q –Learning algorithm,
we need to actually interact with the environment to learn
something about it. Hence, from whichever state we are in, we
need to pick an action correspondingly.
How do we choose the action?
In the first few iterations, we have learnt close to nothing about the model, hence it is
better to explore the different options rather than relying on our little knowledge
about the model. Hence ε is close to 1 in the beginning encouraging exploration.
Towards the end of an episode or a high number of iterations, we have learnt almost
everything about the model and hence we can focus on our knowledge of the model to
extract the best actions and gain greater rewards. Hence, at very large number of
iterations, the ε boils down to 0, encouraging exploitation.
Step 3: Q value estimation
Now that we have picked our action, we need to build our
intuition behind the formula.
Now we know that after picking the action, the reward we get
(say r) plays a role in the Q value, for an optimum Q value, we
can say that
This process is then repeated for the next state after we take the
particular action till CONVERGENCE.
Step 3: Putting all together
Note that when we compare the two ways of computing the Q value for a particular state-action, we take the action in the next
state which has maximum Q value, but it is not necessary that in the next iteration we actually follow that action, because we can
also be exploring with a probability of ε. Hence, as it is not following exactly the same policy as it is using to compute the Q values, it
is called an OFF-POLICY algorithm.
THANK YOU!