Markov Decision Process
Markov Decision Process
3 +1
EXAMPLE: an agent is in the field start,
and can move in any direction between
2 −1
the field. Its actions ends when it reaches
one of the fields (4,2) or (4,3), with the
result marked in those fields. 1 START
1 2 3 4
If the problem was fully deterministic — and the agent’s knowledge of its
position was complete — then the problem would be reduced to action
planning. For example, for the above example the correct solution would be the
action plan: U-U-R-R-R. Equally good would be the plan: R-R-U-U-R. If the
single actions did not cost anything (ie. only the final state did matter), then
equally good would also be the plan: R-R-R-L-L-L-U-U-R-R-R, and many others.
0.8
0.1 0.1
We can now compute the expected values of the sequences of agent’s moves. In
general, there is no guarantee, that after executing any of the above sequences,
the agent will indeed end up in the desired terminal state.
As opposed to the action planning algorithms, here the agent should work out
its strategy not as a specific sequence of actions, but as its policy, which is
a scheme determining actions the agent should take for any specific state it
would find itself in.
This policy obviously results from the assumption of zero cost of the moves. If
the agent’s outcome depended not only on the final state, but also on the
number of moves, then such conservative policy would probably no longer be
optimal.
3 +1
2 −1
1 2 3 4
Computing the policy as a complete mapping from states to the set of actions is
called the Markov decision problem (MDP), if the probabilities of transitions
resulting from the agent’s actions depend only on the current state of the agent,
and not on its history. Such problems are said to have the Markov property.
The solution to an MDP is the policy π(s) mapping from states to actions.
Let’s note that under uncertainty each pass of the agent through the
environment according to the policy may result in a different sequence of states,
and possibly different outcome. The optimal policy π ∗(s) is the policy
achieving the greatest expected utility.
+1 +1
−1 −1
3 +1
+1 +1
2 −1
−1 −1
In MDP problems, states do not have utilities, except terminal states. We could
speak of the utility of a sequence (history) of states Uh([s0, s1, ..., sn]), if it
corresponds to an actual sequence of agent’s actions, and leads to a final state.
It then equals the final result obtained.
Previously we have defined the optimal policy based on the expected utility of
a sequence of states. But determining the optimal policy depends on one
important factor: do we have an infinite time horizon, or is it limited to some
finite number of steps? In the latter case, the specific horizon value will likely
affect the optimal policy. If it is so, then we say the optimal policy is
nonstationary. For infinite horizon problems the optimal policy is stationary.
Computing the optimal policies for finite horizon problems is harder, and we will
consider only infinite horizon problems.
For γ < 1 and R ≤ Rmax the utilities so defined are always finite.
In the case of infinite action sequences there are other approaches possible,
beside discounting. For example, the average reward, computed for a single
step, can be used as the utility of a sequence.
It turns out that for many practical cases the utility functions are additive. For
example, when considering the cost function in the state space search, we
implicitly assumed that they are additive. The incurred cost in a state was
simply the cost in the previous state, plus the cost of the move.
2 0.762 0.660 −1 2 −1
1 2 3 4 1 2 3 4
However, in MDP problems states do not have utilities, except for the final
states. The “utility” of a state (intermediate) depends on the agent’s policy, ie.
what it intends to do in that state. At the same time, the agent’s policy
depends on the “utilities” of the states.
We can introduce state utilities based on policies.
By St we denote here the random variable signifying the state the agent will be
at step t after starting from state s and executing the policy π.
For the utility of a state U (s) we will take its utility computed with respect to
π∗
the optimal policy U (s).
where P (s′|s, a) is the probability that the agent will reach the state s′ if she
executes the action a in the state s.
If in some problem the final states were achieved with known utilities in exactly
n steps, then we could solve the Bellman equation by first determining the
utilities for the states at step n − 1, then at step n − 2, etc., until reaching the
start state. Problems of such type are called n-step decision problems, and
solving them is relatively easy.
For problems, which cannot be stated as the above n-step decision problems, we
can compute approximate values of the state utilities in an iterative procedure
called the value iteration:
X
Ut+1(s) = R(s) + γ max P (s′|s, a)Ut(s′)
a
s′
1 (4,3)
(3,3)
(2,3)
(1,1)
(3,1)
0.5
(4,1)
Utility estimates
-0.5
-1 (4,2)
0 5 10 15 20 25 30
Number of iterations
As we saw in the example the value iteration procedure converged nicely in all
states. The question is, is it always this way?
It turns out that it is. The value iteration algorithm always leads to stable
values of state utilities, which are the sole solutions of the Bellman equation.
The number of iterations of the algorithm needed to reach an arbitrary error
level ǫ is given by the following formula, where Rmax is the upper bound on the
reward values:
N = ⌈log(2Rmax/ǫ(1 − γ))/ log(1/γ)⌉
• In practice, the following termination condition can be used for the value
iteration: ||Ui+1 − Ui|| < ǫ(1 − γ)/γ
• In practice, the optimal policy can be reached much further than the utility
values stabilize with desired small errors.
Just because the optimal policy is often relatively insensitive to the specific
values of the utilities, it can be computed by a similar iterative process, called
the policy iteration. It works by selecting an arbitrary initial policy π0, and
initial utilities, and then updating the utilities determined by the policy,
according to the following formula:
X
Ut+1(s) = R(s) + γ P (s′|s, πt(s))Ut+1(s′)
s′
alternating it with subsequent update of the policy
X
πt+1(s) = argmax P (s′|s, a)Ut(s′)
a
s′
In the above formulas πt(s) denotes the action designated by the current policy
for the state s. The first formula gives a set of linear equations, which can be
solved exactly for Ut+1 in O(n3) time (they are the exact utilities for the
current approximate policy).
The policy iteration algorithm terminates when the policy update step make no
change. Since for a finite space there exist a finite number of policies, the
algorithm is certain to terminate.
For small state spaces (n in O(n3)) the above procedure is often most efficient.
For large spaces, however, the O(n3) causes it to run very slowly. In these cases
a modified policy iteration can be used, which works by iteratively updating
the utilities — instead of computing it exactly each time — using a simplified
Bellman updating given by the formula:
X
Ut+1(s) = R(s) + γ P (s′|s, πt(s))Ut(s′)
s′
Compared with the original Bellman equation the determination of the optimal
action has been dropped here, since the actions are determined here by the
current policy. Thus this procedure is simpler, and even several such update
steps can be made before the next policy iteration step (updating the policy).
In a general case the agent may not be able to determine the state it ended up
in after taking an action, or rather can only determine it with a certain
probability. Such problem are called partially observable Markov decision
problems (POMDP). In these problems the agent must compute the
expected utility of its actions, taking into account both their various possible
outcomes, and various possible new information (still incomplete), that it may
acquire, depending on the state it ends up in.
The task is to compute such policy, that would allow the agent to reach the
goal with the highest probability. During the course of action the agent will
change her belief state, due both to the newly received information, and to
executing the actions by itself.
The key to solving the POMDP is the understanding that the optimal action
depends only on the agent’s belief state. Since that agent does not know her
actual state (and will never learn it in fact), her optimal policy must be
a mapping π ⋆(b) from belief states to actions. The subsequent belief states can
be computed using the formula:
X
′ ′ ′
b (s ) = αP (e|s ) P (s′|s, a)b(s)
s
where P (e|s′) is the probability of receiving the observation e in state s′, and α
is an auxiliary constant normalizing the sum of the belief states to 1.
The work cycle of a POMDP agent, assuming she has already computed her
complete optimal policy π ∗(b), is then as follows:
Since the MDP model considers the probability distributions and permits to
solve such problems, a POMDP problem can be transformed to an equivalent
MDP problem defined in the belief space. In this space we operate on the
probability distribution of the agent reaching the set of beliefs b′ where she
currently has the set of beliefs b and executes the action a. For a problem with
n states, b are n-element real valued vectors.
Note that the belief state space, which we obtained while studying the POMDP
problems, is a continuous space, unlike the original problem. Furthermore, it is
typically multi-dimensional. For example, for the 4 × 3 environment it has
11-dimensions.
The above value iteration and policy iteration algorithms are not applicable to
such problems. Solving them is computationally very hard (PSPACE-hard).
X
P (e|a, b) = P (e|a, s′, b)P (s′|a, b)
s′
X
= P (e|s′)P (s′|a, b)
s′
X X
′
= P (e|s ) P (s′|s, a)b(s)
s′ s
X
′ ′
P (b |b, a) = P (b |a, b) = P (b′|e, a, b)P (e|a, b)
e
X X X
′ ′
= P (b |e, a, b) P (e|s ) P (s′|s, a)b(s)
e s′ s
where
′ ′ ′
P ′
′ 1 if b (s ) = αP (e|s ) s P (s |s, a)b(s)
P (b |e, a, b) =
0 otherwise
and all the elements defined above constitute a totally observable Markov
decision process (MDP) over the belief state space.
It can be proved, that the optimal policy π ∗(b) for this MDP is also the optimal
policy for the original POMDP problem.
A sketch of the algorithm: we define a policy π(b) for regions of the belief
space, where for one region the policy designates a single action. An iterative
process analogous to the value or policy iteration then updates the region
boundaries, and may introduce new regions.
The optimal policy computed with this algorithm for the above example is:
[ L, U, U, R, U, U, (R, U, U)* ]
(the cyclically repeating R-U-U sequence is necessary due to the uncertainty of
the agent ever reaching the terminal state). The expected effect of this solution
is 0.38, which is significantly better than for the naive policy proposed earlier
(0.08).