0% found this document useful (0 votes)

45 views

Markov Decision Process

Sequential decision problems involve making a series of decisions where the utility depends on the whole sequence rather than individual decisions. An example is moving an agent through a grid to reach a goal state. With uncertainty in actions, the optimal policy is not a set plan but a mapping of actions for each state. Markov decision processes (MDPs) provide a framework for finding optimal policies by modeling state transitions, rewards, and exploring factors like horizons, discounting, and ensuring policies are proper and lead to terminal states.

Uploaded by

Edward

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views

Markov Decision Process

Uploaded by

Edward

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Sequential decision problems

In sequential decision problems the utility of agent’s actions do not depend

on single decisions, expressed with the state, which the agent would have gotten
into, as the result of this decision, but rather on the whole sequence of agent’s
action.

3 +1
EXAMPLE: an agent is in the field start,
and can move in any direction between
2 −1
the field. Its actions ends when it reaches
one of the fields (4,2) or (4,3), with the
result marked in those fields. 1 START

1 2 3 4

If the problem was fully deterministic — and the agent’s knowledge of its
position was complete — then the problem would be reduced to action
planning. For example, for the above example the correct solution would be the
action plan: U-U-R-R-R. Equally good would be the plan: R-R-U-U-R. If the
single actions did not cost anything (ie. only the final state did matter), then
equally good would also be the plan: R-R-R-L-L-L-U-U-R-R-R, and many others.

Making complex decisions — sequential decision problems 1

The uncertainty of agent’s action effects

However, considering the uncertainty, the result of the agent’s actions

corresponds with its intentions only with some probability. For example, we may
assume that the action U (Up) transfers the agent to the desired position with
probability 0.8, and with probabilities 0.1 takes the agent left or right. It is only
certain that the agent will not end up in the direction opposite to the intended
one. For simplicity let us also assume that the presence of the walls does not
change this probability distribution, and only makes the agent stay in place, if it
turned out that it should move into the wall.

0.8

0.1 0.1

We can now compute the expected values of the sequences of agent’s moves. In
general, there is no guarantee, that after executing any of the above sequences,
the agent will indeed end up in the desired terminal state.

Making complex decisions — sequential decision problems 2

The agent’s policy

As opposed to the action planning algorithms, here the agent should work out
its strategy not as a specific sequence of actions, but as its policy, which is
a scheme determining actions the agent should take for any specific state it
would find itself in.

We can determine the optimal policy for 3 +1

the previous example problem. Note that

at point (3,2) the policy makes the agent 2 −1
move left, which may seem wrong, but
allows the agent to avoid ending up in 1
state (4,2). Similarly in state (4,1).
1 2 3 4

This policy obviously results from the assumption of zero cost of the moves. If
the agent’s outcome depended not only on the final state, but also on the
number of moves, then such conservative policy would probably no longer be
optimal.

Making complex decisions — sequential decision problems 3

Considering the cost of the moves
Considering a positive cost of the moves, lowers the result achieved in the final
states by the cumulative cost of all the moves. This certainly affects the agent’s
optimal policy.
For example, the following diagram shows the optimal policy when the cost of
the moves equals 1/25 of a unit. Let’s note that in the states (4,1) and (3,2)
the optimal policy dictates the move toward the state (4,3), despite the risk.
However, in the states (2,1) and (3,1) the policy still recommends going around.

3 +1

2 −1

1 2 3 4

Formally, the cost of the moves is introduced as a reward function R for

states, so R(s) = −0.04 for all the nonterminal states.

Making complex decisions — sequential decision problems 4

Markov decision problems

Computing the policy as a complete mapping from states to the set of actions is
called the Markov decision problem (MDP), if the probabilities of transitions
resulting from the agent’s actions depend only on the current state of the agent,
and not on its history. Such problems are said to have the Markov property.

Formally, a Markov decision problem is defined by the following elements:

• the set of states with the starting state s0,

• the set Actions(s) of actions allowed in state s,
• transition model P (s′|s, a),
• reward function R(s) (also possible is: R(s, a), R(s, a, s′)).

The solution to an MDP is the policy π(s) mapping from states to actions.
Let’s note that under uncertainty each pass of the agent through the
environment according to the policy may result in a different sequence of states,
and possibly different outcome. The optimal policy π ∗(s) is the policy
achieving the greatest expected utility.

Making complex decisions — sequential decision problems 5

Influence of rewards on the optimal policy
Varying the reward values results in the changes of the optimal policy for a
problem. With very high negative rewards the policy recommends going directly
to the final state, regardless which one. With the rewards approaching zero the
initial policy gradually returns.

+1 +1

−1 −1

3 +1

r = [− : −1.6284] r = [−0.4278 : −0.0850]

+1 +1
2 −1

−1 −1

r = [−0.0480 : −0.0274] r = [−0.0218 : 0.0000] 1 2 3 4

In case of positive rewards, it is no longer profitable for the agent to terminate

the game, so the optimal policy tells it to avoid the terminal states. Executing
actions is profitable, so the agent should keep running, and avoid termination.

Making complex decisions — sequential decision problems 6

The horizon problem

In MDP problems, states do not have utilities, except terminal states. We could
speak of the utility of a sequence (history) of states Uh([s0, s1, ..., sn]), if it
corresponds to an actual sequence of agent’s actions, and leads to a final state.
It then equals the final result obtained.

Previously we have defined the optimal policy based on the expected utility of
a sequence of states. But determining the optimal policy depends on one
important factor: do we have an infinite time horizon, or is it limited to some
finite number of steps? In the latter case, the specific horizon value will likely
affect the optimal policy. If it is so, then we say the optimal policy is
nonstationary. For infinite horizon problems the optimal policy is stationary.

Computing the optimal policies for finite horizon problems is harder, and we will
consider only infinite horizon problems.

Making complex decisions — sequential decision problems 7

Discounting
As can be seen in our simple example, infinite action sequences are possible,
and can even be optimal policies for an agent. Considering infinite, or very long,
sequences is sometimes necessary, eg. if the problem does not have terminal
states, or if the agent is not expected to reach them. Such computation is
difficult, however, as the reward sums are infinite, and it is hard to compare
them then.

To deal with this, a technique of discounting is used, which works by reducing

0 < γ < 1. The utility of
the effective utilities of future states by some factor P
a sequence of states H is then defined as U (H) = i γ iRi, or:

Uh([s0, s1, ..., sn]) = R(s0) + γR(s1) + γ 2R(s2) + ... + γ nR(sn)

For γ < 1 and R ≤ Rmax the utilities so defined are always finite.

The discounting technique has an intuitive justification in many real life

problems. It reflects the smaller value of very distant rewards. Likewise, the
technique of discounting is used in economy to evaluate the investments.

Making complex decisions — sequential decision problems 8

Proper policies and averaging

In the case of infinite action sequences there are other approaches possible,
beside discounting. For example, the average reward, computed for a single
step, can be used as the utility of a sequence.

If the problem has terminal states, then it is possible to determine a policy

which guarantees bringing the agent to one of these states. Then, it is not
necessary to analyze infinite sequences. The policies guaranteeing bringing the
agent to one of the terminal states are termed proper.

Making complex decisions — sequential decision problems 9

The properties of the utilities of state sequences
A utility function for the sequences of states is called separable, if:

U ([s0, s1, ..., sn]) = f (s0, U ([s1, ..., sn]))

In our example 4 × 3 problem the utility function is separable since we can

compute it using the formula:

U ([s0, s1, ..., sn]) = R(s0) + R(s1) + · · · + R(sn)

We call a utility function additive if it has the following property:

U ([s0, s1, ..., sn]) = R(s0) + U ([s1, ..., sn])

It turns out that for many practical cases the utility functions are additive. For
example, when considering the cost function in the state space search, we
implicitly assumed that they are additive. The incurred cost in a state was
simply the cost in the previous state, plus the cost of the move.

Making complex decisions — sequential decision problems 10

Computing the optimal policy — utilities of the states
In order to determine the optimal policy it would be useful to have state
utilities, such as these marked in the diagram on the left (the question where
these came from we shall postpone until later). We could then employ the MEU
(Maximum Expected Utility) principle, and for each state designate the move,
which maximizes the expected utility.

3 0.812 0.868 0.918 +1 3 +1

2 0.762 0.660 −1 2 −1

1 0.705 0.655 0.611 0.388 1

1 2 3 4 1 2 3 4

However, in MDP problems states do not have utilities, except for the final
states. The “utility” of a state (intermediate) depends on the agent’s policy, ie.
what it intends to do in that state. At the same time, the agent’s policy
depends on the “utilities” of the states.
We can introduce state utilities based on policies.

Making complex decisions — sequential decision problems 11

The utilities of states
The utility of a state with respect to a policy can be defined as the expected
value of the rewards obtained by initiating actions from this state:
"∞ #
X
U π (s) = E γ tR(st)
t=0

By St we denote here the random variable signifying the state the agent will be
at step t after starting from state s and executing the policy π.

It turns out that, even though theoretically the optimal policy

π ∗ = argmax U π (s) depends on the choice of the starting state, but for the
π
decision processes with Markov property, for infinite sequences with discounting,
there is no such dependence. The agent’s optimal policy, determining all her
actions, is independent on the starting point.

For the utility of a state U (s) we will take its utility computed with respect to
π∗
the optimal policy U (s).

Making complex decisions — sequential decision problems 12

Dynamic programming
The optimal policy π ⋆, being a function defined on the set of states, may then
be associated with the (yet unknown) state utility function:
X
∗
π (s) = argmax P (s′|s, a)U (s′)
a
s′

where P (s′|s, a) is the probability that the agent will reach the state s′ if she
executes the action a in the state s.

Since we want to express the utility of a state as the expected value of

a discounted sum of rewards of a sequence of states, then it can be tied to the
utilities of its descendant states with the following equation (Bellman, 1957):
X
U (s) = R(s) + γ max P (s′|s, a)U (s′)
a
s′

For the n states we thus obtain n equations — unfortunately nonlinear, due to

the max term — with n unknowns. Solving these equations is called dynamic
programming.

Making complex decisions — sequential decision problems 13

n-step decision problems

If in some problem the final states were achieved with known utilities in exactly
n steps, then we could solve the Bellman equation by first determining the
utilities for the states at step n − 1, then at step n − 2, etc., until reaching the
start state. Problems of such type are called n-step decision problems, and
solving them is relatively easy.

Unfortunately, in most practical cases we may not assume to have constant,

n-step sequences, due, for example, to looping.

Making complex decisions — sequential decision problems 14

The value iteration algorithm

For problems, which cannot be stated as the above n-step decision problems, we
can compute approximate values of the state utilities in an iterative procedure
called the value iteration:
X
Ut+1(s) = R(s) + γ max P (s′|s, a)Ut(s′)
a
s′

At step (t = 0) we assume arbitrary state utility values, and at the subsequent

steps of the algorithm we compute their subsequent approximations.

The algorithm may be terminated by comparing the obtained utilities and

estimating the error. Note that the optimal agent’s policy may be precisely
determined by approximate utility values, even before they converge.

Making complex decisions — sequential decision problems 15

The value iteration algorithm: an example

1 (4,3)
(3,3)
(2,3)
(1,1)
(3,1)
0.5
(4,1)
Utility estimates

-0.5

-1 (4,2)

0 5 10 15 20 25 30
Number of iterations

Making complex decisions — sequential decision problems 16

Convergence of value iteration

As we saw in the example the value iteration procedure converged nicely in all
states. The question is, is it always this way?

It turns out that it is. The value iteration algorithm always leads to stable
values of state utilities, which are the sole solutions of the Bellman equation.
The number of iterations of the algorithm needed to reach an arbitrary error
level ǫ is given by the following formula, where Rmax is the upper bound on the
reward values:
N = ⌈log(2Rmax/ǫ(1 − γ))/ log(1/γ)⌉

Making complex decisions — sequential decision problems 17

Convergence of value iteration — further remarks

• In practice, the following termination condition can be used for the value
iteration: ||Ui+1 − Ui|| < ǫ(1 − γ)/γ

• In practice, the optimal policy can be reached much further than the utility
values stabilize with desired small errors.

• N grows unboundedly when γ approaches one. The convergence can be

sped up by decreasing γ, although this implies shortening the agent’s
horizon, and neglecting the long-term effects.

• For γ = 1, if there exist terminal states, similar convergence criteria and

error estimates can be derived.

Making complex decisions — sequential decision problems 18

The policy iteration algorithm

Just because the optimal policy is often relatively insensitive to the specific
values of the utilities, it can be computed by a similar iterative process, called
the policy iteration. It works by selecting an arbitrary initial policy π0, and
initial utilities, and then updating the utilities determined by the policy,
according to the following formula:
X
Ut+1(s) = R(s) + γ P (s′|s, πt(s))Ut+1(s′)
s′
alternating it with subsequent update of the policy
X
πt+1(s) = argmax P (s′|s, a)Ut(s′)
a
s′

In the above formulas πt(s) denotes the action designated by the current policy
for the state s. The first formula gives a set of linear equations, which can be
solved exactly for Ut+1 in O(n3) time (they are the exact utilities for the
current approximate policy).

Making complex decisions — policy iteration 19

The policy iteration algorithm (cntd.)

The policy iteration algorithm terminates when the policy update step make no
change. Since for a finite space there exist a finite number of policies, the
algorithm is certain to terminate.

For small state spaces (n in O(n3)) the above procedure is often most efficient.
For large spaces, however, the O(n3) causes it to run very slowly. In these cases
a modified policy iteration can be used, which works by iteratively updating
the utilities — instead of computing it exactly each time — using a simplified
Bellman updating given by the formula:
X
Ut+1(s) = R(s) + γ P (s′|s, πt(s))Ut(s′)
s′

Compared with the original Bellman equation the determination of the optimal
action has been dropped here, since the actions are determined here by the
current policy. Thus this procedure is simpler, and even several such update
steps can be made before the next policy iteration step (updating the policy).

Making complex decisions — policy iteration 20

The case of uncertain state information — POMDP

In a general case the agent may not be able to determine the state it ended up
in after taking an action, or rather can only determine it with a certain
probability. Such problem are called partially observable Markov decision
problems (POMDP). In these problems the agent must compute the
expected utility of its actions, taking into account both their various possible
outcomes, and various possible new information (still incomplete), that it may
acquire, depending on the state it ends up in.

The solution to the problem can be obtained by computing the probability

distribution over all possible states that the agent can possibly be in, by
considering the uncertain information about its environment that it was able to
accumulate. In the general case, however, this computation is made difficult by
the fact, that undertaking an action causes the agent to acquire new
information, which may change its knowledge of the environment in ways
difficult to consider. In practice, the agent must take into account new
information it may acquire, along with the states it may transfer to. This may
involve computing the value of information, which was covered earlier.

Making complex decisions — partially observable MDP 21

POMDP — formal description
A POMDP problem is defined by the following four elements:

• the set of states, but with no starting state s0,

• the set Actions(s) of actions allowed in state s,
• the transition function: P (s′|s, a), which is a probability distribution of
reaching the state s′ after executing action a in state s,
• reward function: R(s),
• sensor model: P (e|s), which is a probability distribution of receiving the
evidence e (partially erroneous) in state s,
• initial belief state: b0.

In POMDP problems there is no assumption about knowing the initial state.

Instead, the agent has to work with the belief state b(s), which is
a probability distribution over all possible states s. Initially, we only know the
initial belief state b0.

The task is to compute such policy, that would allow the agent to reach the
goal with the highest probability. During the course of action the agent will
change her belief state, due both to the newly received information, and to
executing the actions by itself.

Making complex decisions — partially observable MDP 22

POMDP: an example

Let us again consider the agent in the 4x3

environment, except in this case the agent does not
know which initial state it is in, and must assume
equal probabilities 19 of being in each of the
nonterminal states.
What would the agent’s optimal policy be now?

The figures below show subsequent probability distributions of the agent

position after executing subsequently five of each of the following actions: Left,
Up, and Right. It is an extremely cautious and conservative policy, although
quite wasteful. Even though the agent ends up in the “good” terminal state with
probability 0.775, the expected utility of such sequence of moves is only 0.08.

Making complex decisions — partially observable MDP 23

Making complex decisions — partially observable MDP 24
Solving POMDP

The key to solving the POMDP is the understanding that the optimal action
depends only on the agent’s belief state. Since that agent does not know her
actual state (and will never learn it in fact), her optimal policy must be
a mapping π ⋆(b) from belief states to actions. The subsequent belief states can
be computed using the formula:
X
′ ′ ′
b (s ) = αP (e|s ) P (s′|s, a)b(s)
s

where P (e|s′) is the probability of receiving the observation e in state s′, and α
is an auxiliary constant normalizing the sum of the belief states to 1.

The work cycle of a POMDP agent, assuming she has already computed her
complete optimal policy π ∗(b), is then as follows:

1. For the current belief state b, execute the action π ∗(b).

2. Receive the observation e.
3. Move to the belief state b′(s′), and repeat the cycle.

Making complex decisions — partially observable MDP 25

The belief state space

Since the MDP model considers the probability distributions and permits to
solve such problems, a POMDP problem can be transformed to an equivalent
MDP problem defined in the belief space. In this space we operate on the
probability distribution of the agent reaching the set of beliefs b′ where she
currently has the set of beliefs b and executes the action a. For a problem with
n states, b are n-element real valued vectors.

Note that the belief state space, which we obtained while studying the POMDP
problems, is a continuous space, unlike the original problem. Furthermore, it is
typically multi-dimensional. For example, for the 4 × 3 environment it has
11-dimensions.

The above value iteration and policy iteration algorithms are not applicable to
such problems. Solving them is computationally very hard (PSPACE-hard).

Making complex decisions — partially observable MDP 26

Converting POMDP to MDP

where
′ ′ ′
P ′
′ 1 if b (s ) = αP (e|s ) s P (s |s, a)b(s)
P (b |e, a, b) =
0 otherwise

Making complex decisions — partially observable MDP 27

The above equation can be taken as a definition for the transition model for the
belief state space. We need to redefine the reward function:
X
ρ(b) = b(s)R(s)
s

and all the elements defined above constitute a totally observable Markov
decision process (MDP) over the belief state space.

It can be proved, that the optimal policy π ∗(b) for this MDP is also the optimal
policy for the original POMDP problem.

Making complex decisions — partially observable MDP 28

Computing optimal policies for POMDP’s

A sketch of the algorithm: we define a policy π(b) for regions of the belief
space, where for one region the policy designates a single action. An iterative
process analogous to the value or policy iteration then updates the region
boundaries, and may introduce new regions.

The optimal policy computed with this algorithm for the above example is:
[ L, U, U, R, U, U, (R, U, U)* ]
(the cyclically repeating R-U-U sequence is necessary due to the uncertainty of
the agent ever reaching the terminal state). The expected effect of this solution
is 0.38, which is significantly better than for the naive policy proposed earlier
(0.08).

Making complex decisions — partially observable MDP 29

Option Greeks for Traders : Part I, Delta, Vega & Theta: Extrinsiq Advanced Options Trading Guides, #5
From Everand
Option Greeks for Traders : Part I, Delta, Vega & Theta: Extrinsiq Advanced Options Trading Guides, #5
Simon Gleadall
No ratings yet
Chapter17 1
No ratings yet
Chapter17 1
40 pages
Lec 25
No ratings yet
Lec 25
20 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
MDP PDF
No ratings yet
MDP PDF
37 pages
Lecture 4: Sequential Decision Making: Simon Parsons
No ratings yet
Lecture 4: Sequential Decision Making: Simon Parsons
94 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Markov Decision Process I
No ratings yet
Markov Decision Process I
111 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
06 MDP
No ratings yet
06 MDP
89 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
PDF Unit-5(Full Unit)
No ratings yet
PDF Unit-5(Full Unit)
37 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Lec 08
No ratings yet
Lec 08
59 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Chapter 5
No ratings yet
Chapter 5
13 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
RL UNIT - II
No ratings yet
RL UNIT - II
20 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
AI Unit3 Part 1
No ratings yet
AI Unit3 Part 1
5 pages
Decision Making
No ratings yet
Decision Making
63 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
No ratings yet
(Partially Observable) Markov Decision Processes: Frederike Petzschner & Lionel Rigoux
19 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
On State Variables and POMDP-s
No ratings yet
On State Variables and POMDP-s
49 pages
Machine Learning Unit4
No ratings yet
Machine Learning Unit4
21 pages
5 - MDP
No ratings yet
5 - MDP
42 pages
Unit 04 Finite Markov Decision Processes
No ratings yet
Unit 04 Finite Markov Decision Processes
8 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
Decision Making Under Uncertainty
No ratings yet
Decision Making Under Uncertainty
63 pages
Decision Making Under Uncertainty
No ratings yet
Decision Making Under Uncertainty
63 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
AI Notes
No ratings yet
AI Notes
37 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
17 Making Complex Decisions: 4 × 3 U, R, D, L
No ratings yet
17 Making Complex Decisions: 4 × 3 U, R, D, L
8 pages
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
No ratings yet
Book Mathmatical Foundation of Reinforcement Learning Lecture Slides
524 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
RL RS-Unit_3 (1)
No ratings yet
RL RS-Unit_3 (1)
6 pages
Sections
No ratings yet
Sections
76 pages
242 Sheet 02 02
No ratings yet
242 Sheet 02 02
6 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
RL-1
No ratings yet
RL-1
30 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
Foundations of Deep Reinforcement Learning Theory and Practice in Python Addison Wesley Data Analytics Series 1st Edition Laura Graesser Wah Loon Keng - Own the complete ebook with all chapters in PDF format
100% (1)
Foundations of Deep Reinforcement Learning Theory and Practice in Python Addison Wesley Data Analytics Series 1st Edition Laura Graesser Wah Loon Keng - Own the complete ebook with all chapters in PDF format
67 pages
Multi-Agent Reinforcement Learning For Traffic Signal Control Through Universal Communication Method
No ratings yet
Multi-Agent Reinforcement Learning For Traffic Signal Control Through Universal Communication Method
12 pages
Markov Decision Process
No ratings yet
Markov Decision Process
29 pages
Reinforcement Learning - Week 12
No ratings yet
Reinforcement Learning - Week 12
3 pages
Building Poker Agent Using Reinforcement Learning With Neural Networks
No ratings yet
Building Poker Agent Using Reinforcement Learning With Neural Networks
8 pages
Relational Dynamic Influence Diagram Language (RDDL) : Language Description
No ratings yet
Relational Dynamic Influence Diagram Language (RDDL) : Language Description
23 pages
Where Can Buy Cognitive Architectures Maria Isabel Aldinhas Ferreira Ebook With Cheap Price
100% (3)
Where Can Buy Cognitive Architectures Maria Isabel Aldinhas Ferreira Ebook With Cheap Price
62 pages
Ashwin Rao, Tikhon Jelvis - Foundations of Reinforcement Learning With Applications in Finance-CRC Press_Chapman & Hall (2022)
No ratings yet
Ashwin Rao, Tikhon Jelvis - Foundations of Reinforcement Learning With Applications in Finance-CRC Press_Chapman & Hall (2022)
522 pages
Managing Offshore Wind Turbines Through Markov Decision Processes and Dynamic Bayesian Networks
No ratings yet
Managing Offshore Wind Turbines Through Markov Decision Processes and Dynamic Bayesian Networks
11 pages
Openai Gym
No ratings yet
Openai Gym
4 pages
NIPS Conference Book 2011
No ratings yet
NIPS Conference Book 2011
113 pages
Question Bank - REINFORCEMENT LEARNING
100% (2)
Question Bank - REINFORCEMENT LEARNING
2 pages
Probabilistic MDP-behavior Planning For Cars
No ratings yet
Probabilistic MDP-behavior Planning For Cars
6 pages
A Survey of Deep Learning Techniques For Autonomous Driving
No ratings yet
A Survey of Deep Learning Techniques For Autonomous Driving
28 pages
unit 6
No ratings yet
unit 6
48 pages
Lecture 1: Introduction To RL: Emma Brunskill
No ratings yet
Lecture 1: Introduction To RL: Emma Brunskill
78 pages
Hector Ge Ner, Blai Bonet - A Concise Introduction To Models and Methods For Automated Planning-Morgan & Claypool (2013)
No ratings yet
Hector Ge Ner, Blai Bonet - A Concise Introduction To Models and Methods For Automated Planning-Morgan & Claypool (2013)
133 pages
Instant download Cross Layer Design for Secure and Resilient Cyber Physical Systems A Decision and Game Theoretic Approach Quanyan Zhu pdf all chapter
100% (3)
Instant download Cross Layer Design for Secure and Resilient Cyber Physical Systems A Decision and Game Theoretic Approach Quanyan Zhu pdf all chapter
52 pages
Zero-Shot Terrain Generalization For Visual Locomotion Policies
No ratings yet
Zero-Shot Terrain Generalization For Visual Locomotion Policies
7 pages
Full Download (Ebook) Markov Processes for Stochastic Modeling by Oliver Ibe ISBN 0123744512 PDF DOCX
No ratings yet
Full Download (Ebook) Markov Processes for Stochastic Modeling by Oliver Ibe ISBN 0123744512 PDF DOCX
82 pages
(Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python (Addison-Wesley Data & Analytics Series) by Laura Graesser, Wah Loon Keng ISBN 9780135172384, 0135172381 - Download the ebook and start exploring right away
100% (2)
(Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python (Addison-Wesley Data & Analytics Series) by Laura Graesser, Wah Loon Keng ISBN 9780135172384, 0135172381 - Download the ebook and start exploring right away
72 pages
Maintenance and Safety of Aging Infrastructure
100% (1)
Maintenance and Safety of Aging Infrastructure
794 pages
2023 Toward The Third Generation Artificial Intelligenc
No ratings yet
2023 Toward The Third Generation Artificial Intelligenc
19 pages
Complete Download Markov Decision Processes in Practice 1st Edition Richard J. Boucherie PDF All Chapters
100% (1)
Complete Download Markov Decision Processes in Practice 1st Edition Richard J. Boucherie PDF All Chapters
55 pages
How Data-Driven Entrepreneur Analyzes Imperfect Information For Business Opportunity Evaluation
No ratings yet
How Data-Driven Entrepreneur Analyzes Imperfect Information For Business Opportunity Evaluation
14 pages
Recent Trends in Wireless and Mobile Networks
No ratings yet
Recent Trends in Wireless and Mobile Networks
445 pages
Cheat Sheet 4
No ratings yet
Cheat Sheet 4
2 pages
TSP Csse 31116
No ratings yet
TSP Csse 31116
16 pages
Modeling AGI Safety Frameworks With Causal Influence Diagrams
No ratings yet
Modeling AGI Safety Frameworks With Causal Influence Diagrams
12 pages
2309.01038v1
No ratings yet
2309.01038v1
16 pages

Markov Decision Process

Uploaded by

Markov Decision Process

Uploaded by

Sequential decision problems

In sequential decision problems the utility of agent’s actions do not depend

Making complex decisions — sequential decision problems 1

However, considering the uncertainty, the result of the agent’s actions

Making complex decisions — sequential decision problems 2

We can determine the optimal policy for 3 +1

the previous example problem. Note that

Making complex decisions — sequential decision problems 3

Formally, the cost of the moves is introduced as a reward function R for

Making complex decisions — sequential decision problems 4

Formally, a Markov decision problem is defined by the following elements:

• the set of states with the starting state s0,

Making complex decisions — sequential decision problems 5

r = [− : −1.6284] r = [−0.4278 : −0.0850]

r = [−0.0480 : −0.0274] r = [−0.0218 : 0.0000] 1 2 3 4

In case of positive rewards, it is no longer profitable for the agent to terminate

Making complex decisions — sequential decision problems 6

Making complex decisions — sequential decision problems 7

To deal with this, a technique of discounting is used, which works by reducing

Uh([s0, s1, ..., sn]) = R(s0) + γR(s1) + γ 2R(s2) + ... + γ nR(sn)

The discounting technique has an intuitive justification in many real life

Making complex decisions — sequential decision problems 8

If the problem has terminal states, then it is possible to determine a policy

Making complex decisions — sequential decision problems 9

U ([s0, s1, ..., sn]) = f (s0, U ([s1, ..., sn]))

In our example 4 × 3 problem the utility function is separable since we can

U ([s0, s1, ..., sn]) = R(s0) + R(s1) + · · · + R(sn)

We call a utility function additive if it has the following property:

U ([s0, s1, ..., sn]) = R(s0) + U ([s1, ..., sn])

Making complex decisions — sequential decision problems 10

3 0.812 0.868 0.918 +1 3 +1

1 0.705 0.655 0.611 0.388 1

Making complex decisions — sequential decision problems 11

It turns out that, even though theoretically the optimal policy

Making complex decisions — sequential decision problems 12

Since we want to express the utility of a state as the expected value of

For the n states we thus obtain n equations — unfortunately nonlinear, due to

Making complex decisions — sequential decision problems 13

Unfortunately, in most practical cases we may not assume to have constant,

Making complex decisions — sequential decision problems 14

At step (t = 0) we assume arbitrary state utility values, and at the subsequent

The algorithm may be terminated by comparing the obtained utilities and

Making complex decisions — sequential decision problems 15

Making complex decisions — sequential decision problems 16

Making complex decisions — sequential decision problems 17

• N grows unboundedly when γ approaches one. The convergence can be

• For γ = 1, if there exist terminal states, similar convergence criteria and

Making complex decisions — sequential decision problems 18

Making complex decisions — policy iteration 19

Making complex decisions — policy iteration 20

The solution to the problem can be obtained by computing the probability

Making complex decisions — partially observable MDP 21

• the set of states, but with no starting state s0,

In POMDP problems there is no assumption about knowing the initial state.

Making complex decisions — partially observable MDP 22

Let us again consider the agent in the 4x3

The figures below show subsequent probability distributions of the agent

Making complex decisions — partially observable MDP 23

1. For the current belief state b, execute the action π ∗(b).

Making complex decisions — partially observable MDP 25

Making complex decisions — partially observable MDP 26

Making complex decisions — partially observable MDP 27

Making complex decisions — partially observable MDP 28

Making complex decisions — partially observable MDP 29

You might also like