0% found this document useful (0 votes)
14 views

Unit 04 Finite Markov Decision Processes

Uploaded by

PratapePrasad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Unit 04 Finite Markov Decision Processes

Uploaded by

PratapePrasad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Unit 03: The Reinforcement Learning Problem &

Unit 04: Finite Markov Decision Processes


Syllabus: The Agent–Environment Interface, Goals and Rewards, Returns, Unified Notation for
Episodic and Continuing Tasks, Value Functions, Optimal Value Functions, Optimality and
Approximation

3.1 The Agent–Environment Interface


MDPs are meant to be a straightforward framing of the problem of learning from interaction to achieve a
goal. The learner and decision maker is called the agent. The thing it interacts with, comprising everything
outside the agent, is called the environment.
These interact continually, the agent selecting actions and the environment responding to these actions and
presenting new situations to the agent. The environment also gives rise to rewards, special numerical values
that the agent seeks to maximize over time through its choice of actions.

More specifically, the agent and environment interact at each of a sequence of discrete time steps, t = 0, 1, 2,
3,. One time step later, in part as a consequence of its action, the agent receives a numerical reward, Rt+1 ∊ R
⊂ R, and finds itself in a new state, St+1. The MDP and agent together thereby give rise to a sequence or
trajectory that begins like this:

In a finite MDP, the sets of states, actions, and rewards (S, A, and R) all have a finite number of elements.
In this case, the random variables Rt and St have well defined discrete probability distributions dependent
only on the preceding state and action. there is a probability of those values occurring at time t, given particular
values of the preceding state and action:

The function p defines the dynamics of the MDP. The dot over the equals sign in the equation reminds us that
it is a definition (in this case of the function p). The dynamics function p : S x R x S x A → [0, 1] is an ordinary
deterministic function of four arguments. The ‘|’ in the middle of it comes from the notation for conditional
probability, but here it just reminds us that p specifies a probability distribution for each choice of s and a, that
is, that

In a Markov decision process, the probabilities given by p completely characterize the environment’s
dynamics. That is, the probability of each possible value for St and Rt depends on the immediately preceding
state and action, St−1 and At−1, and, given them, not at all on earlier states and actions.
The state must include information about all aspects of the past agent–environment interaction that make a
difference for the future. If it does, then the state is said to have the Markov property.
From the four-argument dynamics function, p, one can compute anything else one might want to know about
the environment, such as the state-transition probabilities (which we denote, with a slight abuse of notation,
as a three-argument function p : S x S x A→ [0, 1]),

We can also compute the expected rewards for state–action pairs as a two-argument function r : S x A →ℝ:

and the expected rewards for state–action–next-state triples as a three-argument function r : S x A x S →ℝ,

The MDP framework is abstract and flexible and can be applied to many different problems in many different
ways. For example, the time steps need not refer to fixed intervals of real time; they can refer to arbitrary
successive stages of decision making and acting.
Actions can be any decisions we want to learn how to make, and states can be anything we can know that might
be useful in making them.
In particular, the boundary between agent and environment is typically not the same as the physical boundary
of a robot’s or an animal’s body. Usually, the boundary is drawn closer to the agent than that.
The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be
outside of it and thus part of its environment. We do not assume that everything in the environment is
unknown to the agent.
in some cases the agent may know everything about how its environment works and still face a difficult
reinforcement learning task, just as we may know exactly how a puzzle like Rubik’s cube works, but still be
unable to solve it. The agent–environment boundary represents the limit of the agent’s absolute control, not
of its knowledge.
The agent–environment boundary can be located at different places for different purposes. In a complicated
robot, many different agents may be operating at once, each with its own boundary.
The MDP framework is a considerable abstraction of the problem of goal-directed learning from interaction. It
proposes that whatever the details of the sensory, memory, and control apparatus, and whatever objective one
is trying to achieve, any problem of learning goal-directed behavior can be reduced to three signals passing
back and forth between an agent and its environment: one signal to represent the choices made by the agent
(the actions), one signal to represent the basis on which the choices are made (the states), and one signal to
define the agent’s goal (the rewards). This framework may not be sufficient to represent all decision-learning
problems usefully, but it has proved to be widely useful and applicable.

3.2 Goals and Rewards


In reinforcement learning, the purpose or goal of the agent is formalized in terms of a special signal, called the
reward, passing from the environment to the agent. At each time step, the reward is a simple number, Rt ∊ ℝ.
Informally, the agent’s goal is to maximize the total amount of reward it receives. This means maximizing not
immediate reward, but cumulative reward in the long run. We can call it as reward hypothesis.
That all of what we mean by goals and purposes can be well thought of as the maximization of the expected
value of the cumulative sum of a received scalar signal (called reward).
Although formulating goals in terms of reward signals might at first appear limiting, in practice it has proved
to be flexible and widely applicable.
For example, to make a robot learn to walk, researchers have provided reward on each time step proportional
to the robot’s forward motion. In making a robot learn how to escape from a maze, the reward is often −1 for
every time step that passes prior to escape; this encourages the agent to escape as quickly as possible.
For an agent to learn to play checkers or chess, the natural rewards are +1 for winning, −1 for losing, and 0 for
drawing and for all nonterminal positions.
In above examples, the agent always learns to maximize its reward. If we want it to do something for us, we
must provide rewards to it in such a way that in maximizing them the agent will also achieve our goals.
It is thus critical that the rewards we set up truly indicate what we want accomplished. In particular, the reward
signal is not the place to impart to the agent prior knowledge about how to achieve what we want it to do.
For example, a chess-playing agent should be rewarded only for actually winning, not for achieving subgoals
such as taking its opponent’s pieces or aining control of the center of the board. If achieving these sorts of
subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal.
The reward signal is your way of communicating to the agent what you want achieved, not how you want it
achieved.

3.3 Returns and Episodes


The agent’s goal is to maximize the cumulative reward it receives in the long run. In general, we seek to
maximize the expected return, where the return, denoted Gt, is defined as some specific function of the reward
sequence. In the simplest case the return is the sum of the rewards:

where T is a final time step. This approach makes sense in applications in which there is a natural notion of
final time step, that is, when the agent–environment interaction breaks naturally into subsequences, which we
call episodes. Each episode ends in a special state called the terminal state, followed by a reset to a standard
starting state or to a sample from a standard distribution of starting states. Even if you think of episodes as
ending in different ways, such as winning and losing a game, the next episode begins independently of how the
previous one ended. Thus, the episodes can all be considered to end in the same terminal state, with different
rewards for the different outcomes. Tasks with episodes of this kind are called episodic tasks.
On the other hand, in many cases the agent–environment interaction does not break naturally into identifiable
episodes, but goes on continually without limit.
We call these continuing tasks. The return formulation (3.7) is problematic for continuing tasks because the
final time step would be T = ∞, and the return, which is what we are trying to maximize, could easily be infinite.
The additional concept that we need is that of discounting. According to this approach, the agent tries to select
actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it
chooses At to maximize the expected discounted return:

where γ is a parameter, 0 <= γ <= 1, called the discount rate.


The discount rate determines the present value of future rewards: a reward received k time steps in the future
is worth only γk−1 times what it would be worth if it were received immediately. If γ < 1, the infinite sum in (3.8)
has a finite value as long as the reward sequence {Rk} is bounded.
If γ = 0, the agent is “myopic” in being concerned only with maximizing immediate rewards: its objective in
this case is to learn how to choose At so as to maximize only Rt+1. If each of the agent’s actions happened to
influence only the immediate reward, not future rewards as well, then a myopic agent could maximize (3.8) by
separately maximizing each immediate reward.
As γ approaches 1, the return objective takes future rewards into account more strongly; the agent becomes
more farsighted. Returns at successive time steps are related to each other in a way that is important for the
theory and algorithms of reinforcement learning:

Note that although the return (3.8) is a sum of an infinite number of terms, it is still finite if the reward is
nonzero and constant—if γ < 1. For example, if the reward is a constant +1, then the return is

3.4 Unified Notation for Episodic and Continuing Tasks


We have two kinds of reinforcement learning tasks, one in which the agent–environment interaction naturally
breaks down into a sequence of separate episodes (episodic tasks), and one in which it does not (continuing
tasks). The former case is mathematically easier because each action affects only the finite number of rewards
subsequently received during the episode.
It is therefore useful to establish one notation that enables us to talk precisely about both cases simultaneously.
To be precise about episodic tasks requires some additional notation. Rather than one long sequence of time
steps, we need to consider a series of episodes, each of which consists of a finite sequence of time steps.
We number the time steps of each episode starting anew from zero. Therefore, we have to refer not just to St,
the state representation at time t, but to St ,i, the state representation at time t of episode i.

We are almost always considering a particular episode, or stating something that is true for all episodes.
Accordingly, in practice we almost always abuse notation slightly by dropping the explicit reference to episode
number. That is, we write St to refer to St,i, and so on.
We need one other convention to obtain a single notation that covers both episodic and continuing tasks. We
have defined the return as a sum over a finite number of terms in one case (3.7) and as a sum over an infinite
number of terms in the other (3.8). These two can be unified by considering episode termination to be the
entering of a special absorbing state that transitions only to itself and that generates only rewards of zero. For
example, consider the state transition diagram:
Here the solid square represents the special absorbing state corresponding to the end of an episode. Starting
from S0, we get the reward sequence +1, +1, +1, 0, 0, 0, . . .. Summing these, we get the same return whether
we sum over the first T rewards (here T = 3) or over the full infinite sequence. This remains true even if we
introduce discounting.
Thus, we can define the return, in general using the convention of omitting episode numbers when they are
not needed, and including the possibility that γ = 1 if the sum remains defined (e.g., because all episodes
terminate). Alternatively, we can write

including the possibility that T = ∞ or γ = 1 (but not both). We use these conventions throughout the rest of
the book to simplify notation and to express the close parallels between episodic and continuing tasks.

3.5 Policies and Value Functions


Almost all reinforcement learning algorithms involve estimating value functions—functions of states (or of
state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform
a given action in a given state).
The notion of “how good” here is defined in terms of future rewards that can be expected, or, to be precise, in
terms of expected return. Accordingly, value functions are defined with respect to particular ways of acting,
called policies.
Formally, a policy is a mapping from states to probabilities of selecting each possible action. If the agent is
following policy π at time t, then π(a|s) is the probability that At = a if St = s. Like p, π is an ordinary function;
the “|” in the middle of π (a|s) merely reminds us that it defines a probability distribution over a ∊ A(s) for each
s ∊ S.
Reinforcement learning methods specify how the agent’s policy is changed as a result of its experience.
The value function of state s under a policy π, denoted vπ(s), is the expected return when starting in s and
following π thereafter. For MDPs, we can define vπ formally by

where Eπ[·] denotes the expected value of a random variable given that the agent follows policy π, and t is any
time step. Note that the value of the terminal state, if any, is always zero. We call the function vπ the state-value
function for policy π.
Similarly, we define the value of taking action a in state s under a policy π, denoted qπ(s, a), as the expected
return starting from s, taking the action a, and thereafter following policy π:

We call qπ the action-value function for policy π.


A fundamental property of value functions used throughout reinforcement learning and dynamic
programming is that they satisfy recursive relationships similar to that which we have already established for
the return (3.9). For any policy π and any state s, the following consistency condition holds between the value
of s and the value of its possible successor states:
Equation (3.14) is the Bellman equation for vπ. It expresses a relationship between the value of a state and
the values of its successor states.
The Bellman equation averages over all the possibilities, weighting each by its probability of occurring. It states
that the value of the start state must equal the (discounted) value of the expected next state, plus the reward
expected along the way.

3.6 Optimal Policies and Optimal Value Functions


Solving a reinforcement learning task means, roughly, finding a policy that achieves a lot of reward over the
long run. For finite MDPs, we can precisely define an optimal policy in the following way. Value functions define
a partial ordering over policies.

A policy π is defined to be better than or equal to a policy π’ if its expected return is greater than or equal to
that of π’ for all states.

There is always at least one policy that is better than or equal to all other policies. This is an optimal policy.
Although there may be more than one, we denote all the optimal policies by π*. They share the same state-
value function, called the optimal state-value function, denoted v*, and defined as

Optimal policies also share the same optimal action-value function, denoted q*, and defined as

for all s ∊ S and a ∊ A(s). For the state–action pair (s, a), this function gives the expected return for taking action
a in state s and thereafter following an optimal policy. Thus, we can write q* in terms of v* as follows:

Because v* is the value function for a policy, it must satisfy the self-consistency condition given by the Bellman
equation for state values. Because it is the optimal value function, however, v*’s consistency condition can be
written in a special form without reference to any specific policy. This is the Bellman equation for v*, or the
Bellman optimality equation. Intuitively, the Bellman optimality equation expresses the fact that the value of a
state under an optimal policy must equal the expected return for the best action from that state:
The last two equations are two forms of the Bellman optimality equation for v*. The Bellman optimality
equation for q* is

The backup diagrams in the figure below show graphically the spans of future states and actions considered
in the Bellman optimality equations for v* and q*. The backup diagram on the left graphically represents the
Bellman optimality equation for v* and the backup diagram on the right graphically represents Bellman
optimality equation for q*.

The Bellman optimality equation is actually a system of equations, one for each state, so if there are n states,
then there are n equations in n unknowns. If the dynamics p of the environment are known, then in principle
one can solve this system of equations for v* using any one of a variety of methods for solving systems of
nonlinear equations.
Once one has v*, it is relatively easy to determine an optimal policy. For each state s, there will be one or more
actions at which the maximum is obtained in the Bellman optimality equation. Any policy that assigns nonzero
probability only to these actions is an optimal policy.
Another way of saying this is that any policy that is greedy with respect to the optimal evaluation function v*
is an optimal policy.
Having q* makes choosing optimal actions even easier. With q*, the agent does not even have to do a one-step-
ahead search: for any state s, it can simply find any action that maximizes q*(s, a). The action-value function
effectively caches the results of all one-step-ahead searches. It provides the optimal expected long-term return
as a value that is locally and immediately available for each state–action pair.
Hence, at the cost of representing a function of state–action pairs, instead of just of states, the optimal action
value function allows optimal actions to be selected without having to know anything about possible successor
states and their values, that is, without having to know anything about the environment’s dynamics.
3.7 Optimality and Approximation
A well-defined notion of optimality organizes the approach to learning and provides a way to understand the
theoretical properties of various learning algorithms, but it is an ideal that agents can only approximate. Even
if we have a complete and accurate model of the environment’s dynamics, it is usually not possible to simply
compute an optimal policy by solving the Bellman optimality equation.
For example, board games such as chess are a tiny fraction of human experience, yet large, custom-designed
computers still cannot compute the optimal moves.
A critical aspect of the problem facing the agent is always the computational power available to it, in particular,
the amount of computation it can perform in a single time step.
The memory available is also an important constraint. A large amount of memory is often required to build up
approximations of value functions, policies, and models. In tasks with small, finite state sets, it is possible to
form these approximations using arrays or tables with one entry for each state (or state–action pair).
This we call the tabular case, and the corresponding methods we call tabular methods. In many cases of
practical interest, however, there are far more states than could possibly be entries in a table. In these cases
the functions must be approximated, using some sort of more compact parameterized function representation.
Our framing of the reinforcement learning problem forces us to settle for approximations. However, it also
presents us with some unique opportunities for achieving useful approximations.
For example, in approximating optimal behavior, there may be many states that the agent faces with such a
low probability that selecting suboptimal actions for them has little impact on the amount of reward the agent
receives.
The online nature of reinforcement learning makes it possible to approximate optimal policies in ways that
put more effort into learning to make good decisions for frequently encountered states, at the expense of less
effort for infrequently encountered states. This is one key property that distinguishes reinforcement learning
from other approaches to approximately solving MDPs.

You might also like