0% found this document useful (0 votes)
21 views

4 Monte Carlo Methods

The document discusses Monte Carlo methods for solving reinforcement learning problems. It describes how Monte Carlo methods use sampling and averaging of returns from actual or simulated experience to estimate value functions, without requiring a model of the environment's dynamics. It presents the first-visit and every-visit Monte Carlo algorithms for estimating state values and action values, and explains how they converge to the true values as the number of visits increases. It also discusses the need for exploration to estimate all action values and learn optimal policies.

Uploaded by

yilvas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

4 Monte Carlo Methods

The document discusses Monte Carlo methods for solving reinforcement learning problems. It describes how Monte Carlo methods use sampling and averaging of returns from actual or simulated experience to estimate value functions, without requiring a model of the environment's dynamics. It presents the first-visit and every-visit Monte Carlo algorithms for estimating state values and action values, and explains how they converge to the true values as the number of visits increases. It also discusses the need for exploration to estimate all action values and learn optimal policies.

Uploaded by

yilvas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Monte Carlo Methods

Monte Carlo Methods

Paul Alexander Bilokon, PhD

Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB

2023.01.24
Monte Carlo Methods

Estimating value functions

I In Markov decision processes (MDPs) we assumed that the agent had complete
knowledge of the environment. In particular, the agent knew the dynamics function
p (r , s 0 | s , a ).
I Monte Carlo methods require only experience — sample sequences of states,
actions, and rewards from actual or simulated interaction with the environment.
I Learning from actual experience requires no prior knowledge of the environment’s
dynamics, yet can still attain optimal behaviour.
I Learning from simulated experience requires a model that need only generate sample
transitions, not the complete probability distributions of all possible transitions required
for dynamic programming (DP).
I In many cases it is straightforward to generate experience sampled according to
desired probability distributions, but infeasible to obtain the distributions in explicit form.
Monte Carlo Methods

Monte Carlo methods

I Monte Carlo methods are ways of solving the reinforcement learning (RL) problem
based on averaging sample returns.
I To ensure that well-defined returns are available, here we define Monte Carlo methods
only for episodic tasks: we assume experience is divided into episodes, and that all
episodes eventually terminate no matter what actions are selected.
I Only on the completion of an episode are value estimates and policies changed.
I Monte Carlo methods can thus be incremental in an episode-by-episode sense, but
not in a step-by-step (online) sense.
I The term “Monte Carlo” is often used more broadly for any estimation method whose
operation involves a significant random component.
I Here we use it specifically for methods based on averaging complete returns (as
opposed to methods that learn from partial returns).
Monte Carlo Methods

Sampling and averaging

I Monte Carlo methods sample and average returns for each state-action pair much like
the bandit methods sample and average rewards for each action.
I The main difference is that now there are multiple states, each acting like a different
bandit problem (like a contextual bandit) and the different bandit problems are
interrelated.
I That is, the return after taking an action in one state depends on the actions taken in
later states in the same episode.
I Because all the action selections are undergoing learning, the problem becomes
nonstationary from the point of view of the earlier state.
Monte Carlo Methods

First-visit MC prediction for estimating V ≈ vπ

1. Input: a policy π to be evaluated


2. Initialise:
I V (s ) ∈ R, arbitrarily, for all s ∈ S
I Returns(s ) ← an empty list, for all s ∈ S
3. Loop forever (for each episode):
I Generate an episode following π : S0 , A0 , R1 , S1 , A1 , R2 , . . . , ST −1 , AT −1 , RT
I G←0
I Loop for each step of episode, t = T − 1, T − 2, . . . , 0:
I G ← γG + Rt +1
I Unless St appears in S0 , S1 , . . . , St −1 :
Append G to Returns(St )
V (St ) ← average(Returns(St ))
Monte Carlo Methods

Every-visit MC prediction for estimating V ≈ vπ

1. Input: a policy π to be evaluated


2. Initialise:
I V (s ) ∈ R, arbitrarily, for all s ∈ S
I Returns(s ) ← an empty list, for all s ∈ S
3. Loop forever (for each episode):
I Generate an episode following π : S0 , A0 , R1 , S1 , A1 , R2 , . . . , ST −1 , AT −1 , RT
I G←0
I Loop for each step of episode, t = T − 1, T − 2, . . . , 0:
I G ← γG + Rt +1
I Append G to Returns(St )
I V (St ) ← average(Returns(St ))
Monte Carlo Methods

Convergence

I Both first-visit MC and every-visit MC converge to vπ (s ) as the number of visits (or first
visits) to s goes to infinity.
I This is easy to see in the case of first-visit MC. In this case each return is an
independent, identically distributed estimate of vπ (s ) with finite variance. By the law of
large numbers the sequence of averages of these estimates converges to their
expected value.
I Each average is itself an unbiased estimate, and the standard deviation of its error falls
as √1n , where n is the number of returns averaged.
I Every-visit MC is less straightforward, but its estimates also converge quadratically to
vπ (s ) [SS96].
Monte Carlo Methods

Monte Carlo estimation of action values

I With a model, state values alone are enough to determine a policy: just look ahead
one step and choose whichever action leads to the best combination of reward and
next state.
I Without a model state values alone are insufficient.
I One must estimate the value of each action in order for the values to be useful in
suggesting a policy.
I We need Monte Carlo methods to estimate q∗ .
I To achieve this, first consider the policy evaluation problem for action (rather than
state) values.
Monte Carlo Methods

The policy evaluation problem for action values

I The policy evaluation problem for action values is to estimate qπ (s , a ), the expected
return when starting in state s, taking action a, and thereafter following policy π .
I The Monte Carlo methods for this are essentially the same as just presented for state
values, except now we talk about visits to a state–action pair rather than to a state.
I A state–action pair (s , a ) is said to be visited in an episode if ever the state s is
visited and action a is taken in it.
I The first-visit MC method averages the returns following the first time in each episode
that the state was visited and the action was selected.
I The every-visit MC method estimates the value of a state–action pair as the average of
the returns that have followed all the visits to it.
I These methods converge quadratically, as before, to the true expected values as the
number of visits to each state–action pair approaches infinity.
Monte Carlo Methods

Maintaining exploration

I Many state–action pairs may never be visited.


I If π is a deterministic policy, then in following π one will observe returns only for one of
the actions from each state.
I With no returns on average, the Monte Carlo estimates of the other actions will not
improve with experience.
I But the purpose of learning action values is to help in choosing among the actions
available in each state.
I We need to estimate the value of all the actions from each state, not just the one we
currently favour.
Monte Carlo Methods

Exploring starts

I This is the problem of maintaining exploration discussed in the context of the


k -armed bandit problem.
I For policy evaluation to work for action values, we must assure continual exploration.
I One way is by specifying that the episodes start in a state–action pair, and that every
pair has a nonzero probability of being selected as the start.
I This guarantees that all state–action pairs will be visited an infinite number of times in
the limit of an infinite number of episodes.
I We call this the assumption of exploring starts.
I The assumption of exploring starts is sometimes useful, but cannot be relied upon in
general, particularly when learning directly from actual interaction with an environment.
I The most common alternative approach to assuring that all state–action pairs are
encountered is to consider only policies that are stochastic with a nonzero probability
of selecting all actions in each state.
Monte Carlo Methods

Monte Carlo control

I We can now consider how Monte Carlo estimation can be used in control, that is, to
approximate optimal policies.
I We proceed according to the same pattern as with DP, that is, according to the idea of
generalised policy iteration (GPI).
I In GPI one maintains both an approximate policy and an approximate value function.
I The value function is repeatedly altered to more closely approximate the value function
for the current policy, and the policy is repeatedly improved with respect to the current
value function, as suggested by the diagram.
I These two kinds of changes work against each other to some extent, as each creates
a moving target for the other, but together they cause both policy and value function to
approach optimality.
Monte Carlo Methods

Monte Carlo version of classical policy iteration (i)

I To begin, let us consider a Monte Carlo version of classical policy iteration.


I In this method, we perform alternating complete steps of policy evaluation and policy
improvement, beginning with an arbitrary policy π0 and ending with the optimal policy
and optimal action-value function:

E I E I E I E
π 0 → qπ0 → π 1 → qπ1 → π 2 → · · · → π ∗ → q∗ ,

E I
where → denotes a complete policy evaluation and → denotes a complete policy
improvement.
I In policy evaluation many episodes are experienced, with the approximate action-value
function approaching the true function asymptotically.
Monte Carlo Methods

Monte Carlo version of classical policy iteration (ii)


I Policy improvement is done by making the policy greedy with respect to the current
value function. In this case we have an action-value function, and therefore no model
is needed to construct the greedy policy.
I For any action-value function q, the corresponding greedy policy is the one that, for
each s ∈ S , deterministically chooses an action with maximal action-value:

π (s ) := arg max q (s , a ).
a

I Policy improvement then can be done by constructing each πk +1 as the greedy policy
with respect to qπk . The policy improvement theorem then applies to πk and πk +1
because, for all s ∈ S ,

qπk (s , πk +1 (s )) = qπk (s , arg max qπk (s , a ))


a

= max qπk (s , a )
a

≥ qπk (s , πk (s ))
≥ vπk (s ).
The theorem assures us that each πk +1 is uniformly better than πk , or just as good as
πk , in which case they are both optimal policies.
I This in turn assures us that the overall process converges to the optimal policy and
optimal value function.
Monte Carlo Methods

Avoiding the infinite number of episodes

I We can avoid the infinite number of episodes nominally required for policy evaluation.
I We give up trying to complete policy evaluation before returning to policy improvement.
I On each evaluation step we move the value function toward qπk , but we do not expect
to actually get close except over many steps.
I We used this idea when we first introduced generalised policy iteration (GPI).
I One extreme form of the idea is value iteration, in which only one iteration of iterative
policy evaluation is performed between each step of policy improvement.
I The in-place version of value iteration is even more extreme; there we alternate
between improvement and evaluation steps for single states.
I For Monte Carlo policy evaluation it is natural to alternate between evaluation and
improvement on an episode-by episode basis. After each episode, the observed
returns are used for policy evaluation, and then the policy is improved at all the states
visited in the episode.
Monte Carlo Methods

Monte Carlo ES (Exploring Starts), for estimating π ≈ π∗

I Initialise:
I π (s ) ∈ A(s ) (arbitrarily), for all s ∈ S
I Q (s , a ) ∈ R (arbitrarily), for all s ∈ S , a ∈ A(s )
I Returns(s , a ) ← empty list, for all s ∈ S , a ∈ A(s )
I Loop forever (for each episode):
I Choose S0 ∈ S , A0 ∈ A(S0 ) randomly such that all pairs have probability > 0
I Generate an episode from S0 , A0 , following π : S0 , A0 , R1 , . . . , ST −1 , AT −1 , RT
I G←0
I Loop for each step of episode, t = T − 1, T − 2, . . . , 0:
I G ← γG + Rt +1
I Unless the pair St , At appears in S0 , A0 , S1 , A1 , . . . , St −1 , At −1 :
Append G to Returns(St , At )
Q (St , At ) ← average(Returns(St , At ))
π (St ) ← arg maxa Q (St , a )
Monte Carlo Methods

Notes on the algorithm

I In Monte Carlo ES, all the returns for each state–action pair are accumulated and
averaged, irrespective of what policy was in force when they were observed.
I It is easy to see that Monte Carlo ES cannot converge to any suboptimal policy. If it
did, then the value function would eventually converge to the value function for that
policy, and that in turn would cause the policy to change.
I Stability is achieved only when both the policy and the value function are optimal.
I Convergence to this optimal fixed point seems inevitable as the changes to the
action-value function decrease over time, but has not yet been formally proved. This is
one of the most fundamental open theoretical questions in RL. See [Tsi02] for a partial
solution.
Monte Carlo Methods

On-policy versus off-policy methods

I How can we avoid the assumption of exploring starts?


I The only general way to ensure that all actions are selected infinitely often is for the
agent to continue to select them.
I There are two approaches to ensuring this, resulting in what we call on-policy
methods and off-policy methods.
I On-policy methods attempt to evaluate or improve the policy that is used to make
decisions, whereas off-policy methods evaluate or improve a policy different from
that used to generate the data.
Monte Carlo Methods

Soft strategies

I In on-policy control methods the policy is generally soft, meaning that π (a | s ) > 0 for
all s ∈ S and all a ∈ A(s ), but gradually shifted closer and closer to a deterministic
optimal policy.
I The on-policy method we present uses e-greedy policies, meaning that most of the
time they choose an action that has maximal estimated action value, but with
probability e they instead select an action at random.
I That is, all nongreedy actions are given the minimal probability of selection, e , and
|A(s )|
the remaining bulk of the probability, 1 − e + |A(es )| , is given to the greedy action.
I The e-greedy policies are examples of e-soft policies, defined as policies for which
π (a | s ) ≥ |A(es )| for all states and actions, for some e > 0.
I Among e-soft policies, e-greedy policies are in some sense those that are closest to
greedy.
Monte Carlo Methods

On-policy first-visit MC control (for e-soft policies, estimates π ≈ π∗ )

I Algorithm parameter: small e > 0


I Initialise:
I π ← an arbitrary e-soft policy
I Q (s , a ) ∈ R (arbitrarily), for all s ∈ S , a ∈ A(s )
I Returns(s , a ) ← empty list, for all s ∈ S , a ∈ A(s )
I Repeat forever (for each episode):
I Generate an episode following π : S0 , A0 , R1 , . . . , ST −1 , AT −1 , RT
I G←0
I Loop for each step of episode, t = T − 1, T − 2, . . . , 0:
I G ← γG + Rt +1
I Unless the pair St , At appears in S0 , A0 , S1 , A1 , . . . , St −1 , At −1 :
Append G to Returns(St , At )
Q (St , At ) ← average(Returns(St , At ))
A ∗ ← arg maxa Q (St , a ) (with ties broken arbitrarily)
For all a ∈ A(St ):

if a = A ∗ ;

1 − e + e/|A(St )|,
π ( a | St ) ←
e/|A(St )|, if a , A ∗ .
Monte Carlo Methods

Off-policy methods

I All learning control methods face a dilemma: they seek to learn action values
conditional on subsequent optimal behaviour, but they need to behave non-optimally in
order to explore all actions (to find the optimal actions).
I How can they learn about the optimal policy while behaving according to an
exploratory policy?
I The on-policy approach is a compromise — it learns action values not for the optimal
policy, but for a near-optimal policy that still explores.
I A more straightforward approach is to use two policies, one that is learned about and
that becomes the optimal policy (the target policy), and one that is more exploratory
and is used to generate behaviour (the behaviour policy).
I In this case we say that learning is from data “off” the target policy, and the overall
process is termed off-policy learning.
Monte Carlo Methods

On-policy versus off-policy methods

I On-policy methods are generally simpler and are considered first.


I Off-policy methods require additional concepts and notation, and because the data is
due to a different policy, off-policy methods are often of greater variance and are
slower to converge.
I On the other hand, off-policy methods are often more powerful and general.
I They include on-policy methods as the special case in which the target and behaviour
policies are the same.
I Off-policy methods also have a variety of additional uses in applications. For example,
they can often be applied to learn from data generated by a conventional non-learning
controller, or from a human expert.
Monte Carlo Methods

Evaluating the target policy

I Suppose that both target and behaviour policies are fixed.


I Suppose we wish to estimate vπ or qπ , but all we have are episodes following another
policy b, where b , π .
I In this case, π is the target policy, b is the behaviour policy, and both policies are
considered fixed and given.
I In order to use episodes from b to estimate values for π , we require that every action
taken under π is also taken, at least occasionally, under b. That is, we require that
π (a | s ) > 0 ⇒ b (a | s ) > 0. This is called the assumption of coverage.
I It follows from coverage that b must be stochastic in states where it is not identical to
π.
I The target policy π , on the other hand, may be deterministic, and, in fact, this is a case
of particular interest in control applications.
I In control, the target policy is typically the deterministic greedy policy with respect to
the current estimate of the action–value function.
I This policy becomes a deterministic optimal policy while the behaviour policy remains
stochastic and more exploratory, for example, an e-greedy policy.
I For now, we consider the prediction problem, in which π is unchanging and given.
Monte Carlo Methods

Importance sampling (i)


I Almost all off-policy methods utilise importance sampling, a general technique for
estimating expected values under one distribution given samples from another.
I We apply importance sampling to off-policy learning by weighting returns according to
the relative probability of their trajectories occurring under the target and behaviour
policies, called the importance sampling ratio.
I Given a starting state St , the probability of the subsequent state–action trajectory,
At , St +1 , At +1 , . . . , ST , occurring under any policy π is
P [At , St +1 , At +1 , . . . , ST | St , At :T −1 ∼ π ]
= π (At | St )p (St +1 | St , At )π (At +1 | St +1 ) · · · p (ST | ST −1 , AT −1 )
T −1
= ∏ π (Ak | Sk )p (Sk +1 | Sk , Ak ),
k =t

where p is the state-transition probability function


p (s 0 | s , a ) := P St = s 0 | St −1 = s , At −1 = a = ∑ p (s 0 , r | s , a ).
 
r ∈R

I Thus, the relative probability of the trajectory under the target and behaviour policies
(the importance sampling ratio) is
−1 T −1
∏Tk = t π (Ak | Sk )p (Sk +1 | Sk , Ak ) π (Ak | Sk )
ρ t :T −1 : = T −1
= ∏ .
∏k =t b (Ak | Sk )p (Sk +1 | Sk , Ak ) k =t
b (Ak | Sk )
Monte Carlo Methods

Importance sampling (ii)

I Although the trajectory probabilities depend on the MDP’s transition probabilities,


which are generally unknown, they appear identically in both the numerator and
denominator, and thus cancel.
I The importance sampling ratio ends up depending only on the two policies and the
sequence, not on the MDP.
I Recall that we wish to estimate the expected returns (values) under the target policy,
but all we have are returns Gt due to the behaviour policy.
I These returns have the wrong expectation

E [Gt | St = s ] = vb (s )

and so cannot be averaged to obtain vπ .


I This is where importance sampling comes in. The ratio ρt :T −1 transforms the returns
to have the right expected value:

E [ρt :T −1 Gt | St = s ] = vπ (s ).
Monte Carlo Methods

The MC algorithm

I Now we are ready to give a MC algorithm that averages returns from a batch of
observed episodes following policy b to estimate vπ (s ).
I It is convenient here to number time steps in a way that increases across episode
boundaries. That is, if the first episode of the batch ends in a terminal state at time
100, then the next episode begins at time t = 101.
I This enables us to use time-step numbers to refer to particular steps in particular
episodes.
I In particular, we can define the set of all time steps in which state s is visited, denoted
T (s ). This is for an every-visit method; for a first-visit method, T (s ) would only
include time steps that were first visits to s within their episodes.
I Also, let T (t ) denote the first time of termination following time t, and Gt denote the
return after t up through T (t ). Then {Gt }t ∈T (s ) are the returns that pertain to state s,
and {ρt :T (t )−1 }t ∈T (s ) are the corresponding importance-sampling ratios.
Monte Carlo Methods

Ordinary vs weighted importance sampling

I To sample vπ (s ), we simply scale the returns by the ratios and average the results:

∑t ∈T (s )ρt :T (t )−1 Gt
V (s ) : = .
|T (s )|
I When importance sampling is done as a simple average in this way it is called
ordinary importance sampling.
I An important alternative is weighted importance sampling, which uses a weighted
average, defined as
∑t ∈T (s ) ρt :T (t )−1 Gt
V (s ) : =
∑t ∈T (s ) ρt :T (t )−1
or zero if the denominator is zero.
Monte Carlo Methods
Bibliography

Satinder P. Singh and Richard S. Sutton.


Reinforcement learning with replacing eligibility traces.
Machine Learning, 22:123–158, March 1996.
John N. Tsitsiklis.
On the convergence of optimistic policy iteration.
Journal of Machine Learning Research, 3:59–72, 2002.

You might also like