4 Monte Carlo Methods
4 Monte Carlo Methods
Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB
2023.01.24
Monte Carlo Methods
I In Markov decision processes (MDPs) we assumed that the agent had complete
knowledge of the environment. In particular, the agent knew the dynamics function
p (r , s 0 | s , a ).
I Monte Carlo methods require only experience — sample sequences of states,
actions, and rewards from actual or simulated interaction with the environment.
I Learning from actual experience requires no prior knowledge of the environment’s
dynamics, yet can still attain optimal behaviour.
I Learning from simulated experience requires a model that need only generate sample
transitions, not the complete probability distributions of all possible transitions required
for dynamic programming (DP).
I In many cases it is straightforward to generate experience sampled according to
desired probability distributions, but infeasible to obtain the distributions in explicit form.
Monte Carlo Methods
I Monte Carlo methods are ways of solving the reinforcement learning (RL) problem
based on averaging sample returns.
I To ensure that well-defined returns are available, here we define Monte Carlo methods
only for episodic tasks: we assume experience is divided into episodes, and that all
episodes eventually terminate no matter what actions are selected.
I Only on the completion of an episode are value estimates and policies changed.
I Monte Carlo methods can thus be incremental in an episode-by-episode sense, but
not in a step-by-step (online) sense.
I The term “Monte Carlo” is often used more broadly for any estimation method whose
operation involves a significant random component.
I Here we use it specifically for methods based on averaging complete returns (as
opposed to methods that learn from partial returns).
Monte Carlo Methods
I Monte Carlo methods sample and average returns for each state-action pair much like
the bandit methods sample and average rewards for each action.
I The main difference is that now there are multiple states, each acting like a different
bandit problem (like a contextual bandit) and the different bandit problems are
interrelated.
I That is, the return after taking an action in one state depends on the actions taken in
later states in the same episode.
I Because all the action selections are undergoing learning, the problem becomes
nonstationary from the point of view of the earlier state.
Monte Carlo Methods
Convergence
I Both first-visit MC and every-visit MC converge to vπ (s ) as the number of visits (or first
visits) to s goes to infinity.
I This is easy to see in the case of first-visit MC. In this case each return is an
independent, identically distributed estimate of vπ (s ) with finite variance. By the law of
large numbers the sequence of averages of these estimates converges to their
expected value.
I Each average is itself an unbiased estimate, and the standard deviation of its error falls
as √1n , where n is the number of returns averaged.
I Every-visit MC is less straightforward, but its estimates also converge quadratically to
vπ (s ) [SS96].
Monte Carlo Methods
I With a model, state values alone are enough to determine a policy: just look ahead
one step and choose whichever action leads to the best combination of reward and
next state.
I Without a model state values alone are insufficient.
I One must estimate the value of each action in order for the values to be useful in
suggesting a policy.
I We need Monte Carlo methods to estimate q∗ .
I To achieve this, first consider the policy evaluation problem for action (rather than
state) values.
Monte Carlo Methods
I The policy evaluation problem for action values is to estimate qπ (s , a ), the expected
return when starting in state s, taking action a, and thereafter following policy π .
I The Monte Carlo methods for this are essentially the same as just presented for state
values, except now we talk about visits to a state–action pair rather than to a state.
I A state–action pair (s , a ) is said to be visited in an episode if ever the state s is
visited and action a is taken in it.
I The first-visit MC method averages the returns following the first time in each episode
that the state was visited and the action was selected.
I The every-visit MC method estimates the value of a state–action pair as the average of
the returns that have followed all the visits to it.
I These methods converge quadratically, as before, to the true expected values as the
number of visits to each state–action pair approaches infinity.
Monte Carlo Methods
Maintaining exploration
Exploring starts
I We can now consider how Monte Carlo estimation can be used in control, that is, to
approximate optimal policies.
I We proceed according to the same pattern as with DP, that is, according to the idea of
generalised policy iteration (GPI).
I In GPI one maintains both an approximate policy and an approximate value function.
I The value function is repeatedly altered to more closely approximate the value function
for the current policy, and the policy is repeatedly improved with respect to the current
value function, as suggested by the diagram.
I These two kinds of changes work against each other to some extent, as each creates
a moving target for the other, but together they cause both policy and value function to
approach optimality.
Monte Carlo Methods
E I E I E I E
π 0 → qπ0 → π 1 → qπ1 → π 2 → · · · → π ∗ → q∗ ,
E I
where → denotes a complete policy evaluation and → denotes a complete policy
improvement.
I In policy evaluation many episodes are experienced, with the approximate action-value
function approaching the true function asymptotically.
Monte Carlo Methods
π (s ) := arg max q (s , a ).
a
I Policy improvement then can be done by constructing each πk +1 as the greedy policy
with respect to qπk . The policy improvement theorem then applies to πk and πk +1
because, for all s ∈ S ,
= max qπk (s , a )
a
≥ qπk (s , πk (s ))
≥ vπk (s ).
The theorem assures us that each πk +1 is uniformly better than πk , or just as good as
πk , in which case they are both optimal policies.
I This in turn assures us that the overall process converges to the optimal policy and
optimal value function.
Monte Carlo Methods
I We can avoid the infinite number of episodes nominally required for policy evaluation.
I We give up trying to complete policy evaluation before returning to policy improvement.
I On each evaluation step we move the value function toward qπk , but we do not expect
to actually get close except over many steps.
I We used this idea when we first introduced generalised policy iteration (GPI).
I One extreme form of the idea is value iteration, in which only one iteration of iterative
policy evaluation is performed between each step of policy improvement.
I The in-place version of value iteration is even more extreme; there we alternate
between improvement and evaluation steps for single states.
I For Monte Carlo policy evaluation it is natural to alternate between evaluation and
improvement on an episode-by episode basis. After each episode, the observed
returns are used for policy evaluation, and then the policy is improved at all the states
visited in the episode.
Monte Carlo Methods
I Initialise:
I π (s ) ∈ A(s ) (arbitrarily), for all s ∈ S
I Q (s , a ) ∈ R (arbitrarily), for all s ∈ S , a ∈ A(s )
I Returns(s , a ) ← empty list, for all s ∈ S , a ∈ A(s )
I Loop forever (for each episode):
I Choose S0 ∈ S , A0 ∈ A(S0 ) randomly such that all pairs have probability > 0
I Generate an episode from S0 , A0 , following π : S0 , A0 , R1 , . . . , ST −1 , AT −1 , RT
I G←0
I Loop for each step of episode, t = T − 1, T − 2, . . . , 0:
I G ← γG + Rt +1
I Unless the pair St , At appears in S0 , A0 , S1 , A1 , . . . , St −1 , At −1 :
Append G to Returns(St , At )
Q (St , At ) ← average(Returns(St , At ))
π (St ) ← arg maxa Q (St , a )
Monte Carlo Methods
I In Monte Carlo ES, all the returns for each state–action pair are accumulated and
averaged, irrespective of what policy was in force when they were observed.
I It is easy to see that Monte Carlo ES cannot converge to any suboptimal policy. If it
did, then the value function would eventually converge to the value function for that
policy, and that in turn would cause the policy to change.
I Stability is achieved only when both the policy and the value function are optimal.
I Convergence to this optimal fixed point seems inevitable as the changes to the
action-value function decrease over time, but has not yet been formally proved. This is
one of the most fundamental open theoretical questions in RL. See [Tsi02] for a partial
solution.
Monte Carlo Methods
Soft strategies
I In on-policy control methods the policy is generally soft, meaning that π (a | s ) > 0 for
all s ∈ S and all a ∈ A(s ), but gradually shifted closer and closer to a deterministic
optimal policy.
I The on-policy method we present uses e-greedy policies, meaning that most of the
time they choose an action that has maximal estimated action value, but with
probability e they instead select an action at random.
I That is, all nongreedy actions are given the minimal probability of selection, e , and
|A(s )|
the remaining bulk of the probability, 1 − e + |A(es )| , is given to the greedy action.
I The e-greedy policies are examples of e-soft policies, defined as policies for which
π (a | s ) ≥ |A(es )| for all states and actions, for some e > 0.
I Among e-soft policies, e-greedy policies are in some sense those that are closest to
greedy.
Monte Carlo Methods
if a = A ∗ ;
1 − e + e/|A(St )|,
π ( a | St ) ←
e/|A(St )|, if a , A ∗ .
Monte Carlo Methods
Off-policy methods
I All learning control methods face a dilemma: they seek to learn action values
conditional on subsequent optimal behaviour, but they need to behave non-optimally in
order to explore all actions (to find the optimal actions).
I How can they learn about the optimal policy while behaving according to an
exploratory policy?
I The on-policy approach is a compromise — it learns action values not for the optimal
policy, but for a near-optimal policy that still explores.
I A more straightforward approach is to use two policies, one that is learned about and
that becomes the optimal policy (the target policy), and one that is more exploratory
and is used to generate behaviour (the behaviour policy).
I In this case we say that learning is from data “off” the target policy, and the overall
process is termed off-policy learning.
Monte Carlo Methods
I Thus, the relative probability of the trajectory under the target and behaviour policies
(the importance sampling ratio) is
−1 T −1
∏Tk = t π (Ak | Sk )p (Sk +1 | Sk , Ak ) π (Ak | Sk )
ρ t :T −1 : = T −1
= ∏ .
∏k =t b (Ak | Sk )p (Sk +1 | Sk , Ak ) k =t
b (Ak | Sk )
Monte Carlo Methods
E [Gt | St = s ] = vb (s )
E [ρt :T −1 Gt | St = s ] = vπ (s ).
Monte Carlo Methods
The MC algorithm
I Now we are ready to give a MC algorithm that averages returns from a batch of
observed episodes following policy b to estimate vπ (s ).
I It is convenient here to number time steps in a way that increases across episode
boundaries. That is, if the first episode of the batch ends in a terminal state at time
100, then the next episode begins at time t = 101.
I This enables us to use time-step numbers to refer to particular steps in particular
episodes.
I In particular, we can define the set of all time steps in which state s is visited, denoted
T (s ). This is for an every-visit method; for a first-visit method, T (s ) would only
include time steps that were first visits to s within their episodes.
I Also, let T (t ) denote the first time of termination following time t, and Gt denote the
return after t up through T (t ). Then {Gt }t ∈T (s ) are the returns that pertain to state s,
and {ρt :T (t )−1 }t ∈T (s ) are the corresponding importance-sampling ratios.
Monte Carlo Methods
I To sample vπ (s ), we simply scale the returns by the ratios and average the results:
∑t ∈T (s )ρt :T (t )−1 Gt
V (s ) : = .
|T (s )|
I When importance sampling is done as a simple average in this way it is called
ordinary importance sampling.
I An important alternative is weighted importance sampling, which uses a weighted
average, defined as
∑t ∈T (s ) ρt :T (t )−1 Gt
V (s ) : =
∑t ∈T (s ) ρt :T (t )−1
or zero if the denominator is zero.
Monte Carlo Methods
Bibliography