0% found this document useful (0 votes)

21 views

4 Monte Carlo Methods

The document discusses Monte Carlo methods for solving reinforcement learning problems. It describes how Monte Carlo methods use sampling and averaging of returns from actual or simulated experience to estimate value functions, without requiring a model of the environment's dynamics. It presents the first-visit and every-visit Monte Carlo algorithms for estimating state values and action values, and explains how they converge to the true values as the number of visits increases. It also discusses the need for exploration to estimate all action values and learn optimal policies.

Uploaded by

yilvas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

4 Monte Carlo Methods

Uploaded by

yilvas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Monte Carlo Methods

Paul Alexander Bilokon, PhD

Thalesians Ltd
Level39, One Canada Square, Canary Wharf, London E14 5AB

2023.01.24
Monte Carlo Methods

Estimating value functions

I In Markov decision processes (MDPs) we assumed that the agent had complete
knowledge of the environment. In particular, the agent knew the dynamics function
p (r , s 0 | s , a ).
I Monte Carlo methods require only experience — sample sequences of states,
actions, and rewards from actual or simulated interaction with the environment.
I Learning from actual experience requires no prior knowledge of the environment’s
dynamics, yet can still attain optimal behaviour.
I Learning from simulated experience requires a model that need only generate sample
transitions, not the complete probability distributions of all possible transitions required
for dynamic programming (DP).
I In many cases it is straightforward to generate experience sampled according to
desired probability distributions, but infeasible to obtain the distributions in explicit form.
Monte Carlo Methods

Monte Carlo methods

I Monte Carlo methods are ways of solving the reinforcement learning (RL) problem
based on averaging sample returns.
I To ensure that well-defined returns are available, here we define Monte Carlo methods
only for episodic tasks: we assume experience is divided into episodes, and that all
episodes eventually terminate no matter what actions are selected.
I Only on the completion of an episode are value estimates and policies changed.
I Monte Carlo methods can thus be incremental in an episode-by-episode sense, but
not in a step-by-step (online) sense.
I The term “Monte Carlo” is often used more broadly for any estimation method whose
operation involves a significant random component.
I Here we use it specifically for methods based on averaging complete returns (as
opposed to methods that learn from partial returns).
Monte Carlo Methods

Sampling and averaging

I Monte Carlo methods sample and average returns for each state-action pair much like
the bandit methods sample and average rewards for each action.
I The main difference is that now there are multiple states, each acting like a different
bandit problem (like a contextual bandit) and the different bandit problems are
interrelated.
I That is, the return after taking an action in one state depends on the actions taken in
later states in the same episode.
I Because all the action selections are undergoing learning, the problem becomes
nonstationary from the point of view of the earlier state.
Monte Carlo Methods

First-visit MC prediction for estimating V ≈ vπ

1. Input: a policy π to be evaluated

2. Initialise:
I V (s ) ∈ R, arbitrarily, for all s ∈ S
I Returns(s ) ← an empty list, for all s ∈ S
3. Loop forever (for each episode):
I Generate an episode following π : S0 , A0 , R1 , S1 , A1 , R2 , . . . , ST −1 , AT −1 , RT
I G←0
I Loop for each step of episode, t = T − 1, T − 2, . . . , 0:
I G ← γG + Rt +1
I Unless St appears in S0 , S1 , . . . , St −1 :
Append G to Returns(St )
V (St ) ← average(Returns(St ))
Monte Carlo Methods

Every-visit MC prediction for estimating V ≈ vπ

1. Input: a policy π to be evaluated

2. Initialise:
I V (s ) ∈ R, arbitrarily, for all s ∈ S
I Returns(s ) ← an empty list, for all s ∈ S
3. Loop forever (for each episode):
I Generate an episode following π : S0 , A0 , R1 , S1 , A1 , R2 , . . . , ST −1 , AT −1 , RT
I G←0
I Loop for each step of episode, t = T − 1, T − 2, . . . , 0:
I G ← γG + Rt +1
I Append G to Returns(St )
I V (St ) ← average(Returns(St ))
Monte Carlo Methods

Convergence

I Both first-visit MC and every-visit MC converge to vπ (s ) as the number of visits (or first
visits) to s goes to infinity.
I This is easy to see in the case of first-visit MC. In this case each return is an
independent, identically distributed estimate of vπ (s ) with finite variance. By the law of
large numbers the sequence of averages of these estimates converges to their
expected value.
I Each average is itself an unbiased estimate, and the standard deviation of its error falls
as √1n , where n is the number of returns averaged.
I Every-visit MC is less straightforward, but its estimates also converge quadratically to
vπ (s ) [SS96].
Monte Carlo Methods

Monte Carlo estimation of action values

I With a model, state values alone are enough to determine a policy: just look ahead
one step and choose whichever action leads to the best combination of reward and
next state.
I Without a model state values alone are insufficient.
I One must estimate the value of each action in order for the values to be useful in
suggesting a policy.
I We need Monte Carlo methods to estimate q∗ .
I To achieve this, first consider the policy evaluation problem for action (rather than
state) values.
Monte Carlo Methods

The policy evaluation problem for action values

I The policy evaluation problem for action values is to estimate qπ (s , a ), the expected
return when starting in state s, taking action a, and thereafter following policy π .
I The Monte Carlo methods for this are essentially the same as just presented for state
values, except now we talk about visits to a state–action pair rather than to a state.
I A state–action pair (s , a ) is said to be visited in an episode if ever the state s is
visited and action a is taken in it.
I The first-visit MC method averages the returns following the first time in each episode
that the state was visited and the action was selected.
I The every-visit MC method estimates the value of a state–action pair as the average of
the returns that have followed all the visits to it.
I These methods converge quadratically, as before, to the true expected values as the
number of visits to each state–action pair approaches infinity.
Monte Carlo Methods

Maintaining exploration

I Many state–action pairs may never be visited.

I If π is a deterministic policy, then in following π one will observe returns only for one of
the actions from each state.
I With no returns on average, the Monte Carlo estimates of the other actions will not
improve with experience.
I But the purpose of learning action values is to help in choosing among the actions
available in each state.
I We need to estimate the value of all the actions from each state, not just the one we
currently favour.
Monte Carlo Methods

Exploring starts

I This is the problem of maintaining exploration discussed in the context of the

k -armed bandit problem.
I For policy evaluation to work for action values, we must assure continual exploration.
I One way is by specifying that the episodes start in a state–action pair, and that every
pair has a nonzero probability of being selected as the start.
I This guarantees that all state–action pairs will be visited an infinite number of times in
the limit of an infinite number of episodes.
I We call this the assumption of exploring starts.
I The assumption of exploring starts is sometimes useful, but cannot be relied upon in
general, particularly when learning directly from actual interaction with an environment.
I The most common alternative approach to assuring that all state–action pairs are
encountered is to consider only policies that are stochastic with a nonzero probability
of selecting all actions in each state.
Monte Carlo Methods

Monte Carlo control

I We can now consider how Monte Carlo estimation can be used in control, that is, to
approximate optimal policies.
I We proceed according to the same pattern as with DP, that is, according to the idea of
generalised policy iteration (GPI).
I In GPI one maintains both an approximate policy and an approximate value function.
I The value function is repeatedly altered to more closely approximate the value function
for the current policy, and the policy is repeatedly improved with respect to the current
value function, as suggested by the diagram.
I These two kinds of changes work against each other to some extent, as each creates
a moving target for the other, but together they cause both policy and value function to
approach optimality.
Monte Carlo Methods

Monte Carlo version of classical policy iteration (i)

I To begin, let us consider a Monte Carlo version of classical policy iteration.

I In this method, we perform alternating complete steps of policy evaluation and policy
improvement, beginning with an arbitrary policy π0 and ending with the optimal policy
and optimal action-value function:

E I E I E I E
π 0 → qπ0 → π 1 → qπ1 → π 2 → · · · → π ∗ → q∗ ,

E I
where → denotes a complete policy evaluation and → denotes a complete policy
improvement.
I In policy evaluation many episodes are experienced, with the approximate action-value
function approaching the true function asymptotically.
Monte Carlo Methods

Monte Carlo version of classical policy iteration (ii)

I Policy improvement is done by making the policy greedy with respect to the current
value function. In this case we have an action-value function, and therefore no model
is needed to construct the greedy policy.
I For any action-value function q, the corresponding greedy policy is the one that, for
each s ∈ S , deterministically chooses an action with maximal action-value:

π (s ) := arg max q (s , a ).
a

I Policy improvement then can be done by constructing each πk +1 as the greedy policy
with respect to qπk . The policy improvement theorem then applies to πk and πk +1
because, for all s ∈ S ,

qπk (s , πk +1 (s )) = qπk (s , arg max qπk (s , a ))

= max qπk (s , a )
a

≥ qπk (s , πk (s ))
≥ vπk (s ).
The theorem assures us that each πk +1 is uniformly better than πk , or just as good as
πk , in which case they are both optimal policies.
I This in turn assures us that the overall process converges to the optimal policy and
optimal value function.
Monte Carlo Methods

Avoiding the infinite number of episodes

I We can avoid the infinite number of episodes nominally required for policy evaluation.
I We give up trying to complete policy evaluation before returning to policy improvement.
I On each evaluation step we move the value function toward qπk , but we do not expect
to actually get close except over many steps.
I We used this idea when we first introduced generalised policy iteration (GPI).
I One extreme form of the idea is value iteration, in which only one iteration of iterative
policy evaluation is performed between each step of policy improvement.
I The in-place version of value iteration is even more extreme; there we alternate
between improvement and evaluation steps for single states.
I For Monte Carlo policy evaluation it is natural to alternate between evaluation and
improvement on an episode-by episode basis. After each episode, the observed
returns are used for policy evaluation, and then the policy is improved at all the states
visited in the episode.
Monte Carlo Methods

Monte Carlo ES (Exploring Starts), for estimating π ≈ π∗

I Initialise:
I π (s ) ∈ A(s ) (arbitrarily), for all s ∈ S
I Q (s , a ) ∈ R (arbitrarily), for all s ∈ S , a ∈ A(s )
I Returns(s , a ) ← empty list, for all s ∈ S , a ∈ A(s )
I Loop forever (for each episode):
I Choose S0 ∈ S , A0 ∈ A(S0 ) randomly such that all pairs have probability > 0
I Generate an episode from S0 , A0 , following π : S0 , A0 , R1 , . . . , ST −1 , AT −1 , RT
I G←0
I Loop for each step of episode, t = T − 1, T − 2, . . . , 0:
I G ← γG + Rt +1
I Unless the pair St , At appears in S0 , A0 , S1 , A1 , . . . , St −1 , At −1 :
Append G to Returns(St , At )
Q (St , At ) ← average(Returns(St , At ))
π (St ) ← arg maxa Q (St , a )
Monte Carlo Methods

Notes on the algorithm

I In Monte Carlo ES, all the returns for each state–action pair are accumulated and
averaged, irrespective of what policy was in force when they were observed.
I It is easy to see that Monte Carlo ES cannot converge to any suboptimal policy. If it
did, then the value function would eventually converge to the value function for that
policy, and that in turn would cause the policy to change.
I Stability is achieved only when both the policy and the value function are optimal.
I Convergence to this optimal fixed point seems inevitable as the changes to the
action-value function decrease over time, but has not yet been formally proved. This is
one of the most fundamental open theoretical questions in RL. See [Tsi02] for a partial
solution.
Monte Carlo Methods

On-policy versus off-policy methods

I How can we avoid the assumption of exploring starts?

I The only general way to ensure that all actions are selected infinitely often is for the
agent to continue to select them.
I There are two approaches to ensuring this, resulting in what we call on-policy
methods and off-policy methods.
I On-policy methods attempt to evaluate or improve the policy that is used to make
decisions, whereas off-policy methods evaluate or improve a policy different from
that used to generate the data.
Monte Carlo Methods

Soft strategies

I In on-policy control methods the policy is generally soft, meaning that π (a | s ) > 0 for
all s ∈ S and all a ∈ A(s ), but gradually shifted closer and closer to a deterministic
optimal policy.
I The on-policy method we present uses e-greedy policies, meaning that most of the
time they choose an action that has maximal estimated action value, but with
probability e they instead select an action at random.
I That is, all nongreedy actions are given the minimal probability of selection, e , and
|A(s )|
the remaining bulk of the probability, 1 − e + |A(es )| , is given to the greedy action.
I The e-greedy policies are examples of e-soft policies, defined as policies for which
π (a | s ) ≥ |A(es )| for all states and actions, for some e > 0.
I Among e-soft policies, e-greedy policies are in some sense those that are closest to
greedy.
Monte Carlo Methods

On-policy first-visit MC control (for e-soft policies, estimates π ≈ π∗ )

I Algorithm parameter: small e > 0

I Initialise:
I π ← an arbitrary e-soft policy
I Q (s , a ) ∈ R (arbitrarily), for all s ∈ S , a ∈ A(s )
I Returns(s , a ) ← empty list, for all s ∈ S , a ∈ A(s )
I Repeat forever (for each episode):
I Generate an episode following π : S0 , A0 , R1 , . . . , ST −1 , AT −1 , RT
I G←0
I Loop for each step of episode, t = T − 1, T − 2, . . . , 0:
I G ← γG + Rt +1
I Unless the pair St , At appears in S0 , A0 , S1 , A1 , . . . , St −1 , At −1 :
Append G to Returns(St , At )
Q (St , At ) ← average(Returns(St , At ))
A ∗ ← arg maxa Q (St , a ) (with ties broken arbitrarily)
For all a ∈ A(St ):

if a = A ∗ ;

1 − e + e/|A(St )|,
π ( a | St ) ←
e/|A(St )|, if a , A ∗ .
Monte Carlo Methods

Off-policy methods

I All learning control methods face a dilemma: they seek to learn action values
conditional on subsequent optimal behaviour, but they need to behave non-optimally in
order to explore all actions (to find the optimal actions).
I How can they learn about the optimal policy while behaving according to an
exploratory policy?
I The on-policy approach is a compromise — it learns action values not for the optimal
policy, but for a near-optimal policy that still explores.
I A more straightforward approach is to use two policies, one that is learned about and
that becomes the optimal policy (the target policy), and one that is more exploratory
and is used to generate behaviour (the behaviour policy).
I In this case we say that learning is from data “off” the target policy, and the overall
process is termed off-policy learning.
Monte Carlo Methods

On-policy versus off-policy methods

I On-policy methods are generally simpler and are considered first.

I Off-policy methods require additional concepts and notation, and because the data is
due to a different policy, off-policy methods are often of greater variance and are
slower to converge.
I On the other hand, off-policy methods are often more powerful and general.
I They include on-policy methods as the special case in which the target and behaviour
policies are the same.
I Off-policy methods also have a variety of additional uses in applications. For example,
they can often be applied to learn from data generated by a conventional non-learning
controller, or from a human expert.
Monte Carlo Methods

Evaluating the target policy

I Suppose that both target and behaviour policies are fixed.

I Suppose we wish to estimate vπ or qπ , but all we have are episodes following another
policy b, where b , π .
I In this case, π is the target policy, b is the behaviour policy, and both policies are
considered fixed and given.
I In order to use episodes from b to estimate values for π , we require that every action
taken under π is also taken, at least occasionally, under b. That is, we require that
π (a | s ) > 0 ⇒ b (a | s ) > 0. This is called the assumption of coverage.
I It follows from coverage that b must be stochastic in states where it is not identical to
π.
I The target policy π , on the other hand, may be deterministic, and, in fact, this is a case
of particular interest in control applications.
I In control, the target policy is typically the deterministic greedy policy with respect to
the current estimate of the action–value function.
I This policy becomes a deterministic optimal policy while the behaviour policy remains
stochastic and more exploratory, for example, an e-greedy policy.
I For now, we consider the prediction problem, in which π is unchanging and given.
Monte Carlo Methods

Importance sampling (i)

I Almost all off-policy methods utilise importance sampling, a general technique for
estimating expected values under one distribution given samples from another.
I We apply importance sampling to off-policy learning by weighting returns according to
the relative probability of their trajectories occurring under the target and behaviour
policies, called the importance sampling ratio.
I Given a starting state St , the probability of the subsequent state–action trajectory,
At , St +1 , At +1 , . . . , ST , occurring under any policy π is
P [At , St +1 , At +1 , . . . , ST | St , At :T −1 ∼ π ]
= π (At | St )p (St +1 | St , At )π (At +1 | St +1 ) · · · p (ST | ST −1 , AT −1 )
T −1
= ∏ π (Ak | Sk )p (Sk +1 | Sk , Ak ),
k =t

where p is the state-transition probability function

p (s 0 | s , a ) := P St = s 0 | St −1 = s , At −1 = a = ∑ p (s 0 , r | s , a ).

r ∈R

Importance sampling (ii)

I Although the trajectory probabilities depend on the MDP’s transition probabilities,

which are generally unknown, they appear identically in both the numerator and
denominator, and thus cancel.
I The importance sampling ratio ends up depending only on the two policies and the
sequence, not on the MDP.
I Recall that we wish to estimate the expected returns (values) under the target policy,
but all we have are returns Gt due to the behaviour policy.
I These returns have the wrong expectation

E [Gt | St = s ] = vb (s )

and so cannot be averaged to obtain vπ .

I This is where importance sampling comes in. The ratio ρt :T −1 transforms the returns
to have the right expected value:

E [ρt :T −1 Gt | St = s ] = vπ (s ).
Monte Carlo Methods

The MC algorithm

I Now we are ready to give a MC algorithm that averages returns from a batch of
observed episodes following policy b to estimate vπ (s ).
I It is convenient here to number time steps in a way that increases across episode
boundaries. That is, if the first episode of the batch ends in a terminal state at time
100, then the next episode begins at time t = 101.
I This enables us to use time-step numbers to refer to particular steps in particular
episodes.
I In particular, we can define the set of all time steps in which state s is visited, denoted
T (s ). This is for an every-visit method; for a first-visit method, T (s ) would only
include time steps that were first visits to s within their episodes.
I Also, let T (t ) denote the first time of termination following time t, and Gt denote the
return after t up through T (t ). Then {Gt }t ∈T (s ) are the returns that pertain to state s,
and {ρt :T (t )−1 }t ∈T (s ) are the corresponding importance-sampling ratios.
Monte Carlo Methods

Ordinary vs weighted importance sampling

I To sample vπ (s ), we simply scale the returns by the ratios and average the results:

∑t ∈T (s )ρt :T (t )−1 Gt
V (s ) : = .
|T (s )|
I When importance sampling is done as a simple average in this way it is called
ordinary importance sampling.
I An important alternative is weighted importance sampling, which uses a weighted
average, defined as
∑t ∈T (s ) ρt :T (t )−1 Gt
V (s ) : =
∑t ∈T (s ) ρt :T (t )−1
or zero if the denominator is zero.
Monte Carlo Methods
Bibliography

Satinder P. Singh and Richard S. Sutton.

Reinforcement learning with replacing eligibility traces.
Machine Learning, 22:123–158, March 1996.
John N. Tsitsiklis.
On the convergence of optimistic policy iteration.
Journal of Machine Learning Research, 3:59–72, 2002.

4TH FBS DLP Table Setup
No ratings yet
4TH FBS DLP Table Setup
2 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Piattelli-Palmarini 1994 Cognition
No ratings yet
Piattelli-Palmarini 1994 Cognition
32 pages
Monte Carlo Learning
No ratings yet
Monte Carlo Learning
14 pages
CH3_1 Montecarlo Components
No ratings yet
CH3_1 Montecarlo Components
18 pages
Lec 5
No ratings yet
Lec 5
13 pages
Lecture 4 Monte Carlo Method
100% (1)
Lecture 4 Monte Carlo Method
22 pages
rl-lecture5
No ratings yet
rl-lecture5
16 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
CH3_2 Montecarlo Control
No ratings yet
CH3_2 Montecarlo Control
33 pages
Module 2
No ratings yet
Module 2
84 pages
Model free methods
No ratings yet
Model free methods
31 pages
2.2+Model Free+Control
No ratings yet
2.2+Model Free+Control
92 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
05 MC Methods
No ratings yet
05 MC Methods
53 pages
Lecture 6 MONTE CARLO Example
No ratings yet
Lecture 6 MONTE CARLO Example
11 pages
notes
No ratings yet
notes
6 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
20 pages
Monte Carlo Methods in Reinforcement Learning
0% (1)
Monte Carlo Methods in Reinforcement Learning
5 pages
Monte Carlo Methods in Reinforcement Learning
No ratings yet
Monte Carlo Methods in Reinforcement Learning
5 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Improving Monte Carlo Evaluation With Offline Data: Sutton and Barto 2018
No ratings yet
Improving Monte Carlo Evaluation With Offline Data: Sutton and Barto 2018
40 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
58 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
lecture3pre
No ratings yet
lecture3pre
67 pages
3 Evaluation
No ratings yet
3 Evaluation
41 pages
Reinforcement Learning Notes
No ratings yet
Reinforcement Learning Notes
1 page
W6 Monte Carlo Methods
No ratings yet
W6 Monte Carlo Methods
80 pages
slidedeck_7_MAS_2021_22_RL_3_MC_Sarsa_QL
No ratings yet
slidedeck_7_MAS_2021_22_RL_3_MC_Sarsa_QL
65 pages
Lecture 3 Pre
No ratings yet
Lecture 3 Pre
67 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
04 MC Methods
No ratings yet
04 MC Methods
18 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
qp ans
No ratings yet
qp ans
40 pages
The Monte Carlo Method
No ratings yet
The Monte Carlo Method
22 pages
The Monte Carlo Method
No ratings yet
The Monte Carlo Method
22 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of contents
No ratings yet
1 - Table of contents
6 pages
NeurIPS-2019-maximum-entropy-monte-carlo-planning-Paper
No ratings yet
NeurIPS-2019-maximum-entropy-monte-carlo-planning-Paper
9 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages
Lecture 5: Model-Free Control: David Silver
No ratings yet
Lecture 5: Model-Free Control: David Silver
43 pages
Assignment 6 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 6 (Sol.) : Reinforcement Learning
4 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Acing Black Jack
No ratings yet
Acing Black Jack
22 pages
Monte Carlo Method
100% (1)
Monte Carlo Method
68 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Bauer The Monte Carlo Method
No ratings yet
Bauer The Monte Carlo Method
14 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
CS229
No ratings yet
CS229
17 pages
Dissecting Reinforcement Learning-Part9
No ratings yet
Dissecting Reinforcement Learning-Part9
15 pages
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
No ratings yet
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
9 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
From Everand
Acceptance-Rejection Sampling and Multi-dimensional Monte Carlo Integrations Utilizing Mathematica®
SUJAUL CHOWDHURY
No ratings yet
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
5 Temporal Difference Learning
No ratings yet
5 Temporal Difference Learning
25 pages
2 Exploration and Exploitation
No ratings yet
2 Exploration and Exploitation
91 pages
1 Introduction To Reinforcement Learning
100% (2)
1 Introduction To Reinforcement Learning
104 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 5
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 5
3 pages
Prove FTAP Without Prob
No ratings yet
Prove FTAP Without Prob
19 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 6
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 6
4 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 4
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 4
3 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 3
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 3
3 pages
MIT18.650. Statistics For Applications Fall 2016. Problem Set 2
No ratings yet
MIT18.650. Statistics For Applications Fall 2016. Problem Set 2
3 pages
Grade 4 Volume 1 Curriculum Synopsis 2023 24 Rev
No ratings yet
Grade 4 Volume 1 Curriculum Synopsis 2023 24 Rev
18 pages
Walter Johnson 2008 (Web Page)
No ratings yet
Walter Johnson 2008 (Web Page)
11 pages
Department of Civil Engineering Geotechnical Engineering Laboratory Laporan Makmal
No ratings yet
Department of Civil Engineering Geotechnical Engineering Laboratory Laporan Makmal
3 pages
GRADE 3-MATH DLP
No ratings yet
GRADE 3-MATH DLP
10 pages
The Secret Life of Methods
No ratings yet
The Secret Life of Methods
37 pages
5msofianefilouane rd2rs
No ratings yet
5msofianefilouane rd2rs
4 pages
1E Lesson Plan (Sorida)
No ratings yet
1E Lesson Plan (Sorida)
2 pages
Instant download (Ebook) Transforming Organizations: One Process at a Time by Kathryn A. LeRoy ISBN 9781138297364, 1138297364 pdf all chapter
100% (4)
Instant download (Ebook) Transforming Organizations: One Process at a Time by Kathryn A. LeRoy ISBN 9781138297364, 1138297364 pdf all chapter
71 pages
Admission On Migration Form-rev01-Updated
No ratings yet
Admission On Migration Form-rev01-Updated
4 pages
Level of Academic Performance of The Wo
No ratings yet
Level of Academic Performance of The Wo
47 pages
NGEC 1 Understanding The Self
No ratings yet
NGEC 1 Understanding The Self
34 pages
Ubd Stage 2 Template Stage 1 - Identify Desired Results
No ratings yet
Ubd Stage 2 Template Stage 1 - Identify Desired Results
16 pages
Closereadingplanningpage
No ratings yet
Closereadingplanningpage
4 pages
RWTH Aachen University - Information Systems - Annual Report Academic Year 2008-2009
No ratings yet
RWTH Aachen University - Information Systems - Annual Report Academic Year 2008-2009
27 pages
1225.consumer Behaviour and Advertising Management
100% (5)
1225.consumer Behaviour and Advertising Management
391 pages
Week 5 FET Lesson Framework - 2020 - FAL GR 11 Short Story
No ratings yet
Week 5 FET Lesson Framework - 2020 - FAL GR 11 Short Story
5 pages
BscIT Final Result
No ratings yet
BscIT Final Result
1 page
Iste Standards For Educators
No ratings yet
Iste Standards For Educators
4 pages
Test Unit 1 Reading
No ratings yet
Test Unit 1 Reading
1 page
Chapter 1 - Introduction Topic 1 - Student Printing Handout
No ratings yet
Chapter 1 - Introduction Topic 1 - Student Printing Handout
20 pages
De Thi Chuyen Anh Su Pham 2015
No ratings yet
De Thi Chuyen Anh Su Pham 2015
7 pages
Text Cohesion PDF
No ratings yet
Text Cohesion PDF
15 pages
GST 05215 TECH WRITING AND PRESENTATION
No ratings yet
GST 05215 TECH WRITING AND PRESENTATION
2 pages
Assignement
No ratings yet
Assignement
1 page
Sample Chapter 2 RRL
No ratings yet
Sample Chapter 2 RRL
11 pages
Doing My Homework
100% (2)
Doing My Homework
8 pages
Determining Whether Effort Could Be Used As A Heuristic For Quality
No ratings yet
Determining Whether Effort Could Be Used As A Heuristic For Quality
17 pages
Nicole Stata Critical Thinking
No ratings yet
Nicole Stata Critical Thinking
1 page

4 Monte Carlo Methods

Uploaded by

4 Monte Carlo Methods

Uploaded by

Monte Carlo Methods

Monte Carlo Methods

Paul Alexander Bilokon, PhD

Estimating value functions

Monte Carlo methods

Sampling and averaging

First-visit MC prediction for estimating V ≈ vπ

1. Input: a policy π to be evaluated

Every-visit MC prediction for estimating V ≈ vπ

1. Input: a policy π to be evaluated

Monte Carlo estimation of action values

The policy evaluation problem for action values

I Many state–action pairs may never be visited.

I This is the problem of maintaining exploration discussed in the context of the

Monte Carlo control

Monte Carlo version of classical policy iteration (i)

I To begin, let us consider a Monte Carlo version of classical policy iteration.

Monte Carlo version of classical policy iteration (ii)

qπk (s , πk +1 (s )) = qπk (s , arg max qπk (s , a ))

Avoiding the infinite number of episodes

Monte Carlo ES (Exploring Starts), for estimating π ≈ π∗

Notes on the algorithm

On-policy versus off-policy methods

I How can we avoid the assumption of exploring starts?

On-policy first-visit MC control (for e-soft policies, estimates π ≈ π∗ )

I Algorithm parameter: small e > 0

On-policy versus off-policy methods

I On-policy methods are generally simpler and are considered first.

Evaluating the target policy

I Suppose that both target and behaviour policies are fixed.

Importance sampling (i)

where p is the state-transition probability function

Importance sampling (ii)

I Although the trajectory probabilities depend on the MDP’s transition probabilities,

and so cannot be averaged to obtain vπ .

Ordinary vs weighted importance sampling

Satinder P. Singh and Richard S. Sutton.

You might also like