0% found this document useful (0 votes)

10 views

05 MC Methods

Uploaded by

udipi.adithya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

05 MC Methods

Uploaded by

udipi.adithya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Monte Carlo Methods

CS60077: Reinforcement Learning

Abir Das

IIT Kharagpur

Sep 17 and 23, 2021

Agenda Introduction MC Evaluation MC Control

Agenda

§ Understand how to evaluate policies in model-free setting using

Monte Carlo methods
§ Understand Monte Carlo methods in model-free setting for control of
Reinforcement Learning problems

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 2 / 32

Agenda Introduction MC Evaluation MC Control

Resources

§ Reinforcement Learning by David Silver [Link]

§ Reinforcement Learning by Balaraman Ravindran [Link]
§ Monte Carlo Simulation by Nando de Freitas [Link]
§ SB: Chapter 5

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 3 / 32

Agenda Introduction MC Evaluation MC Control

Model Free Setting

§ Like the previous few lectures, here also we will deal with prediction
and control problems but this time it will be in a model-free setting
§ In model-free setting we do not have the full knowledge of the MDP
§ Model-free prediction: Estimate the value function of an unknown
MDP
§ Model-free control: Optimise the value function of an unknown
MDP
§ Model-free methods require only experience - sample sequences of
states, actions, and rewards (S1 , A1 , R2 , · · · ) from actual or
simulated interaction with an environment.
§ Actual experince requires no knowledge of the environment’s
dynamics.
§ Simulated experience ‘requires’ models to generate samples only. No
knowledge of the complete probability distributions of state
transitions is required. In many cases this is easy to do.
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 4 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)

'
,0

P(area)=?
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 5 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)

'
,0

P(area)=&⁄'
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 6 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)

'
,0

P(area)=?
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 7 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)

'
,0
'
P(area)= 𝜋 &⁄'
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 8 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)

'
,0

P(area)=?
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 9 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)

'
,0
# *+, -./+0
P(area)=
# -12+ -./+0
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 10 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)
x x x
x x
xx x x
x x
x 8
x x x
x x 19
x x
(0,0) & (1,0)
'
,0
# *+,-. /0 ,1* +,1+
P(area)=
# *+,-.
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 11 / 32
Agenda Introduction MC Evaluation MC Control

History of Monte Carlo

§ The bomb and ENIAC

Image taken from: www.livescience.com

Image taken from: www.digitaltrends.com

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 12 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo for Expectation Calculation

R
§ Lets say we want to compute E[f (x)] = f (x)p(x)dx
N
§ Draw i.i.d. samples x(i) i=1 from the probability density p(x)

Image taken from: Nando de Freitas: MLSS 08

N
§ Approximate p(x)≈ N1 δx(i) (x) [δx(i) (x) is impulse at x(i) on x axis]
P
i=1
R R N
§ E[f (x)] = f (x)p(x)dx ≈ f (x) N1
P
δx(i) (x)dx =
Z i=1
N N
1P 1P
x(i)

N f (x)δx(i) (x)dx = N f
i=1 i=1
| {z }
f (x(i) )
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 13 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo Policy Evaluation

§ Learn vπ from episodes of experience under policy π

S1 , A1 , R2 , S2 , A2 , R3 , · · · , Sk , Ak , Rk ∼ π

§ Recall that the return is the total discounted reward:

Gt = Rt+1 + γRt+2 + · · · + γ T −1 RT

§ Recall that the value function is the expected return:

vπ (s) = E [Gt |St = s]

§ Monte-Carlo policy evaluation uses empirical mean return instead of

expected return

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 14 / 32

Agenda Introduction MC Evaluation MC Control

First Visit Monte Carlo Policy Evaluation

§ To evaluate state s i.e. to learn vπ (s)

§ The first time-step t that state s is visited in an episode,
§ Increment counter N (s) ← N (s) + 1
§ Increment total retun S(s) ← S(s) + Gt
§ Value is estimated by mean return V (s) = S(s)/N (s)
§ By law of large number, V (s) → vπ (s) as N (s) → ∞

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 15 / 32

Agenda Introduction MC Evaluation MC Control

Every Visit Monte Carlo Policy Evaluation

§ To evaluate state s i.e. to learn vπ (s)

§ Every time-step t that state s is visited in an episode,
§ Increment counter N (s) ← N (s) + 1
§ Increment total retun S(s) ← S(s) + Gt
§ Value is estimated by mean return V (s) = S(s)/N (s)
§ By law of large number, V (s) → vπ (s) as N (s) → ∞

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 16 / 32

Agenda Introduction MC Evaluation MC Control

Blackjack Example
Blackjack Example
States (200 of them):
Current sum (12-21)
Dealer’s showing card (ace-10)
Do I have a “useable” ace? (yes-no)
Action stick: Stop receiving cards (and terminate)
Action twist: Take another card (no replacement)
Reward for stick:
+1 if sum of cards > sum of dealer cards
0 if sum of cards = sum of dealer cards
-1 if sum of cards < sum of dealer cards
Reward for twist:
-1 if sum of cards > 21 (and terminate)
0 otherwise
Transitions: automatically twist if sum of cards < 12
Slide courtesy: David Silver [Deepmind]

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 17 / 32

Agenda
Blackjack Example Introduction MC Evaluation MC Control

Blackjack Value
Blackjack Function after Monte-Carlo Learning
Example

Policy: stick if sum of cards ≥ 20, otherwise twist

Slide courtesy: David Silver [Deepmind]

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 18 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ We will now, see how Monte Carlo estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 19 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Policy evaluation is done as Monte Carlo evaluation

§ Then, we can do greedy policy improvement.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 19 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Policy evaluation is done as Monte Carlo evaluation

§ Then, we can do greedy policy improvement.
§ What is the problem!!

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 19 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Policy evaluation is done as Monte Carlo evaluation

§ Then, we can do greedy policy improvement.
§ What is the problem!!

0 . P 0 0
§ π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 19 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Greedy policy improvement

over v(s) requires model of MDP
0 . P 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 20 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Greedy policy improvement

over v(s) requires model of MDP
0 . P 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
§ Greedy policy improvement over q(s, a) is model-free
.
π 0 (s) = arg max q(s, a)
a∈A

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 20 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Greedy policy improvement

over v(s) requires model of MDP
0 . P 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
§ Greedy policy improvement over q(s, a) is model-free
.
π 0 (s) = arg max q(s, a)
a∈A

§ How can we do Monte Carlo policy evaluation for q(s, a)?

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 20 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Greedy policy improvement

over v(s) requires model of MDP
0 . P 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
§ Greedy policy improvement over q(s, a) is model-free
.
π 0 (s) = arg max q(s, a)
a∈A

§ How can we do Monte Carlo policy evaluation for q(s, a)?

§ Essentially the same as Monte Carlo evaluation for state values. Start
at a state s, pick an action a and then follow the policy.
§ After few such episodes average the returns to get an estimate of
q(s, a).

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 20 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ What are some concerns?

§ First visit/Every visit!!
§ Suppose you start at a state s and take action a. You reach at state
s1 and then following the policy π at s, you take the action
a1 = π(s1 ). Can you take the rest of the trajectory as a sample to
estimate q(s1 , a1 )?
§ Practically you can, but convergence can not be guaranteed. The
reason is that this strategy draws a disproportionately large number of
actions corresponding to π. So, each sample is considered only for the
starting s and a.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 21 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ What are some concerns?

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 21 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Many state-action pairs may never be visited.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 22 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Many state-action pairs may never be visited.

§ For deterministic policy, with no returns to average, the Monte Carlo
estimates of many actions will not improve with experience.
§ This is the general problem of maintaining exploration.
§ One way to do this is by specifying that the episodes start in a
state-action pair, and that every pair has a nonzero probability of
being selected as the start.
§ This assumption is called ‘exploring starts’
§ Monte Carlo Exploration Starts is an ‘on policy’ method. On policy
methods evaluate or improve the policy by drawing samples from the
same policy.
§ Off-policy methods evaluate or improve a policy different from that
used to generate the samples.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 22 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 23 / 32

Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Before going to off-policy methods let us look into an on policy
Monte Carlo control method that does not use exploring starts.
§ The assumption of exploring starts is sometimes useful, but it cannot
be relied upon in general, particularly when learning directly from
actual interaction with an environment.
§ The easiest alternative is to consider stochastic policies with a
nonzero probability of selecting all actions in each state.
§ Instead of getting a greedy policy in the policy improvement step, an
-greedy policy is obtained.
§ It means most of the time, the action corresponding to maximum
estimated action value is chosen, but sometimes (with probability )
an action at random is chosen.

§ Probability of choosing nongreedy actions is |A(s)| whereas remaining

bulk of the probability, 1 − + |A(s)| , is given to the greedy action.
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 23 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ -greedy policy is an example of a bigger class of policies known as

-soft policies where π(a|s) ≥ |A(s)| for all states and actions, for
some > 0.
§ Among -soft policies, -greedy policy is, in some sense, closest to
greedy.
§ By using -greedy policy improvement strategy, we achieve the best
policy among -soft policies, but we eliminate the assumption of
‘exploring starts’.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 24 / 32

Agenda Introduction MC Evaluation MC Control

Off-policy Methods

§ All methods trying to learn control face a dilemma.

I They seek to learn action values conditional on subsequent optimal
behavior.
I But they need to behave non-optimally in order to explore all actions
(to find the optimal actions).

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 25 / 32

Agenda Introduction MC Evaluation MC Control

Off-policy Methods

§ All methods trying to learn control face a dilemma.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 25 / 32

Agenda Introduction MC Evaluation MC Control

Off-policy Methods

§ All methods trying to learn control face a dilemma.

I They seek to learn action values conditional on subsequent optimal
behavior.
I But they need to behave non-optimally in order to explore all actions
(to find the optimal actions).
§ The on-policy approach is actually a compromise, it learns action
values not for the optimal policy, but for a near-optimal policy that
still explores.
§ Off-policy methods address this by using two policies for two different
purposes.
I one that is learned about and that becomes the optimal policy - target
policy.
I one that is more exploratory and is used to generate behavior -
behavior policy.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 25 / 32

Agenda Introduction MC Evaluation MC Control

Off-policy Prediction

§ Estimate vπ or qπ of the target policy π, but we have episodes from

another policy µ, the behavior policy.
§ Almost all off-policy methods utilize concepts from sampling theory
for such operations.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 26 / 32

Agenda Introduction MC Evaluation MC Control

Rejection Sampling
set i = 1
Repeat until i = N
1 Sample x(i) ∼ q(x) and u ∼ U(0,1)
(i)
2 If u < Mp(x )
q(x(i) )
, then accept x(i) and increment counter i by 1.
Otherwise, reject.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 27 / 32

Agenda Introduction MC Evaluation MC Control

Importance Sampling
§ What is bad about rejection sampling?

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 28 / 32

Agenda Introduction MC Evaluation MC Control

Importance Sampling
§ What is bad about rejection sampling?
§ Many wasted samples! why?

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 28 / 32

Agenda Introduction MC Evaluation MC Control

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 28 / 32

Agenda Introduction MC Evaluation MC Control

Importance Sampling
§ What is bad about rejection sampling?
§ Many wasted samples! why?
§ Importance sampling is a classical way to address this. You keep all
the samples from the proposal/behavior distribution, you just weigh
them. R
§ Lets say we want to compute Ex∼p(.) [f (x)] = f (x)p(x)dx
Z Z
p(x)
Ex∼p(.) [f (x)] = f (x)p(x)dx = f (x) q(x)dx
q(x)

p(x)
= Ex∼q(.) f (x)
q(x)
N
1 X p(x(i) )
≈ f (x(i) )
N
(i)
q(x(i) )
x ∼q(.),i=1

p(x(i) )
§ q(x(i) )
is called the importance weight.
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 28 / 32
Agenda Introduction MC Evaluation MC Control

Normalized Importance Sampling

To avoid numerical instability, the denominator is changed in the following

way
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 29 / 32

Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 30 / 32

Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

§ x(i) are the trajectories.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 30 / 32

Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

§ x(i) are the trajectories.

§ p(x(i) ) is the probability of the trajectory x(i) given that the
trajectory follows the target policy.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 30 / 32

Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

§ x(i) are the trajectories.

§ p(x(i) ) is the probability of the trajectory x(i) given that the
trajectory follows the target policy.
§ q(x(i) ) is the probability of the trajectory x(i) given that the
trajectory follows the behavior policy.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 30 / 32

Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

§ x(i) are the trajectories.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 30 / 32

Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ How is a trajectory represented?

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 31 / 32

Agenda
Reinforcement Learning
Introduction MC Evaluation
Setting MC Control

MC Control with Importance Sampling

§ How is a trajectory represented?

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 31 / 32

Agenda
Reinforcement Learning
Introduction MC Evaluation
Setting MC Control

MC Control with Importance Sampling

§ How is a trajectory represented?

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 31 / 32

Agenda
Reinforcement Learning
Introduction MC Evaluation
Setting MC Control

MC Control with Importance Sampling

§ How is a trajectory represented?

§ Refresher from the very first lecture.
• Goal in RL Problem:- to maximize the total reward “in expectation” over the
long run.
• 𝜏 ≝ 𝑠$ , 𝑎$ , 𝑠' , 𝑎' , … , 𝑝 τ = 𝑝(s$ ) ∏0 𝑝 𝑎0 |𝑠0 𝑝(𝑠02$ |𝑠0 , 𝑎0 )
• max 𝔼8~:(8) ∑0 𝑅(𝑠0 , 𝑎0 )
CS60077 / Reinforcement Learning | Introduction (c) Abir Das
§ Let some trajectory x(i) be (s1 , a1 , s2 , a2 , · · · )
§ p(x(i) ) = p(s1 )π(a1 |s1 )p(s2 |s1 , a1 )π(a2 |s2 )p(s3 |s2 , a2 ) · · ·
§ q(x(i) ) = p(s1 )µ(a1 |s1 )p(s2 |s1 , a1 )µ(a2 |s2 )p(s3 |s2 , a2 ) · · ·
p(x(i) ) 1 )π(a1 |s1 )(
p(s 2 |s(
p(s( 1 ,a1 )π(a2 |s2 )(
(( 3 |s(
p(s( 2 ,a2 )···
(( π(a1 |s1 )π(a2 |s2 )···
§ (i) =
p(s1 )µ(a1 |s1 )(
= µ(a1 |s1 )µ(a2 |s2 )··· =
q(x ) 2 |s1 ,a1 )µ(a2 |s2 )(
p(s( 3 |s2 ,a2 )···
( (( p(s( ( ((

Ti
Q π(at |st )
µ(at |st )
t=1

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 31 / 32

Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

(i)
f (x(i) ) p(x )
P
q(x(i) )
x(i) ∼µ
Ex∼π [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼µ

§ This was the evaluation step then do the greedy policy improvement.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 32 / 32

Research Methods For Business. A Skill-Building Approach. 6th Edition PDF
No ratings yet
Research Methods For Business. A Skill-Building Approach. 6th Edition PDF
18 pages
Black Skins, Black Masks Hybridity, Dialogism, Performativity by Shirley Anne Tate
No ratings yet
Black Skins, Black Masks Hybridity, Dialogism, Performativity by Shirley Anne Tate
188 pages
Design Basis Report
No ratings yet
Design Basis Report
9 pages
04 MC Methods
No ratings yet
04 MC Methods
18 pages
CH3_2 Montecarlo Control
No ratings yet
CH3_2 Montecarlo Control
33 pages
Lec 5
No ratings yet
Lec 5
13 pages
2.2+Model Free+Control
No ratings yet
2.2+Model Free+Control
92 pages
4 Monte Carlo Methods
No ratings yet
4 Monte Carlo Methods
28 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
Monte Carlo Learning
No ratings yet
Monte Carlo Learning
14 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
58 pages
rl-lecture5
No ratings yet
rl-lecture5
16 pages
CH3_1 Montecarlo Components
No ratings yet
CH3_1 Montecarlo Components
18 pages
Lecture 3 Pre
No ratings yet
Lecture 3 Pre
67 pages
lecture3pre
No ratings yet
lecture3pre
67 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
Monte Carlo Methods in Reinforcement Learning
No ratings yet
Monte Carlo Methods in Reinforcement Learning
5 pages
04_RL_DP
No ratings yet
04_RL_DP
76 pages
Monte Carlo Simulation: Ra Jesh Pi Ryani South Asian University
No ratings yet
Monte Carlo Simulation: Ra Jesh Pi Ryani South Asian University
21 pages
Lecture 5: Model-Free Control: David Silver
No ratings yet
Lecture 5: Model-Free Control: David Silver
43 pages
Improving Monte Carlo Evaluation With Offline Data: Sutton and Barto 2018
No ratings yet
Improving Monte Carlo Evaluation With Offline Data: Sutton and Barto 2018
40 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Module 2
No ratings yet
Module 2
84 pages
Model free methods
No ratings yet
Model free methods
31 pages
Lecture 4 Pre
No ratings yet
Lecture 4 Pre
85 pages
3 - Chapter 5 Monte Carlo Methods
No ratings yet
3 - Chapter 5 Monte Carlo Methods
23 pages
slidedeck_7_MAS_2021_22_RL_3_MC_Sarsa_QL
No ratings yet
slidedeck_7_MAS_2021_22_RL_3_MC_Sarsa_QL
65 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Monte Carlo Short Course Handout
No ratings yet
Monte Carlo Short Course Handout
32 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Monte Carlo Methods in Reinforcement Learning
0% (1)
Monte Carlo Methods in Reinforcement Learning
5 pages
Lecture 4 Monte Carlo Method
100% (1)
Lecture 4 Monte Carlo Method
22 pages
RL_2021_22_Exam_I_63163060c243ad69c552d008b899be82
No ratings yet
RL_2021_22_Exam_I_63163060c243ad69c552d008b899be82
4 pages
2023_week3_modelfree
No ratings yet
2023_week3_modelfree
63 pages
NIPS 1996 On Line Policy Improvement Using Monte Carlo Search Paper
No ratings yet
NIPS 1996 On Line Policy Improvement Using Monte Carlo Search Paper
7 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Unit-3 Unit-3 RL Problems,Prediction and Control p 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems,Prediction and Control p 241111 181426
15 pages
Rl Exam Tutti
No ratings yet
Rl Exam Tutti
47 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
5.4-Reinforcement Learning-Part2-Learning-Algorithms
No ratings yet
5.4-Reinforcement Learning-Part2-Learning-Algorithms
15 pages
Module 04
No ratings yet
Module 04
63 pages
qp ans
No ratings yet
qp ans
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
W6 Monte Carlo Methods
No ratings yet
W6 Monte Carlo Methods
80 pages
Monte Carlo Simulation: Presente R
No ratings yet
Monte Carlo Simulation: Presente R
21 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
notes
No ratings yet
notes
6 pages
intro_mc
No ratings yet
intro_mc
93 pages
Lecture 4: Model Free Control: Emma Brunskill
No ratings yet
Lecture 4: Model Free Control: Emma Brunskill
66 pages
Dissecting Reinforcement Learning-Part9
No ratings yet
Dissecting Reinforcement Learning-Part9
15 pages
Lecture_12_slides_-_after
No ratings yet
Lecture_12_slides_-_after
50 pages
CZ3005 Module 5_Reinforcement Learning(1)
No ratings yet
CZ3005 Module 5_Reinforcement Learning(1)
31 pages
AR514_MDP
No ratings yet
AR514_MDP
27 pages
The Monte Carlo Method and Some Applications
No ratings yet
The Monte Carlo Method and Some Applications
49 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
The Red King Trident Of Witchcraft Volume 2 Mark Alan Smith download
100% (1)
The Red King Trident Of Witchcraft Volume 2 Mark Alan Smith download
37 pages
Adeyemi, 2014.
No ratings yet
Adeyemi, 2014.
16 pages
A Study on Mind-line With Respect to Palmistry
No ratings yet
A Study on Mind-line With Respect to Palmistry
4 pages
Essay For Supervisory
No ratings yet
Essay For Supervisory
3 pages
Flip the Switch Achieve Extraordinary Things with Simple Changes to How You Think 1st Edition Jez Rose instant download
100% (1)
Flip the Switch Achieve Extraordinary Things with Simple Changes to How You Think 1st Edition Jez Rose instant download
41 pages
Saurabh Basu - Condensed Matter Physics_ a Modern Perspective (2022, IOP Publishing) - Libgen.li-2
No ratings yet
Saurabh Basu - Condensed Matter Physics_ a Modern Perspective (2022, IOP Publishing) - Libgen.li-2
356 pages
Latihan Ujian Sekolah Bahasa Inggris
No ratings yet
Latihan Ujian Sekolah Bahasa Inggris
8 pages
Maths PP1 089
No ratings yet
Maths PP1 089
13 pages
Realism, Language, and Reductionism
No ratings yet
Realism, Language, and Reductionism
29 pages
Experiment: Apparatus Used
No ratings yet
Experiment: Apparatus Used
2 pages
Battle of The Brains (JHS)
No ratings yet
Battle of The Brains (JHS)
64 pages
Lesson 3 Argumentation
No ratings yet
Lesson 3 Argumentation
3 pages
Heat Dissipation Calculations for AHF Module
No ratings yet
Heat Dissipation Calculations for AHF Module
1 page
PHD Thesis Astronomy
100% (2)
PHD Thesis Astronomy
9 pages
Dynamic Behavior of Concrete and Seismic Engineering 1st Edition Jacky Mazars 2024 Scribd Download
100% (3)
Dynamic Behavior of Concrete and Seismic Engineering 1st Edition Jacky Mazars 2024 Scribd Download
61 pages
Lucrari Oradea Stancioiu Alin
No ratings yet
Lucrari Oradea Stancioiu Alin
10 pages
Reading 7+8 - Modality
No ratings yet
Reading 7+8 - Modality
391 pages
Funda Lab Report 1 1
No ratings yet
Funda Lab Report 1 1
11 pages
Unit 3
No ratings yet
Unit 3
14 pages
Who Is Eating Out of The World's 'Pangolin Pit'
No ratings yet
Who Is Eating Out of The World's 'Pangolin Pit'
10 pages
Storage Area Inspection Form: NO. Items To Be Checked YES NO N/A Observations / Comments
No ratings yet
Storage Area Inspection Form: NO. Items To Be Checked YES NO N/A Observations / Comments
1 page
WWW Astrology Zodiac Signs Com Tarot Cards Two of Wands
No ratings yet
WWW Astrology Zodiac Signs Com Tarot Cards Two of Wands
7 pages
09.13.2021 PC Agenda
No ratings yet
09.13.2021 PC Agenda
192 pages
Finny The Fish
No ratings yet
Finny The Fish
2 pages
Shelf Life: 1) Essential Oils
No ratings yet
Shelf Life: 1) Essential Oils
4 pages
Kryo X26 MTB - Model X - Https
No ratings yet
Kryo X26 MTB - Model X - Https
4 pages
Personality 10th Edition Jerry M. Burger - Own the complete ebook with all chapters in PDF format
50% (2)
Personality 10th Edition Jerry M. Burger - Own the complete ebook with all chapters in PDF format
58 pages