0% found this document useful (0 votes)
10 views

05 MC Methods

Uploaded by

udipi.adithya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

05 MC Methods

Uploaded by

udipi.adithya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Monte Carlo Methods

CS60077: Reinforcement Learning

Abir Das

IIT Kharagpur

Sep 17 and 23, 2021


Agenda Introduction MC Evaluation MC Control

Agenda

§ Understand how to evaluate policies in model-free setting using


Monte Carlo methods
§ Understand Monte Carlo methods in model-free setting for control of
Reinforcement Learning problems

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 2 / 32


Agenda Introduction MC Evaluation MC Control

Resources

§ Reinforcement Learning by David Silver [Link]


§ Reinforcement Learning by Balaraman Ravindran [Link]
§ Monte Carlo Simulation by Nando de Freitas [Link]
§ SB: Chapter 5

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 3 / 32


Agenda Introduction MC Evaluation MC Control

Model Free Setting


§ Like the previous few lectures, here also we will deal with prediction
and control problems but this time it will be in a model-free setting
§ In model-free setting we do not have the full knowledge of the MDP
§ Model-free prediction: Estimate the value function of an unknown
MDP
§ Model-free control: Optimise the value function of an unknown
MDP
§ Model-free methods require only experience - sample sequences of
states, actions, and rewards (S1 , A1 , R2 , · · · ) from actual or
simulated interaction with an environment.
§ Actual experince requires no knowledge of the environment’s
dynamics.
§ Simulated experience ‘requires’ models to generate samples only. No
knowledge of the complete probability distributions of state
transitions is required. In many cases this is easy to do.
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 4 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)


'
,0

P(area)=?
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 5 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)


'
,0

P(area)=&⁄'
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 6 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)


'
,0

P(area)=?
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 7 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)


'
,0
'
P(area)= 𝜋 &⁄'
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 8 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)


'
,0

P(area)=?
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 9 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)

(0,0) & (1,0)


'
,0
# *+, -./+0
P(area)=
# -12+ -./+0
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 10 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?

(0,1) (1,1)
x x x
x x
xx x x
x x
x 8
x x x
x x 19
x x
(0,0) & (1,0)
'
,0
# *+,-. /0 ,1* +,1+
P(area)=
# *+,-.
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 11 / 32
Agenda Introduction MC Evaluation MC Control

History of Monte Carlo

§ The bomb and ENIAC

Image taken from: www.livescience.com


Image taken from: www.digitaltrends.com

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 12 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo for Expectation Calculation


R
§ Lets say we want to compute E[f (x)] = f (x)p(x)dx
N
§ Draw i.i.d. samples x(i) i=1 from the probability density p(x)


Image taken from: Nando de Freitas: MLSS 08

N
§ Approximate p(x)≈ N1 δx(i) (x) [δx(i) (x) is impulse at x(i) on x axis]
P
i=1
R R N
§ E[f (x)] = f (x)p(x)dx ≈ f (x) N1
P
δx(i) (x)dx =
Z i=1
N N
1P 1P
x(i)

N f (x)δx(i) (x)dx = N f
i=1 i=1
| {z }
f (x(i) )
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 13 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo Policy Evaluation

§ Learn vπ from episodes of experience under policy π

S1 , A1 , R2 , S2 , A2 , R3 , · · · , Sk , Ak , Rk ∼ π

§ Recall that the return is the total discounted reward:

Gt = Rt+1 + γRt+2 + · · · + γ T −1 RT

§ Recall that the value function is the expected return:

vπ (s) = E [Gt |St = s]

§ Monte-Carlo policy evaluation uses empirical mean return instead of


expected return

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 14 / 32


Agenda Introduction MC Evaluation MC Control

First Visit Monte Carlo Policy Evaluation

§ To evaluate state s i.e. to learn vπ (s)


§ The first time-step t that state s is visited in an episode,
§ Increment counter N (s) ← N (s) + 1
§ Increment total retun S(s) ← S(s) + Gt
§ Value is estimated by mean return V (s) = S(s)/N (s)
§ By law of large number, V (s) → vπ (s) as N (s) → ∞

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 15 / 32


Agenda Introduction MC Evaluation MC Control

Every Visit Monte Carlo Policy Evaluation

§ To evaluate state s i.e. to learn vπ (s)


§ Every time-step t that state s is visited in an episode,
§ Increment counter N (s) ← N (s) + 1
§ Increment total retun S(s) ← S(s) + Gt
§ Value is estimated by mean return V (s) = S(s)/N (s)
§ By law of large number, V (s) → vπ (s) as N (s) → ∞

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 16 / 32


Agenda Introduction MC Evaluation MC Control

Blackjack Example
Blackjack Example
States (200 of them):
Current sum (12-21)
Dealer’s showing card (ace-10)
Do I have a “useable” ace? (yes-no)
Action stick: Stop receiving cards (and terminate)
Action twist: Take another card (no replacement)
Reward for stick:
+1 if sum of cards > sum of dealer cards
0 if sum of cards = sum of dealer cards
-1 if sum of cards < sum of dealer cards
Reward for twist:
-1 if sum of cards > 21 (and terminate)
0 otherwise
Transitions: automatically twist if sum of cards < 12
Slide courtesy: David Silver [Deepmind]

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 17 / 32


Agenda
Blackjack Example Introduction MC Evaluation MC Control

Blackjack Value
Blackjack Function after Monte-Carlo Learning
Example

Policy: stick if sum of cards ≥ 20, otherwise twist


Slide courtesy: David Silver [Deepmind]

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 18 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control


§ We will now, see how Monte Carlo estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 19 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control


§ We will now, see how Monte Carlo estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.

§ Policy evaluation is done as Monte Carlo evaluation


§ Then, we can do greedy policy improvement.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 19 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control


§ We will now, see how Monte Carlo estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.

§ Policy evaluation is done as Monte Carlo evaluation


§ Then, we can do greedy policy improvement.
§ What is the problem!!

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 19 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control


§ We will now, see how Monte Carlo estimation can be used in control.
§ This is mostly like the generalized policy iteration (GPI) where one
maintains both an approximate policy and an approximate value
function.

§ Policy evaluation is done as Monte Carlo evaluation


§ Then, we can do greedy policy improvement.
§ What is the problem!!
 
0 . P 0 0
§ π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 19 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Greedy policy improvement


 over v(s) requires model of MDP
0 . P 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 20 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Greedy policy improvement


 over v(s) requires model of MDP
0 . P 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
§ Greedy policy improvement over q(s, a) is model-free
.
π 0 (s) = arg max q(s, a)
a∈A

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 20 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Greedy policy improvement


 over v(s) requires model of MDP
0 . P 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
§ Greedy policy improvement over q(s, a) is model-free
.
π 0 (s) = arg max q(s, a)
a∈A

§ How can we do Monte Carlo policy evaluation for q(s, a)?

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 20 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Greedy policy improvement


 over v(s) requires model of MDP
0 . P 0 0
π (s) = arg max r(s, a) + γ p(s |s, a)vπ (s )
a∈A s0 ∈S
§ Greedy policy improvement over q(s, a) is model-free
.
π 0 (s) = arg max q(s, a)
a∈A

§ How can we do Monte Carlo policy evaluation for q(s, a)?


§ Essentially the same as Monte Carlo evaluation for state values. Start
at a state s, pick an action a and then follow the policy.
§ After few such episodes average the returns to get an estimate of
q(s, a).

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 20 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ What are some concerns?


§ First visit/Every visit!!
§ Suppose you start at a state s and take action a. You reach at state
s1 and then following the policy π at s, you take the action
a1 = π(s1 ). Can you take the rest of the trajectory as a sample to
estimate q(s1 , a1 )?
§ Practically you can, but convergence can not be guaranteed. The
reason is that this strategy draws a disproportionately large number of
actions corresponding to π. So, each sample is considered only for the
starting s and a.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 21 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ What are some concerns?


§ First visit/Every visit!!
§ Suppose you start at a state s and take action a. You reach at state
s1 and then following the policy π at s, you take the action
a1 = π(s1 ). Can you take the rest of the trajectory as a sample to
estimate q(s1 , a1 )?
§ Practically you can, but convergence can not be guaranteed. The
reason is that this strategy draws a disproportionately large number of
actions corresponding to π. So, each sample is considered only for the
starting s and a.
§ How to make sure we have q(s, a) estimates for all s and a?
Especially because of the above the ‘exploring starts’ becomes
important.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 21 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Many state-action pairs may never be visited.


§ For deterministic policy, with no returns to average, the Monte Carlo
estimates of many actions will not improve with experience.
§ This is the general problem of maintaining exploration.
§ One way to do this is by specifying that the episodes start in a
state-action pair, and that every pair has a nonzero probability of
being selected as the start.
§ This assumption is called ‘exploring starts’

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 22 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ Many state-action pairs may never be visited.


§ For deterministic policy, with no returns to average, the Monte Carlo
estimates of many actions will not improve with experience.
§ This is the general problem of maintaining exploration.
§ One way to do this is by specifying that the episodes start in a
state-action pair, and that every pair has a nonzero probability of
being selected as the start.
§ This assumption is called ‘exploring starts’
§ Monte Carlo Exploration Starts is an ‘on policy’ method. On policy
methods evaluate or improve the policy by drawing samples from the
same policy.
§ Off-policy methods evaluate or improve a policy different from that
used to generate the samples.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 22 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control


§ Before going to off-policy methods let us look into an on policy
Monte Carlo control method that does not use exploring starts.
§ The assumption of exploring starts is sometimes useful, but it cannot
be relied upon in general, particularly when learning directly from
actual interaction with an environment.
§ The easiest alternative is to consider stochastic policies with a
nonzero probability of selecting all actions in each state.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 23 / 32


Agenda Introduction MC Evaluation MC Control

Monte Carlo Control


§ Before going to off-policy methods let us look into an on policy
Monte Carlo control method that does not use exploring starts.
§ The assumption of exploring starts is sometimes useful, but it cannot
be relied upon in general, particularly when learning directly from
actual interaction with an environment.
§ The easiest alternative is to consider stochastic policies with a
nonzero probability of selecting all actions in each state.
§ Instead of getting a greedy policy in the policy improvement step, an
-greedy policy is obtained.
§ It means most of the time, the action corresponding to maximum
estimated action value is chosen, but sometimes (with probability )
an action at random is chosen.

§ Probability of choosing nongreedy actions is |A(s)| whereas remaining

bulk of the probability, 1 −  + |A(s)| , is given to the greedy action.
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 23 / 32
Agenda Introduction MC Evaluation MC Control

Monte Carlo Control

§ -greedy policy is an example of a bigger class of policies known as



-soft policies where π(a|s) ≥ |A(s)| for all states and actions, for
some  > 0.
§ Among -soft policies, -greedy policy is, in some sense, closest to
greedy.
§ By using -greedy policy improvement strategy, we achieve the best
policy among -soft policies, but we eliminate the assumption of
‘exploring starts’.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 24 / 32


Agenda Introduction MC Evaluation MC Control

Off-policy Methods

§ All methods trying to learn control face a dilemma.


I They seek to learn action values conditional on subsequent optimal
behavior.
I But they need to behave non-optimally in order to explore all actions
(to find the optimal actions).

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 25 / 32


Agenda Introduction MC Evaluation MC Control

Off-policy Methods

§ All methods trying to learn control face a dilemma.


I They seek to learn action values conditional on subsequent optimal
behavior.
I But they need to behave non-optimally in order to explore all actions
(to find the optimal actions).
§ The on-policy approach is actually a compromise, it learns action
values not for the optimal policy, but for a near-optimal policy that
still explores.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 25 / 32


Agenda Introduction MC Evaluation MC Control

Off-policy Methods

§ All methods trying to learn control face a dilemma.


I They seek to learn action values conditional on subsequent optimal
behavior.
I But they need to behave non-optimally in order to explore all actions
(to find the optimal actions).
§ The on-policy approach is actually a compromise, it learns action
values not for the optimal policy, but for a near-optimal policy that
still explores.
§ Off-policy methods address this by using two policies for two different
purposes.
I one that is learned about and that becomes the optimal policy - target
policy.
I one that is more exploratory and is used to generate behavior -
behavior policy.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 25 / 32


Agenda Introduction MC Evaluation MC Control

Off-policy Prediction

§ Estimate vπ or qπ of the target policy π, but we have episodes from


another policy µ, the behavior policy.
§ Almost all off-policy methods utilize concepts from sampling theory
for such operations.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 26 / 32


Agenda Introduction MC Evaluation MC Control

Rejection Sampling
set i = 1
Repeat until i = N
1 Sample x(i) ∼ q(x) and u ∼ U(0,1)
(i)
2 If u < Mp(x )
q(x(i) )
, then accept x(i) and increment counter i by 1.
Otherwise, reject.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 27 / 32


Agenda Introduction MC Evaluation MC Control

Importance Sampling
§ What is bad about rejection sampling?

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 28 / 32


Agenda Introduction MC Evaluation MC Control

Importance Sampling
§ What is bad about rejection sampling?
§ Many wasted samples! why?

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 28 / 32


Agenda Introduction MC Evaluation MC Control

Importance Sampling
§ What is bad about rejection sampling?
§ Many wasted samples! why?
§ Importance sampling is a classical way to address this. You keep all
the samples from the proposal/behavior distribution, you just weigh
them.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 28 / 32


Agenda Introduction MC Evaluation MC Control

Importance Sampling
§ What is bad about rejection sampling?
§ Many wasted samples! why?
§ Importance sampling is a classical way to address this. You keep all
the samples from the proposal/behavior distribution, you just weigh
them. R
§ Lets say we want to compute Ex∼p(.) [f (x)] = f (x)p(x)dx
Z Z
p(x)
Ex∼p(.) [f (x)] = f (x)p(x)dx = f (x) q(x)dx
q(x)
 
p(x)
= Ex∼q(.) f (x)
q(x)
N
1 X p(x(i) )
≈ f (x(i) )
N
(i)
q(x(i) )
x ∼q(.),i=1

p(x(i) )
§ q(x(i) )
is called the importance weight.
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 28 / 32
Agenda Introduction MC Evaluation MC Control

Normalized Importance Sampling

To avoid numerical instability, the denominator is changed in the following


way
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 29 / 32


Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 30 / 32


Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

§ x(i) are the trajectories.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 30 / 32


Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

§ x(i) are the trajectories.


§ p(x(i) ) is the probability of the trajectory x(i) given that the
trajectory follows the target policy.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 30 / 32


Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

§ x(i) are the trajectories.


§ p(x(i) ) is the probability of the trajectory x(i) given that the
trajectory follows the target policy.
§ q(x(i) ) is the probability of the trajectory x(i) given that the
trajectory follows the behavior policy.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 30 / 32


Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)

§ x(i) are the trajectories.


§ p(x(i) ) is the probability of the trajectory x(i) given that the
trajectory follows the target policy.
§ q(x(i) ) is the probability of the trajectory x(i) given that the
trajectory follows the behavior policy.
§ f (x(i) ) is the return.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 30 / 32


Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

§ How is a trajectory represented?

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 31 / 32


Agenda
Reinforcement Learning
Introduction MC Evaluation
Setting MC Control

MC Control with Importance Sampling

§ How is a trajectory represented?


§ Refresher from the very first lecture.
• Goal in RL Problem:- to maximize the total reward “in expectation” over the
long run.
• 𝜏 ≝ 𝑠$ , 𝑎$ , 𝑠' , 𝑎' , … , 𝑝 τ = 𝑝(s$ ) ∏0 𝑝 𝑎0 |𝑠0 𝑝(𝑠02$ |𝑠0 , 𝑎0 )
• max 𝔼8~:(8) ∑0 𝑅(𝑠0 , 𝑎0 )
CS60077 / Reinforcement Learning | Introduction (c) Abir Das

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 31 / 32


Agenda
Reinforcement Learning
Introduction MC Evaluation
Setting MC Control

MC Control with Importance Sampling

§ How is a trajectory represented?


§ Refresher from the very first lecture.
• Goal in RL Problem:- to maximize the total reward “in expectation” over the
long run.
• 𝜏 ≝ 𝑠$ , 𝑎$ , 𝑠' , 𝑎' , … , 𝑝 τ = 𝑝(s$ ) ∏0 𝑝 𝑎0 |𝑠0 𝑝(𝑠02$ |𝑠0 , 𝑎0 )
• max 𝔼8~:(8) ∑0 𝑅(𝑠0 , 𝑎0 )
CS60077 / Reinforcement Learning | Introduction (c) Abir Das
§ Let some trajectory x(i) be (s1 , a1 , s2 , a2 , · · · )
§ p(x(i) ) = p(s1 )π(a1 |s1 )p(s2 |s1 , a1 )π(a2 |s2 )p(s3 |s2 , a2 ) · · ·
§ q(x(i) ) = p(s1 )µ(a1 |s1 )p(s2 |s1 , a1 )µ(a2 |s2 )p(s3 |s2 , a2 ) · · ·

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 31 / 32


Agenda
Reinforcement Learning
Introduction MC Evaluation
Setting MC Control

MC Control with Importance Sampling

§ How is a trajectory represented?


§ Refresher from the very first lecture.
• Goal in RL Problem:- to maximize the total reward “in expectation” over the
long run.
• 𝜏 ≝ 𝑠$ , 𝑎$ , 𝑠' , 𝑎' , … , 𝑝 τ = 𝑝(s$ ) ∏0 𝑝 𝑎0 |𝑠0 𝑝(𝑠02$ |𝑠0 , 𝑎0 )
• max 𝔼8~:(8) ∑0 𝑅(𝑠0 , 𝑎0 )
CS60077 / Reinforcement Learning | Introduction (c) Abir Das
§ Let some trajectory x(i) be (s1 , a1 , s2 , a2 , · · · )
§ p(x(i) ) = p(s1 )π(a1 |s1 )p(s2 |s1 , a1 )π(a2 |s2 )p(s3 |s2 , a2 ) · · ·
§ q(x(i) ) = p(s1 )µ(a1 |s1 )p(s2 |s1 , a1 )µ(a2 |s2 )p(s3 |s2 , a2 ) · · ·
p(x(i) ) 1 )π(a1 |s1 )(
p(s 2 |s(
p(s( 1 ,a1 )π(a2 |s2 )(
(( 3 |s(
p(s( 2 ,a2 )···
(( π(a1 |s1 )π(a2 |s2 )···
§ (i) = 
p(s1 )µ(a1 |s1 )(
= µ(a1 |s1 )µ(a2 |s2 )··· =
q(x ) 2 |s1 ,a1 )µ(a2 |s2 )(
p(s( 3 |s2 ,a2 )···
 ( (( p(s( ( ((

Ti
Q π(at |st )
µ(at |st )
t=1

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 31 / 32


Agenda Introduction MC Evaluation MC Control

MC Control with Importance Sampling

(i)
f (x(i) ) p(x )
P
q(x(i) )
x(i) ∼µ
Ex∼π [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼µ

vπ (s) = E [G|S1 = s]
N Ti (i) (i)
π(at |st )
G(i)
P Q
(i) (i)
i=1 t=1 µ(at |st )
≈ T
N
P Q π(a(i)
i (i)
t |st )
(i) (i)
i=1 t=1 µ(at |st )

§ This was the evaluation step then do the greedy policy improvement.

Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 32 / 32

You might also like