05 MC Methods
05 MC Methods
Abir Das
IIT Kharagpur
Agenda
Resources
Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?
(0,1) (1,1)
P(area)=?
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 5 / 32
Agenda Introduction MC Evaluation MC Control
Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?
(0,1) (1,1)
P(area)=&⁄'
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 6 / 32
Agenda Introduction MC Evaluation MC Control
Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?
(0,1) (1,1)
P(area)=?
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 7 / 32
Agenda Introduction MC Evaluation MC Control
Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?
(0,1) (1,1)
Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?
(0,1) (1,1)
P(area)=?
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 9 / 32
Agenda Introduction MC Evaluation MC Control
Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?
(0,1) (1,1)
Monte Carlo
§ What is the probability that a dart thrown uniformly at random in the
unit square will hit the red area?
(0,1) (1,1)
x x x
x x
xx x x
x x
x 8
x x x
x x 19
x x
(0,0) & (1,0)
'
,0
# *+,-. /0 ,1* +,1+
P(area)=
# *+,-.
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 11 / 32
Agenda Introduction MC Evaluation MC Control
N
§ Approximate p(x)≈ N1 δx(i) (x) [δx(i) (x) is impulse at x(i) on x axis]
P
i=1
R R N
§ E[f (x)] = f (x)p(x)dx ≈ f (x) N1
P
δx(i) (x)dx =
Z i=1
N N
1P 1P
x(i)
N f (x)δx(i) (x)dx = N f
i=1 i=1
| {z }
f (x(i) )
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 13 / 32
Agenda Introduction MC Evaluation MC Control
S1 , A1 , R2 , S2 , A2 , R3 , · · · , Sk , Ak , Rk ∼ π
Gt = Rt+1 + γRt+2 + · · · + γ T −1 RT
Blackjack Example
Blackjack Example
States (200 of them):
Current sum (12-21)
Dealer’s showing card (ace-10)
Do I have a “useable” ace? (yes-no)
Action stick: Stop receiving cards (and terminate)
Action twist: Take another card (no replacement)
Reward for stick:
+1 if sum of cards > sum of dealer cards
0 if sum of cards = sum of dealer cards
-1 if sum of cards < sum of dealer cards
Reward for twist:
-1 if sum of cards > 21 (and terminate)
0 otherwise
Transitions: automatically twist if sum of cards < 12
Slide courtesy: David Silver [Deepmind]
Blackjack Value
Blackjack Function after Monte-Carlo Learning
Example
Off-policy Methods
Off-policy Methods
Off-policy Methods
Off-policy Prediction
Rejection Sampling
set i = 1
Repeat until i = N
1 Sample x(i) ∼ q(x) and u ∼ U(0,1)
(i)
2 If u < Mp(x )
q(x(i) )
, then accept x(i) and increment counter i by 1.
Otherwise, reject.
Importance Sampling
§ What is bad about rejection sampling?
Importance Sampling
§ What is bad about rejection sampling?
§ Many wasted samples! why?
Importance Sampling
§ What is bad about rejection sampling?
§ Many wasted samples! why?
§ Importance sampling is a classical way to address this. You keep all
the samples from the proposal/behavior distribution, you just weigh
them.
Importance Sampling
§ What is bad about rejection sampling?
§ Many wasted samples! why?
§ Importance sampling is a classical way to address this. You keep all
the samples from the proposal/behavior distribution, you just weigh
them. R
§ Lets say we want to compute Ex∼p(.) [f (x)] = f (x)p(x)dx
Z Z
p(x)
Ex∼p(.) [f (x)] = f (x)p(x)dx = f (x) q(x)dx
q(x)
p(x)
= Ex∼q(.) f (x)
q(x)
N
1 X p(x(i) )
≈ f (x(i) )
N
(i)
q(x(i) )
x ∼q(.),i=1
p(x(i) )
§ q(x(i) )
is called the importance weight.
Abir Das (IIT Kharagpur) CS60077 Sep 17 and 23, 2021 28 / 32
Agenda Introduction MC Evaluation MC Control
§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)
§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)
§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)
§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)
§ What are the samples x(i) ? What are the p(.) and q(.) in our case?
and what is f (x(i) )?
(i) )
f (x(i) ) p(x
P
q(x(i) )
x(i) ∼q(.)
Ex∼p(.) [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼q(.)
(i)
f (x(i) ) p(x )
P
q(x(i) )
x(i) ∼µ
Ex∼π [f (x)] ≈ P p(x(i) )
q(x(i) )
x(i) ∼µ
vπ (s) = E [G|S1 = s]
N Ti (i) (i)
π(at |st )
G(i)
P Q
(i) (i)
i=1 t=1 µ(at |st )
≈ T
N
P Q π(a(i)
i (i)
t |st )
(i) (i)
i=1 t=1 µ(at |st )
§ This was the evaluation step then do the greedy policy improvement.