0% found this document useful (0 votes)

12 views8 pages

Lnotes 04

This document summarizes a lecture on model-free control in reinforcement learning. It discusses: 1. Model-free control learns good policies without knowledge of the MDP model or reward/transition probabilities, instead using interactions alone. 2. Generalized policy iteration extends policy iteration to the model-free case by substituting the model-based policy improvement step with an estimation based on state-action values. 3. Exploration is important for model-free control to evaluate all state-action pairs; ε-greedy policies balance exploration of suboptimal actions with exploitation of optimal actions.

Uploaded by

mouaad ibnealryf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views8 pages

Lnotes 04

Uploaded by

mouaad ibnealryf

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

CS234 Notes - Lecture 4

Model Free Control

Michael Painter, Emma Brunskill

March 20, 2018

5 Model Free Control

Last class, we discussed how to evaluate a given policy while assuming we do not know how the world
works and only interacting with the environment. This gave us our model-free policy evaluation
methods: Monte Carlo policy evaluation and TD learning. This lecture, we will discuss model-free
control where we learn good policies under the same constraints (only interactions, no knowledge of
reward structure or transition probabilities). This framework is important in two types of domains:
1. When the MDP model is unknown, but we can sample trajectories from the MDP, or
2. When the MDP model is known but computing the value function via our model-based control
methods is infeasible due to size of the domain.
In this lecture, we will still restrict ourselves to the tabular setting, where we can represent each
state or state-action value as an element of a lookup table. Next lecture, we will be re-examining the
algorithms from today and the previous lecture under the value function approximation setting,
which involves trying to t a function to the state or state-action value function.

5.1 Generalized Policy Iteration

Recall rst our model-based policy iteration algorithm, which we re-write in algorithm 1, for conve-
nience.
Algorithm 1 Model-based Policy Iteration Algorithm
1: procedure Policy Iteration(M, )
2: π ← Randomly choose a policy π ∈ Π
3: while true do
4: V π ← POLICY EVALUATION (M, π, )
π ∗ (s) ← arg max R(s, a) + γ s0 ∈S P (s0 |s, a)V π (s0 ) , ∀ s ∈ S (policy improvement)

5:
P
a∈A
6: if π ∗ (s) = π(s) then
7: break
8: else
9: π ← π∗
∗
10: V ←Vπ
11: return V ∗ (s), π∗ (s) for all s ∈ S

Last lecture, we introduced methods for model-free policy evaluation, so we can complete line 4 in
a model-free way. However, in order to make this entire algorithm model-free, we must also nd a
way to do line 5, the policy improvement step, in a model-free way. By denition, we have Qπ (s, a) =

1
P (s0 |s, a)V π (s0 ). Thus, we can get a model-free policy iteration algorithm (Algorithm
P
R(s, a)+γ s0 ∈S
2) by substituting this value into the model-based policy iteration algorithm and using state-action
values throughout.

Algorithm 2 Model-free Generalized Policy Iteration Algorithm

1: procedure Policy Iteration(M, )
2: π ← Randomly choose a policy π ∈ Π
3: while true do
4: Qπ ← POLICY EVALUATION (M, π, )
5: π ∗ (s) ← arg max Qπ (s, a) , ∀ s ∈ S (policy improvement)
a∈A
6: if π ∗ (s) = π(s) then
7: break
8: else
9: π ← π∗
∗
10: Q ← Qπ
11: return Q∗ (s, a), π∗ (s) for all s ∈ S , a ∈ A

There are a few caveats to this algorithm due to the substitution we made in line 5:
1. If policy π is deterministic or doesn't take every action a with some positive probability, then we
cannot actually compute the argmax in line 5.
2. The policy evaluation algorithm gives us an estimate of Qπ , so it is not clear whether line 5 will
monotonically improve the policy like in the model-based case.

5.2 Importance of Exploration

5.2.1 Exploration

In the previous section, we saw that one caveat to the model-free policy iteration algorithm is that the
policy π needs to take every action a with some positive probability, so the value of each state-action
pair can be determined. In other words, the policy π needs to explore actions, even if they might be
suboptimal with respect to our current Q-value estimates.

5.2.2 -greedy Policies

In order to explore actions that are suboptimal with respect to our current Q-value estimates, we'll
need a systematic way to balance exploration of suboptimal actions with exploitation of the optimal,
or greedy, action. One naive strategy is to take a random action with small probability and take the
greedy action the rest of the time. This type of exploration strategy is called an -greedy policy.
Mathematically, an -greedy policy with respect to the state-action value Qπ (s, a) takes the following
form:
with probability |A|
(

a
π(a|s) =
arg maxa Qπ (s, a) with probability 1 −

5.2.3 Monotonic -greedy Policy Improvement

We saw in the second lecture via the policy improvement theorem that if we take the greedy action
with respect to the current values and then follow a policy π thereafter, this policy is an improvement
to the policy π . A natural question then is whether an -greedy policy with respect to -greedy policy
π is an improvement on policy π . This would help us address our second caveat of the generalized

2
policy iteration algorithm, Algorithm 2. Fortunately, there is an analogue of the policy improvement
theorem for the -greedy policy, which we state and derive below.
Theorem 5.1 (Monotonic -greedy Policy Improvement). Let πi be an -greedy policy. Then, the
-greedy policy with respect to Qπi , denoted πi+1 , is a monotonic improvement on policy π . In other
words, V πi+1 ≥ V πi .

Proof. We rst show that Qπi (s, πi+1 (s)) ≥ V πi (s) for all states s.
X
Qπi (s, πi+1 (s)) = πi+1 (a|s)Qπi (s, a)
a∈A
X πi
= Q (s, a) + (1 − ) max Qπi (s, a0 )
|A| a0
a∈A
X πi 1−
= Q (s, a) + (1 − ) max Qπi (s, a0 )
|A| a0 1−
a∈A

X πi X πi (a|s) − |A|
= Q (s, a) + (1 − ) max Qπi (s, a0 )
|A| a0 1−
a∈A a∈A

X πi X πi (a|s) − |A|
= Q (s, a) + (1 − ) max Qπi (s, a0 )
|A| 1− 0
a
a∈A a∈A

X πi X πi (a|s) − |A|
≥ Q (s, a) + (1 − ) Qπi (s, a)
|A| 1−
a∈A a∈A
X
= πi (a|s)Qπi (s, a)
a∈A
πi
=V (s)

The rst equality follows from the fact that the rst action we take is with respect to policy πi+1
i , then
P h
we follow policy πi thereafter. The fourth equality follows because 1 − = a πi (a|s) − |A| .

Now from the policy improvement theorem, we have that Qπi (s, πi+1 (s)) ≥ V πi (s) implies V πi+1 (s) ≥
V πi (s) for all states s, as desired.

Thus, the monotonic -greedy policy improvement shows us that our policy does in fact improve if we
act -greedy on the current -greedy policy.

5.2.4 Greedy in the limit of exploration

We introduced -greedy strategies above as a naive way to balance exploration of new actions with
exploitation of current knowledge; however, we can further rene this balance by introducing a new
class of exploration strategies that allow us to make convergence guarantees about our algorithms.
This class of strategies is called Greedy in the Limit of Innite Exploration (GLIE).
Denition 5.1 (Greedy in the Limit of Innite Exploration (GLIE)). A policy π is greedy in the limit
of innite exploration (GLIE) if it satises the following two properties:
1. All state-action pairs are visited an innite number of times. I.e. for all s ∈ S, a ∈ A,
lim Ni (s, a) → ∞,
i→∞

where Ni (s, a) is the number of times action a is taken at state s up to and including episode i.

3
2. The behavior policy converges to the policy that is greedy with respect to the learned Q-function.
I.e. for all s ∈ S, a ∈ A,
lim πi (a|s) = arg max q(s, a) with probability 1
i→∞ a

One example of a GLIE strategy is an -greedy policy where is decayed to zero with i = 1i , where
i is the episode number. We can see that since i > 0 for all i, we will explore with some positive
probability at every time step, hence satisfying the rst GLIE condition. Since i → 0 as i → ∞, we
also have that the policy is greedy in the limit, hence satisfying the second GLIE condition.

5.3 Monte Carlo Control

Now, we will incorporate the exploration strategies described above with our model-free policy eval-
uation algorithms from last lecture to give us some model-free control methods. The rst algorithm
that we discuss is the Monte Carlo online control algorithm. In Algorithm 3, we give the formulation
for rst-visit online Monte Carlo control. An equivalent formulation for every-visit control is given by
not checking for the rst visit in line 7.

Algorithm 3 Online Monte Carlo Control/On Policy Improvement

1: procedure Online Monte Carlo Control
2: Initialize Q(s, a) = 0, Returns(s, a) = 0 for all s ∈ S, a ∈ A
3: Set ← 1, k ← 1
4: loop
5: Sample kth episode (sk1 , ak1 , rk1 , sk2 , . . . , sT ) under policy π
6: for t = 1, . . . , T do
7: if First visitPto (s, a) in episode k then
8: Append T
to Returns(st , at )
j=t rkj
9: Q(st , at ) ← average(Returns(st , at ))
10: k ← k + 1, = k1
11: πk = -greedy with respect to Q (policy improvement)
12: Return Q, π

Now, as stated before, GLIE strategies can help us arrive at convergence guarantees for our model-free
control methods. In particular, we have the following result:
Theorem 5.2. GLIE Monte Carlo control converges to the optimal state-action value function. That
is Q(s, a) → q(s, a).

In other words, if the -greedy strategy used in Algorithm 3 is GLIE, then the Q value derived from
the algorithm will converge to the optimal Q function.

5.4 Temporal Dierence Methods for Control

Last lecture, we also introduced TD(0) as a method for model-free policy evaluation. Now, we can build
on TD(0) with our ideas of exploration to get TD-style model-free control. There are two methods of
doing this: on-policy and o-policy. We rst introduce the on-policy method in Algorithm 4, called
SARSA.
We can see the policy evaluation update takes place in line 10, while the policy improvement is at line
11. SARSA gets its name from the parts of the trajectory used in the update equation. We can see
that to update the Q-value at state-action pair (s, a), we need the reward, next state and next action,

4
Algorithm 4 SARSA
1: procedure SARSA(, αt )
2: Initialize Q(s, a) for all s ∈ S, a ∈ A arbitrarily except Q(terminal, ·) = 0
3: π ← -greedy policy with respect to Q
4: for each episode do
5: Set s1 as the starting state
6: Choose action a1 from policy π(s1 )
7: loop until episode terminates
8: Take action at and observe reward rt and next state st+1
9: Choose action at+1 from policy π(st+1 )
10: Q(st , at ) ← Q(st , at ) + αt [rt + γQ(st+1 , at+1 ) − Q(st , at )]
11: π ← -greedy with respect to Q (policy improvement)
12: t←t+1
13: Return Q, π

thereby using the values (s, a, r, s0 , a0 ). SARSA is an on-policy method because the actions a and a0
used in the update equation are both derived from the policy that is being followed at the time of the
update.
Just like in Monte Carlo, we can arrive at the convergence of SARSA given one extra condition on the
step-sizes as stated below:
Theorem 5.3. SARSA for nite-state and nite-action MDP's converges to the optimal action-value,
i.e. Q(s, a) → q(s, a), if the following two conditions hold:
1. The sequence of policies π from is GLIE
2. The step-sizes αt satisfy the Robbins-Munro sequence such that
∞
X
αt = ∞
t=1
X∞
αt2 < ∞
t=1
Exercise 5.1. What is the benet to performing the policy improvement step after each update in
line 11 of Algorithm 4? What would be the benet to performing the policy improvement step less
frequently?

5.5 Importance Sampling for O-Policy TD

Before diving into o-policy TD-style control, let's rst take a step back and look at one way to do
o-policy TD policy evaluation. Recall that our TD update took the form
V (s) → V (s) + α(r + γV (s0 ) − V (s)).
Now, let's suppose that like in the o-policy Monte Carlo policy evaluation case, we have data from a
policy πb , and we want to estimate the value of policy πe . Then much like in the Monte Carlo policy
evaluation case, we can weight the target by the ratio of the probability of seeing the sample in the
behavior policy and the evaluated policy via importance sampling. This new update then takes the
form

πe (a|s)
V πe (s) → V πe (s) + α (r + γV πe (s0 ) − V πe (s)) .
πb (a|s)
Notice that because we only use one trajectory sample instead of sampling the entire trajectory like in
Monte Carlo, we only incorporate the likelihood ratio from one step. For the same reason, this method
also has signicantly lower variance than Monte Carlo.

5
Additionally, πb doesn't need to be the same at each step, but we do need to know the probability
for every step. As in Monte Carlo, we need the two policies to have the same support. That is, if
πe (a|s) × V πe (s0 ) > 0, then πb (a|s) > 0.

5.6 Q-learning
Now, we return to nding an o-policy method for TD-style control. In the above formulation, we
again leveraged importance sampling, but in the control case, we do not need to rely on this. Instead,
we can maintain state-action Q estimates and bootstrap the value of the best future action. Our
SARSA update took the form
Q(st , at ) ← Q(st , at ) + αt [rt + γQ(st+1 , at+1 ) − Q(st , at )],

but we can instead bootstrap the Q value at the next state to get the following update:
Q(st , at ) ← Q(st , at ) + αt [rt + γ max
0
Q(st+1 , a0 ) − Q(st , at )].
a

This gives rise to Q-learning, which is detailed in Algorithm 5. Now that we take a maximum over
the actions at the next state, this action is not necessarily the same as the one we would derive from
the current policy. Therefore, Q-learning is considered an o-policy algorithm.

Algorithm 5 Q-Learning with-greedy exploration

1: procedure Q-Learning(, α, γ )
2: Initialize Q(s, a) for all s ∈ S, a ∈ A arbitrarily except Q(terminal, ·) = 0
3: π ← -greedy policy with respect to Q
4: for each episode do
5: Set s1 as the starting state
6: t←1
7: loop until episode terminates
8: Sample action at from policy π(st )
9: Take action at and observe reward rt and next state st+1
10: Q(st , at ) ← Q(st , at ) + α(rt + γ maxa0 Q(st+1 , a0 ) − Q(st , at ))
11: π ← -greedy policy with respect to Q (policy improvement)
12: t←t+1
13: return Q, π

6 Maximization Bias
Finally, we are going to discuss a phenomenon known as maximization bias. We'll rst examine
maximization bias in a small example.

6.1 Example: Coins

Suppose there are two identical fair coins, but we don't know that they are fair or identical. If a coin
lands on heads, we get one dollar and if a coin lands on tails, we lose a dollar. We want to answer the
following two questions:
1. Which coin will yield more money for future ips?
2. How much can I expect to win/lose per ip using the coin from question 1?

6
In an eort to answer this question, we ip each coin once. We then pick the coin that yields more
money as the answer to question 1. We answer question 2 with however much that coin gave us. For
example, if coin 1 landed on heads and coin 2 landed on tails, we would answer question 1 with coin
1, and question 1 with one dollar.
Let's examine the possible scenarios for the outcome of this procedure. If at least one of the coins is
heads, then our answer to question 2 is one dollar. If both coins are tails, then our answer is negative
one dollar. Thus, the expected value of our answer to question 2 is 34 × (1) + 14 × (−1) = 0.5. This gives
us a higher estimate of the expected value of ipping the better coin than the true expected value of
ipping that coin. In other words, we're systematically going to think the coins are better than they
actually are.
This problem comes from the fact that we are using our estimate to both choose the better coin and
estimate its value. We can alleviate this by separating these two steps. One method for doing this
would be to change the procedure as follows: After choosing the better coin, ip the better coin again
and use this value as your answer for question 2. The expected value of this answer is now 0, which is
the same as the true expected value of ipping either coin.

6.2 Maximization Bias in Reinforcement Learning

Let's now formalize what we saw above in the context of a one-state MDP with two actions. Suppose
we are in state s and have two actions a1 and a2 , both with mean reward 0. Then, we have that the true
values of the state as well as the state-action pairs are zero. That is, Q(s, a1 ) = Q(s, a2 ) = V (s) = 0.
Suppose that we've sampled some rewards for taking each action so that we have some estimates for
the state-action values, Q̂(s, a1 ), Q̂(s, a2 ). Suppose further that these samples are unbiased because
they were generated using a Monte Carlo method. In other words, Q̂(s, ai ) = n(s,a rj (s, ai ),
1
Pn(s,ai )
i) j=1
where n(s, ai ) is the number of samples for the state-action pair (s, ai ). Let π̂ = arg maxa Q̂(s, a). be
the greedy policy with respect to our Q value estimates. Then, we have that
V̂ (s) = E[max(Q̂(s, a1 ), Q̂(s, a2 )]
≥ max(E[Q̂(s, a1 )], E[Q̂(s, a2 )]) by Jensen's inequality
= max(0, 0) since each estimate is unbiased and Q(s, ·) = 0
= 0 = V ∗ (s)

Thus, our state value estimate is at least as large as the true value of state s, so we are systematically
overestimating the value of the state in the presence of nite samples.

6.3 Double Q-Learning

As we saw in the previous subsection, the state values can suer from maximization bias as well when
we have nitely many samples. As we discussed in the coin example, decoupling taking the max
and estimating the value of the max can get rid of this bias. In Q-learning, we can maintain two
independent unbiased estimates, Q1 and Q2 and when we use one to select the maximum, we can use
the other to get an estimate of the value of this maximum. This gives rise to double Q-learning
which is detailed in algorithm 6. When we refer to the -greedy policy with respect to Q1 + Q2 we
mean the -greedy policy where the optimal action at state s is equal to arg maxa Q1 (s, a) + Q2 (s, a).

7
Algorithm 6 Double Q-Learning
1: procedure Double Q-Learning(, α, γ )
2: Initialize Q1 (s, a), Q2 (s, a) for all s ∈ S, a ∈ A, set t ← 0
3: π ← -greedy policy with respect to Q1 + Q2
4: loop
5: Sample action at from policy π at state st
6: Take action at and observe reward rt and next state st+1
7: if (with 0.5 probability) then
8: Q1 (st , at ) ← Q1 (st , at ) + α(rt + γQ2 (st+1 , arg maxa0 Q1 (st+1 , a0 )) − Q1 (st , at ))
9: else
10: Q2 (st , at ) ← Q2 (st , at ) + α(rt + γQ1 (st+1 , arg maxa0 Q2 (st+1 , a0 )) − Q2 (st , at ))
11: π ← -greedy policy with respect to Q1 + Q2 (policy improvement)
12: t←t+1
13: return π, Q1 + Q2

Double Q-learning can signicantly speed up training time by eliminating suboptimal actions more
quickly than normal Q-learning. Sutton and Barto [1] have a nice example of this in a toy MDP in
section 6.7.

References
[1] Sutton, Richard S. and Andrew G. Barto. Introduction to Reinforcement Learning. 2nd ed., MIT
Press, 2017. Draft. http://incompleteideas.net/book/the-book-2nd.html.

Social Entrepreneurship and Social Work
No ratings yet
Social Entrepreneurship and Social Work
9 pages
Port Economics and Management
100% (9)
Port Economics and Management
66 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Lecture 5: Model-Free Control: David Silver
No ratings yet
Lecture 5: Model-Free Control: David Silver
43 pages
Lecture 4 Pre
No ratings yet
Lecture 4 Pre
85 pages
Mod3 Slides
No ratings yet
Mod3 Slides
199 pages
2.2+Model Free+Control
No ratings yet
2.2+Model Free+Control
92 pages
Lecture 4: Model Free Control: Emma Brunskill
No ratings yet
Lecture 4: Model Free Control: Emma Brunskill
66 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
2-dynamic
No ratings yet
2-dynamic
50 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Practice Assignment 5: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 5: Reinforcement Learning Prof. B. Ravindran
2 pages
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
No ratings yet
3 - Chapter 3 Optimal State Values and Bellman Optimality Equation
21 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
CS229
No ratings yet
CS229
17 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
RL UNIT-4
No ratings yet
RL UNIT-4
18 pages
slidedeck_7_MAS_2021_22_RL_3_MC_Sarsa_QL
No ratings yet
slidedeck_7_MAS_2021_22_RL_3_MC_Sarsa_QL
65 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Notações Dos Algoritimos
No ratings yet
Notações Dos Algoritimos
10 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
Lec 4
No ratings yet
Lec 4
16 pages
1502.05477v5
No ratings yet
1502.05477v5
16 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Rl Exam Tutti
No ratings yet
Rl Exam Tutti
47 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
Exploration in Contextual Bandits: Reedy Reedy
No ratings yet
Exploration in Contextual Bandits: Reedy Reedy
16 pages
DRL_Homework_1
No ratings yet
DRL_Homework_1
4 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
Planning and Optimal Control Policy Gradient Methods
No ratings yet
Planning and Optimal Control Policy Gradient Methods
34 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of contents
No ratings yet
1 - Table of contents
6 pages
rl5
No ratings yet
rl5
26 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
policy (RL IITH)
No ratings yet
policy (RL IITH)
46 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
58 pages
PolicyGradient
No ratings yet
PolicyGradient
33 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
2025_MDPs 2
No ratings yet
2025_MDPs 2
42 pages
F20-AI-L9
No ratings yet
F20-AI-L9
44 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
BSC Se Group02-1
No ratings yet
BSC Se Group02-1
9 pages
OptiStruct 2019 VerificationProblems
No ratings yet
OptiStruct 2019 VerificationProblems
155 pages
Study of Coastal Environment Using RS and GIS
No ratings yet
Study of Coastal Environment Using RS and GIS
24 pages
Agent Monitoring with AgentOps - CrewAI
No ratings yet
Agent Monitoring with AgentOps - CrewAI
5 pages
UNIT 6_ Innovation
No ratings yet
UNIT 6_ Innovation
11 pages
Din en Iso 898-1 DIN EN 20 898-2: Strength Values
100% (1)
Din en Iso 898-1 DIN EN 20 898-2: Strength Values
1 page
Statistics Mcq's
No ratings yet
Statistics Mcq's
7 pages
MODULE-1-INTRODUCTION-TO-MARPOL-7378
No ratings yet
MODULE-1-INTRODUCTION-TO-MARPOL-7378
7 pages
Maintenance Training Jacket
No ratings yet
Maintenance Training Jacket
314 pages
Us As Su Tong Science and Technology Park
No ratings yet
Us As Su Tong Science and Technology Park
50 pages
Lab Report Enzyme Kinetics
No ratings yet
Lab Report Enzyme Kinetics
5 pages
Goals and Plans: Goal Plan
No ratings yet
Goals and Plans: Goal Plan
13 pages
Living With Parents With Obsessive-Compulsive Disorder: Children's Lives and Experiences
No ratings yet
Living With Parents With Obsessive-Compulsive Disorder: Children's Lives and Experiences
15 pages
Developmental Theories and Other Relevant Theories: Vygotsky'S Socio-Cultural Theory Kohlberg'S Stages of Moral Development
No ratings yet
Developmental Theories and Other Relevant Theories: Vygotsky'S Socio-Cultural Theory Kohlberg'S Stages of Moral Development
5 pages
Maya-Number-System-Lesson-student-printable
No ratings yet
Maya-Number-System-Lesson-student-printable
2 pages
HGE Module 1
No ratings yet
HGE Module 1
3 pages
DLL Gen Math Week 6
No ratings yet
DLL Gen Math Week 6
6 pages
34 I Puc History EM
No ratings yet
34 I Puc History EM
366 pages
LESSON 6 To Buy or Not To Buy
No ratings yet
LESSON 6 To Buy or Not To Buy
14 pages
Direct Integration
No ratings yet
Direct Integration
42 pages
ESF Guidance Note 3 Resource Efficiency and Pollution Prevention and Management English
No ratings yet
ESF Guidance Note 3 Resource Efficiency and Pollution Prevention and Management English
20 pages
Soil Physical Properties - Those Properties
No ratings yet
Soil Physical Properties - Those Properties
29 pages
Yarra Valley Assignment
No ratings yet
Yarra Valley Assignment
5 pages
1.BSBWOR203 Student Assessment Tasks
No ratings yet
1.BSBWOR203 Student Assessment Tasks
28 pages
i-CAT FLX - User Manual
No ratings yet
i-CAT FLX - User Manual
97 pages
Amanda
No ratings yet
Amanda
5 pages
Reading Club Action Plan
100% (2)
Reading Club Action Plan
3 pages
Self-Editing Worksheet 6 Chapter 6: Cause/Effect Essays
No ratings yet
Self-Editing Worksheet 6 Chapter 6: Cause/Effect Essays
2 pages

Lnotes 04

Uploaded by

Lnotes 04

Uploaded by

CS234 Notes - Lecture 4

Model Free Control

Michael Painter, Emma Brunskill

March 20, 2018

5 Model Free Control

5.1 Generalized Policy Iteration

Algorithm 2 Model-free Generalized Policy Iteration Algorithm

5.2 Importance of Exploration

5.2.2 -greedy Policies

5.2.3 Monotonic -greedy Policy Improvement

5.2.4 Greedy in the limit of exploration

5.3 Monte Carlo Control

Algorithm 3 Online Monte Carlo Control/On Policy Improvement

5.4 Temporal Dierence Methods for Control

5.5 Importance Sampling for O-Policy TD

Algorithm 5 Q-Learning with-greedy exploration

6.1 Example: Coins

6.2 Maximization Bias in Reinforcement Learning

6.3 Double Q-Learning

You might also like

5.2.2 -greedy Policies

5.2.3 Monotonic -greedy Policy Improvement

5.4 Temporal Dierence Methods for Control

5.5 Importance Sampling for O-Policy TD

Algorithm 5 Q-Learning with-greedy exploration