Lnotes 04
Lnotes 04
Last lecture, we introduced methods for model-free policy evaluation, so we can complete line 4 in
a model-free way. However, in order to make this entire algorithm model-free, we must also nd a
way to do line 5, the policy improvement step, in a model-free way. By denition, we have Qπ (s, a) =
1
P (s0 |s, a)V π (s0 ). Thus, we can get a model-free policy iteration algorithm (Algorithm
P
R(s, a)+γ s0 ∈S
2) by substituting this value into the model-based policy iteration algorithm and using state-action
values throughout.
There are a few caveats to this algorithm due to the substitution we made in line 5:
1. If policy π is deterministic or doesn't take every action a with some positive probability, then we
cannot actually compute the argmax in line 5.
2. The policy evaluation algorithm gives us an estimate of Qπ , so it is not clear whether line 5 will
monotonically improve the policy like in the model-based case.
In the previous section, we saw that one caveat to the model-free policy iteration algorithm is that the
policy π needs to take every action a with some positive probability, so the value of each state-action
pair can be determined. In other words, the policy π needs to explore actions, even if they might be
suboptimal with respect to our current Q-value estimates.
In order to explore actions that are suboptimal with respect to our current Q-value estimates, we'll
need a systematic way to balance exploration of suboptimal actions with exploitation of the optimal,
or greedy, action. One naive strategy is to take a random action with small probability and take the
greedy action the rest of the time. This type of exploration strategy is called an -greedy policy.
Mathematically, an -greedy policy with respect to the state-action value Qπ (s, a) takes the following
form:
with probability |A|
(
a
π(a|s) =
arg maxa Qπ (s, a) with probability 1 −
We saw in the second lecture via the policy improvement theorem that if we take the greedy action
with respect to the current values and then follow a policy π thereafter, this policy is an improvement
to the policy π . A natural question then is whether an -greedy policy with respect to -greedy policy
π is an improvement on policy π . This would help us address our second caveat of the generalized
2
policy iteration algorithm, Algorithm 2. Fortunately, there is an analogue of the policy improvement
theorem for the -greedy policy, which we state and derive below.
Theorem 5.1 (Monotonic -greedy Policy Improvement). Let πi be an -greedy policy. Then, the
-greedy policy with respect to Qπi , denoted πi+1 , is a monotonic improvement on policy π . In other
words, V πi+1 ≥ V πi .
Proof. We rst show that Qπi (s, πi+1 (s)) ≥ V πi (s) for all states s.
X
Qπi (s, πi+1 (s)) = πi+1 (a|s)Qπi (s, a)
a∈A
X πi
= Q (s, a) + (1 − ) max Qπi (s, a0 )
|A| a0
a∈A
X πi 1−
= Q (s, a) + (1 − ) max Qπi (s, a0 )
|A| a0 1−
a∈A
X πi X πi (a|s) − |A|
= Q (s, a) + (1 − ) max Qπi (s, a0 )
|A| a0 1−
a∈A a∈A
X πi X πi (a|s) − |A|
= Q (s, a) + (1 − ) max Qπi (s, a0 )
|A| 1− 0
a
a∈A a∈A
X πi X πi (a|s) − |A|
≥ Q (s, a) + (1 − ) Qπi (s, a)
|A| 1−
a∈A a∈A
X
= πi (a|s)Qπi (s, a)
a∈A
πi
=V (s)
The rst equality follows from the fact that the rst action we take is with respect to policy πi+1
i , then
P h
we follow policy πi thereafter. The fourth equality follows because 1 − = a πi (a|s) − |A| .
Now from the policy improvement theorem, we have that Qπi (s, πi+1 (s)) ≥ V πi (s) implies V πi+1 (s) ≥
V πi (s) for all states s, as desired.
Thus, the monotonic -greedy policy improvement shows us that our policy does in fact improve if we
act -greedy on the current -greedy policy.
We introduced -greedy strategies above as a naive way to balance exploration of new actions with
exploitation of current knowledge; however, we can further rene this balance by introducing a new
class of exploration strategies that allow us to make convergence guarantees about our algorithms.
This class of strategies is called Greedy in the Limit of Innite Exploration (GLIE).
Denition 5.1 (Greedy in the Limit of Innite Exploration (GLIE)). A policy π is greedy in the limit
of innite exploration (GLIE) if it satises the following two properties:
1. All state-action pairs are visited an innite number of times. I.e. for all s ∈ S, a ∈ A,
lim Ni (s, a) → ∞,
i→∞
where Ni (s, a) is the number of times action a is taken at state s up to and including episode i.
3
2. The behavior policy converges to the policy that is greedy with respect to the learned Q-function.
I.e. for all s ∈ S, a ∈ A,
lim πi (a|s) = arg max q(s, a) with probability 1
i→∞ a
One example of a GLIE strategy is an -greedy policy where is decayed to zero with i = 1i , where
i is the episode number. We can see that since i > 0 for all i, we will explore with some positive
probability at every time step, hence satisfying the rst GLIE condition. Since i → 0 as i → ∞, we
also have that the policy is greedy in the limit, hence satisfying the second GLIE condition.
Now, as stated before, GLIE strategies can help us arrive at convergence guarantees for our model-free
control methods. In particular, we have the following result:
Theorem 5.2. GLIE Monte Carlo control converges to the optimal state-action value function. That
is Q(s, a) → q(s, a).
In other words, if the -greedy strategy used in Algorithm 3 is GLIE, then the Q value derived from
the algorithm will converge to the optimal Q function.
4
Algorithm 4 SARSA
1: procedure SARSA(, αt )
2: Initialize Q(s, a) for all s ∈ S, a ∈ A arbitrarily except Q(terminal, ·) = 0
3: π ← -greedy policy with respect to Q
4: for each episode do
5: Set s1 as the starting state
6: Choose action a1 from policy π(s1 )
7: loop until episode terminates
8: Take action at and observe reward rt and next state st+1
9: Choose action at+1 from policy π(st+1 )
10: Q(st , at ) ← Q(st , at ) + αt [rt + γQ(st+1 , at+1 ) − Q(st , at )]
11: π ← -greedy with respect to Q (policy improvement)
12: t←t+1
13: Return Q, π
thereby using the values (s, a, r, s0 , a0 ). SARSA is an on-policy method because the actions a and a0
used in the update equation are both derived from the policy that is being followed at the time of the
update.
Just like in Monte Carlo, we can arrive at the convergence of SARSA given one extra condition on the
step-sizes as stated below:
Theorem 5.3. SARSA for nite-state and nite-action MDP's converges to the optimal action-value,
i.e. Q(s, a) → q(s, a), if the following two conditions hold:
1. The sequence of policies π from is GLIE
2. The step-sizes αt satisfy the Robbins-Munro sequence such that
∞
X
αt = ∞
t=1
X∞
αt2 < ∞
t=1
Exercise 5.1. What is the benet to performing the policy improvement step after each update in
line 11 of Algorithm 4? What would be the benet to performing the policy improvement step less
frequently?
5
Additionally, πb doesn't need to be the same at each step, but we do need to know the probability
for every step. As in Monte Carlo, we need the two policies to have the same support. That is, if
πe (a|s) × V πe (s0 ) > 0, then πb (a|s) > 0.
5.6 Q-learning
Now, we return to nding an o-policy method for TD-style control. In the above formulation, we
again leveraged importance sampling, but in the control case, we do not need to rely on this. Instead,
we can maintain state-action Q estimates and bootstrap the value of the best future action. Our
SARSA update took the form
Q(st , at ) ← Q(st , at ) + αt [rt + γQ(st+1 , at+1 ) − Q(st , at )],
but we can instead bootstrap the Q value at the next state to get the following update:
Q(st , at ) ← Q(st , at ) + αt [rt + γ max
0
Q(st+1 , a0 ) − Q(st , at )].
a
This gives rise to Q-learning, which is detailed in Algorithm 5. Now that we take a maximum over
the actions at the next state, this action is not necessarily the same as the one we would derive from
the current policy. Therefore, Q-learning is considered an o-policy algorithm.
6 Maximization Bias
Finally, we are going to discuss a phenomenon known as maximization bias. We'll rst examine
maximization bias in a small example.
6
In an eort to answer this question, we ip each coin once. We then pick the coin that yields more
money as the answer to question 1. We answer question 2 with however much that coin gave us. For
example, if coin 1 landed on heads and coin 2 landed on tails, we would answer question 1 with coin
1, and question 1 with one dollar.
Let's examine the possible scenarios for the outcome of this procedure. If at least one of the coins is
heads, then our answer to question 2 is one dollar. If both coins are tails, then our answer is negative
one dollar. Thus, the expected value of our answer to question 2 is 34 × (1) + 14 × (−1) = 0.5. This gives
us a higher estimate of the expected value of ipping the better coin than the true expected value of
ipping that coin. In other words, we're systematically going to think the coins are better than they
actually are.
This problem comes from the fact that we are using our estimate to both choose the better coin and
estimate its value. We can alleviate this by separating these two steps. One method for doing this
would be to change the procedure as follows: After choosing the better coin, ip the better coin again
and use this value as your answer for question 2. The expected value of this answer is now 0, which is
the same as the true expected value of ipping either coin.
Thus, our state value estimate is at least as large as the true value of state s, so we are systematically
overestimating the value of the state in the presence of nite samples.
7
Algorithm 6 Double Q-Learning
1: procedure Double Q-Learning(, α, γ )
2: Initialize Q1 (s, a), Q2 (s, a) for all s ∈ S, a ∈ A, set t ← 0
3: π ← -greedy policy with respect to Q1 + Q2
4: loop
5: Sample action at from policy π at state st
6: Take action at and observe reward rt and next state st+1
7: if (with 0.5 probability) then
8: Q1 (st , at ) ← Q1 (st , at ) + α(rt + γQ2 (st+1 , arg maxa0 Q1 (st+1 , a0 )) − Q1 (st , at ))
9: else
10: Q2 (st , at ) ← Q2 (st , at ) + α(rt + γQ1 (st+1 , arg maxa0 Q2 (st+1 , a0 )) − Q2 (st , at ))
11: π ← -greedy policy with respect to Q1 + Q2 (policy improvement)
12: t←t+1
13: return π, Q1 + Q2
Double Q-learning can signicantly speed up training time by eliminating suboptimal actions more
quickly than normal Q-learning. Sutton and Barto [1] have a nice example of this in a toy MDP in
section 6.7.
References
[1] Sutton, Richard S. and Andrew G. Barto. Introduction to Reinforcement Learning. 2nd ed., MIT
Press, 2017. Draft. http://incompleteideas.net/book/the-book-2nd.html.