0% found this document useful (0 votes)
6 views

3 - Chapter 8 Value Function Approximation

Uploaded by

zhanshengheyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

3 - Chapter 8 Value Function Approximation

Uploaded by

zhanshengheyu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Chapter 8

Value Function Approximation

Algorithms/Methods

Chapter 4: Chapter 5: Chapter 6:


with model Stochastic
Value Iteration & to Monte Carlo
Policy Iteration Methods Approximation
without model

Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation

Chapter 10: policy-based


Chapter 9:
Actor-Critic plus Policy Gradient
Methods value-based Methods

Figure 8.1: Where we are in this book.

In this chapter, we continue to study temporal-difference learning algorithms. How-


ever, a different method is used to represent state/action values. So far in this book,
state/action values have been represented by tables. The tabular method is straightfor-
ward to understand, but it is inefficient for handling large state or action spaces. To solve
this problem, this chapter introduces the function approximation method, which has be-
come the standard way to represent values. It is also where artificial neural networks are
incorporated into reinforcement learning as function approximators. The idea of function
approximation can also be extended from representing values to representing policies, as
introduced in Chapter 9.

160
8.1. Value representation: From table to function S. Zhao, 2023

v̂(s) = as + b
v̂(s)

s1 s2 s3 s4 · · · sn s

Figure 8.2: An illustration of the function approximation method. The x-axis and y-axis correspond to
s and v̂(s), respectively.

8.1 Value representation: From table to function


We next use an example to demonstrate the difference between the tabular and function
approximation methods.
Suppose that there are n states {si }ni=1 , whose state values are {vπ (si )}ni=1 . Here, π
is a given policy. Let {v̂(si )}ni=1 denote the estimates of the true state values. If we use
the tabular method, the estimated values can be maintained in the following table. This
table can be stored in memory as an array or a vector. To retrieve or update any value,
we can directly read or rewrite the corresponding entry in the table.

State s1 s2 ··· sn
Estimated value v̂(s1 ) v̂(s2 ) ··· v̂(sn )

We next show that the values in the above table can be approximated by a function.
In particular, {(si , v̂(si ))}ni=1 are shown as n points in Figure 8.2. These points can be
fitted or approximated by a curve. The simplest curve is a straight line, which can be
described as
" #
a
v̂(s, w) = as + b = [s, 1] = φT (s)w. (8.1)
|{z} b
φT (s) | {z }
w

Here, v̂(s, w) is a function for approximating vπ (s). It is determined jointly by the state s
and the parameter vector w ∈ R2 . v̂(s, w) is sometimes written as v̂w (s). Here, φ(s) ∈ R2
is called the feature vector of s.
The first notable difference between the tabular and function approximation methods
concerns how they retrieve and update a value.

 How to retrieve a value: When the values are represented by a table, if we want to
retrieve a value, we can directly read the corresponding entry in the table. However,

161
8.1. Value representation: From table to function S. Zhao, 2023

when the values are represented by a function, it becomes slightly more complicated
to retrieve a value. In particular, we need to input the state index s into the function
and calculate the function value (Figure 8.3). For the example in (8.1), we first need
to calculate the feature vector φ(s) and then calculate φT (s)w. If the function is
an artificial neural network, a forward propagation from the input to the output is
needed.

s v̂(s, w)
w

function

Figure 8.3: An illustration of the process for retrieving the value of s when using the function approxi-
mation method.

The function approximation method is more efficient in terms of storage due to the
way in which the state values are retrieved. Specifically, while the tabular method
needs to store n values, we now only need to store a lower dimensional parameter
vector w. Thus, the storage efficiency can be significantly improved. Such a benefit
is, however, not free. It comes with a cost: the state values may not be accurately
represented by the function. For example, a straight line is not able to accurately fit
the points in Figure 8.2. That is why this method is called approximation. From a
fundamental point of view, some information will certainly be lost when we use a low-
dimensional vector to represent a high-dimensional dataset. Therefore, the function
approximation method enhances storage efficiency by sacrificing accuracy.
 How to update a value: When the values are represented by a table, if we want
to update one value, we can directly rewrite the corresponding entry in the table.
However, when the values are represented by a function, the way to update a value is
completely different. Specifically, we must update w to change the values indirectly.
How to update w to find optimal state values will be addressed in detail later.
Thanks to the way in which the state values are updated, the function approximation
method has another merit: its generalization ability is stronger than that of the tabular
method. The reason is as follows. When using the tabular method, we can update
a value if the corresponding state is visited in an episode. The values of the states
that have not been visited cannot be updated. However, when using the function
approximation method, we need to update w to update the value of a state. The
update of w also affects the values of some other states even though these states have
not been visited. Therefore, the experience sample for one state can generalize to help
estimate the values of some other states.
The above analysis is illustrated in Figure 8.4, where there are three states {s1 , s2 , s3 }.

162
8.1. Value representation: From table to function S. Zhao, 2023

Suppose that we have an experience sample for s3 and would like to update v̂(s3 ).
When using the tabular method, we can only update v̂(s3 ) without changing v̂(s1 ) or
v̂(s2 ), as shown in Figure 8.4(a). When using the function approximation method,
updating w not only can update v̂(s3 ) but also would change v̂(s1 ) and v̂(s2 ), as shown
in Figure 8.4(b). Therefore, the experience sample of s3 can help update the values
of its neighboring states.

v̂(s) v̂(s)
update v̂(s3 )

s1 s2 s3 s s1 s2 s3 s

(a) Tabular method: when v̂(s3 ) is updated, the other values remain the same.

v̂(s) v̂(s)
update w for s3

s1 s2 s3 s s1 s2 s3 s

(b) Function approximation method: when we update v̂(s3 ) by changing w, the values of the
neighboring states are also changed.

Figure 8.4: An illustration of how to update the value of a state.

We can use more complex functions that have stronger approximation abilities than
straight lines. For example, consider a second-order polynomial:
 
a
v̂(s, w) = as2 + bs + c = [s2 , s, 1]  b  = φT (s)w. (8.2)
 
| {z }
φT (s) c
| {z }
w

We can use even higher-order polynomial curves to fit the points. As the order of the
curve increases, the approximation accuracy can be improved, but the dimension of the
parameter vector also increases, requiring more storage and computational resources.
Note that v̂(s, w) in either (8.1) or (8.2) is linear in w (though it may be nonlinear
in s). This type of method is called linear function approximation, which is the simplest
function approximation method. To realize linear function approximation, we need to
select an appropriate feature vector φ(s). That is, we must decide, for example, whether
we should use a first-order straight line or a second-order curve to fit the points. The
selection of appropriate feature vectors is nontrivial. It requires prior knowledge of the
given task: the better we understand the task, the better the feature vectors we can select.
For instance, if we know that the points in Figure 8.2 are approximately located on a

163
8.2. TD learning of state values based on function approximation S. Zhao, 2023

straight line, we can use a straight line to fit the points. However, such prior knowledge
is usually unknown in practice. If we do not have any prior knowledge, a popular solution
is to use artificial neural networks as nonlinear function approximations.
Another important problem is how to find the optimal parameter vector. If we know
{vπ (si )}ni=1 , this is a least-squares problem. The optimal parameter can be obtained by
optimizing the following objective function:
n n
X 2 X 2
J1 = v̂(si , w) − vπ (si ) = φT (si )w − vπ (si )
i=1 i=1
2
φT (s1 )
   
vπ (s1 )
.. .. .
=  w −  = kΦw − vπ k2 ,
   
. . 
T
φ (sn ) vπ (sn )

where

φT (s1 )
   
vπ (s1 )
.  .. n×2 .  .. n
Φ= ∈R , vπ =  ∈R .
 
. .
φT (sn ) vπ (sn )

It can be verified that the optimal solution to this least-squares problem is

w∗ = (ΦT Φ)−1 Φvπ .

More information about least-squares problems can be found in [47, Section 3.3] and
[48, Section 5.14].
The curve-fitting example presented in this section illustrates the basic idea of value
function approximation. This idea will be formally introduced in the next section.

8.2 TD learning of state values based on function


approximation
In this section, we show how to integrate the function approximation method into TD
learning to estimate the state values of a given policy. This algorithm will be extended
to learn action values and optimal policies in Section 8.3.
This section contains quite a few subsections and many coherent contents. It is better
for us to review the contents first before diving into the details.

 The function approximation method is formulated as an optimization problem. The


objective function of this problem is introduced in Section 8.2.1. The TD learning
algorithm for optimizing this objective function is introduced in Section 8.2.2.

164
8.2. TD learning of state values based on function approximation S. Zhao, 2023

 To apply the TD learning algorithm, we need to select appropriate feature vectors.


Section 8.2.3 discusses this problem.
 Examples are given in Section 8.2.4 to demonstrate the TD algorithm and the impacts
of different feature vectors.
 A theoretical analysis of the TD algorithm is given in Section 8.2.5. This subsection
is mathematically intensive. Readers may read it selectively based on their interests.

8.2.1 Objective function


Let vπ (s) and v̂(s, w) be the true state value and approximated state value of s ∈ S,
respectively. The problem to be solved is to find an optimal w so that v̂(s, w) can best
approximate vπ (s) for every s. In particular, the objective function is

J(w) = E[(vπ (S) − v̂(S, w))2 ], (8.3)

where the expectation is calculated with respect to the random variable S ∈ S. While S
is a random variable, what is its probability distribution? This question is important for
understanding this objective function. There are several ways to define the probability
distribution of S.

 The first way is to use a uniform distribution. That is to treat all the states as equally
important by setting the probability of each state to 1/n. In this case, the objective
function in (8.3) becomes

1X
J(w) = (vπ (s) − v̂(s, w))2 , (8.4)
n s∈S

which is the average value of the approximation errors of all the states. However, this
way does not consider the real dynamics of the Markov process under the given policy.
Since some states may be rarely visited by a policy, it may be unreasonable to treat
all the states as equally important.
 The second way, which is the focus of this chapter, is to use the stationary distribution.
The stationary distribution describes the long-term behavior of a Markov decision
process. More specifically, after the agent executes a given policy for a sufficiently
long period, the probability of the agent being located at any state can be described
by this stationary distribution. Interested readers may see the details in Box 8.1.
Let {dπ (s)}s∈S denote the stationary distribution of the Markov process under policy
π. That is, the probability for the agent visiting s after a long period of time is dπ (s).
P
By definition, s∈S dπ (s) = 1. Then, the objective function in (8.3) can be rewritten

165
8.2. TD learning of state values based on function approximation S. Zhao, 2023

as
X
J(w) = dπ (s)(vπ (s) − v̂(s, w))2 , (8.5)
s∈S

which is a weighted average of the approximation errors. The states that have higher
probabilities of being visited are given greater weights.

It is notable that the value of dπ (s) is nontrivial to obtain because it requires knowing
the state transition probability matrix Pπ (see Box 8.1). Fortunately, we do not need to
calculate the specific value of dπ (s) to minimize this objective function as shown in the
next subsection. In addition, it was assumed that the number of states was finite when
we introduced (8.4) and (8.5). When the state space is continuous, we can replace the
summations with integrals.

Box 8.1: Stationary distribution of a Markov decision process

The key tool for analyzing stationary distribution is Pπ ∈ Rn×n , which is the probabil-
ity transition matrix under the given policy π. If the states are indexed as s1 , . . . , sn ,
then [Pπ ]ij is defined as the probability for the agent moving from si to sj . The
definition of Pπ can be found in Section 2.6.
 Interpretation of Pπk (k = 1, 2, 3, . . . ).
First of all, it is necessary to examine the interpretation of the entries in Pπk . The
probability of the agent transitioning from si to sj using exactly k steps is denoted
as

(k)
pij = Pr(Stk = j|St0 = i),

where t0 and tk are the initial and kth time steps, respectively. First, by the
definition of Pπ , we have

(1)
[Pπ ]ij = pij ,

which means that [Pπ ]ij is the probability of transitioning from si to sj using a
single step. Second, consider Pπ2 . It can be verified that
n
X
[Pπ2 ]ij = [Pπ Pπ ]ij = [Pπ ]iq [Pπ ]qj .
q=1

Since [Pπ ]iq [Pπ ]qj is the joint probability of transitioning from si to sq and then
from sq to sj , we know that [Pπ2 ]ij is the probability of transitioning from si to sj

166
8.2. TD learning of state values based on function approximation S. Zhao, 2023

using exactly two steps. That is

(2)
[Pπ2 ]ij = pij .

Similarly, we know that

(k)
[Pπk ]ij = pij ,

which means that [Pπk ]ij is the probability of transitioning from si to sj using
exactly k steps.
 Definition of stationary distributions.
Let d0 ∈ Rn be a vector representing the probability distribution of the states at
the initial time step. For example, if s is always selected as the starting state,
then d0 (s) = 1 and the other entries of d0 are 0. Let dk ∈ Rn be the vector
representing the probability distribution obtained after exactly k steps starting
from d0 . Then, we have
n
X
dk (si ) = d0 (sj )[Pπk ]ji , i = 1, 2, . . . . (8.6)
j=1

This equation indicates that the probability of the agent visiting si at step k
equals the sum of the probabilities of the agent transitioning from {sj }nj=1 to si
using exactly k steps. The matrix-vector form of (8.6) is

dTk = dT0 Pπk . (8.7)

When we consider the long-term behavior of the Markov process, it holds under
certain conditions that

lim Pπk = 1n dTπ , (8.8)


k→∞

where 1n = [1, . . . , 1]T ∈ Rn and 1n dTπ is a constant matrix with all its rows
equal to dTπ . The conditions under which (8.8) is valid will be discussed later.
Substituting (8.8) into (8.7) yields

lim dTk = dT0 lim Pπk = dT0 1n dTπ = dTπ , (8.9)


k→∞ k→∞

where the last equality is valid because dT0 1n = 1.


Equation (8.9) means that the state distribution dk converges to a constant value
dπ , which is called the limiting distribution. The limiting distribution depends

167
8.2. TD learning of state values based on function approximation S. Zhao, 2023

on the system model and the policy π. Interestingly, it is independent of the


initial distribution d0 . That is, regardless of which state the agent starts from,
the probability distribution of the agent after a sufficiently long period can always
be described by the limiting distribution.
The value of dπ can be calculated in the following way. Taking the limit of both
sides of dTk = dTk−1 Pπ gives limk→∞ dTk = limk→∞ dTk−1 Pπ and hence

dTπ = dTπ Pπ . (8.10)

As a result, dπ is the left eigenvector of Pπ associated with the eigenvalue 1. The


P
solution of (8.10) is called the stationary distribution. It holds that s∈S dπ (s) =
1 and dπ (s) > 0 for all s ∈ S. The reason why dπ (s) > 0 (not dπ (s) ≥ 0) will be
explained later.
 Conditions for the uniqueness of stationary distributions.
The solution dπ of (8.10) is usually called a stationary distribution, whereas the
distribution dπ in (8.9) is usually called the limiting distribution. Note that (8.9)
implies (8.10), but the converse may not be true. A general class of Markov
processes that have unique stationary (or limiting) distributions is irreducible (or
regular ) Markov processes. Some necessary definitions are given below. More
details can be found in [49, Chapter IV].

- State sj is said to be accessible from state si if there exists a finite integer k so


that [Pπ ]kij > 0, which means that the agent starting from si can always reach
sj after a finite number of transitions.
- If two states si and sj are mutually accessible, then the two states are said to
communicate.
- A Markov process is called irreducible if all of its states communicate with
each other. In other words, the agent starting from an arbitrary state can
always reach any other state within a finite number of steps. Mathematically,
it indicates that, for any si and sj , there exists k ≥ 1 such that [Pπk ]ij > 0 (the
value of k may vary for different i, j).
- A Markov process is called regular if there exists k ≥ 1 such that [Pπk ]ij > 0
for all i, j. Equivalently, there exists k ≥ 1 such that Pπk > 0, where > is
elementwise. As a result, every state is reachable from any other state within
at most k steps. A regular Markov process is also irreducible, but the converse
is not true. However, if a Markov process is irreducible and there exists i such
0
that [Pπ ]ii > 0, then it is also regular. Moreover, if Pπk > 0, then Pπk > 0 for
any k 0 ≥ k since Pπ ≥ 0. It then follows from (8.9) that dπ (s) > 0 for every s.

168
8.2. TD learning of state values based on function approximation S. Zhao, 2023

 Policies that may lead to unique stationary distributions.


Once the policy is given, a Markov decision process becomes a Markov process,
whose long-term behavior is jointly determined by the given policy and the system
model. Then, an important question is what kind of policies can lead to regular
Markov processes? In general, the answer is exploratory policies such as -greedy
policies. That is because an exploratory policy has a positive probability of taking
any action at any state. As a result, the states can communicate with each other
when the system model allows them to do so.
 An example is given in Figure 8.5 to illustrate stationary distributions. The policy
in this example is -greedy with  = 0.5. The states are indexed as s1 , s2 , s3 , s4 ,
which correspond to the top-left, top-right, bottom-left, and bottom-right cells in
the grid, respectively.
We compare two methods to calculate the stationary distributions. The first
method is to solve (8.10) to get the theoretical value of dπ . The second method is
to estimate dπ numerically: we start from an arbitrary initial state and generate a
sufficiently long episode by following the given policy. Then, dπ can be estimated
by the ratio between the number of times each state is visited in the episode and
the total length of the episode. The estimation result is more accurate when the
episode is longer. We next compare the theoretical and estimated results.

1 2 0.8
Percentage of each state visited

0.6
1 s1
s2
0.4
s3
s4

0.2
2

0
0 200 400 600 800 1000
Step index

Figure 8.5: Long-term behavior of an -greedy policy with  = 0.5. The asterisks in the right
figure represent the theoretical values of the elements of dπ .

- Theoretical value of dπ : It can be verified that the Markov process induced


by the policy is both irreducible and regular. That is due to the following
reasons. First, since all the states communicate, the resulting Markov process
is irreducible. Second, since every state can transition to itself, the resulting

169
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Markov process is regular. It can be seen from Figure 8.5 that


 
0.3 0.1 0.1 0
 0.1 0.3 0 0.1 
PπT = .
 
 0.6 0 0.3 0.1 
0 0.6 0.6 0.8

The eigenvalues of PπT can be calculated as {−0.0449, 0.3, 0.4449, 1}. The
unit-length (right) eigenvector of PπT corresponding to the eigenvalue 1 is
[0.0463, 0.1455, 0.1785, 0.9720]T . After scaling this vector so that the sum of
all its elements is equal to 1, we obtain the theoretical value of dπ as follows:
 
0.0345
 0.1084 
dπ =  .
 
 0.1330 
0.7241

The ith element of dπ corresponds to the probability of the agent visiting si


in the long run.
- Estimated value of dπ : We next verify the above theoretical value of dπ by
executing the policy for sufficiently many steps in the simulation. Specifically,
we select s1 as the starting state and run 1,000 steps by following the policy.
The proportion of the visits of each state during the process is shown in Fig-
ure 8.5. It can be seen that the proportions converge to the theoretical value
of dπ after hundreds of steps.

8.2.2 Optimization algorithms


To minimize the objective function J(w) in (8.3), we can use the gradient descent algo-
rithm:

wk+1 = wk − αk ∇w J(wk ),

where

∇w J(wk ) = ∇w E[(vπ (S) − v̂(S, wk ))2 ],


= E[∇w (vπ (S) − v̂(S, wk ))2 ]
= 2E[(vπ (S) − v̂(S, wk ))(−∇w v̂(S, wk ))]
= −2E[(vπ (S) − v̂(S, wk ))∇w v̂(S, wk )].

170
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Therefore, the gradient descent algorithm is

wk+1 = wk + 2αk E[(vπ (S) − v̂(S, wk ))∇w v̂(S, wk )], (8.11)

where the coefficient 2 before αk can be merged into αk without loss of generality. The
algorithm in (8.11) requires calculating the expectation. In the spirit of stochastic gra-
dient descent, we can replace the true gradient with a stochastic gradient. Then, (8.11)
becomes

wt+1 = wt + αt vπ (st ) − v̂(st , wt ) ∇w v̂(st , wt ), (8.12)

where st is a sample of S at time t.


Notably, (8.12) is not implementable because it requires the true state value vπ , which
is unknown and must be estimated. We can replace vπ (st ) with an approximation to make
the algorithm implementable. The following two methods can be used to do so.

 Monte Carlo method: Suppose that we have an episode (s0 , r1 , s1 , r2 , . . . ). Let gt be


the discounted return starting from st . Then, gt can be used as an approximation of
vπ (st ). The algorithm in (8.12) becomes

wt+1 = wt + αt gt − v̂(st , wt ) ∇w v̂(st , wt ).

This is the algorithm of Monte Carlo learning with function approximation.


 Temporal-difference method: In the spirit of TD learning, rt+1 + γv̂(st+1 , wt ) can be
used as an approximation of vπ (st ). The algorithm in (8.12) becomes

wt+1 = wt + αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt ). (8.13)

This is the algorithm of TD learning with function approximation. This algorithm is


summarized in Algorithm 8.1.

Understanding the TD algorithm in (8.13) is important for studying the other algo-
rithms in this chapter. Notably, (8.13) can only learn the state values of a given policy.
It will be extended to algorithms that can learn action values in Sections 8.3.1 and 8.3.2.

8.2.3 Selection of function approximators


To apply the TD algorithm in (8.13), we need to select appropriate v̂(s, w). There are two
ways to do that. The first is to use an artificial neural network as a nonlinear function
approximator. The input of the neural network is the state, the output is v̂(s, w), and
the network parameter is w. The second is to simply use a linear function:

v̂(s, w) = φT (s)w,

171
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Algorithm 8.1: TD learning of state values with function approximation

Initialization: A function v̂(s, w) that is differentiable in w. Initial parameter w0 .


Goal: Learn the true state values of a given policy π.
For each episode {(st , rt+1 , st+1 )}t generated by π, do
For each sample (st , rt+1 , st+1 ), do
In the general case, wt+1 = wt +αt [rt+1 + γv̂(st+1 , wt ) − v̂(st , wt )] ∇w v̂(st , wt )
In the linear case, wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st )

where φ(s) ∈ Rm is the feature vector of s. The lengths of φ(s) and w are equal to m,
which is usually much smaller than the number of states. In the linear case, the gradient
is
∇w v̂(s, w) = φ(s),

Substituting which into (8.13) yields

wt+1 = wt + αt rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ).


 
(8.14)

This is the algorithm of TD learning with linear function approximation. We call it


TD-Linear for short.
The linear case is much better understood in theory than the nonlinear case. However,
its approximation ability is limited. It is also nontrivial to select appropriate feature
vectors for complex tasks. By contrast, artificial neural networks can approximate values
as black-box universal nonlinear approximators, which are more friendly to use.
Nevertheless, it is still meaningful to study the linear case. A better understanding
of the linear case can help readers better grasp the idea of the function approximation
method. Moreover, the linear case is sufficient for solving the simple grid world tasks
considered in this book. More importantly, the linear case is still powerful in the sense
that the tabular method can be viewed as a special linear case. More information can be
found in Box 8.2.

Box 8.2: Tabular TD learning is a special case of TD-Linear

We next show that the tabular TD algorithm in (7.1) in Chapter 7 is a special case
of the TD-Linear algorithm in (8.14).
Consider the following special feature vector for any s ∈ S:

φ(s) = es ∈ Rn ,

where es is the vector with the entry corresponding to s equal to 1 and the other

172
8.2. TD learning of state values based on function approximation S. Zhao, 2023

entries equal to 0. In this case,

v̂(s, w) = eTs w = w(s),

where w(s) is the entry in w that corresponds to s. Substituting the above equation
into (8.14) yields

wt+1 = wt + αt rt+1 + γwt (st+1 ) − wt (st ) est .

The above equation merely updates the entry wt (st ) due to the definition of est .
Motivated by this, multiplying eTst on both sides of the equation yields

wt+1 (st ) = wt (st ) + αt rt+1 + γwt (st+1 ) − wt (st ) ,

which is exactly the tabular TD algorithm in (7.1).


In summary, by selecting the feature vector as φ(s) = es , the TD-Linear algorithm
becomes the tabular TD algorithm.

8.2.4 Illustrative examples


We next present some examples for demonstrating how to use the TD-Linear algorithm
in (8.14) to estimate the state values of a given policy. In the meantime, we demonstrate
how to select feature vectors.

1 2 3 4 5 1 2 3 4 5

1 1 -3.8 -3.8 -3.6 -3.1 -3.2

2 2 -3.8 -3.8 -3.8 -3.1 -2.9

3 3 -3.6 -3.9 -3.4 -3.2 -2.9

4 4 -3.9 -3.6 -3.4 -2.9 -3.2

5 5 -4.5 -4.2 -3.4 -3.4 -3.5

(a) (b) (c)

Figure 8.6: (a) The policy to be evaluated. (b) The true state values are represented as a table. (c) The
true state values are represented as a 3D surface.

The grid world example is shown in Figure 8.6. The given policy takes any action at a
state with a probability of 0.2. Our goal is to estimate the state values under this policy.
There are 25 state values in total. The true state values are shown in Figure 8.6(b). The
true state values are visualized as a three-dimensional surface in Figure 8.6(c).

173
8.2. TD learning of state values based on function approximation S. Zhao, 2023

We next show that we can use fewer than 25 parameters to approximate these state
values. The simulation setup is as follows. Five hundred episodes are generated by the
given policy. Each episode has 500 steps and starts from a randomly selected state-action
pair following a uniform distribution. In addition, in each simulation trial, the parameter
vector w is randomly initialized such that each element is drawn from a standard normal
distribution with a zero mean and a standard deviation of 1. We set rforbidden = rboundary =
−1, rtarget = 1, and γ = 0.9.
To implement the TD-Linear algorithm, we need to select the feature vector φ(s) first.
There are different ways to do that as shown below.

 The first type of feature vector is based on polynomials. In the grid world example, a
state s corresponds to a 2D location. Let x and y denote the column and row indexes
of s, respectively. To avoid numerical issues, we normalize x and y so that their values
are within the interval of [−1, +1]. With a slight abuse of notation, the normalized
values are also represented by x and y. Then, the simplest feature vector is
" #
x
φ(s) = ∈ R2 .
y

In this case, we have


" #
w1
v̂(s, w) = φT (s)w = [x, y] = w1 x + w2 y.
w2

When w is given, v̂(s, w) = w1 x + w2 y represents a 2D plane that passes through the


origin. Since the surface of the state values may not pass through the origin, we need
to introduce a bias to the 2D plane to better approximate the state values. To do
that, we consider the following 3D feature vector:
 
1
φ(s) =  x  ∈ R3 . (8.15)
 

In this case, the approximated state value is


 
w1
v̂(s, w) = φT (s)w = [1, x, y]  w2  = w1 + w2 x + w3 y.
 

w3

When w is given, v̂(s, w) corresponds to a plane that may not pass through the origin.
Notably, φ(s) can also be defined as φ(s) = [x, y, 1]T , where the order of the elements
does not matter.
The estimation result when we use the feature vector in (8.15) is shown in Fig-

174
8.2. TD learning of state values based on function approximation S. Zhao, 2023

ure 8.7(a). It can be seen that the estimated state values form a 2D plane. Although
the estimation error converges as more episodes are used, the error cannot decrease
to zero due to the limited approximation ability of a 2D plane.
To enhance the approximation ability, we can increase the dimension of the feature
vector. To that end, consider

φ(s) = [1, x, y, x2 , y 2 , xy]T ∈ R6 . (8.16)

In this case, v̂(s, w) = φT (s)w = w1 + w2 x + w3 y + w4 x2 + w5 y 2 + w6 xy, which


corresponds to a quadratic 3D surface. We can further increase the dimension of the
feature vector:

φ(s) = [1, x, y, x2 , y 2 , xy, x3 , y 3 , x2 y, xy 2 ]T ∈ R10 . (8.17)

The estimation results when we use the feature vectors in (8.16) and (8.17) are shown
in Figures 8.7(b)-(c). As can be seen, the longer the feature vector is, the more
accurately the state values can be approximated. However, in all three cases, the
estimation error cannot converge to zero because these linear approximators still have
limited approximation abilities.

5 5 3.5
TD-Linear: =0.0005 TD-Linear: =0.0005 TD-Linear: =0.0005
State value error (RMSE)

State value error (RMSE)

State value error (RMSE)

3
4 4
2.5
3 3 2

2 2 1.5

1
1 1
0.5

0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Episode index Episode index Episode index

(a) φ(s) ∈ R3 (b) φ(s) ∈ R6 (c) φ(s) ∈ R10

Figure 8.7: TD-Linear estimation results obtained with the polynomial features in (8.15), (8.16), and
(8.17).

 In addition to polynomial feature vectors, many other types of features are available
such as Fourier basis and tile coding [3, Chapter 9]. First, the values of x and y of

175
8.2. TD learning of state values based on function approximation S. Zhao, 2023

5 5 6
TD-Linear: =0.0005 TD-Linear: =0.0005 TD-Linear: =0.0005
State value error (RMSE)

State value error (RMSE)

State value error (RMSE)


5
4 4

4
3 3
3
2 2
2

1 1
1

0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Episode index Episode index Episode index

(a) q = 1 and φ(s) ∈ R4 (b) q = 2 and φ(s) ∈ R9 (c) q = 3 and φ(s) ∈ R16

Figure 8.8: TD-Linear estimation results obtained with the Fourier features in (8.18).

each state are normalized to the interval of [0, 1]. The resulting feature vector is

..
 
.
(q+1)2
  
φ(s) =  cos π(c1 x + c2 y) ∈R , (8.18)

..
.

where π denotes the circumference ratio, which is 3.1415 . . . , instead of a policy. Here,
c1 or c2 can be set as any integers in {0, 1, . . . , q}, where q is a user-specified integer.
As a result, there are (q + 1)2 possible values for the pair (c1 , c2 ) to take. Hence, the
dimension of φ(s) is (q + 1)2 . For example, in the case of q = 1, the feature vector is
    
cos π(0x + 0y) 1

 cos π(0x + 1y)   cos(πy) 
φ(s) =  =  ∈ R4 .
   

 cos π(1x + 0y)   cos(πx) 

cos π(1x + 1y) cos(π(x + y))

The estimation results obtained when we use the Fourier features with q = 1, 2, 3 are
shown in Figure 8.8. The dimensions of the feature vectors in the three cases are
4, 9, 16, respectively. As can be seen, the higher the dimension of the feature vector
is, the more accurately the state values can be approximated.

8.2.5 Theoretical analysis


Thus far, we have finished describing the story of TD learning with function approxima-
tion. This story started from the objective function in (8.3). To optimize this objective

176
8.2. TD learning of state values based on function approximation S. Zhao, 2023

function, we introduced the stochastic algorithm in (8.12). Later, the true value function
in the algorithm, which was unknown, was replaced by an approximation, leading to the
TD algorithm in (8.13). Although this story is helpful for understanding the basic idea
of value function approximation, it is not mathematically rigorous. For example, the
algorithm in (8.13) actually does not minimize the objective function in (8.3).
We next present a theoretical analysis of the TD algorithm in (8.13) to reveal why
the algorithm works effectively and what mathematical problems it solves. Since gen-
eral nonlinear approximators are difficult to analyze, this part only considers the linear
case. Readers are advised to read selectively based on their interests since this part is
mathematically intensive.

Convergence analysis

To study the convergence property of (8.13), we first consider the following deterministic
algorithm:
h i
wt+1 = wt + αt E rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ) ,

(8.19)

where the expectation is calculated with respect to the random variables st , st+1 , rt+1 .
The distribution of st is assumed to be the stationary distribution dπ . The algorithm
in (8.19) is deterministic because the random variables st , st+1 , rt+1 all disappear after
calculating the expectation.
Why would we consider this deterministic algorithm? First, the convergence of this
deterministic algorithm is easier (though nontrivial) to analyze. Second and more im-
portantly, the convergence of this deterministic algorithm implies the convergence of the
stochastic TD algorithm in (8.13). That is because (8.13) can be viewed as a stochastic
gradient descent (SGD) implementation of (8.19). Therefore, we only need to study the
convergence property of the deterministic algorithm.
Although the expression of (8.19) may look complex at first glance, it can be greatly
simplified. To do that, define

.. ..
   
. .
 ∈ Rn×m ,  ∈ Rn×n ,
   
Φ= T D= (8.20)
φ (s) dπ (s)
..
   
...
.

where Φ is the matrix containing all the feature vectors, and D is a diagonal matrix with
the stationary distribution in its diagonal entries. The two matrices will be frequently
used.

Lemma 8.1. The expectation in (8.19) can be rewritten as


h i
E rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ) = b − Awt ,


177
8.2. TD learning of state values based on function approximation S. Zhao, 2023

where

.
A = ΦT D(I − γPπ )Φ ∈ Rm×m ,
.
b = ΦT Drπ ∈ Rm . (8.21)

Here, Pπ , rπ are the two terms in the Bellman equation vπ = rπ + γPπ vπ , and I is the
identity matrix with appropriate dimensions.

The proof is given in Box 8.3. With the expression in Lemma 8.1, the deterministic
algorithm in (8.19) can be rewritten as

wt+1 = wt + αt (b − Awt ), (8.22)

which is a simple deterministic process. Its convergence is analyzed below.


First, what is the converged value of wt ? Hypothetically, if wt converges to a constant
value w∗ as t → ∞, then (8.22) implies w∗ = w∗ + α∞ (b − Aw∗ ), which suggests that
b − Aw∗ = 0 and hence
w∗ = A−1 b.

Several remarks about this converged value are given below.

 Is A invertible? The answer is yes. In fact, A is not only invertible but also positive
definite. That is, for any nonzero vector x with appropriate dimensions, xT Ax > 0.
The proof is given in Box 8.4.
 What is the interpretation of w∗ = A−1 b? It is actually the optimal solution for min-
imizing the projected Bellman error. The details will be introduced in Section 8.2.5.
 The tabular method is a special case. One interesting result is that, when the di-
mensionality of w equals n = |S| and φ(s) = [0, . . . , 1, . . . , 0]T , where the entry corre-
sponding to s is 1, we have

w∗ = A−1 b = vπ . (8.23)

This equation indicates that the parameter vector to be learned is actually the true
state value. This conclusion is consistent with the fact that the tabular TD algorithm
is a special case of the TD-Linear algorithm, as introduced in Box 8.2. The proof
of (8.23) is given below. It can be verified that Φ = I in this case and hence A =
ΦT D(I − γPπ )Φ = D(I − γPπ ) and b = ΦT Drπ = Drπ . Thus, w∗ = A−1 b =
(I − γPπ )−1 D−1 Drπ = (I − γPπ )−1 rπ = vπ .

Second, we prove that wt in (8.22) converges to w∗ = A−1 b as t → ∞. Since (8.22) is


a simple deterministic process, it can be proven in many ways. We present two proofs as
follows.

178
8.2. TD learning of state values based on function approximation S. Zhao, 2023

.
 Proof 1: Define the convergence error as δt = wt − w∗ . We only need to show that δt
converges to zero. To do that, substituting wt = δt + w∗ into (8.22) gives

δt+1 = δt − αt Aδt = (I − αt A)δt .

It then follows that


δt+1 = (I − αt A) · · · (I − α0 A)δ0 .

Consider the simple case where αt = α for all t. Then, we have

kδt+1 k2 ≤ kI − αAkt+1
2 kδ0 k2 .

When α > 0 is sufficiently small, we have that kI − αAk2 < 1 and hence δt → 0 as
t → ∞. The reason why kI − αAk2 < 1 holds is that A is positive definite and hence
xT (I − αA)x < 1 for any x.
.
 Proof 2: Consider g(w) = b−Aw. Since w∗ is the root of g(w) = 0, the task is actually
a root-finding problem. The algorithm in (8.22) is actually a Robbins-Monro (RM)
algorithm. Although the original RM algorithm was designed for stochastic processes,
it can also be applied to deterministic cases. The convergence of RM algorithms can
shed light on the convergence of wt+1 = wt + αt (b − Awt ). That is, wt converges to
w∗ when t αt = ∞ and t αt2 < ∞.
P P

Box 8.3: Proof of Lemma 8.1


By using the law of total expectation, we have
h  i
E rt+1 φ(st ) + φ(st ) γφT (st+1 ) − φT (st ) wt
X h i
T T

= dπ (s)E rt+1 φ(st ) + φ(st ) γφ (st+1 ) − φ (st ) wt st = s
s∈S
X h i X h i
dπ (s)E φ(st ) γφT (st+1 ) − φT (st ) wt st = s .

= dπ (s)E rt+1 φ(st ) st = s +
s∈S s∈S
(8.24)

Here, st is assumed to obey the stationary distribution dπ .


First, consider the first term in (8.24). Note that
h i h i
E rt+1 φ(st ) st = s = φ(s)E rt+1 st = s = φ(s)rπ (s),
P P
where rπ (s) = a π(a|s) r rp(r|s, a). Then, the first term in (8.24) can be rewritten

179
8.2. TD learning of state values based on function approximation S. Zhao, 2023

as
X h i X
dπ (s)E rt+1 φ(st ) st = s = dπ (s)φ(s)rπ (s) = ΦT Drπ , (8.25)
s∈S s∈S

where rπ = [· · · , rπ (s), · · · ]T ∈ Rn .
Second, consider the second term in (8.24). Since
h i
E φ(st ) γφT (st+1 ) − φT (st ) wt st = s

h i h i
T T
= −E φ(st )φ (st )wt st = s + E γφ(st )φ (st+1 )wt st = s
h i
= −φ(s)φT (s)wt + γφ(s)E φT (st+1 ) st = s wt
X
= −φ(s)φT (s)wt + γφ(s) p(s0 |s)φT (s0 )wt ,
s0 ∈S

the second term in (8.24) becomes


X h i
dπ (s)E φ(st ) γφT (st+1 ) − φT (st ) wt st = s

s∈S
X h X i
= dπ (s) − φ(s)φT (s)wt + γφ(s) p(s0 |s)φT (s0 )wt
s∈S s0 ∈S
X h X iT
= dπ (s)φ(s) − φ(s) + γ p(s0 |s)φ(s0 ) wt
s∈S s0 ∈S

= ΦT D(−Φ + γPπ Φ)wt


= −ΦT D(I − γPπ )Φwt . (8.26)

Combining (8.25) and (8.26) gives


h i
E rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ) = ΦT Drπ − ΦT D(I − γPπ )Φwt


.
= b − Awt , (8.27)

. .
where b = ΦT Drπ and A = ΦT D(I − γPπ )Φ.

Box 8.4: Proving that A = ΦT D(I −γPπ )Φ is invertible and positive definite.

The matrix A is positive definite if xT Ax > 0 for any nonzero vector x with ap-
propriate dimensions. If A is positive (or negative) definite, it is denoted as A  0
(or A ≺ 0). Here,  and ≺ should be differentiated from > and <, which indicate
elementwise comparisons. Note that A may not be symmetric. Although positive

180
8.2. TD learning of state values based on function approximation S. Zhao, 2023

definite matrices often refer to symmetric matrices, nonsymmetric ones can also be
positive definite.
We next prove that A  0 and hence A is invertible. The idea for proving A  0
is to show that

.
D(I − γPπ ) = M  0. (8.28)

It is clear that M  0 implies A = ΦT M Φ  0 since Φ is a tall matrix with full column


rank (suppose that the feature vectors are selected to be linearly independent). Note
that
M + MT M − MT
M= + .
2 2
Since M − M T is skew-symmetric and hence xT (M − M T )x = 0 for any x, we know
that M  0 if and only if M + M T  0. To show M + M T  0, we apply the fact
that strictly diagonal dominant matrices are positive definite [4].
First, it holds that

(M + M T )1n > 0, (8.29)

where 1n = [1, . . . , 1]T ∈ Rn . The proof of (8.29) is given below. Since Pπ 1n = 1n ,


we have M 1n = D(I − γPπ )1n = D(1n − γ1n ) = (1 − γ)dπ . Moreover, M T 1n =
(I − γPπT )D1n = (I − γPπT )dπ = (1 − γ)dπ , where the last equality is valid because
PπT dπ = dπ . In summary, we have

(M + M T )1n = 2(1 − γ)dπ .

Since all the entries of dπ are positive (see Box 8.1), we have (M + M T )1n > 0.
Second, the elementwise form of (8.29) is
n
X
[M + M T ]ij > 0, i = 1, . . . , n,
j=1

which can be further written as


X
[M + M T ]ii + [M + M T ]ij > 0.
j6=i

It can be verified according to (8.28) that the diagonal entries of M are positive and
the off-diagonal entries of M are nonpositive. Therefore, the above inequality can be

181
8.2. TD learning of state values based on function approximation S. Zhao, 2023

rewritten as
X
[M + M T ]ii > [M + M T ]ij .
j6=i

The above inequality indicates that the absolute value of the ith diagonal entry in
M + M T is greater than the sum of the absolute values of the off-diagonal entries
in the same row. Thus, M + M T is strictly diagonal dominant and the proof is
complete.

TD learning minimizes the projected Bellman error

While we have shown that the TD-Linear algorithm converges to w∗ = A−1 b, we next
show that w∗ is the optimal solution that minimizes the projected Bellman error. To do
that, we review three objective functions.

 The first objective function is

JE (w) = E[(vπ (S) − v̂(S, w))2 ],

which has been introduced in (8.3). By the definition of expectation, JE (w) can be
reexpressed in a matrix-vector form as

JE (w) = kv̂(w) − vπ k2D ,

where vπ is the true state value vector and v̂(w) is the approximated one. Here, k · k2D
is a weighted norm: kxk2D = xT Dx = kD1/2 xk22 , where D is given in (8.20).
This is the simplest objective function that we can imagine when talking about func-
tion approximation. However, it relies on the true state, which is unknown. To obtain
an implementable algorithm, we must consider other objective functions such as the
Bellman error and projected Bellman error [50–54].
 The second objective function is the Bellman error. In particular, since vπ satisfies
the Bellman equation vπ = rπ + γPπ vπ , it is expected that the estimated value v̂(w)
should also satisfy this equation to the greatest extent possible. Thus, the Bellman
error is

.
JBE (w) = kv̂(w) − (rπ + γPπ v̂(w))k2D = kv̂(w) − Tπ (v̂(w))k2D . (8.30)

Here, Tπ (·) is the Bellman operator. In particular, for any vector x ∈ Rn , the Bellman
operator is defined as
.
Tπ (x) = rπ + γPπ x.

182
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Minimizing the Bellman error is a standard least-squares problem. The details of the
solution are omitted here.
 Third, it is notable that JBE (w) in (8.30) may not be minimized to zero due to the
limited approximation ability of the approximator. By contrast, an objective function
that can be minimized to zero is the projected Bellman error :

JP BE (w) = kv̂(w) − M Tπ (v̂(w))k2D ,

where M ∈ Rn×n is the orthogonal projection matrix that geometrically projects any
vector onto the space of all approximations.

In fact, the TD learning algorithm in (8.13) aims to minimize the projected Bellman
error JP BE rather than JE or JBE . The reason is as follows. For the sake of simplicity,
consider the linear case where v̂(w) = Φw. Here, Φ is defined in (8.20). The range space
of Φ is the set of all possible linear approximations. Then,

M = Φ(ΦT DΦ)−1 ΦT D ∈ Rn×n

is the projection matrix that geometrically projects any vector onto the range space Φ.
Since v̂(w) is in the range space of Φ, we can always find a value of w that can minimize
JP BE (w) to zero. It can be proven that the solution minimizing JP BE (w) is w∗ = A−1 b.
That is

w∗ = A−1 b = arg min JP BE (w),


w

The proof is given in Box 8.5.

Box 8.5: Showing that w∗ = A−1 b minimizes JP BE (w)

We next show that w∗ = A−1 b is the optimal solution that minimizes JP BE (w). Since
JP BE (w) = 0 ⇔ v̂(w) − M Tπ (v̂(w)) = 0, we only need to study the root of

v̂(w) = M Tπ (v̂(w)).

In the linear case, substituting v̂(w) = Φw and the expression of M in (8.28) into
the above equation gives

Φw = Φ(ΦT DΦ)−1 ΦT D(rπ + γPπ Φw). (8.31)

183
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Since Φ has full column rank, we have Φx = Φy ⇔ x = y for any x, y. Therefore,


(8.31) implies

w = (ΦT DΦ)−1 ΦT D(rπ + γPπ Φw)


⇐⇒ ΦT D(rπ + γPπ Φw) = (ΦT DΦ)w
⇐⇒ ΦT Drπ + γΦT DPπ Φw = (ΦT DΦ)w
⇐⇒ ΦT Drπ = ΦT D(I − γPπ )Φw
⇐⇒ w = (ΦT D(I − γPπ )Φ)−1 ΦT Drπ = A−1 b,

where A, b are given in (8.21). Therefore, w∗ = A−1 b is the optimal solution that
minimizes JP BE (w).

Since the TD algorithm aims to minimize JP BE rather than JE , it is natural to ask


how close the estimated value v̂(w) is to the true state value vπ . In the linear case, the
estimated value that minimizes the projected Bellman error is v̂(w) = Φw∗ . Its deviation
from the true state value vπ satisfies

1 1 p
kΦw∗ − vπ kD ≤ min kv̂(w) − vπ kD = min JE (w). (8.32)
1−γ w 1−γ w

The proof of this inequality is given in Box 8.6. Inequality (8.32) indicates that the
discrepancy between Φw∗ and vπ is bounded from above by the minimum value of JE (w).
However, this bound is loose, especially when γ is close to one. It is thus mainly of
theoretical value.

Box 8.6: Proof of the error bound in (8.32)

Note that

kΦw∗ − vπ kD = kΦw∗ − M vπ + M vπ − vπ kD
≤ kΦw∗ − M vπ kD + kM vπ − vπ kD
= kM Tπ (Φw∗ ) − M Tπ (vπ )kD + kM vπ − vπ kD , (8.33)

where the last equality is due to Φw∗ = M Tπ (Φw∗ ) and vπ = Tπ (vπ ). Substituting

M Tπ (Φw∗ ) − M Tπ (vπ ) = M (rπ + γPπ Φw∗ ) − M (rπ + γPπ vπ ) = γM Pπ (Φw∗ − vπ )

184
8.2. TD learning of state values based on function approximation S. Zhao, 2023

into (8.33) yields

kΦw∗ − vπ kD ≤ kγM Pπ (Φw∗ − vπ )kD + kM vπ − vπ kD


≤ γkM kD kPπ (Φw∗ − vπ )kD + kM vπ − vπ kD
= γkPπ (Φw∗ − vπ )kD + kM vπ − vπ kD (because kM kD = 1)
≤ γkΦw∗ − vπ kD + kM vπ − vπ kD . (because kPπ xkD ≤ kxkD for all x)

The proof of kM kD = 1 and kPπ xkD ≤ kxkD are postponed to the end of the box.
Recognizing the above inequality gives

1
kΦw∗ − vπ kD ≤ kM vπ − vπ kD
1−γ
1
= min kv̂(w) − vπ kD ,
1−γ w

where the last equality is because kM vπ − vπ kD is the error between vπ and its
orthogonal projection into the space of all possible approximations. Therefore, it is
the minimum value of the error between vπ and any v̂(w).
We next prove some useful facts, which have already been used in the above proof.

 Properties of matrix weighted norms. By definition, kxkD = xT Dx = kD1/2 xk2 .
The induced matrix norm is kAkD = maxx6=0 kAxkD /kxkD = kD1/2 AD−1/2 k2 . For
matrices A, B with appropriate dimensions, we have kABxkD ≤ kAkD kBkD kxkD .
To see that, kABxkD = kD1/2 ABxk2 = kD1/2 AD−1/2 D1/2 BD−1/2 D1/2 xk2 ≤
kD1/2 AD−1/2 k2 kD1/2 BD−1/2 k2 kD1/2 xk2 = kAkD kBkD kxkD .
 Proof of kM kD = 1. This is valid because kM kD = kΦ(ΦT DΦ)−1 ΦT DkD =
kD1/2 Φ(ΦT DΦ)−1 ΦT DD−1/2 k2 = 1, where the last equality is valid due to the
fact that the matrix in the L2 -norm is an orthogonal projection matrix and the
L2 -norm of any orthogonal projection matrix is equal to one.
 Proof of kPπ xkD ≤ kxkD for any x ∈ Rn . First,
!
X X X
kPπ xk2D = xT PπT DPπ x = xi [PπT DPπ ]ij xj = xi [PπT ]ik [D]kk [Pπ ]kj xj .
i,j i,j k

185
8.2. TD learning of state values based on function approximation S. Zhao, 2023

Reorganizing the above equation gives


X X 2
kPπ xk2D = [D]kk [Pπ ]ki xi
k i
X X 
≤ [D]kk [Pπ ]ki x2i (due to Jensen’s inequality [55, 56])
k i
XX 
= [D]kk [Pπ ]ki x2i
i k
X
= [D]ii x2i (due to dTπ Pπ = dTπ )
i

= kxk2D .

Least-squares TD

We next introduce an algorithm called least-squares TD (LSTD) [57]. Like the TD-Linear
algorithm, LSTD aims to minimize the projected Bellman error. However, it has some
advantages over the TD-Linear algorithm.
Recall that the optimal parameter for minimizing the projected Bellman error is
w = A−1 b, where A = ΦT D(I − γPπ )Φ and b = ΦT Drπ . In fact, it follows from (8.27)

that A and b can also be written as


h T i
A = E φ(st ) φ(st ) − γφ(st+1 ) ,
h i
b = E rt+1 φ(st ) .

The above two equations show that A and b are expectations of st , st+1 , rt+1 . The idea of
LSTD is simple: if we can use random samples to directly obtain the estimates of A and
b, which are denoted as  and b̂, then the optimal parameter can be directly estimated
as w∗ ≈ Â−1 b̂.
In particular, suppose that (s0 , r1 , s1 , . . . , st , rt+1 , st+1 , . . . ) is a trajectory obtained
by following a given policy π. Let Ât and b̂t be the estimates of A and b at time t,
respectively. They are calculated as the averages of the samples:

t−1
X T
Ât = φ(sk ) φ(sk ) − γφ(sk+1 ) ,
k=0
t−1
X
b̂t = rk+1 φ(sk ). (8.34)
k=0

Then, the estimated parameter is


wt = Â−1
t b̂t .

186
8.2. TD learning of state values based on function approximation S. Zhao, 2023

The reader may wonder if a coefficient of 1/t is missing on the right-hand side of (8.34).
In fact, it is omitted for the sake of simplicity since the value of wt remains the same
when it is omitted. Since Ât may not be invertible especially when t is small, Ât is usually
biased by a small constant matrix σI, where I is the identity matrix and σ is a small
positive number.
The advantage of LSTD is that it uses experience samples more efficiently and con-
verges faster than the TD method. That is because this algorithm is specifically designed
based on the knowledge of the optimal solution’s expression. The better we understand
a problem, the better algorithms we can design.
The disadvantages of LSTD are as follows. First, it can only estimate state values.
By contrast, the TD algorithm can be extended to estimate action values as shown in the
next section. Moreover, while the TD algorithm allows nonlinear approximators, LSTD
does not. That is because this algorithm is specifically designed based on the expression
of w∗ . Second, the computational cost of LSTD is higher than that of TD since LSTD
updates an m × m matrix in each update step, whereas TD updates an m-dimensional
vector. More importantly, in every step, LSTD needs to compute the inverse of Ât , whose
computational complexity is O(m3 ). The common method for resolving this problem is
to directly update the inverse of Ât rather than updating Ât . In particular, Ât+1 can be
calculated recursively as follows:
t
X T
Ât+1 = φ(sk ) φ(sk ) − γφ(sk+1 )
k=0
t−1
X T T
= φ(sk ) φ(sk ) − γφ(sk+1 ) + φ(st ) φ(st ) − γφ(st+1 )
k=0
T
= Ât + φ(st ) φ(st ) − γφ(st+1 ) .

The above expression decomposes Ât+1 into the sum of two matrices. Its inverse can be
calculated as [58]
 T −1
Â−1
t+1 = Ât + φ(s t ) φ(s t ) − γφ(s t+1 )
T −1
−1 Â−1
t φ(st ) φ(st ) − γφ(st+1 ) Ât
= Ât + T −1 .
1 + φ(st ) − γφ(st+1 ) Ât φ(st )

Therefore, we can directly store and update Â−1 t to avoid the need to calculate the matrix
inverse. This recursive algorithm does not require a step size. However, it requires setting
the initial value of Â−1
0 . The initial value of such a recursive algorithm can be selected as
Â−1
0 = σI, where σ is a positive number. A good tutorial on the recursive least-squares
approach can be found in [59].

187
8.3. TD learning of action values based on function approximation S. Zhao, 2023

8.3 TD learning of action values based on function


approximation
While Section 8.2 introduced the problem of state value estimation, the present section
introduces how to estimate action values. The tabular Sarsa and tabular Q-learning
algorithms are extended to the case of value function approximation. Readers will see
that the extension is straightforward.

8.3.1 Sarsa with function approximation


The Sarsa algorithm with function approximation can be readily obtained from (8.13)
by replacing the state values with action values. In particular, suppose that qπ (s, a) is
approximated by q̂(s, a, w). Replacing v̂(s, w) in (8.13) by q̂(s, a, w) gives
h i
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt ). (8.35)

The analysis of (8.35) is similar to that of (8.13) and is omitted here. When linear
functions are used, we have

q̂(s, a, w) = φT (s, a)w,

where φ(s, a) is a feature vector. In this case, ∇w q̂(s, a, w) = φ(s, a).


The value estimation step in (8.35) can be combined with a policy improvement step
to learn optimal policies. The procedure is summarized in Algorithm 8.2. It should be
noted that accurately estimating the action values of a given policy requires (8.35) to be
run sufficiently many times. However, (8.35) is executed only once before switching to the
policy improvement step. This is similar to the tabular Sarsa algorithm. Moreover, the
implementation in Algorithm 8.2 aims to solve the task of finding a good path to the target
state from a prespecified starting state. As a result, it cannot find the optimal policy
for every state. However, if sufficient experience data are available, the implementation
process can be easily adapted to find optimal policies for every state.
An illustrative example is shown in Figure 8.9. In this example, the task is to find a
good policy that can lead the agent to the target when starting from the top-left state.
Both the total reward and the length of each episode gradually converge to steady values.
In this example, the linear feature vector is selected as the Fourier function of order 5.
The expression of a Fourier feature vector is given in (8.18).

188
8.3. TD learning of action values based on function approximation S. Zhao, 2023

0 1 2 3 4 5

Total reward
-500 1

-1000
2
0 100 200 300 400 500

3
Episode length

500

0
0 100 200 300 400 500 5
Episode index

Figure 8.9: Sarsa with linear function approximation. Here, γ = 0.9,  = 0.1, rboundary = rforbidden =
−10, rtarget = 1, and α = 0.001.

Algorithm 8.2: Sarsa with function approximation

Initialization: Initial parameter w0 . Initial policy π0 . αt = α > 0 for all t.  ∈ (0, 1).
Goal: Learn an optimal policy that can lead the agent to the target state from an initial
state s0 .
For each episode, do
Generate a0 at s0 following π0 (s0 )
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (rt+1 , st+1 , at+1 ) given (st , at ): generate rt+1 , st+1
by interacting with the environment; generate at+1 following πt (st+1 ).
Update q-value:h i
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt )
Update policy:
πt+1 (a|st ) = 1 − |A(sε t )| (|A(st )| − 1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
πt+1 (a|st ) = |A(s t )| otherwise
st ← st+1 , at ← at+1

8.3.2 Q-learning with function approximation


Tabular Q-learning can also be extended to the case of function approximation. The
update rule is
h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt ). (8.36)
a∈A(st+1 )

The above update rule is similar to (8.35) except that q̂(st+1 , at+1 , wt ) in (8.35) is replaced
with maxa∈A(st+1 ) q̂(st+1 , a, wt ).
Similar to the tabular case, (8.36) can be implemented in either an on-policy or
off-policy fashion. An on-policy version is given in Algorithm 8.3. An example for
demonstrating the on-policy version is shown in Figure 8.10. In this example, the task is
to find a good policy that can lead the agent to the target state from the top-left state.

189
8.4. Deep Q-learning S. Zhao, 2023

Algorithm 8.3: Q-learning with function approximation (on-policy ver-


sion)

Initialization: Initial parameter w0 . Initial policy π0 . αt = α > 0 for all t.  ∈ (0, 1).
Goal: Learn an optimal path that can lead the agent to the target state from an initial
state s0 .
For each episode, do
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (at , rt+1 , st+1 ) given st : generate at following
πt (st ); generate rt+1 , st+1 by interacting with the environment.
Update q-value:h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt )
a∈A(st+1 )
Update policy:
πt+1 (a|st ) = 1 − |A(sε t )| (|A(st )| − 1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
πt+1 (a|st ) = |A(sε t )| otherwise

As can be seen, Q-learning with linear function approximation can successfully learn an
optimal policy. Here, linear Fourier basis functions of order five are used. The off-policy
version will be demonstrated when we introduce deep Q-learning in Section 8.4.
0 1 2 3 4 5
Total reward

1
-2000

-4000 2
0 100 200 300 400 500

3
Episode length

1000
4

0
0 100 200 300 400 500 5
Episode index

Figure 8.10: Q-learning with linear function approximation. Here, γ = 0.9,  = 0.1, rboundary =
rforbidden = −10, rtarget = 1, and α = 0.001.

One may notice in Algorithm 8.2 and Algorithm 8.3 that, although the values are
represented as functions, the policy π(a|s) is still represented as a table. Thus, it still
assumes finite numbers of states and actions. In Chapter 9, we will see that the policies
can be represented as functions so that continuous state and action spaces can be handled.

8.4 Deep Q-learning


We can integrate deep neural networks into Q-learning to obtain an approach called
deep Q-learning or deep Q-network (DQN) [22, 60, 61]. Deep Q-learning is one of the

190
8.4. Deep Q-learning S. Zhao, 2023

earliest and most successful deep reinforcement learning algorithms. Notably, the neural
networks do not have to be deep. For simple tasks such as our grid world examples,
shallow networks with one or two hidden layers may be sufficient.
Deep Q-learning can be viewed as an extension of the algorithm in (8.36). However,
its mathematical formulation and implementation techniques are substantially different
and deserve special attention.

8.4.1 Algorithm description


Mathematically, deep Q-learning aims to minimize the following objective function:
" 2 #
0
J =E R + γ max0 q̂(S , a, w) − q̂(S, A, w) , (8.37)
a∈A(S )

where (S, A, R, S 0 ) are random variables that denote a state, an action, the immediate
reward, and the next state, respectively. This objective function can be viewed as the
squared Bellman optimality error. That is because
 
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , for all s, a
a∈A(St+1 )

is the Bellman optimality equation (the proof is given in Box 7.5). Therefore, R +
γ maxa∈A(S 0 ) q̂(S 0 , a, w) − q̂(S, A, w) should equal zero in the expectation sense when
q̂(S, A, w) can accurately approximate the optimal action values.
To minimize the objective function in (8.37), we can use the gradient descent algorith-
m. To that end, we need to calculate the gradient of J with respect to w. It is noted that
.
the parameter w appears not only in q̂(S, A, w) but also in y = R+γ maxa∈A(S 0 ) q̂(S 0 , a, w).
As a result, it is nontrivial to calculate the gradient. For the sake of simplicity, it is as-
sumed that the value of w in y is fixed (for a short period of time) so that the calculation
of the gradient becomes much easier. In particular, we introduce two networks: one is a
main network representing q̂(s, a, w) and the other is a target network q̂(s, a, wT ). The
objective function in this case becomes
" 2 #
J =E R + γ max0 q̂(S 0 , a, wT ) − q̂(S, A, w) ,
a∈A(S )

where wT is the target network’s parameter. When wT is fixed, the gradient of J is


  
0
∇w J = −E R + γ max0 q̂(S , a, wT ) − q̂(S, A, w) ∇w q̂(S, A, w) , (8.38)
a∈A(S )

where some constant coefficients are omitted without loss of generality.


To use the gradient in (8.38) to minimize the objective function in (8.37), we need to

191
8.4. Deep Q-learning S. Zhao, 2023

pay attention to the following techniques.

 The first technique is to use two networks, a main network and a target network,
as mentioned when we calculate the gradient in (8.38). The implementation details
are explained below. Let w and wT denote the parameters of the main and target
networks, respectively. They are initially set to the same value.
In every iteration, we draw a mini-batch of samples {(s, a, r, s0 )} from the replay buffer
(the replay buffer will be explained soon). The inputs of the main network are s and
a. The output y = q̂(s, a, w) is the estimated q-value. The target value of the output
.
is yT = r + γ maxa∈A(s0 ) q̂(s0 , a, wT ). The main network is updated to minimize the
TD error (also called the loss function) (y − yT )2 over the samples {(s, a, yT )}.
P

Updating w in the main network does not explicitly use the gradient in (8.38). Instead,
it relies on the existing software tools for training neural networks. As a result, we
need a mini-batch of samples to train a network instead of using a single sample to
update the main network based on (8.38). This is one notable difference between deep
and nondeep reinforcement learning algorithms.
The main network is updated in every iteration. By contrast, the target network is
set to be the same as the main network every certain number of iterations to satisfy
the assumption that wT is fixed when calculating the gradient in (8.38).
 The second technique is experience replay [22, 60, 62]. That is, after we have collected
some experience samples, we do not use these samples in the order they were collected.
Instead, we store them in a dataset called the replay buffer. In particular, let (s, a, r, s0 )
.
be an experience sample and B = {(s, a, r, s0 )} be the replay buffer. Every time we
update the main network, we can draw a mini-batch of experience samples from the
replay buffer. The draw of samples, or called experience replay, should follow a uniform
distribution.
Why is experience replay necessary in deep Q-learning, and why must the replay
follow a uniform distribution? The answer lies in the objective function in (8.37).
In particular, to well define the objective function, we must specify the probability
distributions for S, A, R, S 0 . The distributions of R and S 0 are determined by the
system model once (S, A) is given. The simplest way to describe the distribution of
the state-action pair (S, A) is to assume it to be uniformly distributed.
However, the state-action samples may not be uniformly distributed in practice s-
ince they are generated as a sample sequence according to the behavior policy. It is
necessary to break the correlation between the samples in the sequence to satisfy the
assumption of uniform distribution. To do this, we can use the experience replay tech-
nique by uniformly drawing samples from the replay buffer. This is the mathematical
reason why experience replay is necessary and why experience replay must follow a
uniform distribution. A benefit of random sampling is that each experience sample

192
8.4. Deep Q-learning S. Zhao, 2023

Algorithm 8.3: Deep Q-learning (off-policy version)

Initialization: A main network and a target network with the same initial parameter.
Goal: Learn an optimal target network to approximate the optimal action values from
the experience samples generated by a given behavior policy πb .
Store the experience samples generated by πb in a replay buffer B = {(s, a, r, s0 )}
For each iteration, do
Uniformly draw a mini-batch of samples from B
For each sample (s, a, r, s0 ), calculate the target value as yT = r +
γ maxa∈A(s0 ) q̂(s0 , a, wT ), where wT is the parameter of the target network
Update the main network to minimize (yT − q̂(s, a, w))2 using the mini-batch
of samples
Set wT = w every C iterations

may be used multiple times, which can increase the data efficiency. This is especially
important when we have a limited amount of data.

The implementation procedure of deep Q-learning is summarized in Algorithm 8.3.


This implementation is off-policy. It can also be adapted to become on-policy if needed.

8.4.2 Illustrative examples


An example is given in Figure 8.11 to demonstrate Algorithm 8.3. This example aims
to learn the optimal action values for every state-action pair. Once the optimal action
values are obtained, the optimal greedy policy can be obtained immediately.
A single episode is generated by the behavior policy shown in Figure 8.11(a). This
behavior policy is exploratory in the sense that it has the same probability of taking
any action at any state. The episode has only 1,000 steps as shown in Figure 8.11(b).
Although there are only 1,000 steps, almost all the state action pairs are visited in this
episode due to the strong exploration ability of the behavior policy. The replay buffer is
a set of 1,000 experience samples. The mini-batch size is 100, meaning that we uniformly
draw 100 samples from the replay buffer every time we acquire samples.
The main and target networks have the same structure: a neural network with one
hidden layer of 100 neurons (the numbers of layers and neurons can be tuned). The
neural network has three inputs and one output. The first two inputs are the normalized
row and column indexes of a state. The third input is the normalized action index. Here,
“normalization” means converting a value to the interval of [0,1]. The output of the
network is the estimated action value. The reason why we design the inputs as the row
and column of a state rather than a state index is that we know that a state corresponds
to a two-dimensional location in the grid. The more information about the state we use
when designing the network, the better the network can perform. Moreover, the neural

193
8.4. Deep Q-learning S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

(a) The behavior policy. (b) An episode with 1,000 steps. (c) The final learned policy.

5 10

State value error (RMSE)


TD error / loss function

4 8

3 6

2 4

1 2

0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index

(d) The loss function converges to zero. (e) The value error converges to zero.

Figure 8.11: Optimal policy learning via deep Q-learning. Here, γ = 0.9, rboundary = rforbidden = −10,
and rtarget = 1. The batch size is 100.

network can also be designed in other ways. For example, it can have two inputs and
five outputs, where the two inputs are the normalized row and column of a state and the
outputs are the five estimated action values for the input state [22].
As shown in Figure 8.11(d), the loss function, defined as the average squared TD
error of each mini-batch, converges to zero, meaning that the network can fit the training
samples well. As shown in Figure 8.11(e), the state value estimation error also converges
to zero, indicating that the estimates of the optimal action values become sufficiently
accurate. Then, the corresponding greedy policy is optimal.
This example demonstrates the high efficiency of deep Q-learning. In particular, a
short episode of 1,000 steps is sufficient for obtaining an optimal policy here. By contrast,
an episode with 100,000 steps is required by tabular Q-learning, as shown in Figure 7.4.
One reason for the high efficiency is that the function approximation method has a strong
generalization ability. Another reason is that the experience samples can be repeatedly
used.
We next deliberately challenge the deep Q-learning algorithm by considering a scenario
with fewer experience samples. Figure 8.12 shows an example of an episode with merely
100 steps. In this example, although the network can still be well-trained in the sense

194
8.5. Summary S. Zhao, 2023

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

(a) The behavior policy. (b) An episode with 100 steps. (c) The final learned policy.

7 8

State value error (RMSE)


6
TD error / loss function

7
5

4
6
3

2
5
1

0 4
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index
(d) The loss function converges to zero. (e) The value error does not converge to zero.

Figure 8.12: Optimal policy learning via deep Q-learning. Here, γ = 0.9, rboundary = rforbidden = −10,
and rtarget = 1. The batch size is 50.

that the loss function converges to zero, the state estimation error cannot converge to
zero. That means the network can properly fit the given experience samples, but the
experience samples are too few to accurately estimate the optimal action values.

8.5 Summary
This chapter continued introducing TD learning algorithms. However, it switches from
the tabular method to the function approximation method. The key to understanding
the function approximation method is to know that it is an optimization problem. The
simplest objective function is the squared error between the true state values and the
estimated values. There are also other objective functions such as the Bellman error
and the projected Bellman error. We have shown that the TD-Linear algorithm actually
minimizes the projected Bellman error. Several optimization algorithms such as Sarsa
and Q-learning with value approximation have been introduced.
One reason why the value function approximation method is important is that it allows
artificial neural networks to be integrated with reinforcement learning. For example,
deep Q-learning is one of the most successful deep reinforcement learning algorithms.

195
8.6. Q&A S. Zhao, 2023

Although neural networks have been widely used as nonlinear function approximators,
this chapter provides a comprehensive introduction to the linear function case. Fully
understanding the linear case is important for better understanding the nonlinear case.
Interested readers may refer to [63] for a thorough analysis of TD learning algorithms
with function approximation. A more theoretical discussion on deep Q-learning can be
found in [61].
An important concept named stationary distribution is introduced in this chapter.
The stationary distribution plays an important role in defining an appropriate objective
function in the value function approximation method. It also plays a key role in Chapter 9
when we use functions to approximate policies. An excellent introduction to this topic
can be found in [49, Chapter IV]. The contents of this chapter heavily rely on matrix
analysis. Some results are used without explanation. Excellent references regarding
matrix analysis and linear algebra can be found in [4, 48].

8.6 Q&A
 Q: What is the difference between the tabular and function approximation methods?
A: One important difference is how a value is updated and retrieved.
How to retrieve a value: When the values are represented by a table, if we would like
to retrieve a value, we can directly read the corresponding entry in the table. However,
when the values are represented by a function, we need to input the state index s into
the function and calculate the function value. If the function is an artificial neural
network, a forward prorogation process from the input to the output is needed.
How to update a value: When the values are represented by a table, if we would like
to update one value, we can directly rewrite the corresponding entry in the table.
However, when the values are represented by a function, we must update the function
parameter to change the values indirectly.
 Q: What are the advantages of the function approximation method over the tabular
method?
A: Due to the way state values are retrieved, the function approximation method is
more efficient in storage. In particular, while the tabular method needs to store |S|
values, the function approximation method only needs to store a parameter vector
whose dimension is usually much less than |S|.
Due to the way in which state values are updated, the function approximation method
has another merit: its generalization ability is stronger than that of the tabular
method. The reason is as follows. With the tabular method, updating one state
value would not change the other state values. However, with the function approx-
imation method, updating the function parameter affects the values of many states.

196
8.6. Q&A S. Zhao, 2023

Therefore, the experience sample for one state can generalize to help estimate the
values of other states.
 Q: Can we unify the tabular and the function approximation methods?
A: Yes. The tabular method can be viewed as a special case of the function approxi-
mation method. The related details can be found in Box 8.2.
 Q: What is the stationary distribution and why is it important?
A: The stationary distribution describes the long-term behavior of a Markov decision
process. More specifically, after the agent executes a given policy for a sufficiently long
period, the probability of the agent visiting a state can be described by this stationary
distribution. More information can be found in Box 8.1.
The reason why this concept emerges in this chapter is that it is necessary for defining
a valid objective function. In particular, the objective function involves the probability
distribution of the states, which is usually selected as the stationary distribution. The
stationary distribution is important not only for the value approximation method but
also for the policy gradient method, which will be introduced in Chapter 9.
 Q: What are the advantages and disadvantages of the linear function approximation
method?
A: Linear function approximation is the simplest case whose theoretical properties
can be thoroughly analyzed. However, the approximation ability of this method is
limited. It is also nontrivial to select appropriate feature vectors for complex tasks.
By contrast, artificial neural networks can be used to approximate values as black-box
universal nonlinear approximators, which are more friendly to use. Nevertheless, it
is still meaningful to study the linear case to better grasp the idea of the function
approximation method. Moreover, the linear case is powerful in the sense that the
tabular method can be viewed as a special linear case (Box 8.2).
 Q: Why does deep Q-learning require experience replay?
A: The reason lies in the objective function in (8.37). In particular, to well define
the objective function, we must specify the probability distributions of S, A, R, S 0 .
The distributions of R and S 0 are determined by the system model once (S, A) is
given. The simplest way to describe the distribution of the state-action pair (S, A)
is to assume it to be uniformly distributed. However, the state-action samples may
not be uniformly distributed in practice since they are generated as a sequence by the
behavior policy. It is necessary to break the correlation between the samples in the
sequence to satisfy the assumption of uniform distribution. To do this, we can use
the experience replay technique by uniformly drawing samples from the replay buffer.
A benefit of experience replay is that each experience sample may be used multiple
times, which can increase the data efficiency.

197
8.6. Q&A S. Zhao, 2023

 Q: Can tabular Q-learning use experience replay?


A: Although tabular Q-learning does not require experience replay, it can also use
experience relay without encountering problems. That is because Q-learning has no
requirements about how the samples are obtained due to its off-policy attribute. One
benefit of using experience replay is that the samples can be used repeatedly and
hence more efficiently.
 Q: Why does deep Q-learning require two networks?
A: The fundamental reason is to simplify the calculation of the gradient of (8.37).
Specifically, the parameter w appears not only in q̂(S, A, w) but also in R+γ maxa∈A(S 0 ) q̂(S 0 , a, w).
As a result, it is nontrivial to calculate the gradient with respect to w. On the one
hand, if we fix w in R + γ maxa∈A(S 0 ) q̂(S 0 , a, w), the gradient can be easily calculated
as shown in (8.38). This gradient suggests that two networks should be maintained.
The main network’s parameter is updated in every iteration. The target network’s
parameter is fixed within a certain period. On the other hand, the target network’s
parameter cannot be fixed forever. It should be updated every certain number of
iterations.
 Q: When an artificial neural network is used as a nonlinear function approximator,
how should we update its parameter?
A: It must be noted that we should not directly update the parameter vector by
using, for example, (8.36). Instead, we should follow the network training procedure
to update the parameter. This procedure can be realized based on neural network
training toolkits, which are currently mature and widely available.

198

You might also like