3 - Chapter 8 Value Function Approximation
3 - Chapter 8 Value Function Approximation
Algorithms/Methods
Chapter 7:
Chapter 3: Temporal-Difference
Chapter 2: Methods
Bellman Optimality
Bellman Equation
Equation
tabular representation
to
function representation
Chapter 1:
Basic Concepts
Chapter 8:
Value Function
Fundamental tools
Approximation
160
8.1. Value representation: From table to function S. Zhao, 2023
v̂(s) = as + b
v̂(s)
s1 s2 s3 s4 · · · sn s
Figure 8.2: An illustration of the function approximation method. The x-axis and y-axis correspond to
s and v̂(s), respectively.
State s1 s2 ··· sn
Estimated value v̂(s1 ) v̂(s2 ) ··· v̂(sn )
We next show that the values in the above table can be approximated by a function.
In particular, {(si , v̂(si ))}ni=1 are shown as n points in Figure 8.2. These points can be
fitted or approximated by a curve. The simplest curve is a straight line, which can be
described as
" #
a
v̂(s, w) = as + b = [s, 1] = φT (s)w. (8.1)
|{z} b
φT (s) | {z }
w
Here, v̂(s, w) is a function for approximating vπ (s). It is determined jointly by the state s
and the parameter vector w ∈ R2 . v̂(s, w) is sometimes written as v̂w (s). Here, φ(s) ∈ R2
is called the feature vector of s.
The first notable difference between the tabular and function approximation methods
concerns how they retrieve and update a value.
How to retrieve a value: When the values are represented by a table, if we want to
retrieve a value, we can directly read the corresponding entry in the table. However,
161
8.1. Value representation: From table to function S. Zhao, 2023
when the values are represented by a function, it becomes slightly more complicated
to retrieve a value. In particular, we need to input the state index s into the function
and calculate the function value (Figure 8.3). For the example in (8.1), we first need
to calculate the feature vector φ(s) and then calculate φT (s)w. If the function is
an artificial neural network, a forward propagation from the input to the output is
needed.
s v̂(s, w)
w
function
Figure 8.3: An illustration of the process for retrieving the value of s when using the function approxi-
mation method.
The function approximation method is more efficient in terms of storage due to the
way in which the state values are retrieved. Specifically, while the tabular method
needs to store n values, we now only need to store a lower dimensional parameter
vector w. Thus, the storage efficiency can be significantly improved. Such a benefit
is, however, not free. It comes with a cost: the state values may not be accurately
represented by the function. For example, a straight line is not able to accurately fit
the points in Figure 8.2. That is why this method is called approximation. From a
fundamental point of view, some information will certainly be lost when we use a low-
dimensional vector to represent a high-dimensional dataset. Therefore, the function
approximation method enhances storage efficiency by sacrificing accuracy.
How to update a value: When the values are represented by a table, if we want
to update one value, we can directly rewrite the corresponding entry in the table.
However, when the values are represented by a function, the way to update a value is
completely different. Specifically, we must update w to change the values indirectly.
How to update w to find optimal state values will be addressed in detail later.
Thanks to the way in which the state values are updated, the function approximation
method has another merit: its generalization ability is stronger than that of the tabular
method. The reason is as follows. When using the tabular method, we can update
a value if the corresponding state is visited in an episode. The values of the states
that have not been visited cannot be updated. However, when using the function
approximation method, we need to update w to update the value of a state. The
update of w also affects the values of some other states even though these states have
not been visited. Therefore, the experience sample for one state can generalize to help
estimate the values of some other states.
The above analysis is illustrated in Figure 8.4, where there are three states {s1 , s2 , s3 }.
162
8.1. Value representation: From table to function S. Zhao, 2023
Suppose that we have an experience sample for s3 and would like to update v̂(s3 ).
When using the tabular method, we can only update v̂(s3 ) without changing v̂(s1 ) or
v̂(s2 ), as shown in Figure 8.4(a). When using the function approximation method,
updating w not only can update v̂(s3 ) but also would change v̂(s1 ) and v̂(s2 ), as shown
in Figure 8.4(b). Therefore, the experience sample of s3 can help update the values
of its neighboring states.
v̂(s) v̂(s)
update v̂(s3 )
s1 s2 s3 s s1 s2 s3 s
(a) Tabular method: when v̂(s3 ) is updated, the other values remain the same.
v̂(s) v̂(s)
update w for s3
s1 s2 s3 s s1 s2 s3 s
(b) Function approximation method: when we update v̂(s3 ) by changing w, the values of the
neighboring states are also changed.
We can use more complex functions that have stronger approximation abilities than
straight lines. For example, consider a second-order polynomial:
a
v̂(s, w) = as2 + bs + c = [s2 , s, 1] b = φT (s)w. (8.2)
| {z }
φT (s) c
| {z }
w
We can use even higher-order polynomial curves to fit the points. As the order of the
curve increases, the approximation accuracy can be improved, but the dimension of the
parameter vector also increases, requiring more storage and computational resources.
Note that v̂(s, w) in either (8.1) or (8.2) is linear in w (though it may be nonlinear
in s). This type of method is called linear function approximation, which is the simplest
function approximation method. To realize linear function approximation, we need to
select an appropriate feature vector φ(s). That is, we must decide, for example, whether
we should use a first-order straight line or a second-order curve to fit the points. The
selection of appropriate feature vectors is nontrivial. It requires prior knowledge of the
given task: the better we understand the task, the better the feature vectors we can select.
For instance, if we know that the points in Figure 8.2 are approximately located on a
163
8.2. TD learning of state values based on function approximation S. Zhao, 2023
straight line, we can use a straight line to fit the points. However, such prior knowledge
is usually unknown in practice. If we do not have any prior knowledge, a popular solution
is to use artificial neural networks as nonlinear function approximations.
Another important problem is how to find the optimal parameter vector. If we know
{vπ (si )}ni=1 , this is a least-squares problem. The optimal parameter can be obtained by
optimizing the following objective function:
n n
X 2 X 2
J1 = v̂(si , w) − vπ (si ) = φT (si )w − vπ (si )
i=1 i=1
2
φT (s1 )
vπ (s1 )
.. .. .
= w − = kΦw − vπ k2 ,
. .
T
φ (sn ) vπ (sn )
where
φT (s1 )
vπ (s1 )
. .. n×2 . .. n
Φ= ∈R , vπ = ∈R .
. .
φT (sn ) vπ (sn )
More information about least-squares problems can be found in [47, Section 3.3] and
[48, Section 5.14].
The curve-fitting example presented in this section illustrates the basic idea of value
function approximation. This idea will be formally introduced in the next section.
164
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where the expectation is calculated with respect to the random variable S ∈ S. While S
is a random variable, what is its probability distribution? This question is important for
understanding this objective function. There are several ways to define the probability
distribution of S.
The first way is to use a uniform distribution. That is to treat all the states as equally
important by setting the probability of each state to 1/n. In this case, the objective
function in (8.3) becomes
1X
J(w) = (vπ (s) − v̂(s, w))2 , (8.4)
n s∈S
which is the average value of the approximation errors of all the states. However, this
way does not consider the real dynamics of the Markov process under the given policy.
Since some states may be rarely visited by a policy, it may be unreasonable to treat
all the states as equally important.
The second way, which is the focus of this chapter, is to use the stationary distribution.
The stationary distribution describes the long-term behavior of a Markov decision
process. More specifically, after the agent executes a given policy for a sufficiently
long period, the probability of the agent being located at any state can be described
by this stationary distribution. Interested readers may see the details in Box 8.1.
Let {dπ (s)}s∈S denote the stationary distribution of the Markov process under policy
π. That is, the probability for the agent visiting s after a long period of time is dπ (s).
P
By definition, s∈S dπ (s) = 1. Then, the objective function in (8.3) can be rewritten
165
8.2. TD learning of state values based on function approximation S. Zhao, 2023
as
X
J(w) = dπ (s)(vπ (s) − v̂(s, w))2 , (8.5)
s∈S
which is a weighted average of the approximation errors. The states that have higher
probabilities of being visited are given greater weights.
It is notable that the value of dπ (s) is nontrivial to obtain because it requires knowing
the state transition probability matrix Pπ (see Box 8.1). Fortunately, we do not need to
calculate the specific value of dπ (s) to minimize this objective function as shown in the
next subsection. In addition, it was assumed that the number of states was finite when
we introduced (8.4) and (8.5). When the state space is continuous, we can replace the
summations with integrals.
The key tool for analyzing stationary distribution is Pπ ∈ Rn×n , which is the probabil-
ity transition matrix under the given policy π. If the states are indexed as s1 , . . . , sn ,
then [Pπ ]ij is defined as the probability for the agent moving from si to sj . The
definition of Pπ can be found in Section 2.6.
Interpretation of Pπk (k = 1, 2, 3, . . . ).
First of all, it is necessary to examine the interpretation of the entries in Pπk . The
probability of the agent transitioning from si to sj using exactly k steps is denoted
as
(k)
pij = Pr(Stk = j|St0 = i),
where t0 and tk are the initial and kth time steps, respectively. First, by the
definition of Pπ , we have
(1)
[Pπ ]ij = pij ,
which means that [Pπ ]ij is the probability of transitioning from si to sj using a
single step. Second, consider Pπ2 . It can be verified that
n
X
[Pπ2 ]ij = [Pπ Pπ ]ij = [Pπ ]iq [Pπ ]qj .
q=1
Since [Pπ ]iq [Pπ ]qj is the joint probability of transitioning from si to sq and then
from sq to sj , we know that [Pπ2 ]ij is the probability of transitioning from si to sj
166
8.2. TD learning of state values based on function approximation S. Zhao, 2023
(2)
[Pπ2 ]ij = pij .
(k)
[Pπk ]ij = pij ,
which means that [Pπk ]ij is the probability of transitioning from si to sj using
exactly k steps.
Definition of stationary distributions.
Let d0 ∈ Rn be a vector representing the probability distribution of the states at
the initial time step. For example, if s is always selected as the starting state,
then d0 (s) = 1 and the other entries of d0 are 0. Let dk ∈ Rn be the vector
representing the probability distribution obtained after exactly k steps starting
from d0 . Then, we have
n
X
dk (si ) = d0 (sj )[Pπk ]ji , i = 1, 2, . . . . (8.6)
j=1
This equation indicates that the probability of the agent visiting si at step k
equals the sum of the probabilities of the agent transitioning from {sj }nj=1 to si
using exactly k steps. The matrix-vector form of (8.6) is
When we consider the long-term behavior of the Markov process, it holds under
certain conditions that
where 1n = [1, . . . , 1]T ∈ Rn and 1n dTπ is a constant matrix with all its rows
equal to dTπ . The conditions under which (8.8) is valid will be discussed later.
Substituting (8.8) into (8.7) yields
167
8.2. TD learning of state values based on function approximation S. Zhao, 2023
168
8.2. TD learning of state values based on function approximation S. Zhao, 2023
1 2 0.8
Percentage of each state visited
0.6
1 s1
s2
0.4
s3
s4
0.2
2
0
0 200 400 600 800 1000
Step index
Figure 8.5: Long-term behavior of an -greedy policy with = 0.5. The asterisks in the right
figure represent the theoretical values of the elements of dπ .
169
8.2. TD learning of state values based on function approximation S. Zhao, 2023
The eigenvalues of PπT can be calculated as {−0.0449, 0.3, 0.4449, 1}. The
unit-length (right) eigenvector of PπT corresponding to the eigenvalue 1 is
[0.0463, 0.1455, 0.1785, 0.9720]T . After scaling this vector so that the sum of
all its elements is equal to 1, we obtain the theoretical value of dπ as follows:
0.0345
0.1084
dπ = .
0.1330
0.7241
wk+1 = wk − αk ∇w J(wk ),
where
170
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where the coefficient 2 before αk can be merged into αk without loss of generality. The
algorithm in (8.11) requires calculating the expectation. In the spirit of stochastic gra-
dient descent, we can replace the true gradient with a stochastic gradient. Then, (8.11)
becomes
wt+1 = wt + αt vπ (st ) − v̂(st , wt ) ∇w v̂(st , wt ), (8.12)
Understanding the TD algorithm in (8.13) is important for studying the other algo-
rithms in this chapter. Notably, (8.13) can only learn the state values of a given policy.
It will be extended to algorithms that can learn action values in Sections 8.3.1 and 8.3.2.
v̂(s, w) = φT (s)w,
171
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where φ(s) ∈ Rm is the feature vector of s. The lengths of φ(s) and w are equal to m,
which is usually much smaller than the number of states. In the linear case, the gradient
is
∇w v̂(s, w) = φ(s),
We next show that the tabular TD algorithm in (7.1) in Chapter 7 is a special case
of the TD-Linear algorithm in (8.14).
Consider the following special feature vector for any s ∈ S:
φ(s) = es ∈ Rn ,
where es is the vector with the entry corresponding to s equal to 1 and the other
172
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where w(s) is the entry in w that corresponds to s. Substituting the above equation
into (8.14) yields
wt+1 = wt + αt rt+1 + γwt (st+1 ) − wt (st ) est .
The above equation merely updates the entry wt (st ) due to the definition of est .
Motivated by this, multiplying eTst on both sides of the equation yields
wt+1 (st ) = wt (st ) + αt rt+1 + γwt (st+1 ) − wt (st ) ,
1 2 3 4 5 1 2 3 4 5
Figure 8.6: (a) The policy to be evaluated. (b) The true state values are represented as a table. (c) The
true state values are represented as a 3D surface.
The grid world example is shown in Figure 8.6. The given policy takes any action at a
state with a probability of 0.2. Our goal is to estimate the state values under this policy.
There are 25 state values in total. The true state values are shown in Figure 8.6(b). The
true state values are visualized as a three-dimensional surface in Figure 8.6(c).
173
8.2. TD learning of state values based on function approximation S. Zhao, 2023
We next show that we can use fewer than 25 parameters to approximate these state
values. The simulation setup is as follows. Five hundred episodes are generated by the
given policy. Each episode has 500 steps and starts from a randomly selected state-action
pair following a uniform distribution. In addition, in each simulation trial, the parameter
vector w is randomly initialized such that each element is drawn from a standard normal
distribution with a zero mean and a standard deviation of 1. We set rforbidden = rboundary =
−1, rtarget = 1, and γ = 0.9.
To implement the TD-Linear algorithm, we need to select the feature vector φ(s) first.
There are different ways to do that as shown below.
The first type of feature vector is based on polynomials. In the grid world example, a
state s corresponds to a 2D location. Let x and y denote the column and row indexes
of s, respectively. To avoid numerical issues, we normalize x and y so that their values
are within the interval of [−1, +1]. With a slight abuse of notation, the normalized
values are also represented by x and y. Then, the simplest feature vector is
" #
x
φ(s) = ∈ R2 .
y
w3
When w is given, v̂(s, w) corresponds to a plane that may not pass through the origin.
Notably, φ(s) can also be defined as φ(s) = [x, y, 1]T , where the order of the elements
does not matter.
The estimation result when we use the feature vector in (8.15) is shown in Fig-
174
8.2. TD learning of state values based on function approximation S. Zhao, 2023
ure 8.7(a). It can be seen that the estimated state values form a 2D plane. Although
the estimation error converges as more episodes are used, the error cannot decrease
to zero due to the limited approximation ability of a 2D plane.
To enhance the approximation ability, we can increase the dimension of the feature
vector. To that end, consider
The estimation results when we use the feature vectors in (8.16) and (8.17) are shown
in Figures 8.7(b)-(c). As can be seen, the longer the feature vector is, the more
accurately the state values can be approximated. However, in all three cases, the
estimation error cannot converge to zero because these linear approximators still have
limited approximation abilities.
5 5 3.5
TD-Linear: =0.0005 TD-Linear: =0.0005 TD-Linear: =0.0005
State value error (RMSE)
3
4 4
2.5
3 3 2
2 2 1.5
1
1 1
0.5
0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Episode index Episode index Episode index
Figure 8.7: TD-Linear estimation results obtained with the polynomial features in (8.15), (8.16), and
(8.17).
In addition to polynomial feature vectors, many other types of features are available
such as Fourier basis and tile coding [3, Chapter 9]. First, the values of x and y of
175
8.2. TD learning of state values based on function approximation S. Zhao, 2023
5 5 6
TD-Linear: =0.0005 TD-Linear: =0.0005 TD-Linear: =0.0005
State value error (RMSE)
4
3 3
3
2 2
2
1 1
1
0 0 0
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Episode index Episode index Episode index
(a) q = 1 and φ(s) ∈ R4 (b) q = 2 and φ(s) ∈ R9 (c) q = 3 and φ(s) ∈ R16
Figure 8.8: TD-Linear estimation results obtained with the Fourier features in (8.18).
each state are normalized to the interval of [0, 1]. The resulting feature vector is
..
.
(q+1)2
φ(s) = cos π(c1 x + c2 y) ∈R , (8.18)
..
.
where π denotes the circumference ratio, which is 3.1415 . . . , instead of a policy. Here,
c1 or c2 can be set as any integers in {0, 1, . . . , q}, where q is a user-specified integer.
As a result, there are (q + 1)2 possible values for the pair (c1 , c2 ) to take. Hence, the
dimension of φ(s) is (q + 1)2 . For example, in the case of q = 1, the feature vector is
cos π(0x + 0y) 1
cos π(0x + 1y) cos(πy)
φ(s) = = ∈ R4 .
cos π(1x + 0y) cos(πx)
cos π(1x + 1y) cos(π(x + y))
The estimation results obtained when we use the Fourier features with q = 1, 2, 3 are
shown in Figure 8.8. The dimensions of the feature vectors in the three cases are
4, 9, 16, respectively. As can be seen, the higher the dimension of the feature vector
is, the more accurately the state values can be approximated.
176
8.2. TD learning of state values based on function approximation S. Zhao, 2023
function, we introduced the stochastic algorithm in (8.12). Later, the true value function
in the algorithm, which was unknown, was replaced by an approximation, leading to the
TD algorithm in (8.13). Although this story is helpful for understanding the basic idea
of value function approximation, it is not mathematically rigorous. For example, the
algorithm in (8.13) actually does not minimize the objective function in (8.3).
We next present a theoretical analysis of the TD algorithm in (8.13) to reveal why
the algorithm works effectively and what mathematical problems it solves. Since gen-
eral nonlinear approximators are difficult to analyze, this part only considers the linear
case. Readers are advised to read selectively based on their interests since this part is
mathematically intensive.
Convergence analysis
To study the convergence property of (8.13), we first consider the following deterministic
algorithm:
h i
wt+1 = wt + αt E rt+1 + γφT (st+1 )wt − φT (st )wt φ(st ) ,
(8.19)
where the expectation is calculated with respect to the random variables st , st+1 , rt+1 .
The distribution of st is assumed to be the stationary distribution dπ . The algorithm
in (8.19) is deterministic because the random variables st , st+1 , rt+1 all disappear after
calculating the expectation.
Why would we consider this deterministic algorithm? First, the convergence of this
deterministic algorithm is easier (though nontrivial) to analyze. Second and more im-
portantly, the convergence of this deterministic algorithm implies the convergence of the
stochastic TD algorithm in (8.13). That is because (8.13) can be viewed as a stochastic
gradient descent (SGD) implementation of (8.19). Therefore, we only need to study the
convergence property of the deterministic algorithm.
Although the expression of (8.19) may look complex at first glance, it can be greatly
simplified. To do that, define
.. ..
. .
∈ Rn×m , ∈ Rn×n ,
Φ= T D= (8.20)
φ (s) dπ (s)
..
...
.
where Φ is the matrix containing all the feature vectors, and D is a diagonal matrix with
the stationary distribution in its diagonal entries. The two matrices will be frequently
used.
177
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where
.
A = ΦT D(I − γPπ )Φ ∈ Rm×m ,
.
b = ΦT Drπ ∈ Rm . (8.21)
Here, Pπ , rπ are the two terms in the Bellman equation vπ = rπ + γPπ vπ , and I is the
identity matrix with appropriate dimensions.
The proof is given in Box 8.3. With the expression in Lemma 8.1, the deterministic
algorithm in (8.19) can be rewritten as
Is A invertible? The answer is yes. In fact, A is not only invertible but also positive
definite. That is, for any nonzero vector x with appropriate dimensions, xT Ax > 0.
The proof is given in Box 8.4.
What is the interpretation of w∗ = A−1 b? It is actually the optimal solution for min-
imizing the projected Bellman error. The details will be introduced in Section 8.2.5.
The tabular method is a special case. One interesting result is that, when the di-
mensionality of w equals n = |S| and φ(s) = [0, . . . , 1, . . . , 0]T , where the entry corre-
sponding to s is 1, we have
w∗ = A−1 b = vπ . (8.23)
This equation indicates that the parameter vector to be learned is actually the true
state value. This conclusion is consistent with the fact that the tabular TD algorithm
is a special case of the TD-Linear algorithm, as introduced in Box 8.2. The proof
of (8.23) is given below. It can be verified that Φ = I in this case and hence A =
ΦT D(I − γPπ )Φ = D(I − γPπ ) and b = ΦT Drπ = Drπ . Thus, w∗ = A−1 b =
(I − γPπ )−1 D−1 Drπ = (I − γPπ )−1 rπ = vπ .
178
8.2. TD learning of state values based on function approximation S. Zhao, 2023
.
Proof 1: Define the convergence error as δt = wt − w∗ . We only need to show that δt
converges to zero. To do that, substituting wt = δt + w∗ into (8.22) gives
kδt+1 k2 ≤ kI − αAkt+1
2 kδ0 k2 .
When α > 0 is sufficiently small, we have that kI − αAk2 < 1 and hence δt → 0 as
t → ∞. The reason why kI − αAk2 < 1 holds is that A is positive definite and hence
xT (I − αA)x < 1 for any x.
.
Proof 2: Consider g(w) = b−Aw. Since w∗ is the root of g(w) = 0, the task is actually
a root-finding problem. The algorithm in (8.22) is actually a Robbins-Monro (RM)
algorithm. Although the original RM algorithm was designed for stochastic processes,
it can also be applied to deterministic cases. The convergence of RM algorithms can
shed light on the convergence of wt+1 = wt + αt (b − Awt ). That is, wt converges to
w∗ when t αt = ∞ and t αt2 < ∞.
P P
179
8.2. TD learning of state values based on function approximation S. Zhao, 2023
as
X h i X
dπ (s)E rt+1 φ(st ) st = s = dπ (s)φ(s)rπ (s) = ΦT Drπ , (8.25)
s∈S s∈S
where rπ = [· · · , rπ (s), · · · ]T ∈ Rn .
Second, consider the second term in (8.24). Since
h i
E φ(st ) γφT (st+1 ) − φT (st ) wt st = s
h i h i
T T
= −E φ(st )φ (st )wt st = s + E γφ(st )φ (st+1 )wt st = s
h i
= −φ(s)φT (s)wt + γφ(s)E φT (st+1 ) st = s wt
X
= −φ(s)φT (s)wt + γφ(s) p(s0 |s)φT (s0 )wt ,
s0 ∈S
.
= b − Awt , (8.27)
. .
where b = ΦT Drπ and A = ΦT D(I − γPπ )Φ.
Box 8.4: Proving that A = ΦT D(I −γPπ )Φ is invertible and positive definite.
The matrix A is positive definite if xT Ax > 0 for any nonzero vector x with ap-
propriate dimensions. If A is positive (or negative) definite, it is denoted as A 0
(or A ≺ 0). Here, and ≺ should be differentiated from > and <, which indicate
elementwise comparisons. Note that A may not be symmetric. Although positive
180
8.2. TD learning of state values based on function approximation S. Zhao, 2023
definite matrices often refer to symmetric matrices, nonsymmetric ones can also be
positive definite.
We next prove that A 0 and hence A is invertible. The idea for proving A 0
is to show that
.
D(I − γPπ ) = M 0. (8.28)
Since all the entries of dπ are positive (see Box 8.1), we have (M + M T )1n > 0.
Second, the elementwise form of (8.29) is
n
X
[M + M T ]ij > 0, i = 1, . . . , n,
j=1
It can be verified according to (8.28) that the diagonal entries of M are positive and
the off-diagonal entries of M are nonpositive. Therefore, the above inequality can be
181
8.2. TD learning of state values based on function approximation S. Zhao, 2023
rewritten as
X
[M + M T ]ii > [M + M T ]ij .
j6=i
The above inequality indicates that the absolute value of the ith diagonal entry in
M + M T is greater than the sum of the absolute values of the off-diagonal entries
in the same row. Thus, M + M T is strictly diagonal dominant and the proof is
complete.
While we have shown that the TD-Linear algorithm converges to w∗ = A−1 b, we next
show that w∗ is the optimal solution that minimizes the projected Bellman error. To do
that, we review three objective functions.
which has been introduced in (8.3). By the definition of expectation, JE (w) can be
reexpressed in a matrix-vector form as
where vπ is the true state value vector and v̂(w) is the approximated one. Here, k · k2D
is a weighted norm: kxk2D = xT Dx = kD1/2 xk22 , where D is given in (8.20).
This is the simplest objective function that we can imagine when talking about func-
tion approximation. However, it relies on the true state, which is unknown. To obtain
an implementable algorithm, we must consider other objective functions such as the
Bellman error and projected Bellman error [50–54].
The second objective function is the Bellman error. In particular, since vπ satisfies
the Bellman equation vπ = rπ + γPπ vπ , it is expected that the estimated value v̂(w)
should also satisfy this equation to the greatest extent possible. Thus, the Bellman
error is
.
JBE (w) = kv̂(w) − (rπ + γPπ v̂(w))k2D = kv̂(w) − Tπ (v̂(w))k2D . (8.30)
Here, Tπ (·) is the Bellman operator. In particular, for any vector x ∈ Rn , the Bellman
operator is defined as
.
Tπ (x) = rπ + γPπ x.
182
8.2. TD learning of state values based on function approximation S. Zhao, 2023
Minimizing the Bellman error is a standard least-squares problem. The details of the
solution are omitted here.
Third, it is notable that JBE (w) in (8.30) may not be minimized to zero due to the
limited approximation ability of the approximator. By contrast, an objective function
that can be minimized to zero is the projected Bellman error :
where M ∈ Rn×n is the orthogonal projection matrix that geometrically projects any
vector onto the space of all approximations.
In fact, the TD learning algorithm in (8.13) aims to minimize the projected Bellman
error JP BE rather than JE or JBE . The reason is as follows. For the sake of simplicity,
consider the linear case where v̂(w) = Φw. Here, Φ is defined in (8.20). The range space
of Φ is the set of all possible linear approximations. Then,
is the projection matrix that geometrically projects any vector onto the range space Φ.
Since v̂(w) is in the range space of Φ, we can always find a value of w that can minimize
JP BE (w) to zero. It can be proven that the solution minimizing JP BE (w) is w∗ = A−1 b.
That is
We next show that w∗ = A−1 b is the optimal solution that minimizes JP BE (w). Since
JP BE (w) = 0 ⇔ v̂(w) − M Tπ (v̂(w)) = 0, we only need to study the root of
v̂(w) = M Tπ (v̂(w)).
In the linear case, substituting v̂(w) = Φw and the expression of M in (8.28) into
the above equation gives
183
8.2. TD learning of state values based on function approximation S. Zhao, 2023
where A, b are given in (8.21). Therefore, w∗ = A−1 b is the optimal solution that
minimizes JP BE (w).
1 1 p
kΦw∗ − vπ kD ≤ min kv̂(w) − vπ kD = min JE (w). (8.32)
1−γ w 1−γ w
The proof of this inequality is given in Box 8.6. Inequality (8.32) indicates that the
discrepancy between Φw∗ and vπ is bounded from above by the minimum value of JE (w).
However, this bound is loose, especially when γ is close to one. It is thus mainly of
theoretical value.
Note that
kΦw∗ − vπ kD = kΦw∗ − M vπ + M vπ − vπ kD
≤ kΦw∗ − M vπ kD + kM vπ − vπ kD
= kM Tπ (Φw∗ ) − M Tπ (vπ )kD + kM vπ − vπ kD , (8.33)
where the last equality is due to Φw∗ = M Tπ (Φw∗ ) and vπ = Tπ (vπ ). Substituting
184
8.2. TD learning of state values based on function approximation S. Zhao, 2023
The proof of kM kD = 1 and kPπ xkD ≤ kxkD are postponed to the end of the box.
Recognizing the above inequality gives
1
kΦw∗ − vπ kD ≤ kM vπ − vπ kD
1−γ
1
= min kv̂(w) − vπ kD ,
1−γ w
where the last equality is because kM vπ − vπ kD is the error between vπ and its
orthogonal projection into the space of all possible approximations. Therefore, it is
the minimum value of the error between vπ and any v̂(w).
We next prove some useful facts, which have already been used in the above proof.
√
Properties of matrix weighted norms. By definition, kxkD = xT Dx = kD1/2 xk2 .
The induced matrix norm is kAkD = maxx6=0 kAxkD /kxkD = kD1/2 AD−1/2 k2 . For
matrices A, B with appropriate dimensions, we have kABxkD ≤ kAkD kBkD kxkD .
To see that, kABxkD = kD1/2 ABxk2 = kD1/2 AD−1/2 D1/2 BD−1/2 D1/2 xk2 ≤
kD1/2 AD−1/2 k2 kD1/2 BD−1/2 k2 kD1/2 xk2 = kAkD kBkD kxkD .
Proof of kM kD = 1. This is valid because kM kD = kΦ(ΦT DΦ)−1 ΦT DkD =
kD1/2 Φ(ΦT DΦ)−1 ΦT DD−1/2 k2 = 1, where the last equality is valid due to the
fact that the matrix in the L2 -norm is an orthogonal projection matrix and the
L2 -norm of any orthogonal projection matrix is equal to one.
Proof of kPπ xkD ≤ kxkD for any x ∈ Rn . First,
!
X X X
kPπ xk2D = xT PπT DPπ x = xi [PπT DPπ ]ij xj = xi [PπT ]ik [D]kk [Pπ ]kj xj .
i,j i,j k
185
8.2. TD learning of state values based on function approximation S. Zhao, 2023
= kxk2D .
Least-squares TD
We next introduce an algorithm called least-squares TD (LSTD) [57]. Like the TD-Linear
algorithm, LSTD aims to minimize the projected Bellman error. However, it has some
advantages over the TD-Linear algorithm.
Recall that the optimal parameter for minimizing the projected Bellman error is
w = A−1 b, where A = ΦT D(I − γPπ )Φ and b = ΦT Drπ . In fact, it follows from (8.27)
∗
The above two equations show that A and b are expectations of st , st+1 , rt+1 . The idea of
LSTD is simple: if we can use random samples to directly obtain the estimates of A and
b, which are denoted as  and b̂, then the optimal parameter can be directly estimated
as w∗ ≈ Â−1 b̂.
In particular, suppose that (s0 , r1 , s1 , . . . , st , rt+1 , st+1 , . . . ) is a trajectory obtained
by following a given policy π. Let Ât and b̂t be the estimates of A and b at time t,
respectively. They are calculated as the averages of the samples:
t−1
X T
Ât = φ(sk ) φ(sk ) − γφ(sk+1 ) ,
k=0
t−1
X
b̂t = rk+1 φ(sk ). (8.34)
k=0
186
8.2. TD learning of state values based on function approximation S. Zhao, 2023
The reader may wonder if a coefficient of 1/t is missing on the right-hand side of (8.34).
In fact, it is omitted for the sake of simplicity since the value of wt remains the same
when it is omitted. Since Ât may not be invertible especially when t is small, Ât is usually
biased by a small constant matrix σI, where I is the identity matrix and σ is a small
positive number.
The advantage of LSTD is that it uses experience samples more efficiently and con-
verges faster than the TD method. That is because this algorithm is specifically designed
based on the knowledge of the optimal solution’s expression. The better we understand
a problem, the better algorithms we can design.
The disadvantages of LSTD are as follows. First, it can only estimate state values.
By contrast, the TD algorithm can be extended to estimate action values as shown in the
next section. Moreover, while the TD algorithm allows nonlinear approximators, LSTD
does not. That is because this algorithm is specifically designed based on the expression
of w∗ . Second, the computational cost of LSTD is higher than that of TD since LSTD
updates an m × m matrix in each update step, whereas TD updates an m-dimensional
vector. More importantly, in every step, LSTD needs to compute the inverse of Ât , whose
computational complexity is O(m3 ). The common method for resolving this problem is
to directly update the inverse of Ât rather than updating Ât . In particular, Ât+1 can be
calculated recursively as follows:
t
X T
Ât+1 = φ(sk ) φ(sk ) − γφ(sk+1 )
k=0
t−1
X T T
= φ(sk ) φ(sk ) − γφ(sk+1 ) + φ(st ) φ(st ) − γφ(st+1 )
k=0
T
= Ât + φ(st ) φ(st ) − γφ(st+1 ) .
The above expression decomposes Ât+1 into the sum of two matrices. Its inverse can be
calculated as [58]
T −1
Â−1
t+1 = Ât + φ(s t ) φ(s t ) − γφ(s t+1 )
T −1
−1 Â−1
t φ(st ) φ(st ) − γφ(st+1 ) Ât
= Ât + T −1 .
1 + φ(st ) − γφ(st+1 ) Ât φ(st )
Therefore, we can directly store and update Â−1 t to avoid the need to calculate the matrix
inverse. This recursive algorithm does not require a step size. However, it requires setting
the initial value of Â−1
0 . The initial value of such a recursive algorithm can be selected as
Â−1
0 = σI, where σ is a positive number. A good tutorial on the recursive least-squares
approach can be found in [59].
187
8.3. TD learning of action values based on function approximation S. Zhao, 2023
The analysis of (8.35) is similar to that of (8.13) and is omitted here. When linear
functions are used, we have
188
8.3. TD learning of action values based on function approximation S. Zhao, 2023
0 1 2 3 4 5
Total reward
-500 1
-1000
2
0 100 200 300 400 500
3
Episode length
500
0
0 100 200 300 400 500 5
Episode index
Figure 8.9: Sarsa with linear function approximation. Here, γ = 0.9, = 0.1, rboundary = rforbidden =
−10, rtarget = 1, and α = 0.001.
Initialization: Initial parameter w0 . Initial policy π0 . αt = α > 0 for all t. ∈ (0, 1).
Goal: Learn an optimal policy that can lead the agent to the target state from an initial
state s0 .
For each episode, do
Generate a0 at s0 following π0 (s0 )
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (rt+1 , st+1 , at+1 ) given (st , at ): generate rt+1 , st+1
by interacting with the environment; generate at+1 following πt (st+1 ).
Update q-value:h i
wt+1 = wt + αt rt+1 + γ q̂(st+1 , at+1 , wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt )
Update policy:
πt+1 (a|st ) = 1 − |A(sε t )| (|A(st )| − 1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
πt+1 (a|st ) = |A(s t )| otherwise
st ← st+1 , at ← at+1
The above update rule is similar to (8.35) except that q̂(st+1 , at+1 , wt ) in (8.35) is replaced
with maxa∈A(st+1 ) q̂(st+1 , a, wt ).
Similar to the tabular case, (8.36) can be implemented in either an on-policy or
off-policy fashion. An on-policy version is given in Algorithm 8.3. An example for
demonstrating the on-policy version is shown in Figure 8.10. In this example, the task is
to find a good policy that can lead the agent to the target state from the top-left state.
189
8.4. Deep Q-learning S. Zhao, 2023
Initialization: Initial parameter w0 . Initial policy π0 . αt = α > 0 for all t. ∈ (0, 1).
Goal: Learn an optimal path that can lead the agent to the target state from an initial
state s0 .
For each episode, do
If st (t = 0, 1, 2, . . . ) is not the target state, do
Collect the experience sample (at , rt+1 , st+1 ) given st : generate at following
πt (st ); generate rt+1 , st+1 by interacting with the environment.
Update q-value:h i
wt+1 = wt + αt rt+1 + γ max q̂(st+1 , a, wt ) − q̂(st , at , wt ) ∇w q̂(st , at , wt )
a∈A(st+1 )
Update policy:
πt+1 (a|st ) = 1 − |A(sε t )| (|A(st )| − 1) if a = arg maxa∈A(st ) q̂(st , a, wt+1 )
πt+1 (a|st ) = |A(sε t )| otherwise
As can be seen, Q-learning with linear function approximation can successfully learn an
optimal policy. Here, linear Fourier basis functions of order five are used. The off-policy
version will be demonstrated when we introduce deep Q-learning in Section 8.4.
0 1 2 3 4 5
Total reward
1
-2000
-4000 2
0 100 200 300 400 500
3
Episode length
1000
4
0
0 100 200 300 400 500 5
Episode index
Figure 8.10: Q-learning with linear function approximation. Here, γ = 0.9, = 0.1, rboundary =
rforbidden = −10, rtarget = 1, and α = 0.001.
One may notice in Algorithm 8.2 and Algorithm 8.3 that, although the values are
represented as functions, the policy π(a|s) is still represented as a table. Thus, it still
assumes finite numbers of states and actions. In Chapter 9, we will see that the policies
can be represented as functions so that continuous state and action spaces can be handled.
190
8.4. Deep Q-learning S. Zhao, 2023
earliest and most successful deep reinforcement learning algorithms. Notably, the neural
networks do not have to be deep. For simple tasks such as our grid world examples,
shallow networks with one or two hidden layers may be sufficient.
Deep Q-learning can be viewed as an extension of the algorithm in (8.36). However,
its mathematical formulation and implementation techniques are substantially different
and deserve special attention.
where (S, A, R, S 0 ) are random variables that denote a state, an action, the immediate
reward, and the next state, respectively. This objective function can be viewed as the
squared Bellman optimality error. That is because
q(s, a) = E Rt+1 + γ max q(St+1 , a) St = s, At = a , for all s, a
a∈A(St+1 )
is the Bellman optimality equation (the proof is given in Box 7.5). Therefore, R +
γ maxa∈A(S 0 ) q̂(S 0 , a, w) − q̂(S, A, w) should equal zero in the expectation sense when
q̂(S, A, w) can accurately approximate the optimal action values.
To minimize the objective function in (8.37), we can use the gradient descent algorith-
m. To that end, we need to calculate the gradient of J with respect to w. It is noted that
.
the parameter w appears not only in q̂(S, A, w) but also in y = R+γ maxa∈A(S 0 ) q̂(S 0 , a, w).
As a result, it is nontrivial to calculate the gradient. For the sake of simplicity, it is as-
sumed that the value of w in y is fixed (for a short period of time) so that the calculation
of the gradient becomes much easier. In particular, we introduce two networks: one is a
main network representing q̂(s, a, w) and the other is a target network q̂(s, a, wT ). The
objective function in this case becomes
" 2 #
J =E R + γ max0 q̂(S 0 , a, wT ) − q̂(S, A, w) ,
a∈A(S )
191
8.4. Deep Q-learning S. Zhao, 2023
The first technique is to use two networks, a main network and a target network,
as mentioned when we calculate the gradient in (8.38). The implementation details
are explained below. Let w and wT denote the parameters of the main and target
networks, respectively. They are initially set to the same value.
In every iteration, we draw a mini-batch of samples {(s, a, r, s0 )} from the replay buffer
(the replay buffer will be explained soon). The inputs of the main network are s and
a. The output y = q̂(s, a, w) is the estimated q-value. The target value of the output
.
is yT = r + γ maxa∈A(s0 ) q̂(s0 , a, wT ). The main network is updated to minimize the
TD error (also called the loss function) (y − yT )2 over the samples {(s, a, yT )}.
P
Updating w in the main network does not explicitly use the gradient in (8.38). Instead,
it relies on the existing software tools for training neural networks. As a result, we
need a mini-batch of samples to train a network instead of using a single sample to
update the main network based on (8.38). This is one notable difference between deep
and nondeep reinforcement learning algorithms.
The main network is updated in every iteration. By contrast, the target network is
set to be the same as the main network every certain number of iterations to satisfy
the assumption that wT is fixed when calculating the gradient in (8.38).
The second technique is experience replay [22, 60, 62]. That is, after we have collected
some experience samples, we do not use these samples in the order they were collected.
Instead, we store them in a dataset called the replay buffer. In particular, let (s, a, r, s0 )
.
be an experience sample and B = {(s, a, r, s0 )} be the replay buffer. Every time we
update the main network, we can draw a mini-batch of experience samples from the
replay buffer. The draw of samples, or called experience replay, should follow a uniform
distribution.
Why is experience replay necessary in deep Q-learning, and why must the replay
follow a uniform distribution? The answer lies in the objective function in (8.37).
In particular, to well define the objective function, we must specify the probability
distributions for S, A, R, S 0 . The distributions of R and S 0 are determined by the
system model once (S, A) is given. The simplest way to describe the distribution of
the state-action pair (S, A) is to assume it to be uniformly distributed.
However, the state-action samples may not be uniformly distributed in practice s-
ince they are generated as a sample sequence according to the behavior policy. It is
necessary to break the correlation between the samples in the sequence to satisfy the
assumption of uniform distribution. To do this, we can use the experience replay tech-
nique by uniformly drawing samples from the replay buffer. This is the mathematical
reason why experience replay is necessary and why experience replay must follow a
uniform distribution. A benefit of random sampling is that each experience sample
192
8.4. Deep Q-learning S. Zhao, 2023
Initialization: A main network and a target network with the same initial parameter.
Goal: Learn an optimal target network to approximate the optimal action values from
the experience samples generated by a given behavior policy πb .
Store the experience samples generated by πb in a replay buffer B = {(s, a, r, s0 )}
For each iteration, do
Uniformly draw a mini-batch of samples from B
For each sample (s, a, r, s0 ), calculate the target value as yT = r +
γ maxa∈A(s0 ) q̂(s0 , a, wT ), where wT is the parameter of the target network
Update the main network to minimize (yT − q̂(s, a, w))2 using the mini-batch
of samples
Set wT = w every C iterations
may be used multiple times, which can increase the data efficiency. This is especially
important when we have a limited amount of data.
193
8.4. Deep Q-learning S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
(a) The behavior policy. (b) An episode with 1,000 steps. (c) The final learned policy.
5 10
4 8
3 6
2 4
1 2
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index
(d) The loss function converges to zero. (e) The value error converges to zero.
Figure 8.11: Optimal policy learning via deep Q-learning. Here, γ = 0.9, rboundary = rforbidden = −10,
and rtarget = 1. The batch size is 100.
network can also be designed in other ways. For example, it can have two inputs and
five outputs, where the two inputs are the normalized row and column of a state and the
outputs are the five estimated action values for the input state [22].
As shown in Figure 8.11(d), the loss function, defined as the average squared TD
error of each mini-batch, converges to zero, meaning that the network can fit the training
samples well. As shown in Figure 8.11(e), the state value estimation error also converges
to zero, indicating that the estimates of the optimal action values become sufficiently
accurate. Then, the corresponding greedy policy is optimal.
This example demonstrates the high efficiency of deep Q-learning. In particular, a
short episode of 1,000 steps is sufficient for obtaining an optimal policy here. By contrast,
an episode with 100,000 steps is required by tabular Q-learning, as shown in Figure 7.4.
One reason for the high efficiency is that the function approximation method has a strong
generalization ability. Another reason is that the experience samples can be repeatedly
used.
We next deliberately challenge the deep Q-learning algorithm by considering a scenario
with fewer experience samples. Figure 8.12 shows an example of an episode with merely
100 steps. In this example, although the network can still be well-trained in the sense
194
8.5. Summary S. Zhao, 2023
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
(a) The behavior policy. (b) An episode with 100 steps. (c) The final learned policy.
7 8
7
5
4
6
3
2
5
1
0 4
0 200 400 600 800 1000 0 200 400 600 800 1000
Iteration index Iteration index
(d) The loss function converges to zero. (e) The value error does not converge to zero.
Figure 8.12: Optimal policy learning via deep Q-learning. Here, γ = 0.9, rboundary = rforbidden = −10,
and rtarget = 1. The batch size is 50.
that the loss function converges to zero, the state estimation error cannot converge to
zero. That means the network can properly fit the given experience samples, but the
experience samples are too few to accurately estimate the optimal action values.
8.5 Summary
This chapter continued introducing TD learning algorithms. However, it switches from
the tabular method to the function approximation method. The key to understanding
the function approximation method is to know that it is an optimization problem. The
simplest objective function is the squared error between the true state values and the
estimated values. There are also other objective functions such as the Bellman error
and the projected Bellman error. We have shown that the TD-Linear algorithm actually
minimizes the projected Bellman error. Several optimization algorithms such as Sarsa
and Q-learning with value approximation have been introduced.
One reason why the value function approximation method is important is that it allows
artificial neural networks to be integrated with reinforcement learning. For example,
deep Q-learning is one of the most successful deep reinforcement learning algorithms.
195
8.6. Q&A S. Zhao, 2023
Although neural networks have been widely used as nonlinear function approximators,
this chapter provides a comprehensive introduction to the linear function case. Fully
understanding the linear case is important for better understanding the nonlinear case.
Interested readers may refer to [63] for a thorough analysis of TD learning algorithms
with function approximation. A more theoretical discussion on deep Q-learning can be
found in [61].
An important concept named stationary distribution is introduced in this chapter.
The stationary distribution plays an important role in defining an appropriate objective
function in the value function approximation method. It also plays a key role in Chapter 9
when we use functions to approximate policies. An excellent introduction to this topic
can be found in [49, Chapter IV]. The contents of this chapter heavily rely on matrix
analysis. Some results are used without explanation. Excellent references regarding
matrix analysis and linear algebra can be found in [4, 48].
8.6 Q&A
Q: What is the difference between the tabular and function approximation methods?
A: One important difference is how a value is updated and retrieved.
How to retrieve a value: When the values are represented by a table, if we would like
to retrieve a value, we can directly read the corresponding entry in the table. However,
when the values are represented by a function, we need to input the state index s into
the function and calculate the function value. If the function is an artificial neural
network, a forward prorogation process from the input to the output is needed.
How to update a value: When the values are represented by a table, if we would like
to update one value, we can directly rewrite the corresponding entry in the table.
However, when the values are represented by a function, we must update the function
parameter to change the values indirectly.
Q: What are the advantages of the function approximation method over the tabular
method?
A: Due to the way state values are retrieved, the function approximation method is
more efficient in storage. In particular, while the tabular method needs to store |S|
values, the function approximation method only needs to store a parameter vector
whose dimension is usually much less than |S|.
Due to the way in which state values are updated, the function approximation method
has another merit: its generalization ability is stronger than that of the tabular
method. The reason is as follows. With the tabular method, updating one state
value would not change the other state values. However, with the function approx-
imation method, updating the function parameter affects the values of many states.
196
8.6. Q&A S. Zhao, 2023
Therefore, the experience sample for one state can generalize to help estimate the
values of other states.
Q: Can we unify the tabular and the function approximation methods?
A: Yes. The tabular method can be viewed as a special case of the function approxi-
mation method. The related details can be found in Box 8.2.
Q: What is the stationary distribution and why is it important?
A: The stationary distribution describes the long-term behavior of a Markov decision
process. More specifically, after the agent executes a given policy for a sufficiently long
period, the probability of the agent visiting a state can be described by this stationary
distribution. More information can be found in Box 8.1.
The reason why this concept emerges in this chapter is that it is necessary for defining
a valid objective function. In particular, the objective function involves the probability
distribution of the states, which is usually selected as the stationary distribution. The
stationary distribution is important not only for the value approximation method but
also for the policy gradient method, which will be introduced in Chapter 9.
Q: What are the advantages and disadvantages of the linear function approximation
method?
A: Linear function approximation is the simplest case whose theoretical properties
can be thoroughly analyzed. However, the approximation ability of this method is
limited. It is also nontrivial to select appropriate feature vectors for complex tasks.
By contrast, artificial neural networks can be used to approximate values as black-box
universal nonlinear approximators, which are more friendly to use. Nevertheless, it
is still meaningful to study the linear case to better grasp the idea of the function
approximation method. Moreover, the linear case is powerful in the sense that the
tabular method can be viewed as a special linear case (Box 8.2).
Q: Why does deep Q-learning require experience replay?
A: The reason lies in the objective function in (8.37). In particular, to well define
the objective function, we must specify the probability distributions of S, A, R, S 0 .
The distributions of R and S 0 are determined by the system model once (S, A) is
given. The simplest way to describe the distribution of the state-action pair (S, A)
is to assume it to be uniformly distributed. However, the state-action samples may
not be uniformly distributed in practice since they are generated as a sequence by the
behavior policy. It is necessary to break the correlation between the samples in the
sequence to satisfy the assumption of uniform distribution. To do this, we can use
the experience replay technique by uniformly drawing samples from the replay buffer.
A benefit of experience replay is that each experience sample may be used multiple
times, which can increase the data efficiency.
197
8.6. Q&A S. Zhao, 2023
198