Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping
Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping
Vektor Dewanto1 , George Dunn2 , Ali Eshragh2, 4 , Marcus Gallagher1 , and Fred Roosta3, 4
arXiv:2010.08920v2 [cs.LG] 3 Aug 2021
1
School of Information Technology and Electrical Engineering, University of Queensland, AU
2
School of Mathematical and Physical Sciences, University of Newcastle, AU
3
School of Mathematics and Physics, University of Queensland, AU
4
International Computer Science Institute, Berkeley, CA, USA
[email protected], [email protected],
{marcusg,fred.roosta}@uq.edu.au, [email protected]
Abstract
Reinforcement learning is important part of artificial intelligence. In this paper,
we review model-free reinforcement learning that utilizes the average reward op-
timality criterion in the infinite horizon setting. Motivated by the solo survey by
Mahadevan (1996a), we provide an updated review of work in this area and extend
it to cover policy-iteration and function approximation methods (in addition to the
value-iteration and tabular counterparts). We present a comprehensive literature
mapping. We also identify and discuss opportunities for future work.
1 Introduction
Reinforcement learning (RL) is one promising approach to the problem of making sequential deci-
sions under uncertainty. Such a problem is often formulated as a Markov decision process (MDP)
with a state set S, an action set A, a reward set R, and a decision-epoch set T . At each decision-
epoch (i.e. timestep) t ∈ T , a decision maker (henceforth, an agent) is at a state st ∈ S, and
chooses to then execute an action at ∈ A. Consequently, the agent arrives at the next state st+1 and
earns an (immediate) reward rt+1 . For t = 0, 1, . . . , tmax with tmax ≤ ∞, the agent experiences
a sequence of S0 , A0 , R1 , S1 , A1 , R2 , . . . , Stmax . Here, S0 is drawn from an initial state distribu-
tion ps0 , whereas st+1 and rt+1 are governed by the environment dynamics, which is fully specified
by the state-transitionPprobability p(s′ |s, a) := Pr{St+1 = s′ |St = s, At = a}, and the reward
′
function r(s, a, s ) = r∈R Pr{r|s, a, s′ } · r.
The solution to the decision making problem is a mapping from every state to a probability distribu-
tion over the set of available actions in that state. This mapping is called a policy, i.e. π : S × A 7→
[0, 1], where π(at |st ) := Pr{A = at |St = st }. Thus, solving such a problem amounts to finding the
optimal policy, denoted by π ∗ . The basic optimality criterion asserts that a policy with the largest
value is optimal. That is, v(π ∗ ) ≥ v(π), ∀π ∈ Π, where the function v measures the value of any
policy π in the policy set Π. There are two major ways to value a policy based on the infinite reward
sequence that it generates, namely the average- and discounted-reward policy value formulations.
They induce the average- and discounted-reward optimality criteria, respectively. For an examina-
tion of their relationship, pros, and cons in RL, we refer the readers to (Dewanto and Gallagher,
2021).
Furthermore, RL can be viewed as simulation-based asynchronous approximate dynamic program-
ming (DP) (Gosavi, 2015, Ch 7). Particularly in model-free RL, the simulation is deemed expensive
because it corresponds to direct interaction between an agent and its (real) environment. Model-free
RL mitigates not only the curse of dimensionality (inherently as approximate DP methods), but also
Preprint.
the curse of modeling (since no model learning is required). This is in contrast to model-based RL,
where an agent interacts with the learned model of the environment (Dean et al., 2017; Jaksch et al.,
2010; Tadepalli and Ok, 1998). Model-free RL is fundamental in that it encompasses the bare es-
sentials for updating sequential decision rules through natural agent-environment interaction. The
same update machinery can generally be applied to model-based RL. In practice, we may always
desire a system that runs both model-free and model-based mechanisms, see e.g. the so-called Dyna
architecture (Silver et al., 2008; Sutton, 1990).
This work surveys the existing value- and policy-iteration based average-reward model-free RL. We
begin by reviewing relevant DP (as the root of RL), before progressing to the tabular then function
approximation settings in RL. For comparison, the solo survey on average-reward RL (Mahadevan,
1996a) embraced only the tabular value-iteration based methods. We present our review in Secs 3
and 4, which are accompanied by concise Tables 1 and 2 (in Appendix A) along with literature maps
(Figs 1, 2, 3, and 4 in Appendix B). We then discuss the insight and outlook in Sec 5. Additionally
in Appendix C, we compile environments that were used for evaluation by the existing works.
In order to limit the scope, this work focuses on model-free RL with a single non-hierarchical agent
interacting online with its environment. We do not include works that approximately optimize the
average reward by introducing a discount factor (hence, the corresponding approximation error),
e.g. Schneckenreither (2020); Karimi et al. (2019); Bartlett and Baxter (2002). We also choose not
to examine RL methods that are based on linear programming (Wang, 2017; Neu et al., 2017), and
decentralized learning automata (Chang, 2009; Wheeler and Narendra, 1986).
2 Preliminaries
We assume that the problem we aim to tackle can be well modeled as a Markov decision process
(MDP) with the following properties.
• The state set S and action set A are finite. All states are fully observable, and all actions
a ∈ A are available in every state s ∈ S. The decision-epoch set T = Z≥0 is discrete and
infinite. Thus, we have discrete-time infinite-horizon finite MDPs.
• The state-transition probability p(st+1 |st , at ) and the reward function r(st , at ) are both
stationary (time homogenous, fixed over time). Here, r(st , at ) = E[r(st , at , St+1 )], and it
is uniformly bounded by a finite constant rmax , i.e. |r(s, a)| ≤ rmax < ∞, ∀(s, a) ∈ S ×A.
• The MDP is recurrent, and its optimal policies belong to the stationary policy set ΠS .
Furthermore, every stationary policy π ∈ ΠS of an MDP induces a Markov chain (MC), whose
transition (stochastic) matrix is denoted as P π ∈ [0, 1]|S|×|S|. The [s, s′ ]-entry of P π indicates
the probability P of transitioning′ from a current state s to the next state s′ under a policy π. That is,
′ t
P π [s, s ] = a∈A π(a|s)p(s |s, a). The t-th power of P π gives P π , whose [s0 , s]-entry indicates
the probability of being in state s in t timesteps when starting from s0 and following π. That is,
P tπ [s0 , s] = Pr{St = s|s0 , π} =: ptπ (s|s0 ). The limiting distribution of ptπ is given by
tmax
X−1
1
p⋆π (s|s0 ) = lim ptπ (s|s0 ) = lim ptπmax (s|s0 ), ∀s ∈ S, (1)
tmax →∞ tmax tmax →∞
t=0
where the first limit is proven to exist in finite MDPs, while the second limit exists whenever the finite
MDP is aperiodic (Puterman, 1994, Appendix A.4). This limiting state distribution p⋆π is equivalent
to the unique stationary (time-invariant) state distribution that satisfies (p⋆π )⊺ P π = (p⋆π )⊺ , which
may be achieved in finite timesteps. Here, p⋆π ∈ [0, 1]|S| is p⋆π (s|s0 ) stacked together for all s ∈ S.
The expected average reward (also termed the gain) value of a policy π is defined for all s0 ∈ S as
"t −1 #
max
1 X
vg (π, s0 ) := lim ESt ,At r(St , At ) S0 = s0 , π (2)
tmax →∞ tmax
t=0
Xn tmax
X−1 o
1 X
= lim ptπ (s|s0 ) rπ (s) = p⋆π (s|s0 )rπ (s) (3)
tmax →∞ tmax
s∈S t=0 s∈S
h i
= lim EStmax ,Atmax r(Stmax , Atmax ) S0 = s0 , π , (4)
tmax →∞
2
P
where rπ (s) = a∈A π(a|s) r(s, a). The limit in (2) exists when the policy π is stationary, and the
MDP is finite (Puterman, 1994, Prop 8.1.1). Whenever it exists, the equality in (3) follows due to the
existence of limit in (1) and the validity of interchanging the limit and the expectation. The equality
in (4) holds if its limit exists, for instance when π is stationary and the induced MC is aperiodic
(nonetheless, note that even if the induced MC is periodic, the limit in (4) exists for certain reward
structures, see Puterman (1994, Problem 5.2)). In matrix forms, the gain can be expressed as
tmax
X−1
1 1
v g (π) = lim v tmax (π) = lim P tπ r π = P ⋆π rπ ,
tmax →∞ tmax tmax →∞ tmax
t=0
|S|
where r π ∈ R is rπ (s) stacked together for all s ∈ S. Notice that the gain involves taking the
limit of the average of the expected total reward v tmax from t = 0 to tmax − 1 for tmax → ∞.
Since unichain MDPs have a single chain (i.e. a closed irreducible recurrent class), the stationary
distribution is invariant to initial states. Therefore, P ⋆π has identical rows so that the gain is constant
across all initial states. That is, vg (π) = p⋆π · r π = vg (π, s0 ), ∀s0 ∈ S, hence v g (π) = vg (π) · 1.
The gain vg (π) can be interpreted as the stationary reward because it represents the average reward
per timestep of a system in its steady-state under π.
A policy π ∗ ∈ ΠS is gain-optimal (hereafter, simply called optimal) if
vg (π ∗ , ✚
s✚
0 ) ≥ vg (π, ✚
s✚
0 ), ∀π ∈ ΠS , ∀s0 ∈ S, hence π ∗ ∈ argmax vg (π) (5)
π∈ΠS
Such an optimal policy π ∗ is also a greedy (hence, deterministic) policy in that it selects actions max-
imizing the RHS of the average-reward Bellman optimality equation (which is useful for deriving
optimal control algorithms) as follows,
n X o
vb (π ∗ , s) + vg (π ∗ ) = max r(s, a) + p(s′ |s, a) vb (π ∗ , s′ ) , ∀s ∈ S, (6)
a∈A
s′ ∈S
where vb denotes the bias value. It is defined for all π ∈ ΠS and all s0 ∈ S as
"t −1 #
max
X
vb (π, s0 ) := lim ESt ,At r(St , At ) − vg (π) S0 = s0 , π (7)
tmax →∞
t=0
tmax
X−1 X
= lim ptπ (s|s0 ) − p⋆π (s) rπ (s) (8)
tmax →∞
t=0 s∈S
τX
−1 X tmax
X−1 X
= ptπ (s|s0 )rπ (s) − τ vg (π) + lim ptπ (s|s0 ) − p⋆π (s) rπ (s) (9)
tmax →∞
t=0 s∈S t=τ s∈S
| {z } | {z }
the expected total reward vτ approaches 0 as τ → ∞
Xn tmax
X−1 o X
= lim ptπ (s|s0 ) − p⋆π (s) rπ (s) = dπ (s|s0 )rπ (s), (10)
tmax →∞
s∈S t=0 s∈S
where all limits are assumed to exist, and (7) is bounded because of the subtraction of vg (π). When-
ever exchanging the limit and the expectation is valid in (10), dπ (s|s0 ) represents the [s0 , s]-entry
of the non-stochastic deviation matrix Dπ := (I − P π + P ⋆π )−1 (I − P ⋆π ).
The bias vb (π, s0 ) can be interpreted in several ways. Firstly based on (7), the bias is the expected
total difference between the immediate reward r(st , at ) and the stationary reward vg (π) when a
process starts at s0 and follows π. Secondly from (8), the bias indicates the difference between the
expected total rewards of two processes under π: one starts at s0 and the other at an initial state
drawn from p⋆π . Put in another way, it is the difference of the total reward of π and the total reward
that would be earned if the reward per timestep were vg (π). Thirdly, decomposing the timesteps as
in (9) yields vτ (π, s0 ) ≈ vg (π)τ + vb (π, s0 ). This suggests that the bias serves as the intercept of
a line around which the expected total reward vτ oscillates, and eventually converges as τ increases.
Such a line has a slope of the gain value. For example in an MDP with zero reward absorbing
terminal state (whose gain is 0), the bias equals the expected total reward before the process is
absorbed. Lastly, the deviation factor (ptπ (s|s0 ) − p⋆π (s)) in (10) is non-zero only before the process
3
reaches its steady-state. Therefore, the bias indicates the transient performance. It may be regarded
as the “transient” reward, whose values are earned during the transient phase.
For any reference states sref ∈ S, we can define the (bias) relative value vbrel as follows
vbrel (π, s) := vb (π, s) − vb (π, sref ) = lim {vtmax (π, s) − vtmax (π, sref )}, ∀π ∈ ΠS , ∀s ∈ S.
tmax →∞
The right-most equality follows from (7) or (9), since the gain are the same from both s and sref . It
indicates that vbrel represents the asymptotic different in the expected total reward due to starting
from s, instead of sref . More importantly, the relative value vbrel is equal to vb up to some constant
for any s ∈ S. Moreover, vbrel (π, s = sref ) = 0. After fixing sref , therefore, we can substitute
vbrel for vb in (6), then uniquely determine vbrel for all states. Note that (6) with vb is originally an
underdetermined nonlinear system with |S| equations and |S| + 1 unknowns (i.e. one extra of vg ).
In practice, we often resort to vbrel and abuse the notation vb to also denote the relative value
whenever the context is clear. In a similar fashion, we call the bias as the relative/differential value.
One may also refer to the bias as relative values due to (8), average-adjusted values (due to (7)), or
potential values (similar to potential energy in physics whose values differ by a constant (Cao, 2007,
p193)).
For brevity, we often use the following notations, vgπ := vg (π) =: g π , and vg∗ := vg (π ∗ ) =: g ∗ , as
well as vbπ (s) := vb (π, s), and vb∗ (s) := vb (π ∗ , s). Here, vbπ (s) may be read as the relative state
value of s under a policy π.
3 Value-iteration schemes
Based on the Bellman optimality equation (6), we can obtain the optimal policy once we know
the optimal state value v ∗b ∈ R|S| , which includes knowing the optimal gain vg∗ . Therefore, one
approach to optimal control is to (approximately) compute v ∗b , which leads to the value-iteration
algorithm in DP. The value-iteration scheme in RL uses the same principle as its DP counterpart.
However, we approximate the optimal (state-)action value q ∗b ∈ R|S||A| , instead of v ∗b .
In this section, we begin by introducing the foundations of the value-iteration scheme, showing the
progression from DP into RL. We then cover tabular methods by presenting how to estimate the opti-
mal action values iteratively along with numerous approaches to estimating the optimal gain. Lastly,
we examine function approximation methods, where the action value function is parameterized by
a weight vector, and present different techniques of updating the weights. Table 1 (in Appendix A)
summarizes existing average reward model-free RL based on the value-iteration scheme.
3.1 Foundations
In DP, the basic value iteration (VI) algorithm iteratively solves for the optimal state value v ∗b and
outputs an ε-optimal policy; hence it is deemed exact up to the convergence tolerance ε. The basic
VI algorithm proceeds with the following steps.
Step 1: Initialize the iteration index k ← 0 and v̂bk=0 (s) ← 0, where v̂bk (s) ≈ vb∗ (s), ∀s ∈ S.
Also select a small positive convergence tolerance ε > 0.
Note that v̂bk may not correspond to any stationary policy.
Step 2: Perform updates on all iterates (termed as synchronous updates) as follows,
v̂bk+1 (s) = B∗ v̂bk (s) , ∀s ∈ S, (Basic VI: synchronous updates)
where the average-reward Bellman optimality operator B∗ is defined as
h X i
B∗ v̂bk (s) := max r(s, a) + p(s′ |s, a) v̂bk (s′ ) , ∀s ∈ S.
a∈A
s′ ∈S
Clearly, B∗ is non-linear (due to the max operator), and is derived from (6) with v̂g∗ ← 0.
Step 3: If the span seminorm sp(v̂ k+1
b − v̂ kb ) > ε, then increment k and go to Step 2.
Otherwise, output a greedy policy with respect to the RHS of (6); such a policy is ε-optimal.
Here, sp(v) := maxs∈S v(s) − mins′ ∈S v(s′ ), which measure the range of all components
of a vector v. Therefore, although the vector changes, its span may be constant.
4
The mapping in Step 2 of the VI algorithm is not contractive (because there is no discount factor).
Moreover, the iterates v̂ kb can grow very large, leading to numerical instability. Therefore, White
(1963) proposed substracting out the value of an arbitrary-but-fixed reference state sref at every
iteration. That is,
v̂bk+1 (s) = B∗ v̂bk (s) − v̂bk (sref ), ∀s ∈ S, (Relative VI: synchronous updates)
which results in the so-called relative value iteration (RVI) algorithm. This update yields the same
span and the same sequence of maximizing actions as that of the basic VI algorithm. Importantly,
as k → ∞, the iterate v̂bk (sref ) converges to vg∗ .
The asynchronous version of RVI may diverge (Abounadi et al., 2001, p682, Gosavi, 2015, p232).
As a remedy, Jalali and Ferguson (1990) introduced the following update,
v̂bk+1 (s) = B∗ v̂bk (s) − v̂gk , ∀(s 6= sref ) ∈ S, (Relative VI: asynchronous updates)
where v̂gk is the estimate of vg∗ at iteration k, and v̂bk (sref ) = 0, for all iterations k = 0, 1, . . .. It is
shown that this asynchronous method converges to produce a gain optimal policy in a regenerative
process, where there exists a single recurrent state under all stationary and deterministic policies.
In RL, the reward function r(s, a) and the transition distribution p(s′ |s, a) are unknown. Therefore,
it is convenient to define a relative (state-)action value q b (π) of a policy π as follows,
"t −1 #
max
X
π
qb (π, s, a) := lim ESt ,At r(St , At ) − vg S0 = s, A0 = a, π
tmax →∞
t=0
= r(s, a) − vgπ + E[vbπ (St+1 )], ∀(s, a) ∈ S × A, (11)
where for brevity, we define qbπ (s, a)
:= qb (π, s, a), as well as qb∗ (s, a) ∗
:= qb (π , s, a). The corre-
sponding average reward Bellman optimality equation is as follows (Bertsekas, 2012, p549),
X
qb∗ (s, a) + vg∗ = r(s, a) + p(s′ |s, a) max q ∗ (s′ , a′ ), ∀(s, a) ∈ S × A,
′ ∈A b
(12)
a
s′ ∈S | {z }
vb∗ (s′ )
whose RHS is in the same form as the quantity maximized over all actions in (6). Therefore, the
gain-optimal policy π ∗ can be obtained simply by acting greedily over the optimal action value;
hence, π ∗ is deterministic. That is,
h X i
π ∗ (s) = argmax r(s, a) + p(s′ |s, a) max qb
∗ ′ ′
(s , a ) − vg
∗
, ∀s ∈ S.
a∈A
′ a ∈A
s′ ∈S
| {z }
qb∗ (s,a)
As can be observed, q ∗b combines the effect of p(s′ |s, a) and v ∗b , without estimating them separately
at the cost of an increased number of estimated values since typically |S| × |A| ≫ |S|. The benefit
is that action selection via q ∗b does not require the knowledge of r(s, a) and p(s′ |s, a). Note that vg∗
is invariant to state s and action a; hence its involvement in the above maximization has no effect.
Applying the idea of RVI on action values q ∗b yields the following iterate,
q̂bk+1 (s, a) = B∗ q̂bk (s, a) − vbk (sref ) (Relative VI on q ∗b : asynchronous updates)
X
= r(s, a) + p(s′ |s, a) max q̂bk (s′ , a′ ) − max q̂bk (sref , a′′ ), (13)
a′ ∈A a′′ ∈A
s ∈S
′ | {z }
| {z } can be interpreted as v̂∗
g
B∗ [q̂bk (s,a)]
where q̂ kb denotes the estimate of q ∗b at iteration k, and B∗ is based on (12) so it operates on action
values. The iterates of v̂bk (s) = maxa∈A q̂bk (s, a) are conjectured to converge to vb∗ (s) for all s ∈ S
by Bertsekas (2012, Sec 7.2.3).
5
where β is a positive stepsize, whereas s, a, and s′ denote the current state, current action, and
next state, respectively. Here, the sum over s′ in (13), i.e. the expectation with respect to S ′ , is
approximated by a single sample s′ . The stochastic approximation (SA) based update in (14) is the
essense of Qb -learning (Schwartz, 1993, Sec 5), and most of its variants. One exception is that of
Prashanth and Bhatnagar (2011, Eqn 8), where there is no subtraction of q̂b∗ (s, a).
In order to prevent the iterate q̂b∗ from becoming very large (causing numerical instability), Singh
(1994, p702) advocated assigning qb∗ (sref , aref ) ← 0, for arbitrary-but-fixed reference state sref and
action aref . Alternatively, Bertsekas and Tsitsiklis (1996, p404) advised setting q̂b∗ (sref , ·) ← 0.
Both suggestions seem to follow the heuristics of obtaining the unique solution of the underdeter-
mined Bellman optimality non-linear system of equations in (6).
There are several ways to approximate the optimal gain vg∗ in (14), as summarized in Fig 1 (Ap-
pendix B). In particular, Abounadi et al. (2001, Sec 2.2) proposed three variants as follows.
i. v̂g∗ ← q̂b∗ (sref , aref ) with reference state sref and action aref . Yang et al. (2016) argued
that properly choosing sref can be difficult in that the choice of sref affects the learning
performance, especially when the state set is large. They proposed setting v̂g∗ ← c for
a constant c from prior knowledge. Moreover, Wan et al. (2020) showed empirically that
such a reference retards learning and causes divergence. This happens when (sref , aref ) is
infrequently visited, e.g. being a trasient state in unichain MDPs.
ii. v̂g∗ ← maxa′ ∈A q̂b∗ (sref , a′ ), used by Prashanth and Bhatnagar (2011, Eqn 8). This inherits
the same issue regarding sref as before. Whenever the action set is large, the maximization
over A P should be estimated, yielding another layer of approximation errors.
iii. v̂g∗ ← ∗
(s,a)∈S×A q̂b (s, a)/(|S||A|), used by Avrachenkov and Borkar (2020, Eqn 11).
Averaging all entries of q̂ ∗b removes the need for sref and aref . However, because q̂b∗ itself
is an estimating function with diverse accuracy across all state-action pairs, the estimate v̂g∗
involves the averaged approximation error. The potential issue due to large state and action
sets is also concerning.
Equation (14) with one of three proposals (i - iii) for vg∗ estimators constitutes the RVI Qb -learning.
Although it operates asynchronously, its convergence is assured by decreasing the stepsize β.
The optimal gain vg∗ in (14) can also be estimated iteratively via SA as follows,
v̂g∗ ← v̂g∗ + βg ∆g , for some update ∆g and a positive stepsize βg . (15)
Thus, the corresponding Qb -learning becomes 2-timescale SA, involving both β and βg . There exist
several variations on ∆g as listed below.
i. In Schwartz (1993, Sec 5),
∆g := r(s, a) + max q̂b∗ (s′ , a′ ) − max q̂b∗ (s, a′ ) − v̂g∗ , if a = argmax q̂b∗ (s, a′ ) . (16)
a′ ∈A a′ ∈A a′ ∈A
| {z } | {z }
to minimize the variance of updates a greedy action a
By updating only when greedy actions are executed, the influence of exploratory actions
(which are mostly suboptimal) can be avoided.
ii. In Singh (1994, Algo 3), and Wan et al. (2020, Eqn 6),
∆g := r(s, a) + max q̂b∗ (s′ , a′ ) − q̂b∗ (s, a) − v̂g∗ . (To update at every action)
a′ ∈A
| {z }
vg∗ in expectation of S ′ when using qb∗ (see (12))
Since the equation for vg∗ (12) applies to any state-action pairs, it is reasonable to update v̂g∗
for both greedy and exploratory actions as above. This also implies that information from
non-greedy actions is not wasted. Hence, it is more sample-efficient than that of (16).
iii. In Singh (1994, Algo 4), and Das et al. (1999, Eqn 8),
∆g := r(s, a) − v̂g∗ , if a is chosen greedily with βg set to 1/(nu + 1),
where nu denotes the number of v̂g∗ updates so far (15). This special value of βg makes the
estimation equivalent to the sample average of the rewards received for greedy actions.
iv. In Bertsekas and Tsitsiklis (1996, p404), Abounadi et al. (2001, Eqn 2.9b), and Bertsekas
(2012, p551),
∆g := max q̂b∗ (sref , a′ ), for an arbitrary reference state sref . (17)
a′ ∈A
| {z }
v̂b∗ (sref )
6
This benefits from having sref such that vb∗ (sref ) can be interpreted as vg∗ , while also satis-
fying the underdetermined system of average-reward Bellman optimality equations.
Particularly, the optimal gain vg∗ is estimated using (15, 17). They highlighted that even if q̂b∗ (w) is
bounded, v̂g∗ may diverge.
Das et al. (1999, Eqn 9) updated the weight using temporal difference (TD, or TD error) as follows,
n o
w ← w + β r(s, a) − v̂g∗ + max q̂b
∗ ′ ′
(s , a , w) − q̂b
∗
(s, a; w) ∇q̂b
∗
(s, a; w) , (19)
a′ ∈A
| {z }
relative TD in terms of q̂b∗
which can be interpreted as the parameterized form of (14), and is a semi-gradient update, similar to
its discounted-reward counterpart in Qγ -learning (Sutton and Barto, 2018, Eqn 16.3). This update
is adopted by Yang et al. (2016, Algo 2). The approximation for vg∗ can be performed in various
ways, as for tabular settings in Sec 3.2. Note that (19) differs from (18) in the use of TD, affecting
the update factor. In contrast, Prashanth and Bhatnagar (2011, Eqn 10) leveraged (14) in order to
suggest the following update,
n o
w ← w + β r(s, a) − v̂g∗ + max q̂b
∗ ′ ′
(s , a , w) , whose v̂g∗ is updated using (15, 17).
a ∈A
′
4 Policy-iteration schemes
Instead of iterating towards the value of optimal policies, we can iterate policies directly towards
the optimal one. At each iteration, the current policy iterate is evaluated, then its value is used
to obtain the next policy iterate by taking the greedy action based on the RHS of the Bellman
optimality equation (6). The latter step differentiates this so-called policy iteration from naive policy
enumeration that evaluates and compares policies as prescribed in (5).
Like in the previous Sec 3, we begin with the original policy iteration in DP, which is then gener-
alized in RL. Afterward, we review average-reward policy gradient methods, including actor-critic
variants, because of their prominence and proven empirical successes. The last two sections are
devoted to approximate policy evaluation, namely gain and relative value estimations. Table 2 (in
Appendix A) summarizes existing works on average-reward policy-iteration-based model-free RL.
4.1 Foundations
In DP, Howard (1960) proposed the first policy iteration algorithm to obtain gain optimal policies
for unichain MDPs. It proceeds as follows.
Step 1: Initialize the iteration index k ← 0 and set the initial policy arbitrarily, π̂ k=0 ≈ π ∗ .
Step 2: Perform (exact) policy evaluation:
Solve the following underdetermined linear system for vgk and v kb ,
X
vbk (s) + vgk = r(s, a) + p(s′ |s, a) vbk (s′ ), ∀s ∈ S, with a = π̂ k (s), (20)
s′ ∈S
which is called the Bellman policy expectation equation, also the Poisson equation
(Feinberg and Shwartz, 2002, Eqn 9.1).
7
Step 3: Perform (exact) policy improvement:
Compute a policy π̂ k+1 by greedy action selection (analogous to the RHS of (6)):
X
π̂ k+1 (s) ← argmax r(s, a) + p(s′ |s, a) vbk (s′ ) , ∀s ∈ S. (Synchronous updates)
a∈A s′ ∈S
| {z }
qbk (s, a) + vgk
Step 4: If stable, i.e. π̂ k+1 (s) = π̂ k (s), ∀s ∈ S, then output π̂ k+1 (which is equivalent to π ∗ ).
Otherwise, increment k, then go to Step 2.
The above exact policy iteration is generalized to the generalized policy iteration (GPI) for RL
(Sutton and Barto, 2018, Sec 4.6). Such generalization is in the sense of the details of policy eval-
uation and improvement, such as approximation (inexact evaluation and inexact improvement) and
the update granularity, ranging from per timestep (incremental, step-wise) to per number (batch) of
timesteps.
GPI relies on the policy improvement theorem (Sutton and Barto, 2018, p101). It assures that any
ǫ-greedy policy with respect to qb (π) is an improvement over any ǫ-soft policy π, i.e. any policy
whose π(a|s) ≥ ǫ/|A|, ∀(s, a) ∈ S × A. Moreover, ǫ-greediness is also beneficial for explo-
ration. There exist several variations on policy improvement that all share a similar idea to ǫ-greedy,
i.e. updating the current policy towards a greedy policy. They include Chen-Yu Wei (2019, Eqn 5),
Abbasi-Yadkori et al. (2019a, Eqn 4), Hao et al. (2020, Eqn 4.1), and approximate gradient-ascent
updates used in policy gradient methods (Sec 4.2).
In this section, we outline policy gradient methods, which have proven empirical successes in func-
tion approximation settings (Agarwal et al., 2019; Duan et al., 2016). They requires explicit policy
parameterization, which enables not only learning appropriate levels of exploration (either control-
or parameter-based (Vemula et al., 2019, Miyamae et al., 2010, Fig 1)), but also injection of domain
knowledge (Deisenroth et al., 2013, Sec 1.3). Note that from a sensitivity-based point of view, policy
gradient methods belong to perturbation analysis (Cao, 2007, p18).
In policy gradient methods, a policy π is parameterized by a parameter vector θ ∈ Θ = Rdim(θ) ,
where dim(θ) indicates the number of dimensions of θ. In order to obtain a smooth dependence on θ
(hence, smooth gradients), we restrict the policy class to a set of randomized stationary policies ΠSR .
We further assume that π(θ) is twice differentiable with bounded first and second derivatives. For
discrete actions, one may utilize a categorical distribution (a special case of Gibbs/Boltzmann distri-
butions) as follows,
where φ(s, a) is the feature vector of a state-action pair (s, a) for this policy parameterization
(Sutton and Barto (2018, Sec 3.1), Bhatnagar et al. (2009a, Eqn 8)). Note that parametric policies
may not contain the optimal policy because there are typically fewer parameters than state-action
pairs, yielding some approximation error.
The policy improvement is based on the following optimization with vg (θ) := vg (π(θ)),
8
where α is a positive step length, and C ∈ Rdim(θ)×dim(θ) denotes some preconditioning positive
definite matrix. Based on (3), we have
XX
∇vg (θ) = r(s, a) ∇{p⋆π (s) π(a|s; θ)} (∇ := ∂θ∂
)
s∈S a∈A
XX
= p⋆π (s) π(a|s; θ) r(s, a) {∇ log π(a|s; θ) + ∇ log p⋆π (s)}
s∈S a∈A
XX
= p⋆π (s) π(a|s; θ) qbπ (s, a) ∇ log π(a|s; θ) (does not involve ∇ log p⋆π (s))
| {z }
s∈S a∈A
in lieu of r(s, a)
XX X
= p⋆π (s) π(a|s; θ) p(s′ |s, a) vbπ (s′ ) ∇ log π(a|s; θ). (22)
s∈S a∈A s′ ∈S
The penultimate equation above is due to the (randomized) policy gradient theorem (Sutton et al.
(2000, Thm 1), Marbach and Tsitsiklis (2001, Sec 6), Deisenroth et al. (2013, p28)). The last equa-
tion was proven to be equivalent by Castro and Meir (2010, Appendix B).
There exist (at least) two variants of preconditioning matrices C. First is through the second
derivative C = −∇2 vg (θ), as well as its approximation, see Furmston et al. (2016, Appendix B.1),
(Morimura et al., 2008, Sec 3.3, 5.2), Kakade (2002, Eqn 6). Second, one can use a Riemannian-
metric matrix for natural gradients. It aims to make the update directions invariant to the policy
parameterization. Kakade (2002) first proposed the Fisher information matrix (FIM) as such a ma-
trix, for which an incremental approximation was suggested by Bhatnagar et al. (2009a, Eqn 26). A
generalized variant of natural gradients was introduced by Morimura et al. (2009, 2008). In addition,
Thomas (2014) derived another generalization that allows for a positive semidefinite matrix. Recall
that FIM is only guaranteed to be positive semidefinite (hence, describing a semi-Riemannian man-
ifold); whereas the natural gradient ascent assumes the function being optimized has a Riemannian
manifold domain.
In order to obtain more efficient learning with efficient computation, several works propose using
backward-view eligibility traces in policy parameter updates. The key idea is to keep track of the
eligibility of each component of θ for getting updated whenever a reinforcing event, i.e. qbπ , occurs.
Given a state-action sample (s, a), the update in (21) becomes
θ ← θ + αC −1 qbπ (s, a) eθ , which is carried out after eθ ← λθ eθ + ∇ log π(a|s; θ), (23)
| {z }
the eligibility
dim(θ)
where eθ ∈ R denotes the accumulating eligibility vector for θ and is initialized to 0,
whereas λθ ∈ (0, 1) the trace decay factor for θ. This is used by Iwaki and Asada (2019,
Sec 4), Sutton and Barto (2018, Sec 13.6), Furmston et al. (2016, Appendix B.1), Degris et al. (2012,
Sec III.B), Matsubara et al. (2010a, Sec 4), and Marbach and Tsitsiklis (2001, Sec 5).
As can be observed in (22), computing the gradient ∇vg (θ) exactly requires (exact) knowledge of
p⋆π and qbπ , which are both unknown in RL. It also requires summation over all states and actions,
which becomes an issue when state and action sets are large. As a result, we resort to the gradient
estimate, which if unbiased, leads to stochastic gradient ascent.
In order to reduce the variance of gradient estimates, there are two common techniques based on
control variates (Greensmith et al., 2004, Sec 5, 6).
First is the baseline control variate, for which the optimal baseline is the one that minimizes the
variance. Choosing vbπ as a baseline (Bhatnagar et al., 2009a, Lemma 2) yields
hn o i
qbπ (s, a) − vbπ (s) = ES ′ r(s, a) − vgπ + vbπ (S ′ ) − vbπ (s) s, a , ∀(s, a) ∈ S × A,
| {z } | {z }
π
relative action advantage, ab (s, a)
relative TD, δvπ (s, a, S ′ )
b
(24)
where S ′ denotes the next state given the current state s and action a. It has been shown that
aπb (s, a) = ES ′ δvπb (s, a, S ′ ) , meaning that the TD, i.e. δvπb (s, a, S ′ ), can be used as an unbiased
estimate for the action advantage (Iwaki and Asada, 2019, Prop 1, Castro and Meir, 2010, Thm 5,
Bhatnagar et al., 2009a, Lemma 3). Hence, we have
∇vg (θ) = ES,A aπb (S, A) ∇ log π(A|S; θ) ≈ δvπb (s, a, s′ ) ∇ log π(a|s; θ),
9
where the expectation of S, A, and S ′ is approximated with a single sample (s, a, s′ ). This yields
an unbiased gradient estimate with lower variance than that using qbπ . Note that in RL, the exact aπb
and vbπ (for calculating δvπb ) should also be approximated.
Second, a policy value estimator is set up and often also parameterized by w ∈ W = Rdim(w) . This
gives rise to actor-critic methods, where “actor” refers to the parameterized policy, whereas “critic”
the parameterized value estimator. The critic can take either one of these three forms,
i. relative state-value estimator v̂bπ (s; wv ) for computing the relative TD approximate δ̂vπb ,
ii. both relative state- and action-value estimators for âπb (s, a) ← q̂bπ (s, a; wq ) − v̂bπ (s; wv ),
iii. relative action-advantage estimator âπb (s, a; w a ),
for all s ∈ S, a ∈ A, and with the corresponding critic parameter vectors wv , wq , and wa . This
parametric value approximation is reviewed in Sec 4.4.
We classify gain approximations into incremental and batch categories. Incremental methods update
their current estimates using information only from one timestep, hence step-wise updates. In con-
trast, batch methods uses information from multiple timesteps in a batch. Every update, therefore,
has to wait until the batch is available. Notationally, we write g π := vg (π). The superscript π is
often dropped whenever the context is clear.
Based on (4), we can define an increasing function f (g) := g − E[r(S, A)]. Therefore, the problem
of estimating g becomes finding the root of f (g), by which f (g) = 0. As can be observed, f (g)
is unknown because E[r(S, A)] is unknown. Moreover, we can only obtain noisy observations of
E[r(S, A)], namely r(s, a) = E[r(S, A)] + ε for some additive error ε > 0. Thus, the noisy
observation of f (ĝt ) at iteration t is given by fˆ(ĝt ) = ĝt − r(st , at ). Recall that an increasing
function satisfies df (x)/dx > 0 for x > x∗ where x∗ indicates the root of the f (x).
We can solve for the root of f (g) iteratively via the Robbins-Monro (RM) algorithm as follows,
where βt ∈ (0, 1) is a gain approximation step size; note that setting βt ≥ 1 violates the RM algo-
rithm since it makes the coefficient of ĝt non-positive. This recursive procedure converges to the root
of f (g) with probability 1 under several conditions, including εt is i.i.d with E[εt ] = 0, the step size
satisfies the standard requirement in SA, and some other technical conditions (Cao, 2007, Sec 6.1).
The stochastic approximation technique in (25) is the most commonly used. For instance, Wu et al.
(2020, Algo 1), Heess et al. (2012), Powell (2011, Sec 10.7), Castro and Meir (2010, Eqn 13),
Bhatnagar et al. (2009a, Eqn 21), Konda and Tsitsiklis (2003, Eqn 3.1), Marbach and Tsitsiklis
(2001, Sec 4), and Tsitsiklis and Roy (1999, Sec 2).
Furthermore, a learning rate of βt = 1/(t + 1) in (25) yields
which is the average of a noisy gain observation sequence up to t. The ergodic theorem asserts that
for such a Markovian sequence, the time average converges to (4) as t approaches infinity (Gray,
2009, Ch 8). This decaying learning rate is suggested by Singh (1994, Algo 2). Additionally, a
variant of βt = 1/(t + 1)κ with a positive constant κ < 1 is used by Wu et al. (2020, Sec 5.2.1) for
establishing the finite-time error rate of this gain approximation under non-i.i.d Markovian samples.
10
Another iterative approach is based on the Bellman expectation equation (20) for obtaining the noisy
observation of g. That is,
π π
ĝt+1 = (1 − βt )ĝt + βt r(st , at ) + v̂b,t (st+1 ) − v̂b,t (st ) (The variance is reduced, cf. (25))
| {z }
π
g in expectation of St+1 when v̂b,t = vbπ
π π
= ĝt + βt {r(st , at ) − ĝt + v̂b,t (st+1 )} − v̂b,t (st ) . (27)
| {z }
δ̂vπ (st , at , st+1 )
b ,t
In comparison to (25), the preceding update is anticipated to have lower variance due to the ad-
π π
justment terms of v̂b,t (st+1 ) − v̂b,t (st ). This update is used by Singh (1994, Algo 1), Degris et al.
(2012, Sec 3B), and Sutton and Barto (2018, p333). An analogous update can be formulated by
replacing vbπ with qbπ , which is possible only when the next action at+1 is already available, as in
the differential Sarsa algorithm (Sutton and Barto, 2018, p251).
where tπsref denotes the timestep at which sref is visited while following a policy π, assuming
that sref is a recurrent state under all policies. This is used by Sutton et al. (2000, Sec 1), and
Marbach and Tsitsiklis (2001, Sec 6). Another batch approximation technique is based on the
inverse-propensity scoring (Chen-Yu Wei, 2019, Algo 3).
11
Relative state-value approximation: The value approximator is parameterized as v̂ πb (w) ≈ v πb ,
where w ∈ W = Rdim(w) . Then, the mean squared error (MSE) objective is minimized. That is,
kv πb − v̂ πb (w)k2diag(p⋆π ) := ES∼p⋆π {vbπ (S) − v̂bπ (S, w)}2 ,
where diag(p⋆π ) is a |S|-by-|S| diagonal matrix with p⋆π (s) in its diagonal. As a result, the gradient
estimate for updating w is given by
1
− ∇ES∼p⋆π {vbπ (S) − v̂bπ (S; w)}2 = ES∼p⋆π [{vbπ (S) − v̂bπ (S; w)}∇v̂bπ (S; w)]
2
≈ {vbπ (s) − v̂bπ (s; w)}∇v̂bπ (s; w) (Via a single sample s)
≈ {r(s, a) − v̂gπ + v̂bπ (s′ ; w)} − v̂bπ (s; w) ∇v̂bπ (s; w)
| {z }
TD target that approximates vbπ (s)
Relative action-value and action-advantage approximation: Estimation for qbπ can be carried
out similarly as that for vbπ in (30), but uses the relative TD on action values, namely
δ̂qπb (s, a, s′ , a′ ) := {r(s, a) − v̂gπ + q̂bπ (s′ , a′ ; w)} − q̂bπ (s, a; w), with a′ := at+1 ,
| {z }
TD target that approximates qbπ (s, a)
as in Sarsa (Sutton and Barto, 2018, p251), as well as Konda and Tsitsiklis (2003, Eqn 3.1). It is
also possible to follow this pattern but with different approaches to approximating the true value of
qbπ (s, a), e.g. via episode returns in (28) (Sutton et al., 2000, Thm 2).
For approximating the action advantage parametrically through âπb (w), the technique used in (29)
can also be applied. However, the true value aπb is estimated by TD on state values. That is,
1
− ∇ES∼p⋆π ,A∼π {aπb (S, A) − âπb (S, A; w a )}2 = E[{aπb (S, A) − âπb (S, A; wa )}∇âπb (S, A; w a )]
2
≈ aπb (s, a) − âπb (s, a; wa ) ∇âπb (s, a; w a )
(Approximation via a single sample (s, a))
π
≈ δ̂vb (s, a, s′ ; w v ) − âπb (s, a; wa ) ∇âπb (s, a; wa ).
(Approximation via δ̂vπb ≈ aπb (s, a), see (24))
12
This is used by Iwaki and Asada (2019, Eqn 17), Heess et al. (2012, Eqn 13), and Bhatnagar et al.
(2009a, Eqn 30). In particular, Iwaki and Asada (2019, Eqn 27) proposed a preconditioning matrix
for the gradients, namely: I − κ(f (s, a)f ⊺ (s, a))/(1 + κkf (s, a)k2 ), where f (s, a) denotes the
feature vector of a state-action pair, while κ ≥ βa some scaling constant.
Futhermore, the action-value or action-advantage estimators can be parameterized linearly with the
so-called θ-compatible state-action feature, denoted by f θ (s, a), as follows,
q̂bπ (s, a; w) = w⊺ f θ (s, a), with f θ (s, a) := ∇ log π(a|s; θ), ∀(s, a) ∈ S × A. (31)
| {z }
θ-compatible state-action feature
This parameterization along with the minimization of MSE loss are beneficial for two reasons. First,
their use with (locally) optimal parameter w∗ ∈ W satisfies
h i
ES,A q̂bπ (S, A; w = w∗ ) ∇π(A|S; θ) = ∇vg (θ), (Exact policy gradients when w = w∗ )
| {z }
in lieu of qbπ (S, A)
as shown by Sutton et al. (2000, Thm 2). Otherwise, the gradient estimate is likely to be biased
(Bhatnagar et al., 2009a, Lemma 4). Second, they make computing natural gradients equivalent to
finding the optimal weight w ∗ for q̂bπ (w). That is,
XX
p⋆θ (s)π(a|s; θ)f θ (s, a) f ⊺θ (s, a)w ∗ − qbπ (s, a) = 0 (Since MSE is 0 when w = w∗ )
s∈S a∈A
nX X o XX
p⋆θ (s)π(a|s; θ)f θ (s, a)f ⊺θ (s, a) w ∗ = p⋆θ (s)π(a|s; θ)qbπ (s, a)f θ (s, a)
s∈S a∈A s∈S a∈A
| {z } | {z }
F a (θ) ∇vg (θ)
where F a (θ) denotes the Fisher (information) matrix based on the action distribution (hence, the
subscript a), as introduced by Kakade (2002, Thm 1). This implies that the estimation for natu-
ral gradients is reduced to a regression problem of state-action value functions. It can be either
regressing
i. action value qbπ (Sutton et al., 2000, Thm 2, Kakade, 2002, Thm 1, Konda and Tsitsiklis,
2003, Eqn 3.1),
ii. action advantage aπb (Bhatnagar et al., 2009a, Eqn 30, Algo 3, 4; Heess et al., 2012, Eqn 13;
Iwaki and Asada, 2019, Eqn 17), or
iii. the immediate reward r(s, A ∼ π) as a known ground truth (Morimura et al., 2008, Thm 1),
recall that the ground truth qbπ (Item i.) and aπb (Item ii.) above are unknown. This reward
regression is used along with a θ-compatible state-action feature that is defined differently
compared to (31), namely f θ (s, a) := ∇ log p⋆π (s, a) = ∇ log p⋆π (s) + ∇ log π(a|s; θ),
where f θ is based on the stationary joint state-action distribution p⋆π (s, a). In fact, this
leads to a new Fisher matrix, yielding the so-called natural state-action gradients F s,a
(which subsumes F a in (32) as a special case when ∇ log p⋆π is set to 0).
In addition, Konda and Tsitsiklis (2003) pointed out that the dependency of ∇vg (θ) to qbπ is only
dim(θ)
through its inner products with vectors in the subspace spanned by {∇i log π(a|s; θ)}i=1 . This
π
implies that learning the projection of qb onto the aforementioned (low-dimensional) subspace is
sufficient, instead of learning qbπ fully. A θ-dependent state-action feature can be defined based on
either the action distribution (as in (31)), or the stationary state-action distribution (as mentioned in
Item iii. in the previous passage).
5 Discussion
In this section, we discuss several open questions as first steps towards completing the literature in
average-reward model-free RL. We begin with those of value iteration schemes (Sec 5.1), then of
policy iteration schemes focussing on policy evaluation (Sec 5.2). Lastly, we also outline several
issues that apply to both or beyond the two aforementioned schemes.
13
5.1 Approximation for optimal gain and optimal action-values (of optimal policies)
Two main components of Qb -learning are q̂b∗ and v̂g∗ updates. The former is typically carried out
via either (14) for tabular, or (19) for (parametric) function approximation settings. What remains
is determining how to estimate the optimal gain vg∗ , which becomes the main bottleneck for Qb -
learning.
As can be seen in Fig 1 (Appendix B), there are two classes of vg∗ approximators. First, approxi-
mators that are not SA-based generally need sref and aref specification, which is shown to affect
the performance especially in large state and action sets (Wan et al., 2020; Yang et al., 2016). On
the other hand, SA-based approximators require a dedicated stepsize βg , yielding more complicated
2-timescale Qb -learning. Furthermore, it is not yet clear whether the approximation for vg∗ should
be on-policy, i.e. updating only when a greedy action is executed, at the cost of reduced sample
efficiency. This begs the question of which approach to estimating vg∗ is “best” (in which cases).
In discounted reward
settings, Hasselt (2010) (also, Hasselt et al. (2016)) pointed out that the ap-
proximation for E maxa∈A qγ∗ (St+1 , a) poses overestimation, which may be non-uniform and not
concentrated at states that are beneficial in terms of exploration. He proposed instantiating two
decoupled approximators such that
∗ ′
ESt+1 max qγ (S t+1 , a ) ≈ q̂γ∗ (st+1 , argmax q̂γ∗ (st+1 , a′ ; w(2) (1)
q ); w q ), (Double Qγ -learning)
a ∈A
′
a′ ∈A
(1) (2)
where wq and w q denote their corresponding weights. This was shown to be successful in
reducing the negative effect of overestimation. In average reward cases, the overestimation of qb∗
becomes more convoluted due to the involvement of v̂g∗ , as shown in (12). We believe that it is
important to extend the idea of double action-value approximators to Qb -learning.
To our knowledge, there is no finite-time convergence analysis for Qb -learning thus far. There
are also very few works on Qb -learning with function approximation. This is in contrast with its
discounted reward counterpart, e.g. sample complexity of Qγ -learning with UCB-exploration bonus
(Wang et al., 2020), as well as deep Qγ -learning neural-network (DQN) and its variants with non-
linear function approximation (Hessel et al., 2018).
Gain approximation v̂gπ : In order to have more flexibility in terms of learning methods, we can
parameterize the gain estimator, for instance, v̂gπ (s; wg ) := f ⊺ (s)w g , by which the gain estimate is
state-dependent. The learning uses the following gradients,
1 h i
2
− ∇ES∼p⋆π vgπ − v̂gπ (S; w g ) = ES∼p⋆π vgπ − v̂gπ (S; wg ) ∇v̂gπ (S; wg )
2
≈ vgπ (s) − v̂gπ (s; wg ) ∇v̂gπ (s; w g ) (By a single sample s)
≈ r(s, a) + v̂bπ (s′ ) − v̂bπ (s) − v̂gπ (s; w g ) ∇v̂gπ (s; w g )
| {z }
approximates the true vgπ based on (20)
State-value approximation v̂bπ : We observe that most, if not all, relative state-value approxima-
tors are based on TD, leading to semi-gradient methods (Sec 4.4.2). In discounted reward settings,
there are several non-TD approaches. These include the true gradient methods (Sutton et al., 2009),
as well as those using the Bellman residual (i.e. the expected value of TD), such as Zhang et al.
(2019); Geist et al. (2017); see also surveys by Dann et al. (2014); Geist and Pietquin (2013).
It is interesting to investigate whether TD or not TD for learning v̂bπ . For example, we may formu-
late a (true) gradient TD method that optimizes the mean-squared projected Bellman error (MSPBE),
namely kv̂bπ (w) − P[Bπg [v̂bπ (w)]]k2diag(p⋆ ) with a projection operator P that projects any value ap-
π
proximation onto the space of representable paramaterized approximators.
14
Action-value approximation q̂bπ : Commonly, the estimation for qbπ involves the gain estimate
v̂gπ as dictated by the Bellman expectation equation (20). It is natural then to ask: is it possible to
perform average-reward GPI without estimating the gain of a policy at each iteration? We speculate
that the answer is affirmative as follows.
Let the gain vgπ be the baseline for qbπ in a similiar manner to vbπ in (24); cf. in discounted reward
settings, see Weaver and Tao (2001, Thm 1). Then, based on the Bellman expectation equation in
action values (analogous to (12)), we have the following identity,
qbπ (s, a) − (−vgπ ) = r(s, a) + ESt+1 [vbπ (St+1 )] . (Note different symbols, qbπ vs qπb )
| {z }
The surrogate action-value qπ
b (s, a)
This surrogate can be parameterized as q̂πb (s, a; w). Its parameter w is updated using the following
gradient estimates,
1
− ∇ES,A {qπb (S, A) − q̂πb (S, A; w)}2 = ES,A [{qπb (S, A) − q̂πb (S, A; w)}∇q̂πb (S, A; w)]
2 hn o i
= ES,A,S ′ r(S, A) + vbπ (S ′ ) − q̂πb (S, A; w) ∇q̂πb (S, A; w)
| {z }
The surrogate qπ
b (S, A)
n o
≈ r(s, a) + vbπ (s′ ) − q̂πb (s, a; w) ∇q̂πb (s, a; w)
(Using a single sample (s, a, s′ ))
n o
∝ r(s, a) + vbπ (s′ ) + (vgπ + κ) − q̂πb (s, a; w) ∇q̂πb (s, a; w),
| {z }
The surrogate state-value νbπ (s′ )
(Note different symbols, vbπ vs νbπ )
for some arbitrary constant κ ∈ R. The key is to exploit the fact that we have one degree of
freedom in the underdetermined linear systems for policy evaluation in (20). Here, the surrogate
state-value νbπ is equal to vbπ up to some constant, i.e. νbπ = vbπ + (vgπ + κ). One can estimate νbπ in
a similar fashion as v̂bπ in (29), except that now the gain approximation v̂gπ is no longer needed. The
parameterized estimator ν̂bπ (w ν ) can be updated by following below gradient estimate,
1
− ∇ES∼p⋆π {νbπ (S) − ν̂bπ (S, w ν )}2 ≈ r(s, a) + νbπ (s′ ; wν ) − ν̂bπ (s; w ν ) ∇ν̂bπ (s; w ν ), (33)
2
which is based on a single sample (s, a, s′ ). In the RHS of (33) above, the TD of νbπ looks similar
to that of discounted reward vγ , but with abused γ = 1. Therefore, its stability (whether it will grow
unboundedly) warrants experiments.
Extras: We notice that existing TD-based value approximators simply re-use the TD estimate δ̂vπb
obtained using the old gain estimate from the previous iteration, e.g. Sutton and Barto (2018, p251,
333). They do not harness the newly updated gain estimate for cheap recomputation of δ̂vπb . We
anticipate that doing so will yield more accurate evaluation of the current policy.
There are also open research questions that apply to both average- and discounted-reward policy eval-
uation, but generally require different treatment. They are presented below along with the parallel
works in discounted rewards, if any, which are put in brackets at the end of each bullet point.
• How sensitive is the estimation accuracy of v̂bπ with respect to the trace decay factor λ
in TD(λ)? Recall that v̂bπ involves v̂g , which is commonly estimated using δ̂vπb as in (27).
(cf. Sutton and Barto (2018, Fig 12.6))
• What are the advantages and challenges of natural gradients for the critic? Note that natural
gradients typically are applied only for the actor. (cf. Wu et al. (2017))
• As can be observed, the weight update in (30) resembles that of SGD with momentum.
It begs the question: what is the connection between backward-view eligibility traces
and momentum in gradient-based step-wise updates for both actor and critic parameters?
(cf. Vieillard et al. (2020); Nichols (2017); Xu et al. (2006))
• How to construct “high-performing” basis vectors for the value approximation? To what
extent does the limitation of θ-compatible critic ((31)) outweigh its benefit? Also notice
the Bellman average-reward bases (Mahadevan, 2009, Sec 11.2.4), as well as non-linear
(over-)parameterized neural networks. (cf. Wang et al. (2019))
15
• Among 3 choices for advantage approximation (Sec 4.2), which one is most beneficial?
In the following passages, relevant works in discounted reward settings are mentioned, if any, inside
the brackets at the end of each part.
On batch settings: For on-policy online RL without experience replay buffer, we pose the follow-
ing questions. How to determine a batch size that balances the trade-off between collecting more
samples with the current policy and updating the policy more often with fewer numbers of samples?
How to apply the concept of backward-view eligibility traces in batch settings? (cf. Harb and Precup
(2017)).
On value- vs policy-iteration schemes: As can be observed in Tables 1 and 2 (in Appendix A),
there is less work on value- than policy-iteration schemes; even less when considering only function
approximation settings. Our preliminary experiments suggest that although following value iteration
is straightforward (cf. policy iteration with evaluation and improvement steps), it is more difficult to
make it “work”, especially with function approximation. One challenge is to specify the proper off-
set, e.g. vg∗ or its estimate, in RVI-like methods to bound the iterates. Moreover, Mahadevan (1996a,
Sec 3.5) highlighted that the seminal value-iteration based RL, i.e. average reward Qb -learning, is
sensitive to exploration.
We posit these questions. Is it still worth it to adopt the value iteration scheme after all? Which
scheme is more advantageous in terms of exploration in RL? How to reconcile both schemes?
(cf. O’Donoghue et al. (2016); Schulman et al. (2017); Wang et al. (2020)).
On MDP modeling: The broadest class that can be handled by existing average-reward RL is
the unichain MDP; note that most works assume the more specific class, i.e. the recurrent (ergodic)
MDP. To our knowledge, there is still no average reward model-free RL for multichain MDPs (which
is the most general class).
We also desire to apply average-reward RL to continuous state problems, for which we may benefit
from the DP theory on general states, e.g. Sennott (1999, Ch 8). There are few attempts thus far, for
instance, (Yang et al., 2019), which is limited to linear quadratic regulator with ergodic cost.
On optimality criteria: The average reward optimality is underselective for problems with tran-
sient states, for which we need (n = 0)-discount (bias) optimality, or even higher n-discount op-
timality. This underselective-ness motivates the weighted optimality in DP (Krass et al., 1992). In
RL, Mahadevan (1996b) developed bias-optimal Q-learning.
In the other extreme, (n = ∞)-discount optimality is the most selective criterion. According to
Puterman (1994, Thm 10.1.5), it is proven to be equivalent to the Blackwell optimality, which in-
tuitively claims that upon considering sufficiently far into the future via the Blackwell’s discount
factor γBw , there is no policy better than the Blackwell optimal policy. Moreover, optimizing the
discounted reward does not require any knowledge about the MDP structure (i.e. recurrent, unichain,
multichain classification). Therefore, one of pressing questions is on estimating such γBw in RL.
Acknowledgments
We thank Aaron Snoswell, Nathaniel Du Preez-Wilkinson, Jordan Bishop, Russell Tsuchida, and
Matthew Aitchison for insightful discussions that helped improve this paper. Vektor is supported by
the University of Queensland Research Training Scholarship.
16
References
Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C., and Weisz, G. (2019a). POLI-
TEX: Regret bounds for policy iteration using expert prediction. In ICML. (p8, 24, 25, and 35)
Abbasi-Yadkori, Y., Lazic, N., and Szepesvari, C. (2019b). Model-free linear quadratic control via
reduction to expert prediction. In AISTATS. (p24 and 35)
Abdulla, M. S. and Bhatnagar, S. (2007). Reinforcement learning based algorithms for average cost
markov decision processes. Discrete Event Dynamic Systems, 17(1). (p33)
Abounadi, J., Bertsekas, D., and Borkar, V. S. (2001). Learning algorithms for markov decision
processes with average cost. SIAM J. Control Optim., 40(3). (p5, 6, 26, and 30)
Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). On the theory of policy gradient
methods: Optimality, approximation, and distribution shift. arXiv: 1908.00261. (p8)
Avrachenkov, K. and Borkar, V. S. (2020). Whittle index based q-learning for restless bandits with
average reward. arXiv:2004.14427. (p6, 23, 26, and 30)
Bagnell, J. A. D. and Schneider, J. (2003). Covariant policy search. In Proceeding of the Interna-
tional Joint Conference on Artifical Intelligence. (p32)
Bartlett, P. L. and Baxter, J. (2002). Estimation and approximation bounds for gradient-based rein-
forcement learning. Journal of Computer and System Sciences, 64(1). (p2)
Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement
learning. In ICML. (p16)
Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control, Vol. II. Athena Scientific, 4th
edition. (p5, 6, 26, and 30)
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific, 1st
edition. (p6, 7, 26, and 29)
Bhatnagar, S., Ghavamzadeh, M., Lee, M., and Sutton, R. S. (2008). Incremental natural actor-critic
algorithms. In NIPS. (p33)
Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2009a). Natural actor-critic algorithms.
Automatica, 45. (p8, 9, 10, 12, 13, 27, 28, 33, and 34)
Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2009b). Natural actor-critic algorithms.
Technical report. (p23)
Cao, X. (2007). Stochastic Learning and Optimization: A Sensitivity-Based Approach. International
Series on Discrete Event Dynamic Systems, v. 17. Springer US. (p4, 8, 10, 11, and 27)
Castro, D. D. and Mannor, S. (2010). Adaptive bases for reinforcement learning. In Machine
Learning and Knowledge Discovery in Databases. (p34)
Castro, D. D. and Meir, R. (2010). A convergent online single time scale actor critic algorithm.
Journal of Machine Learning Research. (p9, 10, 12, 23, 27, 28, and 34)
Castro, D. D., Volkinshtein, D., and Meir, R. (2009). Temporal difference based actor critic learning
- convergence and neural implementation. In NIPS. (p34)
Chang, H. S. (2009). Decentralized learning in finite markov chains: Revisited. IEEE Transactions
on Automatic Control, 54(7):1648–1653. (p2)
Chen-Yu Wei, Mehdi Jafarnia-Jahromi, H. L. H. S. R. J. (2019). Model-free reinforcement learning
in infinite-horizon average-reward markov decision processes. arXiv:1910.07072. (p8, 11, 23,
24, 28, and 36)
Dann, C., Neumann, G., and Peters, J. (2014). Policy evaluation with temporal differences: A survey
and comparison. Journal of Machine Learning Research, 15(24):809–883. (p14)
17
Das, T., Gosavi, A., Mahadevan, S., and Marchalleck, N. (1999). Solving semi-markov decision
problems using average reward reinforcement learning. Manage. Sci., 45(4). (p6, 7, 23, 26,
and 29)
Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. (2017). On the sample complexity of the linear
quadratic regulator. arXiv:1710.01688. (p2)
Degris, T., Pilarski, P. M., and Sutton, R. S. (2012). Model-free reinforcement learning with contin-
uous action in practice. In 2012 American Control Conference (ACC). (p9, 11, 27, and 34)
Degris, T., White, M., and Sutton, R. S. (2012). Off-policy actor-critic. In ICML. (p25)
Deisenroth, M. P., Neumann, G., and Peters, J. (2013). A survey on policy search for robotics.
Foundations and Trends in Robotics, 2(1–2):1–142. (p8 and 9)
Devraj, A. M., Kontoyiannis, I., and Meyn, S. P. (2018). Differential temporal difference learning.
arXiv:1812.11137. (p24 and 35)
Devraj, A. M. and Meyn, S. P. (2016). Differential td learning for value function approximation. In
2016 IEEE 55th Conference on Decision and Control (CDC), pages 6347–6354. (p35)
Dewanto, V. and Gallagher, M. (2021). Examining average and discounted reward optimality criteria
in reinforcement learning. arXiv:2107.01348. (p1)
Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking deep rein-
forcement learning for continuous control. In ICML. (p8)
Feinberg, E. A. and Shwartz, A. (2002). Handbook of Markov Decision Processes: Methods and
Applications, volume 40. Springer US. (p7, 11, and 27)
Furmston, T. and Barber, D. (2012). A unifying perspective of parametric policy search methods for
markov decision processes. In Advances in Neural Information Processing Systems 25. (p35)
Furmston, T., Lever, G., and Barber, D. (2016). Approximate newton methods for policy search in
markov decision processes. Journal of Machine Learning Research, 17(227). (p9 and 35)
Geist, M. and Pietquin, O. (2013). Algorithmic survey of parametric value function approximation.
IEEE Transactions on Neural Networks and Learning Systems, 24(6):845–867. (p14)
Geist, M., Piot, B., and Pietquin, O. (2017). Is the bellman residual a bad proxy? In Advances in
Neural Information Processing Systems 30. (p14)
Gosavi, A. (2004a). A reinforcement learning algorithm based on policy iteration for average reward:
Empirical results with yield management and convergence analysis. Machine Learning, 55(1).
(p11, 27, 28, and 33)
Gosavi, A. (2004b). Reinforcement learning for long-run average cost. European Journal of Oper-
ational Research, 155(3):654 – 674. (p29)
Gosavi, A. (2015). Simulation-Based Optimization: Parametric Optimization Techniques and Rein-
forcement Learning. Springer Publishing Company, Incorporated, 2nd edition. (p1 and 5)
Gosavi, A., BANDLA, N., and DAS, T. K. (2002). A reinforcement learning approach to a sin-
gle leg airline revenue management problem with multiple fare classes and overbooking. IIE
Transactions, 34(9). (p24 and 29)
Gray, R. M. (2009). Probability, Random Processes, and Ergodic Properties. Springer US, 2nd
edition. (p10)
Greensmith, E., Bartlett, P. L., and Baxter, J. (2004). Variance reduction techniques for gradient
estimates in reinforcement learning. J. Mach. Learn. Res., 5. (p9)
Hao, B., Lazic, N., Abbasi-Yadkori, Y., Joulani, P., and Szepesvari, C. (2020). Provably efficient
adaptive approximate policy iteration. (p8, 23, and 36)
18
Harb, J. and Precup, D. (2017). Investigating recurrence and eligibility traces in deep q-networks.
CoRR, abs/1704.05495. (p16)
Hasselt, H. V. (2010). Double q-learning. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J.,
Zemel, R. S., and Culotta, A., editors, Advances in Neural Information Processing Systems 23,
pages 2613–2621. Curran Associates, Inc. (p14)
Hasselt, H. v., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double q-learning.
In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. (p14)
Heess, N., Silver, D., and Teh, Y. W. (2012). Actor-critic reinforcement learning with energy-based
policies. In European Workshop on Reinforcement Learning, Proceedings of Machine Learning
Research. (p10, 13, 28, and 34)
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B.,
Azar, M. G., and Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement
learning. In AAAI. AAAI Press. (p14)
Howard, R. A. (1960). Dynamic Programming and Markov Processes. Technology Press of the
Massachusetts Institute of Technology. (p7)
Iwaki, R. and Asada, M. (2019). Implicit incremental natural actor critic algorithm. Neural Net-
works, 109. (p9, 13, 28, and 35)
Jafarnia-Jahromi, M., Wei, C.-Y., Jain, R., and Luo, H. (2020). A model-free learning algorithm for
infinite-horizon average-reward mdps with near-optimal regret. (p23, 24, 26, and 31)
Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning.
J. Mach. Learn. Res., 11:1563–1600. (p2)
Jalali, A. and Ferguson, M. J. (1990). A distributed asynchronous algorithm for expected average
cost dynamic programming. In 29th IEEE Conference on Decision and Control, pages 1394–1395
vol.3. (p5)
Kakade, S. (2002). A natural policy gradient. In NIPS. (p9, 13, 23, 25, 28, 32, and 34)
Karimi, B., Miasojedow, B., Moulines, E., and Wai, H.-T. (2019). Non-asymptotic analysis of biased
stochastic approximation scheme. (p2)
Konda, V. R. and Tsitsiklis, J. N. (2003). On actor-critic algorithms. SIAM J. Control Optim., 42(4).
(p10, 12, 13, 28, and 33)
Krass, D., Filar, J. A., and Sinha, S. S. (1992). A weighted markov decision process. Operations
Research, 40(6). (p16)
Lagoudakis, M. G. (2003). Efficient Approximate Policy Iteration Methods for Sequential Decision
Making in Reinforcement Learning. PhD thesis, Department of Computer Science, Duke Univer-
sity. (p11, 27, 28, and 33)
Liu, Q., Li, L., Tang, Z., and Zhou, D. (2018). Breaking the curse of horizon: Infinite-horizon
off-policy estimation. In Advances in Neural Information Processing Systems 31. (p24 and 25)
Mahadevan, S. (1996a). Average reward reinforcement learning: Foundations, algorithms, and
empirical results. Machine Learning. (p1, 2, 16, 23, 24, and 25)
Mahadevan, S. (1996b). Sensitive discount optimality: Unifying discounted and average reward
reinforcement learning. In ICML. (p16)
Mahadevan, S. (2009). Learning representation and control in markov decision processes: New
frontiers. Foundations and Trends in Machine Learning, 1(4):403–565. (p15)
Marbach, P. and Tsitsiklis, J. N. (2001). Simulation-based optimization of markov reward processes.
IEEE Transactions on Automatic Control, 46(2):191–209. (p9, 10, 11, 24, 27, 28, and 32)
19
Matsubara, T., Morimura, T., and Morimoto, J. (2010a). Adaptive step-size policy gradients with
average reward metric. In Proceedings of the 2nd Asian Conference on Machine Learning, JMLR
Proceedings. (p9 and 34)
Matsubara, T., Morimura, T., and Morimoto, J. (2010b). Adaptive step-size policy gradients with
average reward metric. In Proceedings of the 2nd Asian Conference on Machine Learning, ACML
2010, Tokyo, Japan, November 8-10, 2010, pages 285–298. (p23)
Miyamae, A., Nagata, Y., Ono, I., and Kobayashi, S. (2010). Natural policy gradient methods with
parameter-based exploration for control tasks. In Advances in Neural Information Processing
Systems 23. (p8)
Morimura, T., Osogami, T., and Shirai, T. (2014). Mixing-time regularized policy gradient. In
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI’14. (p23
and 35)
Morimura, T., Uchibe, E., Yoshimoto, J., and Doya, K. (2008). A new natural policy gradient
by stationary distribution metric. In European Conference Machine Learning and Knowledge
Discovery in Databases ECML/PKDD, pages 82–97. Springer. (p9, 13, 23, and 34)
Morimura, T., Uchibe, E., Yoshimoto, J., and Doya, K. (2009). A generalized natural actor-critic
algorithm. In Advances in Neural Information Processing Systems 22. (p9, 23, and 34)
Morimura, T., Uchibe, E., Yoshimoto, J., Peters, J., and Doya, K. (2010). Derivatives of loga-
rithmic stationary distributions for policy gradient reinforcement learning. Neural Computation,
22(2):342–376. (p16, 25, and 34)
Neu, G., Jonsson, A., and Gómez, V. (2017). A unified view of entropy-regularized markov decision
processes. CoRR, abs/1705.07798. (p2)
Nichols, B. D. (2017). A comparison of eligibility trace and momentum on sarsa in continuous
state-and action-space. In 2017 9th Computer Science and Electronic Engineering (CEEC), pages
55–59. (p15)
O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016). PGQL: combining policy
gradient and q-learning. CoRR, abs/1611.01626. (p16)
Ormoneit, D. and Glynn, P. W. (2001). Kernel-based reinforcement learning in average-cost prob-
lems: An application to optimal portfolio choice. In Advances in Neural Information Processing
Systems 13. (p7 and 23)
Ormoneit, D. and Glynn, P. W. (2002). Kernel-based reinforcement learning in average-cost prob-
lems. IEEE Transactions on Automatic Control, 47(10):1624–1636. (p7)
Ortner, R. (2007). Pseudometrics for state aggregation in average reward markov decision processes.
In Hutter, M., Servedio, R. A., and Takimoto, E., editors, Algorithmic Learning Theory, pages
373–387, Berlin, Heidelberg. Springer Berlin Heidelberg. (p7)
Peters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71(7). (p32)
Powell, W. (2011). Approximate Dynamic Programming: Solving the Curses of Dimensionality.
Wiley Series in Probability and Statistics. Wiley, 2nd edition. (p10, 11, and 27)
Prashanth, L. A. and Bhatnagar, S. (2011). Reinforcement learning with average cost for adaptive
control of traffic lights at intersections. In 2011 14th International IEEE Conference on Intelligent
Transportation Systems (ITSC), pages 1640–1645. (p6, 7, 24, 26, 30, and 33)
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming.
John Wiley & Sons, Inc., 1st edition. (p2, 3, and 16)
Qiu, S., Yang, Z., Ye, J., and Wang, Z. (2019). On the finite-time convergence of actor-critic algo-
rithm. In Optimization Foundations for Reinforcement Learning Workshop, NeurIPS 2019. (p36)
Schneckenreither, M. (2020). Average reward adjusted discounted reinforcement learning: Near-
blackwell-optimal policies for real-world applications. arXiv: 2004.00857. (p2 and 24)
20
Schulman, J., Abbeel, P., and Chen, X. (2017). Equivalence between policy gradients and soft
q-learning. CoRR, abs/1704.06440. (p16)
Sennott, L. I. (1999). Stochastic Dynamic Programming and the Control of Queueing Systems.
Wiley-Interscience, New York, NY, USA. (p16)
Silver, D., Sutton, R. S., and Müller, M. (2008). Sample-based learning and search with perma-
nent and transient memories. In Proceedings of the 25th International Conference on Machine
Learning, ICML ’08. (p2)
Singh, S. P. (1994). Reinforcement learning algorithms for average-payoff markovian decision pro-
cesses. AAAI ’94, pages 700–705. (p6, 10, 11, 23, 26, 27, 28, 29, 30, and 32)
Sutton, R. S. (1990). Integrated architecture for learning, planning, and reacting based on approx-
imating dynamic programming. In Proceedings of the Seventh International Conference (1990)
on Machine Learning. (p2)
Sutton, R. S. and Barto, A. G. (2018). Introduction to Reinforcement Learning. MIT Press. (p7, 8,
9, 11, 12, 15, 24, 27, 28, 33, and 35)
Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora,
E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function
approximation. In Proceedings of the 26th ICML. (p14)
Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). Policy gradient methods for
reinforcement learning with function approximation. In NIPS. (p9, 11, 12, 13, 28, and 32)
Tadepalli, P. and Ok, D. (1998). Model-based average reward reinforcement learning. Artificial
Intelligence, 100(1). (p2)
Thomas, P. (2014). Genga: A generalization of natural gradient ascent with positive and negative
convergence results. In Proceedings of the 31st International Conference on Machine Learning,
volume 32 of Proceedings of Machine Learning Research. (p9)
Ueno, T., Kawanabe, M., Mori, T., Maeda, S.-i., and Ishii, S. (2008). A semiparametric statistical
approach to model-free policy evaluation. In Proceedings of the 25th International Conference
on Machine Learning. (p11, 28, and 33)
Vemula, A., Sun, W., and Bagnell, J. A. (2019). Contrasting exploration in parameter and action
space: A zeroth-order optimization perspective. In The 22nd AISTATS 2019, Proceedings of
Machine Learning Research. (p8)
Vieillard, N., Scherrer, B., Pietquin, O., and Geist, M. (2020). Momentum in reinforcement learn-
ing. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and
Statistics, Proceedings of Machine Learning Research. PMLR. (p15)
Wan, Y., Naik, A., and Sutton, R. S. (2020). Learning and planning in average-reward markov
decision processes. (p6, 14, 24, 26, and 30)
Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural policy gradient methods: Global optimal-
ity and rates of convergence. (p15)
Wang, M. (2017). Primal-dual π learning: Sample complexity and sublinear run time for ergodic
markov decision problems. CoRR, abs/1710.06100. (p2)
21
Wang, Y., Dong, K., Chen, X., and Wang, L. (2020). Q-learning with {ucb} exploration is sample
efficient for infinite-horizon {mdp}. In International Conference on Learning Representations.
(p14 and 16)
Weaver, L. and Tao, N. (2001). The optimal reward baseline for gradient-based reinforcement
learning. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence,
UAI’01. (p15)
Wheeler, R. and Narendra, K. (1986). Decentralized learning in finite markov chains. IEEE Trans-
actions on Automatic Control, 31(6):519–526. (p2)
White, D. (1963). Dynamic programming, markov chains, and the method of successive approxima-
tions. Journal of Mathematical Analysis and Applications, 6(3):373 – 376. (p5)
Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba, J. (2017). Scalable trust-region method
for deep reinforcement learning using kronecker-factored approximation. In Advances in Neural
Information Processing Systems 30. (p15)
Wu, Y., Zhang, W., Xu, P., and Gu, Q. (2020). A finite time analysis of two time-scale actor critic
methods. arXiv:2005.01350. (p10, 12, 27, and 36)
Xu, J., Liang, F., and Yu, W. (2006). Learning with eligibility traces in adaptive critic designs. In
2006 IEEE International Conference on Vehicular Electronics and Safety, pages 309–313. (p15)
Yang, S., Gao, Y., An, B., Wang, H., and Chen, X. (2016). Efficient average reward reinforcement
learning using constant shifting values. In Proceedings of the Thirtieth AAAI Conference on
Artificial Intelligence, AAAI’16. AAAI Press. (p6, 7, 14, 24, 25, 26, and 30)
Yang, Z., Chen, Y., Hong, M., and Wang, Z. (2019). Provably global convergence of actor-critic: A
case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Process-
ing Systems 32. (p16)
Yu, H. and Bertsekas, D. P. (2009). Convergence results for some temporal difference methods
based on least squares. IEEE Transactions on Automatic Control, 54(7):1515–1531. (p11, 27,
28, and 33)
Zhang, S., Boehmer, W., and Whiteson, S. (2019). Deep residual reinforcement learning. CoRR,
abs/1905.01072. (p14)
22
APPENDIX
These are environments in which an MDPs with n states is constructed with randomized transition
matrices and rewards. In order to provide context and meaning, a background story is often pro-
vided for these systems such as in the case of DeepSea (Hao et al., 2020) where the environment is
interpreted as a diver searching for treasure.
One special subclass is the Generic Average Reward Non-stationary Environment Testbed
(GARNET), which was proposed by Bhatnagar et al. (2009b). It is parameterized as
GARNET(|S|, |A|, x, σ, τ ), where |S| is the number of states and |A| is the number of actions,
τ determines how non-stationary the problem is, and x is a branching factors which determines how
many next states are available for each state-action pair. The set of next states is created by a random
selection from the state set without replacement. Probabilities of going to each available next state
are uniformly distributed and the reward recieved for each transition is normally distributed with
standard deviation σ.
23
n-cycle (loop, circle) MDPs:
PI: None.
VI: Schwartz (1993); Mahadevan (1996a); Yang et al. (2016); Wan et al. (2020).
This group of problem has n cycles (circles, loops) with the state set defined by the number of cycles
and the number of states in each cycle. The state transitions are typically deterministic. At the state
in which the loops intersect, the agent must decide which loop to take. For each state inside the
loops, the only available actions are to stay or move to the next state. The reward function is set out
such that each loop leads to a different reward, with some rewards being more delayed than others.
Mahadevan (1996a) contains a 2-cycle problem. A robot receives a reward of +5 if it chooses the
cycle going from home to the printer which has 5 states. It receives +20 if it chooses the other cycle
which leads to the mail room and contains 10 states.
n-chain:
PI: Chen-Yu Wei (2019).
VI: Yang et al. (2016); Jafarnia-Jahromi et al. (2020).
These are environments involving n states arranged in a linear chain-like formation. The only two
available actions are to move forwards or backwards. Note that the term “chain” here does not mean
a recurrent class. All n-chain environments has one recurrent class.
Strens (2000) presents are more specific type which is used in the OpenAI gym. In this, moving
forward results in no reward whereas choosing to go backwards returns the agent to beginning results
in a small reward. Reaching the end of the chain presents a large reward. The state transitions are
not deterministic, with each action having a small possibility of resulting in the opposite transition.
RiverSwim is another n-chain style problem consisting of 6 states in a chain-like formation to rep-
resent positions across a river with a current flowing from right to left. Each state has two possible
actions; swim left with the current or right against the current. Swimming with the current is al-
ways successful (deterministic) but swimming against it can fail with some probability. The reward
function is such that the optimal policy is swimming to the rightmost state and remaining there.
C.1.2 Queuing
Another commonly used group of environments in RL involve optimization of systems that involve
queuing. Common examples of queuing environments within literature include call bandwidth allo-
cation (Marbach and Tsitsiklis, 2001), and seating on airplanes (Gosavi et al., 2002).
24
C.1.3 Continuous States and Actions
Many environments with applications to real world systems such as robotics can contain continuous
states and actions. The most common approach to these environments in RL is to discretize the
continuous variables however there are RL methods that can be used even if the MDP remains
continuous. Below we describe commonly used continuous environments from within the literature.
Swing-up pendulum:
PI: Morimura et al. (2010); Degris et al. (2012); Liu et al. (2018).
VI: None.
Swinging a simulated pendulum with the goal of getting it into and keeping it in the vertical position.
The state of the system is described by the current angle and angular velocity of the pendulum. The
action is the torque applied at the base of the pendulum which is restricted to be within a practical
limit. Rewards are received as the cosine of pendulums angle which is proportional to the height of
the pendulum’s tip.
Grid-navigation:
PI: None.
VI: Mahadevan (1996a).
Tetris:
PI: Kakade (2002).
VI: Yang et al. (2016).
Atari Pacman:
PI: Abbasi-Yadkori et al. (2019a).
VI: None.
25
Using reward r(s, a) only:
Singh (1994, Algo 4),
Das et al. (1999, Eqn 8)
non-SA
(1-timescale Qb -learning)
26
Unspecified step-sizes:
Tsitsiklis and Roy (1999, Sec 2),
Marbach and Tsitsiklis (2001, Sec 4),
Bhatnagar et al. (2009a, Eqn 21),
Castro and Meir (2010, Eqn 13)
Basic SA
Specified step-sizes:
Singh (1994, Algo 2),
Wu et al. (2020, Sec 5.2.1)
Incremental
TD-based:
Singh (1994, Algo 1),
Degris et al. (2012, Sec 3B),
Sutton and Barto (2018, p251, 333)
v̂gπ
Using fixed references:
Lagoudakis (2003, Appendix A),
Cao (2007, Eqn 6.23, p311)
Constant:
Tsitsiklis and Roy (1999, Sec 5)
Figure 2: Taxonomy for gain approximation v̂gπ , used in policy-iteration based average-reward
model-free RL. Here, SA stands for stochastic approximation, whereas TD for temporal difference.
For details, see Sec 4.3.
27
Incremental:
Tabular Singh (1994, Algo 1, 2)
Gradient based:
Tsitsiklis and Roy (1999, Thm 1),
v̂bπ
Castro and Meir (2010, Eqn 13),
Sutton and Barto (2018, p333)
Function approximator
Least-square based:
Ueno et al. (2008),
Yu and Bertsekas (2009)
Figure 3: Taxonomy for state-value approximation v̂bπ , used in policy-iteration based average-reward
model-free RL. For details, see Sec 4.4.
Incremental:
Sutton et al. (2000),
Marbach and Tsitsiklis (2001),
Gosavi (2004a)
Tabular Batch:
Sutton et al. (2000, Sec 1),
Marbach and Tsitsiklis (2001, Sec 6),
Chen-Yu Wei (2019, Algo 3)
Least-square based:
Lagoudakis (2003, Appendix A)
Figure 4: Taxonomy for action-value related approximation, i.e. action values q̂bπ and action advan-
tages âπb , used in policy-iteration based average-reward model-free RL. Here, TD stands for temporal
difference. For details, see Sec 4.4.
28
Table 1: Existing works on value-iteration based average-reward model-free RL. The label T, F,
or TF (following the work number) indicates whether the corresponding work deals with tabular,
function approximation, or both settings, respectively
for some positive stepsize β. Suggest setting q̂b∗ (sref , ·) ← 0 for an arbitrary-but-fixed
reference state sref . The optimal gain vg∗ is estimated using (15, 17).
Experiment None
Work 4 (F) Das et al. (1999); Gosavi et al. (2002); Gosavi (2004b)
Contribution Qb -learning on semi-MDPs, named Semi-Markov Average Reward Technique
(SMART). Use a feed-forward neural network to represent q̂b∗ (w) with the following
update (Eqn 9),
w ← w + β{r(s, a) − v̂g∗ + max q̂b∗ (s′ , a′ , w) − q̂b∗ (s, a; w)}∇q̂b∗ (s, a; w).
′ a ∈A
29
Table 1 – continued from previous page
Work 5 (T) Abounadi et al. (2001); Bertsekas (2012): Tabular RVI and SSP Qb -learning
Contribution First, asymptotic convergence for RVI Qb -learning based on ODE with the usual di-
minishing stepsize for SA (Sec 2.2, 3). Stepsize-free gain approximation, namely
v̂g∗ = f (q̂b∗ ), where f : R|S|×|A| 7→ R is Lipschitz, f (1sa ) = 1 and f (x + c1sa ) =
f (x)+c, ∀c ∈ R, 1sa denotes a constant vector of all 1’s in R|S|×|A| (Assumption 2.2).
For example,
1 X ∗
f (q̂b∗ ) := q̂b∗ (sref , aref ), f (q̂b∗ ) := max q̂b∗ (sref , a), f (q̂b∗ ) := q̂ (s, a).
a |S||A| s,a b
Second, SSP Qb -learning based on an observation that the gain of any stationary pol-
icy is simply the ratio of expected total reward and expected time between 2 suc-
cesssive visits to the reference state (Sec 2.3, Eqn 2.9a); also based on the contract-
ing value iteration (Bertsekas (2012): Sec 7.2.3). Asymptotic convergence based on
ODE and the usual diminishing stepsize for SA (Sec 4). The update for q̂b∗ involves
maxa′ ∈A q̂b∗ (s′ , a′ ), where q̂b∗ (s′ , a′ ) = 0 if s′ = sref (Bertsekas (2012): Eqn 7.15).
The optimal gain is approximated (in Eqn 2.9b) as
v̂g∗ ← Pκ [v̂g∗ + βg max q̂b∗ (sref , a)],
a∈A
where Pκ denotes the projection on to an interval [−κ, κ] for some constant κ such
that vg∗ ∈ (−κ, κ). This v̂g∗ should be updated at a slower rate than q̂b∗ .
Experiment None
Work 6 (F) Prashanth and Bhatnagar (2011)
Contribution Qb -learning with function approximation (Eqn 10),
w ← w + β{r(s, a) − v̂g∗ + max q̂b∗ (s′ , a′ , w)}, with v̂g∗ = max q̂b∗ (sref , a′′ ; w).
′ a ∈A a ∈A
′′
Experiment On traffic-light control. The parameter w does not converge, it oscillates. The pro-
posal resulted in worse performance, compared to an actor-critic method.
Work 7 (TF) Yang et al. (2016): Constant shifting values (CSVs)
Contribution Use a constant as v̂g∗ , which is inferred from prior knowledge, because RVI Qb -
learning is sensitive to the choice of sref when the state set is large. Argue that un-
bounded q̂b∗ (when v̂g∗ < vg∗ ) are acceptable as long as the policy converges. Hence,
derive a terminating condition (as soon as the policy is deemed stable), and prove that
such a convergence could be towards the optimal policy if CSVs are properly chosen.
Experiment On 4-circle MDP, RiverSwim, Tetris. Outperform SMART, RVI Qb -learning, R-
learning in both tabular and linear function approximation (with some notion of eli-
gibility traces).
Work 8 (T) Avrachenkov and Borkar (2020)
Contribution Tabular RVI Qb -learning for the Whittle index policy (Eqn 11). Approximate vg∗ using
the average of all entries of q̂ ∗b (same as Abounadi et al. (2001): Sec 2.2). Analyse the
asymptotic convergence.
Experiment Multi-armed restless bandits: 4-state problem with circulant dynamics, 5-state prob-
lem with restart. The proposal is shown to converge to the exact vg∗ .
Work 9 (T) Wan et al. (2020)
Contribution Tabular Qb -learning without any reference state-action pair. Show empirically that
such reference retards learning and causes divergence when (sref , aref ) is infrequently
visited (e.g. trasient states in unichain MDPs). Prove its asymptotic convergence un-
der unichain MDPs, whose key is using TD for v̂g∗ estimates (same as Singh (1994):
Algo 3).
Experiment On access-control queuing. The proposal is empirically shown to be competitive in
terms of learning curves in comparison to RVI Qb -learning.
Continued on next page
30
Table 1 – continued from previous page
Work 10 (T) Jafarnia-Jahromi et al. (2020):
Exploration Enhanced Q-learning (EE-QL)
p
Contribution A regret bound of Õ t̂max with tabular non-parametric policies in weakly commu-
nicating MDPs (more general than the typically-assumed recurrent MDPs). The key
is to use a single scalar estimate v̂g∗ for all states, instead of maintaining the estimate
for each state. This concentrating approximation v̂g∗ is assumed to be available. For
√
updating q̂b∗ , use β = 1/ k for the sake of analysis, instead of the common β = 1/k.
Experiment On random MDPs and RiverSwim (weakly communicating variant). Outperform
MDP-OOMD and POLITEX, as well as model-based benchmarks UCRL2 and PSRL.
31
Table 2: Existing works on policy-iteration based average-reward model-free RL. The label E, I,
or EI (following the work number) indicates whether the corresponding work is (mainly) about
on-policy policy evaluation, policy improvement, or both, respectively.
32
Table 2 – continued from previous page
Work 6 (EI) Konda and Tsitsiklis (2003): A class of actor-critic algorithms
Contribution Interpret randomized policy gradients as an inner product of 2 real-valued functions
on S × A, i.e.
XX
∂vg (θ)/∂θi = hqb (θ), ∇i log π(θ)iθ = p⋆θ (s, a)qbθ (s, a)∇i log π(s, a; θ),
s∈S a∈A
where ∇i log π(θ) denotes the i-th unit vector component of ∇ log π(θ) for i =
1, 2, . . . , dθ . Thus, in order to compute ∇vg (θ), it suffices to learn the projection
of qb (θ) onto the span of the unit vectors {∇i log π(θ); 1 ≤ i ≤ dθ } in R|S||A| . This
span should be contained in the span of the basis vectors of the linearly parameterized
critic, e.g. setting ∇i log π(θ) as such basis vectors. Using TD(λ) with backward-view
eligibility traces. The convergence analysis is based on the martingale approach.
Experiment None
Work 7 (E) Lagoudakis (2003): Least-Squares TD of Q for average reward (LSTDQ-AR)
Contribution Batch policy evaluation as part of least-squares policy iteration. Remark that because
there is no exponential drop due to a discount factor, the value function approxima-
tion (for average rewards) is more amenable to fitting with linear architectures (Ap-
pendix A).
Experiment None
Work 8 (E) Gosavi (2004a): QP-Learning
Contribution Tabular GPI-based methods with implicit policy representation (taking a greedy action
with respect to relative action values). Use TD for learning q̂bπ . Decay the learning rate,
as the inverse of the number of updates (timesteps). Convergence analysis to optimal
solution via ODE methods (Sec 6). This method is similar to Sarsa (Sutton and Barto
(2018): p251), but with batch gain approximation; note that the action value approxi-
mation has its own disjoint batch and is updated incrementally per timestep.
Experiment Airline yield management: 6 distinct cases, 10 runs (1 million timestep each). Im-
provements over the commonly used heuristic (expected marginal seat revenue): 3.2%
to 16%; whereas over Qγ=0.99 -learning: 2.3% to 9.3%.
Work 9 (E) Ueno et al. (2008): gLSTD and LSTDc
Contribution LSTD-based batch policy evaluation (on state values) via semiparametric statistical
inference. Using the so-called estimating-function methods, leading to gLSTD and
LSTDc. Analyze asymptotic variance of linear function approximation.
Experiment On a 4-state, 2-action MDP. The gLSTD and LSTDc are shown to have lower variance.
Work 10 (EI) Bhatnagar et al. (2009a); Prashanth and Bhatnagar (2011): Natural actor critic
Contribution An iterative procedure to estimate the inverse of the Fisher information matrix F −1 . It
is based on weighted average in SA and Sherman-Morrison formula (Eqn 26, Algo 2).
This F −1 is also used as a preconditioning matrix for the critic’s gradients (Eqn 35,
Algo 4). Use a θ-compatible parameterized action advantage approximator (critic),
whose parameter w is utilized as natural gradient estimates (Eqn 30, Algo 3). Based
on ODE, prove the asymptotic convergence to a small neighborhood of the set of local
maxima of vg∗ . Related to (Bhatnagar et al., 2008; Abdulla and Bhatnagar, 2007).
Experiment On Generic Average Reward Non-stationary Environment Testbed (GARNET).
Algo 3 with parametric action advantage approximator yields reliably good
performance in both small and large GARNETs, and outperforms that of
Konda and Tsitsiklis (2003).
Work 11 (E) Yu and Bertsekas (2009)
Contribution Analyze convergence and the rate of convergence of TD-based least squares batch
policy evaluation for vbπ , specifically LSPE(λ) algorithm for any λ ∈ (0, 1) and any
constant stepsize β ∈ (0, 1]. The non-expansive mapping is turned to contraction
through the choice of basis functions and a constant stepsize.
Experiment On 2- and 100-state randomly constructed MDPs. LSPE(λ) is competitive with
LSTD(λ).
Continued on next page ...
33
Table 2 – continued from previous page
Work 12 (I) Morimura et al. (2009, 2008): generalized Natural Actor-Critic (gNAC)
Contribution A variant of natural actor critic that uses the generalized natural gradient (gNG), i.e.
F sa (θ, κ) := κF s (θ) + F a (θ) with some constant κ ∈ [0, 1].
This linear interpolation controlled by κ can be interpreted as a continuous interpo-
lation with respect to the k-timesteps state-action joint distribution. In particular,
κ = 0 corresponds to ∞-timesteps (after converging to the stationary state distribu-
tion), whereas κ = 1 corresponds to 1-timestep. Provide an efficient implementation
for the gNG learning based on the estimating function theory, equipped with an aux-
ilary function for variance reduction (Lemma 1, Thm 1). The policy parameter is
updated by a gNG estimate, which is a solution of the estimating function.
Experiment On randomly constructed MDPs with 2 actions and 2, 5, 10, up to 100 states; using the
technique of Morimura et al. (2010) for ∇ps (θ). The generalized variant with κ = 1
converges to the same point as that of κ = 0.25, but with a slower rate of learning.
Both outperform the baseline NAC algorithm, which is equivalent to using κ = 0.
Work 13 (EI) Castro and Meir (2010); Castro and Mannor (2010); Castro et al. (2009)
Contribution ODE-based convergence analysis of 1-timescale actor-critic methods to a neighbor-
hood of a local maximum of vg∗ (instead of to a local maximum itself as in 2-timescale
variants). This single timescale is motivated by a biological context, i.e. unclear justi-
fication for 2 timescales operating within the same anatomical structure.
Experiment On GARNET, standard gradients, and linearly parameterized TD(λ) critic with eligi-
bility trace. During the initial phase, 1-timescale algorithm converges faster than that
of Bhatnagar et al. (2009a), whereas the long term behavior is problem-dependent.
Work 14 (I) Matsubara et al. (2010a): Adaptive stepsizes
Contribution An adaptive stepsize for both standard and natural policy gradient methods. It is
achieved by setting the distance between θ and (θ + ∆θ) to some user-specified con-
stant, where the effect of a change ∆θ on vg is measured by a Riemannian metric.
Thus, α(θ) = 1/{∇⊺ vg (θ)F −1 (θ)∇vg (θ)}. Estimate F (θ) as the mean of the expo-
nential recency-weighted average, i.e. F̂ ← F̂ + λ(∇vg (θ)∇⊺ vg (θ) − F̂ ).
Experiment On 3- and 20-state MDPs, with eligibility trace on policy gradients. The proposed
adaptive natural gradients lead to faster convergence than that of Kakade (2002).
Work 15 (I) Heess et al. (2012): Energy-based policy parameterization
Contribution Energy-based policy parameterization with latent variables (Eqn 3), which enables
complex non-linear relationship between actions and states. Incremental approxima-
tion techniques for the partition function and for two expectations used in computing
compatible features (Eqn 19).
Experiment Octopus arm (continuous states, discrete actions), compatible action advantage ap-
proximator. The proposal (which builds upon Bhatnagar et al. (2009a)) ourperforms
the non-linear version of the Sarsa algorithm (Eqn 5).
Work 16 (EI) Degris et al. (2012)
Contribution Empirical study of actor-critic methods with backward-view eligibility traces for both
actor and critic. Parameterize continuous action policies using a normal distribution,
whose variance is used to scale the gradient.
Experiment On swing-up pendulum (continuous states and actions). Use standard and natural
gradients, and linearly parameterized state-value approximator with tile coding for
state feature extractions. The use of variance-scaled gradients and eligibility traces
significantly improves performance, but the use of natural gradients does not.
Continued on next page ...
34
Table 2 – continued from previous page
Work 17 (I) Morimura et al. (2014)
Contribution Regularization for policy gradient updates, i.e. −κ∇h(θ), where κ denote some scal-
ing factor, and h the hitting time, which controls the magnitude of the mixing time. A
TD-based approximation for the gradient of the hitting time, ∇h(θ).
Experiment On a 2-state MDP. The regularized policy gradient leads to faster learning, compared
to non-regularized ones (both standard and natural gradients). It also reduces the effect
of policy initialization on the learning performance.
Work 18 (I) Furmston et al. (2016); Furmston and Barber (2012)
Contribution Estimate the Hessian of vg (θ) via Gauss-Newton methods (Algo 2 in Appendix B.1).
Use eligibility traces for both the gradients and the Hessian approximates.
Experiment None
Work 19 (EI) Sutton and Barto (2018)
Contribution Present 2 algorithms: a) semi-gradient Sarsa with q̂bπ (w) (p251), and b) actor-Critic
with per-step updates and backward-view eligibility traces for both actor and critic
v̂bπ (w) (p333).
Experiment Access-control queueing with tabular action value approximators (p252).
Work 20 (E) Devraj et al. (2018); Devraj and Meyn (2016): grad-LSTD(λ)
Contribution Grad-LSTD(λ) for w∗ = argminw kv̂bπ (w) − vbπ kpπs ,1 , which is a quadratic program
(Eqn 30, 37), instead of w∗ = argminw kv̂bπ (w) − vbπ k2pπs in the standard LSTD(1).
Analysis of convergence and error rate for linear parameterization. Claim that grad-
LSTD(λ) is applicable for models that do not have regeneration.
Experiment On a single-server queue with controllable service rate. With respect to the Bellman
error, grad-LSTD(λ) appears to converge faster than LSTD(λ). Moreover, the variance
of grad-LSTD’s parameters is smaller compared to those of LSTD.
Work 21 (EI) Abbasi-Yadkori et al. (2019a,b): POLITEX (Policy Iteration with Expert Advice)
Contribution Reduction of controls to expert predictions: in each state, there exists an expert al-
gorithm and the policy’s value losses are fed to the learning algorithm. Each policy
is represented as a Boltzmann distribution over the sum of action-value function esti-
Pi
mates of all previous policies. That is, π i+1 (a|s) ∝ exp(−κ j=1 q̂bj (s, a)), where κ
is learning rate and i indexes the policy iteration. Achieve a regret bound of
a positive
3/4
Õ tmax , which scales only in the number of features.
Experiment On 4- and 8-server queueing with linear action value approximation (Bertsekas’s
LSPE): POLITEX achieves similar performance to LSPI, and slightly outperforms
RLSVI. On Atari Pacman using non-linear TD-based action-value approximation (3
random seeds): POLITEX obtains higher scores than DQN, arguably due to more
stable learning.
Work 22 (E) Iwaki and Asada (2019): Implicit incremental natural actor critic (I2NAC)
Contribution A preconditioning matrix for the gradient used in âπb (w) updates, namely:
I − κ(f (s, a)f ⊺ (s, a))/(1 + κkf (s, a)k2 ), for some scaling κ ≥ β and some pos-
sible use of eligibility traces on f (Eqn 27). It is derived from the so-called implicit
TD method with linear function approximation. Provide asymptotic convergence anal-
ysis and show that the proposal is less sensitive to the learning rate and state-action
features.
Experiment None (the experiment is all in discounted reward settings)
Continued on next page ...
35
Table 2 – continued from previous page
Work 23 (EI) Qiu et al. (2019)
Contribution A non-asymptotic convergence analysis of actor-critic algorithm with linearly-
parameterized TD-based state-value approximation. The keys are to interpret the actor-
critic algorithm as a bilevel optimization problem (Prop 3.2), namely
max vg (π(θ)), subject to argmin ℓ(w, vg ; θ), for some loss function ℓ,
θ∈Θ w∈W,vg ∈R
and to decouple the actor (upper-level) and critic (lower-level) updates; instead of
typical ODE-based techniques. Here, “decoupling” implies that at every iteration, the
critic start from scratch to estimate the value of the actor’s policy. Actor converges
sublinearly in expectation to a stationary point, although its updates are with biased
gradient estimates (due to critic approximation). The analysis assumes that the actor
receives i.i.d samples.
Experiment None
Work 24 (EI) Chen-Yu Wei (2019): Optimistic online mirror descent (MDP-OOMD)
√
Contribution A regret bound of Õ tmax with non-parametric tabular randomized policies in
weakly communicating MDPs. The key idea is to maintain an instance of an adversar-
ial multi-armed bandit at each state to learn the best action. Assume that tmax is large
enough so that mixing and hitting times are both smaller than tmax /4.
Experiment On a random MDP and JumpRiverSwim (both have 6 states and 2 actions). MDP-
OOMD outperforms standard Q-learning, but is on par with POLITEX.
Work 25 (EI) Wu et al. (2020)
Contribution A non-asymptotic analysis of 2-timescale actor-critic methods under non-i.i.d Marko-
vian samples and with linearly-parameterized critic v̂bπ (w). Show convergence
guaraantee to an ǫ-approximate 1st order stationary point of the non-concave policy
value function, i.e. k∇vg (θ)k22 ≤ ǫ, in at most Õ ǫ−2.5 samples when per iteration
sample is 1 (Cor 4.10). Establish that for gain approximation with βt = 1/(1 + t)κ1 ,
we have
t̂X
max
E (v̂gθ − vgθ )2 = O t̂κmax
1 1−κ1
+ O t̂max 1−2(κ2 −κ1 )
log t̂max + O t̂max ,
t=τ
for some positive constants 0 < κ1 < κ2 < 1 that indicate the relationship between
actor’s and critic’s update rates (Sec 5.2.1).
Experiment None
Work 26 (EI) Hao et al. (2020): Adaptive approximate policy iteration (AAPI)
2/3
Contribution A learning scheme that operates in batches (phases) and achieves Õ tmax regret
bound. It uses adaptive, data- and state-dependent learning rates, as well as side-
information (Eqn 4.1), i.e. a vector computable based on past information and being
predictive of the next loss (here, action values). The policy improvement is based
on the adaptive optimistic follow-the-regularized-leader (AO-FTRL), which can be
interpreted as regularizing each policy by the KL-divergence to the previous policy.
Experiment On randomly-constructed MDPs with 5, 10, 20 states, and 5, 10, 20 actions; DeepSea
environment with grid-based states and 2 actions; CartPole with continuous states
and 2 actions. AAPI outperforms POLITEX and RLSVI. Remark that smaller phase
lengths perform better, whereas adaptive per-state learning rate is not effective in Cart-
Pole.
36