0% found this document useful (0 votes)
14 views

Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping

This paper provides a systematic review of average-reward model-free reinforcement learning (RL) in the infinite horizon setting, extending previous surveys to include policy-iteration and function approximation methods. It presents a comprehensive literature mapping and identifies future research opportunities in this area. The focus is on model-free RL with a single non-hierarchical agent interacting with its environment, excluding methods that approximate average reward through discount factors.

Uploaded by

Avik Kar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Average-reward Model-free Reinforcement Learning- A Systematic Review and Literature Mapping

This paper provides a systematic review of average-reward model-free reinforcement learning (RL) in the infinite horizon setting, extending previous surveys to include policy-iteration and function approximation methods. It presents a comprehensive literature mapping and identifies future research opportunities in this area. The focus is on model-free RL with a single non-hierarchical agent interacting with its environment, excluding methods that approximate average reward through discount factors.

Uploaded by

Avik Kar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Average-reward model-free reinforcement learning:

a systematic review and literature mapping

Vektor Dewanto1 , George Dunn2 , Ali Eshragh2, 4 , Marcus Gallagher1 , and Fred Roosta3, 4
arXiv:2010.08920v2 [cs.LG] 3 Aug 2021

1
School of Information Technology and Electrical Engineering, University of Queensland, AU
2
School of Mathematical and Physical Sciences, University of Newcastle, AU
3
School of Mathematics and Physics, University of Queensland, AU
4
International Computer Science Institute, Berkeley, CA, USA
[email protected], [email protected],
{marcusg,fred.roosta}@uq.edu.au, [email protected]

Abstract
Reinforcement learning is important part of artificial intelligence. In this paper,
we review model-free reinforcement learning that utilizes the average reward op-
timality criterion in the infinite horizon setting. Motivated by the solo survey by
Mahadevan (1996a), we provide an updated review of work in this area and extend
it to cover policy-iteration and function approximation methods (in addition to the
value-iteration and tabular counterparts). We present a comprehensive literature
mapping. We also identify and discuss opportunities for future work.

1 Introduction
Reinforcement learning (RL) is one promising approach to the problem of making sequential deci-
sions under uncertainty. Such a problem is often formulated as a Markov decision process (MDP)
with a state set S, an action set A, a reward set R, and a decision-epoch set T . At each decision-
epoch (i.e. timestep) t ∈ T , a decision maker (henceforth, an agent) is at a state st ∈ S, and
chooses to then execute an action at ∈ A. Consequently, the agent arrives at the next state st+1 and
earns an (immediate) reward rt+1 . For t = 0, 1, . . . , tmax with tmax ≤ ∞, the agent experiences
a sequence of S0 , A0 , R1 , S1 , A1 , R2 , . . . , Stmax . Here, S0 is drawn from an initial state distribu-
tion ps0 , whereas st+1 and rt+1 are governed by the environment dynamics, which is fully specified
by the state-transitionPprobability p(s′ |s, a) := Pr{St+1 = s′ |St = s, At = a}, and the reward

function r(s, a, s ) = r∈R Pr{r|s, a, s′ } · r.
The solution to the decision making problem is a mapping from every state to a probability distribu-
tion over the set of available actions in that state. This mapping is called a policy, i.e. π : S × A 7→
[0, 1], where π(at |st ) := Pr{A = at |St = st }. Thus, solving such a problem amounts to finding the
optimal policy, denoted by π ∗ . The basic optimality criterion asserts that a policy with the largest
value is optimal. That is, v(π ∗ ) ≥ v(π), ∀π ∈ Π, where the function v measures the value of any
policy π in the policy set Π. There are two major ways to value a policy based on the infinite reward
sequence that it generates, namely the average- and discounted-reward policy value formulations.
They induce the average- and discounted-reward optimality criteria, respectively. For an examina-
tion of their relationship, pros, and cons in RL, we refer the readers to (Dewanto and Gallagher,
2021).
Furthermore, RL can be viewed as simulation-based asynchronous approximate dynamic program-
ming (DP) (Gosavi, 2015, Ch 7). Particularly in model-free RL, the simulation is deemed expensive
because it corresponds to direct interaction between an agent and its (real) environment. Model-free
RL mitigates not only the curse of dimensionality (inherently as approximate DP methods), but also

Preprint.
the curse of modeling (since no model learning is required). This is in contrast to model-based RL,
where an agent interacts with the learned model of the environment (Dean et al., 2017; Jaksch et al.,
2010; Tadepalli and Ok, 1998). Model-free RL is fundamental in that it encompasses the bare es-
sentials for updating sequential decision rules through natural agent-environment interaction. The
same update machinery can generally be applied to model-based RL. In practice, we may always
desire a system that runs both model-free and model-based mechanisms, see e.g. the so-called Dyna
architecture (Silver et al., 2008; Sutton, 1990).
This work surveys the existing value- and policy-iteration based average-reward model-free RL. We
begin by reviewing relevant DP (as the root of RL), before progressing to the tabular then function
approximation settings in RL. For comparison, the solo survey on average-reward RL (Mahadevan,
1996a) embraced only the tabular value-iteration based methods. We present our review in Secs 3
and 4, which are accompanied by concise Tables 1 and 2 (in Appendix A) along with literature maps
(Figs 1, 2, 3, and 4 in Appendix B). We then discuss the insight and outlook in Sec 5. Additionally
in Appendix C, we compile environments that were used for evaluation by the existing works.
In order to limit the scope, this work focuses on model-free RL with a single non-hierarchical agent
interacting online with its environment. We do not include works that approximately optimize the
average reward by introducing a discount factor (hence, the corresponding approximation error),
e.g. Schneckenreither (2020); Karimi et al. (2019); Bartlett and Baxter (2002). We also choose not
to examine RL methods that are based on linear programming (Wang, 2017; Neu et al., 2017), and
decentralized learning automata (Chang, 2009; Wheeler and Narendra, 1986).

2 Preliminaries
We assume that the problem we aim to tackle can be well modeled as a Markov decision process
(MDP) with the following properties.
• The state set S and action set A are finite. All states are fully observable, and all actions
a ∈ A are available in every state s ∈ S. The decision-epoch set T = Z≥0 is discrete and
infinite. Thus, we have discrete-time infinite-horizon finite MDPs.
• The state-transition probability p(st+1 |st , at ) and the reward function r(st , at ) are both
stationary (time homogenous, fixed over time). Here, r(st , at ) = E[r(st , at , St+1 )], and it
is uniformly bounded by a finite constant rmax , i.e. |r(s, a)| ≤ rmax < ∞, ∀(s, a) ∈ S ×A.
• The MDP is recurrent, and its optimal policies belong to the stationary policy set ΠS .
Furthermore, every stationary policy π ∈ ΠS of an MDP induces a Markov chain (MC), whose
transition (stochastic) matrix is denoted as P π ∈ [0, 1]|S|×|S|. The [s, s′ ]-entry of P π indicates
the probability P of transitioning′ from a current state s to the next state s′ under a policy π. That is,
′ t
P π [s, s ] = a∈A π(a|s)p(s |s, a). The t-th power of P π gives P π , whose [s0 , s]-entry indicates
the probability of being in state s in t timesteps when starting from s0 and following π. That is,
P tπ [s0 , s] = Pr{St = s|s0 , π} =: ptπ (s|s0 ). The limiting distribution of ptπ is given by
tmax
X−1
1
p⋆π (s|s0 ) = lim ptπ (s|s0 ) = lim ptπmax (s|s0 ), ∀s ∈ S, (1)
tmax →∞ tmax tmax →∞
t=0

where the first limit is proven to exist in finite MDPs, while the second limit exists whenever the finite
MDP is aperiodic (Puterman, 1994, Appendix A.4). This limiting state distribution p⋆π is equivalent
to the unique stationary (time-invariant) state distribution that satisfies (p⋆π )⊺ P π = (p⋆π )⊺ , which
may be achieved in finite timesteps. Here, p⋆π ∈ [0, 1]|S| is p⋆π (s|s0 ) stacked together for all s ∈ S.
The expected average reward (also termed the gain) value of a policy π is defined for all s0 ∈ S as
"t −1 #
max
1 X
vg (π, s0 ) := lim ESt ,At r(St , At ) S0 = s0 , π (2)
tmax →∞ tmax
t=0
Xn tmax
X−1 o
1 X
= lim ptπ (s|s0 ) rπ (s) = p⋆π (s|s0 )rπ (s) (3)
tmax →∞ tmax
s∈S t=0 s∈S
h i
= lim EStmax ,Atmax r(Stmax , Atmax ) S0 = s0 , π , (4)
tmax →∞

2
P
where rπ (s) = a∈A π(a|s) r(s, a). The limit in (2) exists when the policy π is stationary, and the
MDP is finite (Puterman, 1994, Prop 8.1.1). Whenever it exists, the equality in (3) follows due to the
existence of limit in (1) and the validity of interchanging the limit and the expectation. The equality
in (4) holds if its limit exists, for instance when π is stationary and the induced MC is aperiodic
(nonetheless, note that even if the induced MC is periodic, the limit in (4) exists for certain reward
structures, see Puterman (1994, Problem 5.2)). In matrix forms, the gain can be expressed as
tmax
X−1
1 1
v g (π) = lim v tmax (π) = lim P tπ r π = P ⋆π rπ ,
tmax →∞ tmax tmax →∞ tmax
t=0

|S|
where r π ∈ R is rπ (s) stacked together for all s ∈ S. Notice that the gain involves taking the
limit of the average of the expected total reward v tmax from t = 0 to tmax − 1 for tmax → ∞.
Since unichain MDPs have a single chain (i.e. a closed irreducible recurrent class), the stationary
distribution is invariant to initial states. Therefore, P ⋆π has identical rows so that the gain is constant
across all initial states. That is, vg (π) = p⋆π · r π = vg (π, s0 ), ∀s0 ∈ S, hence v g (π) = vg (π) · 1.
The gain vg (π) can be interpreted as the stationary reward because it represents the average reward
per timestep of a system in its steady-state under π.
A policy π ∗ ∈ ΠS is gain-optimal (hereafter, simply called optimal) if
vg (π ∗ , ✚
s✚
0 ) ≥ vg (π, ✚
s✚
0 ), ∀π ∈ ΠS , ∀s0 ∈ S, hence π ∗ ∈ argmax vg (π) (5)
π∈ΠS

Such an optimal policy π ∗ is also a greedy (hence, deterministic) policy in that it selects actions max-
imizing the RHS of the average-reward Bellman optimality equation (which is useful for deriving
optimal control algorithms) as follows,
n X o
vb (π ∗ , s) + vg (π ∗ ) = max r(s, a) + p(s′ |s, a) vb (π ∗ , s′ ) , ∀s ∈ S, (6)
a∈A
s′ ∈S

where vb denotes the bias value. It is defined for all π ∈ ΠS and all s0 ∈ S as
"t −1 #
max
X  
vb (π, s0 ) := lim ESt ,At r(St , At ) − vg (π) S0 = s0 , π (7)
tmax →∞
t=0
tmax
X−1 X  
= lim ptπ (s|s0 ) − p⋆π (s) rπ (s) (8)
tmax →∞
t=0 s∈S
τX
−1 X tmax
X−1 X  
= ptπ (s|s0 )rπ (s) − τ vg (π) + lim ptπ (s|s0 ) − p⋆π (s) rπ (s) (9)
tmax →∞
t=0 s∈S t=τ s∈S
| {z } | {z }
the expected total reward vτ approaches 0 as τ → ∞

Xn tmax
X−1  o X
= lim ptπ (s|s0 ) − p⋆π (s) rπ (s) = dπ (s|s0 )rπ (s), (10)
tmax →∞
s∈S t=0 s∈S

where all limits are assumed to exist, and (7) is bounded because of the subtraction of vg (π). When-
ever exchanging the limit and the expectation is valid in (10), dπ (s|s0 ) represents the [s0 , s]-entry
of the non-stochastic deviation matrix Dπ := (I − P π + P ⋆π )−1 (I − P ⋆π ).
The bias vb (π, s0 ) can be interpreted in several ways. Firstly based on (7), the bias is the expected
total difference between the immediate reward r(st , at ) and the stationary reward vg (π) when a
process starts at s0 and follows π. Secondly from (8), the bias indicates the difference between the
expected total rewards of two processes under π: one starts at s0 and the other at an initial state
drawn from p⋆π . Put in another way, it is the difference of the total reward of π and the total reward
that would be earned if the reward per timestep were vg (π). Thirdly, decomposing the timesteps as
in (9) yields vτ (π, s0 ) ≈ vg (π)τ + vb (π, s0 ). This suggests that the bias serves as the intercept of
a line around which the expected total reward vτ oscillates, and eventually converges as τ increases.
Such a line has a slope of the gain value. For example in an MDP with zero reward absorbing
terminal state (whose gain is 0), the bias equals the expected total reward before the process is
absorbed. Lastly, the deviation factor (ptπ (s|s0 ) − p⋆π (s)) in (10) is non-zero only before the process

3
reaches its steady-state. Therefore, the bias indicates the transient performance. It may be regarded
as the “transient” reward, whose values are earned during the transient phase.
For any reference states sref ∈ S, we can define the (bias) relative value vbrel as follows
vbrel (π, s) := vb (π, s) − vb (π, sref ) = lim {vtmax (π, s) − vtmax (π, sref )}, ∀π ∈ ΠS , ∀s ∈ S.
tmax →∞

The right-most equality follows from (7) or (9), since the gain are the same from both s and sref . It
indicates that vbrel represents the asymptotic different in the expected total reward due to starting
from s, instead of sref . More importantly, the relative value vbrel is equal to vb up to some constant
for any s ∈ S. Moreover, vbrel (π, s = sref ) = 0. After fixing sref , therefore, we can substitute
vbrel for vb in (6), then uniquely determine vbrel for all states. Note that (6) with vb is originally an
underdetermined nonlinear system with |S| equations and |S| + 1 unknowns (i.e. one extra of vg ).
In practice, we often resort to vbrel and abuse the notation vb to also denote the relative value
whenever the context is clear. In a similar fashion, we call the bias as the relative/differential value.
One may also refer to the bias as relative values due to (8), average-adjusted values (due to (7)), or
potential values (similar to potential energy in physics whose values differ by a constant (Cao, 2007,
p193)).
For brevity, we often use the following notations, vgπ := vg (π) =: g π , and vg∗ := vg (π ∗ ) =: g ∗ , as
well as vbπ (s) := vb (π, s), and vb∗ (s) := vb (π ∗ , s). Here, vbπ (s) may be read as the relative state
value of s under a policy π.

3 Value-iteration schemes
Based on the Bellman optimality equation (6), we can obtain the optimal policy once we know
the optimal state value v ∗b ∈ R|S| , which includes knowing the optimal gain vg∗ . Therefore, one
approach to optimal control is to (approximately) compute v ∗b , which leads to the value-iteration
algorithm in DP. The value-iteration scheme in RL uses the same principle as its DP counterpart.
However, we approximate the optimal (state-)action value q ∗b ∈ R|S||A| , instead of v ∗b .
In this section, we begin by introducing the foundations of the value-iteration scheme, showing the
progression from DP into RL. We then cover tabular methods by presenting how to estimate the opti-
mal action values iteratively along with numerous approaches to estimating the optimal gain. Lastly,
we examine function approximation methods, where the action value function is parameterized by
a weight vector, and present different techniques of updating the weights. Table 1 (in Appendix A)
summarizes existing average reward model-free RL based on the value-iteration scheme.

3.1 Foundations
In DP, the basic value iteration (VI) algorithm iteratively solves for the optimal state value v ∗b and
outputs an ε-optimal policy; hence it is deemed exact up to the convergence tolerance ε. The basic
VI algorithm proceeds with the following steps.
Step 1: Initialize the iteration index k ← 0 and v̂bk=0 (s) ← 0, where v̂bk (s) ≈ vb∗ (s), ∀s ∈ S.
Also select a small positive convergence tolerance ε > 0.
Note that v̂bk may not correspond to any stationary policy.
Step 2: Perform updates on all iterates (termed as synchronous updates) as follows,
 
v̂bk+1 (s) = B∗ v̂bk (s) , ∀s ∈ S, (Basic VI: synchronous updates)
where the average-reward Bellman optimality operator B∗ is defined as
  h X i
B∗ v̂bk (s) := max r(s, a) + p(s′ |s, a) v̂bk (s′ ) , ∀s ∈ S.
a∈A
s′ ∈S

Clearly, B∗ is non-linear (due to the max operator), and is derived from (6) with v̂g∗ ← 0.
Step 3: If the span seminorm sp(v̂ k+1
b − v̂ kb ) > ε, then increment k and go to Step 2.
Otherwise, output a greedy policy with respect to the RHS of (6); such a policy is ε-optimal.
Here, sp(v) := maxs∈S v(s) − mins′ ∈S v(s′ ), which measure the range of all components
of a vector v. Therefore, although the vector changes, its span may be constant.

4
The mapping in Step 2 of the VI algorithm is not contractive (because there is no discount factor).
Moreover, the iterates v̂ kb can grow very large, leading to numerical instability. Therefore, White
(1963) proposed substracting out the value of an arbitrary-but-fixed reference state sref at every
iteration. That is,
 
v̂bk+1 (s) = B∗ v̂bk (s) − v̂bk (sref ), ∀s ∈ S, (Relative VI: synchronous updates)
which results in the so-called relative value iteration (RVI) algorithm. This update yields the same
span and the same sequence of maximizing actions as that of the basic VI algorithm. Importantly,
as k → ∞, the iterate v̂bk (sref ) converges to vg∗ .
The asynchronous version of RVI may diverge (Abounadi et al., 2001, p682, Gosavi, 2015, p232).
As a remedy, Jalali and Ferguson (1990) introduced the following update,
 
v̂bk+1 (s) = B∗ v̂bk (s) − v̂gk , ∀(s 6= sref ) ∈ S, (Relative VI: asynchronous updates)
where v̂gk is the estimate of vg∗ at iteration k, and v̂bk (sref ) = 0, for all iterations k = 0, 1, . . .. It is
shown that this asynchronous method converges to produce a gain optimal policy in a regenerative
process, where there exists a single recurrent state under all stationary and deterministic policies.
In RL, the reward function r(s, a) and the transition distribution p(s′ |s, a) are unknown. Therefore,
it is convenient to define a relative (state-)action value q b (π) of a policy π as follows,
"t −1 #
max
X
π

qb (π, s, a) := lim ESt ,At r(St , At ) − vg S0 = s, A0 = a, π
tmax →∞
t=0
= r(s, a) − vgπ + E[vbπ (St+1 )], ∀(s, a) ∈ S × A, (11)
where for brevity, we define qbπ (s, a)
:= qb (π, s, a), as well as qb∗ (s, a) ∗
:= qb (π , s, a). The corre-
sponding average reward Bellman optimality equation is as follows (Bertsekas, 2012, p549),
X
qb∗ (s, a) + vg∗ = r(s, a) + p(s′ |s, a) max q ∗ (s′ , a′ ), ∀(s, a) ∈ S × A,
′ ∈A b
(12)
a
s′ ∈S | {z }
vb∗ (s′ )

whose RHS is in the same form as the quantity maximized over all actions in (6). Therefore, the
gain-optimal policy π ∗ can be obtained simply by acting greedily over the optimal action value;
hence, π ∗ is deterministic. That is,
h X i
π ∗ (s) = argmax r(s, a) + p(s′ |s, a) max qb
∗ ′ ′
(s , a ) − vg

, ∀s ∈ S.
a∈A
′ a ∈A
s′ ∈S
| {z }
qb∗ (s,a)

As can be observed, q ∗b combines the effect of p(s′ |s, a) and v ∗b , without estimating them separately
at the cost of an increased number of estimated values since typically |S| × |A| ≫ |S|. The benefit
is that action selection via q ∗b does not require the knowledge of r(s, a) and p(s′ |s, a). Note that vg∗
is invariant to state s and action a; hence its involvement in the above maximization has no effect.
Applying the idea of RVI on action values q ∗b yields the following iterate,
 
q̂bk+1 (s, a) = B∗ q̂bk (s, a) − vbk (sref ) (Relative VI on q ∗b : asynchronous updates)
X
= r(s, a) + p(s′ |s, a) max q̂bk (s′ , a′ ) − max q̂bk (sref , a′′ ), (13)
a′ ∈A a′′ ∈A
s ∈S
′ | {z }
| {z } can be interpreted as v̂∗
g
B∗ [q̂bk (s,a)]

where q̂ kb denotes the estimate of q ∗b at iteration k, and B∗ is based on (12) so it operates on action
values. The iterates of v̂bk (s) = maxa∈A q̂bk (s, a) are conjectured to converge to vb∗ (s) for all s ∈ S
by Bertsekas (2012, Sec 7.2.3).

3.2 Tabular methods


In model-free RL, the iteration for estimating q ∗b in (13) is carried out asynchronously as follows,

q̂b∗ (s, a) ← q̂b∗ (s, a) + β r(s, a) + max q̂b∗ (s′ , a′ ) − v̂g∗ − q̂b∗ (s, a) , (14)
′ a ∈A

5
where β is a positive stepsize, whereas s, a, and s′ denote the current state, current action, and
next state, respectively. Here, the sum over s′ in (13), i.e. the expectation with respect to S ′ , is
approximated by a single sample s′ . The stochastic approximation (SA) based update in (14) is the
essense of Qb -learning (Schwartz, 1993, Sec 5), and most of its variants. One exception is that of
Prashanth and Bhatnagar (2011, Eqn 8), where there is no subtraction of q̂b∗ (s, a).
In order to prevent the iterate q̂b∗ from becoming very large (causing numerical instability), Singh
(1994, p702) advocated assigning qb∗ (sref , aref ) ← 0, for arbitrary-but-fixed reference state sref and
action aref . Alternatively, Bertsekas and Tsitsiklis (1996, p404) advised setting q̂b∗ (sref , ·) ← 0.
Both suggestions seem to follow the heuristics of obtaining the unique solution of the underdeter-
mined Bellman optimality non-linear system of equations in (6).
There are several ways to approximate the optimal gain vg∗ in (14), as summarized in Fig 1 (Ap-
pendix B). In particular, Abounadi et al. (2001, Sec 2.2) proposed three variants as follows.
i. v̂g∗ ← q̂b∗ (sref , aref ) with reference state sref and action aref . Yang et al. (2016) argued
that properly choosing sref can be difficult in that the choice of sref affects the learning
performance, especially when the state set is large. They proposed setting v̂g∗ ← c for
a constant c from prior knowledge. Moreover, Wan et al. (2020) showed empirically that
such a reference retards learning and causes divergence. This happens when (sref , aref ) is
infrequently visited, e.g. being a trasient state in unichain MDPs.
ii. v̂g∗ ← maxa′ ∈A q̂b∗ (sref , a′ ), used by Prashanth and Bhatnagar (2011, Eqn 8). This inherits
the same issue regarding sref as before. Whenever the action set is large, the maximization
over A P should be estimated, yielding another layer of approximation errors.
iii. v̂g∗ ← ∗
(s,a)∈S×A q̂b (s, a)/(|S||A|), used by Avrachenkov and Borkar (2020, Eqn 11).
Averaging all entries of q̂ ∗b removes the need for sref and aref . However, because q̂b∗ itself
is an estimating function with diverse accuracy across all state-action pairs, the estimate v̂g∗
involves the averaged approximation error. The potential issue due to large state and action
sets is also concerning.
Equation (14) with one of three proposals (i - iii) for vg∗ estimators constitutes the RVI Qb -learning.
Although it operates asynchronously, its convergence is assured by decreasing the stepsize β.
The optimal gain vg∗ in (14) can also be estimated iteratively via SA as follows,
v̂g∗ ← v̂g∗ + βg ∆g , for some update ∆g and a positive stepsize βg . (15)
Thus, the corresponding Qb -learning becomes 2-timescale SA, involving both β and βg . There exist
several variations on ∆g as listed below.
i. In Schwartz (1993, Sec 5),
∆g := r(s, a) + max q̂b∗ (s′ , a′ ) − max q̂b∗ (s, a′ ) − v̂g∗ , if a = argmax q̂b∗ (s, a′ ) . (16)
a′ ∈A a′ ∈A a′ ∈A
| {z } | {z }
to minimize the variance of updates a greedy action a
By updating only when greedy actions are executed, the influence of exploratory actions
(which are mostly suboptimal) can be avoided.
ii. In Singh (1994, Algo 3), and Wan et al. (2020, Eqn 6),
∆g := r(s, a) + max q̂b∗ (s′ , a′ ) − q̂b∗ (s, a) − v̂g∗ . (To update at every action)
a′ ∈A
| {z }
vg∗ in expectation of S ′ when using qb∗ (see (12))

Since the equation for vg∗ (12) applies to any state-action pairs, it is reasonable to update v̂g∗
for both greedy and exploratory actions as above. This also implies that information from
non-greedy actions is not wasted. Hence, it is more sample-efficient than that of (16).
iii. In Singh (1994, Algo 4), and Das et al. (1999, Eqn 8),
∆g := r(s, a) − v̂g∗ , if a is chosen greedily with βg set to 1/(nu + 1),
where nu denotes the number of v̂g∗ updates so far (15). This special value of βg makes the
estimation equivalent to the sample average of the rewards received for greedy actions.
iv. In Bertsekas and Tsitsiklis (1996, p404), Abounadi et al. (2001, Eqn 2.9b), and Bertsekas
(2012, p551),
∆g := max q̂b∗ (sref , a′ ), for an arbitrary reference state sref . (17)
a′ ∈A
| {z }
v̂b∗ (sref )

6
This benefits from having sref such that vb∗ (sref ) can be interpreted as vg∗ , while also satis-
fying the underdetermined system of average-reward Bellman optimality equations.

3.3 Methods with function approximation


We focus on parametric techniques for Qb -learning with function approximation. In such cases, the
action value is parameterized by a weight (parameter) vector w ∈ W = Rdim(w) , where dim(w) is
the number of dimensions of w. That is, q̂b∗ (s, a; w) ≈ qb∗ (s, a), ∀(s, a) ∈ S × A. Note that there
also exist non-parametric techniques, for instance, kernel-based methods (Ormoneit and Glynn,
2002, 2001) and those based on state aggregation (Ortner, 2007).
Bertsekas and Tsitsiklis (1996, p404) proposed the following weight update,
n  o
w ← w + β r(s, a) − v̂g∗ + max q̂b
∗ ′ ′
(s , a ; w) ∇q̂b

(s, a; w) − w . (18)
′ a ∈A

Particularly, the optimal gain vg∗ is estimated using (15, 17). They highlighted that even if q̂b∗ (w) is
bounded, v̂g∗ may diverge.
Das et al. (1999, Eqn 9) updated the weight using temporal difference (TD, or TD error) as follows,
n  o
w ← w + β r(s, a) − v̂g∗ + max q̂b
∗ ′ ′
(s , a , w) − q̂b

(s, a; w) ∇q̂b

(s, a; w) , (19)
a′ ∈A
| {z }
relative TD in terms of q̂b∗

which can be interpreted as the parameterized form of (14), and is a semi-gradient update, similar to
its discounted-reward counterpart in Qγ -learning (Sutton and Barto, 2018, Eqn 16.3). This update
is adopted by Yang et al. (2016, Algo 2). The approximation for vg∗ can be performed in various
ways, as for tabular settings in Sec 3.2. Note that (19) differs from (18) in the use of TD, affecting
the update factor. In contrast, Prashanth and Bhatnagar (2011, Eqn 10) leveraged (14) in order to
suggest the following update,
n o
w ← w + β r(s, a) − v̂g∗ + max q̂b
∗ ′ ′
(s , a , w) , whose v̂g∗ is updated using (15, 17).
a ∈A

4 Policy-iteration schemes
Instead of iterating towards the value of optimal policies, we can iterate policies directly towards
the optimal one. At each iteration, the current policy iterate is evaluated, then its value is used
to obtain the next policy iterate by taking the greedy action based on the RHS of the Bellman
optimality equation (6). The latter step differentiates this so-called policy iteration from naive policy
enumeration that evaluates and compares policies as prescribed in (5).
Like in the previous Sec 3, we begin with the original policy iteration in DP, which is then gener-
alized in RL. Afterward, we review average-reward policy gradient methods, including actor-critic
variants, because of their prominence and proven empirical successes. The last two sections are
devoted to approximate policy evaluation, namely gain and relative value estimations. Table 2 (in
Appendix A) summarizes existing works on average-reward policy-iteration-based model-free RL.

4.1 Foundations
In DP, Howard (1960) proposed the first policy iteration algorithm to obtain gain optimal policies
for unichain MDPs. It proceeds as follows.
Step 1: Initialize the iteration index k ← 0 and set the initial policy arbitrarily, π̂ k=0 ≈ π ∗ .
Step 2: Perform (exact) policy evaluation:
Solve the following underdetermined linear system for vgk and v kb ,
X
vbk (s) + vgk = r(s, a) + p(s′ |s, a) vbk (s′ ), ∀s ∈ S, with a = π̂ k (s), (20)
s′ ∈S

which is called the Bellman policy expectation equation, also the Poisson equation
(Feinberg and Shwartz, 2002, Eqn 9.1).

7
Step 3: Perform (exact) policy improvement:
Compute a policy π̂ k+1 by greedy action selection (analogous to the RHS of (6)):

 X 
π̂ k+1 (s) ← argmax r(s, a) + p(s′ |s, a) vbk (s′ ) , ∀s ∈ S. (Synchronous updates)
a∈A s′ ∈S
| {z }
qbk (s, a) + vgk

Step 4: If stable, i.e. π̂ k+1 (s) = π̂ k (s), ∀s ∈ S, then output π̂ k+1 (which is equivalent to π ∗ ).
Otherwise, increment k, then go to Step 2.
The above exact policy iteration is generalized to the generalized policy iteration (GPI) for RL
(Sutton and Barto, 2018, Sec 4.6). Such generalization is in the sense of the details of policy eval-
uation and improvement, such as approximation (inexact evaluation and inexact improvement) and
the update granularity, ranging from per timestep (incremental, step-wise) to per number (batch) of
timesteps.
GPI relies on the policy improvement theorem (Sutton and Barto, 2018, p101). It assures that any
ǫ-greedy policy with respect to qb (π) is an improvement over any ǫ-soft policy π, i.e. any policy
whose π(a|s) ≥ ǫ/|A|, ∀(s, a) ∈ S × A. Moreover, ǫ-greediness is also beneficial for explo-
ration. There exist several variations on policy improvement that all share a similar idea to ǫ-greedy,
i.e. updating the current policy towards a greedy policy. They include Chen-Yu Wei (2019, Eqn 5),
Abbasi-Yadkori et al. (2019a, Eqn 4), Hao et al. (2020, Eqn 4.1), and approximate gradient-ascent
updates used in policy gradient methods (Sec 4.2).

4.2 Average-reward policy gradient methods

In this section, we outline policy gradient methods, which have proven empirical successes in func-
tion approximation settings (Agarwal et al., 2019; Duan et al., 2016). They requires explicit policy
parameterization, which enables not only learning appropriate levels of exploration (either control-
or parameter-based (Vemula et al., 2019, Miyamae et al., 2010, Fig 1)), but also injection of domain
knowledge (Deisenroth et al., 2013, Sec 1.3). Note that from a sensitivity-based point of view, policy
gradient methods belong to perturbation analysis (Cao, 2007, p18).
In policy gradient methods, a policy π is parameterized by a parameter vector θ ∈ Θ = Rdim(θ) ,
where dim(θ) indicates the number of dimensions of θ. In order to obtain a smooth dependence on θ
(hence, smooth gradients), we restrict the policy class to a set of randomized stationary policies ΠSR .
We further assume that π(θ) is twice differentiable with bounded first and second derivatives. For
discrete actions, one may utilize a categorical distribution (a special case of Gibbs/Boltzmann distri-
butions) as follows,

exp(θ⊺ φ(s, a))


π(a|s; θ) := P ′
, ∀(s, a) ∈ S × A, (soft-max in action preferences)
a′ ∈A exp(θ φ(s, a ))

where φ(s, a) is the feature vector of a state-action pair (s, a) for this policy parameterization
(Sutton and Barto (2018, Sec 3.1), Bhatnagar et al. (2009a, Eqn 8)). Note that parametric policies
may not contain the optimal policy because there are typically fewer parameters than state-action
pairs, yielding some approximation error.
The policy improvement is based on the following optimization with vg (θ) := vg (π(θ)),

θ∗ := argmax vg (θ), with iterative update: θ ← θ + αC −1 ∇vg (θ), (21)


θ∈Θ

8
where α is a positive step length, and C ∈ Rdim(θ)×dim(θ) denotes some preconditioning positive
definite matrix. Based on (3), we have
XX
∇vg (θ) = r(s, a) ∇{p⋆π (s) π(a|s; θ)} (∇ := ∂θ∂
)
s∈S a∈A
XX
= p⋆π (s) π(a|s; θ) r(s, a) {∇ log π(a|s; θ) + ∇ log p⋆π (s)}
s∈S a∈A
XX
= p⋆π (s) π(a|s; θ) qbπ (s, a) ∇ log π(a|s; θ) (does not involve ∇ log p⋆π (s))
| {z }
s∈S a∈A
in lieu of r(s, a)
XX X
= p⋆π (s) π(a|s; θ) p(s′ |s, a) vbπ (s′ ) ∇ log π(a|s; θ). (22)
s∈S a∈A s′ ∈S

The penultimate equation above is due to the (randomized) policy gradient theorem (Sutton et al.
(2000, Thm 1), Marbach and Tsitsiklis (2001, Sec 6), Deisenroth et al. (2013, p28)). The last equa-
tion was proven to be equivalent by Castro and Meir (2010, Appendix B).
There exist (at least) two variants of preconditioning matrices C. First is through the second
derivative C = −∇2 vg (θ), as well as its approximation, see Furmston et al. (2016, Appendix B.1),
(Morimura et al., 2008, Sec 3.3, 5.2), Kakade (2002, Eqn 6). Second, one can use a Riemannian-
metric matrix for natural gradients. It aims to make the update directions invariant to the policy
parameterization. Kakade (2002) first proposed the Fisher information matrix (FIM) as such a ma-
trix, for which an incremental approximation was suggested by Bhatnagar et al. (2009a, Eqn 26). A
generalized variant of natural gradients was introduced by Morimura et al. (2009, 2008). In addition,
Thomas (2014) derived another generalization that allows for a positive semidefinite matrix. Recall
that FIM is only guaranteed to be positive semidefinite (hence, describing a semi-Riemannian man-
ifold); whereas the natural gradient ascent assumes the function being optimized has a Riemannian
manifold domain.
In order to obtain more efficient learning with efficient computation, several works propose using
backward-view eligibility traces in policy parameter updates. The key idea is to keep track of the
eligibility of each component of θ for getting updated whenever a reinforcing event, i.e. qbπ , occurs.
Given a state-action sample (s, a), the update in (21) becomes
θ ← θ + αC −1 qbπ (s, a) eθ , which is carried out after eθ ← λθ eθ + ∇ log π(a|s; θ), (23)
| {z }
the eligibility
dim(θ)
where eθ ∈ R denotes the accumulating eligibility vector for θ and is initialized to 0,
whereas λθ ∈ (0, 1) the trace decay factor for θ. This is used by Iwaki and Asada (2019,
Sec 4), Sutton and Barto (2018, Sec 13.6), Furmston et al. (2016, Appendix B.1), Degris et al. (2012,
Sec III.B), Matsubara et al. (2010a, Sec 4), and Marbach and Tsitsiklis (2001, Sec 5).
As can be observed in (22), computing the gradient ∇vg (θ) exactly requires (exact) knowledge of
p⋆π and qbπ , which are both unknown in RL. It also requires summation over all states and actions,
which becomes an issue when state and action sets are large. As a result, we resort to the gradient
estimate, which if unbiased, leads to stochastic gradient ascent.
In order to reduce the variance of gradient estimates, there are two common techniques based on
control variates (Greensmith et al., 2004, Sec 5, 6).
First is the baseline control variate, for which the optimal baseline is the one that minimizes the
variance. Choosing vbπ as a baseline (Bhatnagar et al., 2009a, Lemma 2) yields
hn o i
qbπ (s, a) − vbπ (s) = ES ′ r(s, a) − vgπ + vbπ (S ′ ) − vbπ (s) s, a , ∀(s, a) ∈ S × A,
| {z } | {z }
π
relative action advantage, ab (s, a)
relative TD, δvπ (s, a, S ′ )
b
(24)
where S ′ denotes the next state given the current state s and action a. It has been shown that
aπb (s, a) = ES ′ δvπb (s, a, S ′ ) , meaning that the TD, i.e. δvπb (s, a, S ′ ), can be used as an unbiased
estimate for the action advantage (Iwaki and Asada, 2019, Prop 1, Castro and Meir, 2010, Thm 5,
Bhatnagar et al., 2009a, Lemma 3). Hence, we have
 
∇vg (θ) = ES,A aπb (S, A) ∇ log π(A|S; θ) ≈ δvπb (s, a, s′ ) ∇ log π(a|s; θ),

9
where the expectation of S, A, and S ′ is approximated with a single sample (s, a, s′ ). This yields
an unbiased gradient estimate with lower variance than that using qbπ . Note that in RL, the exact aπb
and vbπ (for calculating δvπb ) should also be approximated.
Second, a policy value estimator is set up and often also parameterized by w ∈ W = Rdim(w) . This
gives rise to actor-critic methods, where “actor” refers to the parameterized policy, whereas “critic”
the parameterized value estimator. The critic can take either one of these three forms,
i. relative state-value estimator v̂bπ (s; wv ) for computing the relative TD approximate δ̂vπb ,
ii. both relative state- and action-value estimators for âπb (s, a) ← q̂bπ (s, a; wq ) − v̂bπ (s; wv ),
iii. relative action-advantage estimator âπb (s, a; w a ),
for all s ∈ S, a ∈ A, and with the corresponding critic parameter vectors wv , wq , and wa . This
parametric value approximation is reviewed in Sec 4.4.

4.3 Gain Approximation

We classify gain approximations into incremental and batch categories. Incremental methods update
their current estimates using information only from one timestep, hence step-wise updates. In con-
trast, batch methods uses information from multiple timesteps in a batch. Every update, therefore,
has to wait until the batch is available. Notationally, we write g π := vg (π). The superscript π is
often dropped whenever the context is clear.

4.3.1 Incremental methods

Based on (4), we can define an increasing function f (g) := g − E[r(S, A)]. Therefore, the problem
of estimating g becomes finding the root of f (g), by which f (g) = 0. As can be observed, f (g)
is unknown because E[r(S, A)] is unknown. Moreover, we can only obtain noisy observations of
E[r(S, A)], namely r(s, a) = E[r(S, A)] + ε for some additive error ε > 0. Thus, the noisy
observation of f (ĝt ) at iteration t is given by fˆ(ĝt ) = ĝt − r(st , at ). Recall that an increasing
function satisfies df (x)/dx > 0 for x > x∗ where x∗ indicates the root of the f (x).
We can solve for the root of f (g) iteratively via the Robbins-Monro (RM) algorithm as follows,

ĝt+1 = ĝt − βt {fˆ(ĝt )} = ĝt − βt {ĝt − r(st , at )} (for an increasing function f )


= ĝt + βt {r(st , at ) − ĝt } = (1 − βt )ĝt + βt {r(st , at )}, (25)

where βt ∈ (0, 1) is a gain approximation step size; note that setting βt ≥ 1 violates the RM algo-
rithm since it makes the coefficient of ĝt non-positive. This recursive procedure converges to the root
of f (g) with probability 1 under several conditions, including εt is i.i.d with E[εt ] = 0, the step size
satisfies the standard requirement in SA, and some other technical conditions (Cao, 2007, Sec 6.1).
The stochastic approximation technique in (25) is the most commonly used. For instance, Wu et al.
(2020, Algo 1), Heess et al. (2012), Powell (2011, Sec 10.7), Castro and Meir (2010, Eqn 13),
Bhatnagar et al. (2009a, Eqn 21), Konda and Tsitsiklis (2003, Eqn 3.1), Marbach and Tsitsiklis
(2001, Sec 4), and Tsitsiklis and Roy (1999, Sec 2).
Furthermore, a learning rate of βt = 1/(t + 1) in (25) yields

t × ĝt + r(st , at ) r(st , at ) + r(st−1 , at−1 ) + . . . + r(s0 , a0 )


ĝt+1 = = , (26)
t+1 t+1

which is the average of a noisy gain observation sequence up to t. The ergodic theorem asserts that
for such a Markovian sequence, the time average converges to (4) as t approaches infinity (Gray,
2009, Ch 8). This decaying learning rate is suggested by Singh (1994, Algo 2). Additionally, a
variant of βt = 1/(t + 1)κ with a positive constant κ < 1 is used by Wu et al. (2020, Sec 5.2.1) for
establishing the finite-time error rate of this gain approximation under non-i.i.d Markovian samples.

10
Another iterative approach is based on the Bellman expectation equation (20) for obtaining the noisy
observation of g. That is,
 π π
ĝt+1 = (1 − βt )ĝt + βt r(st , at ) + v̂b,t (st+1 ) − v̂b,t (st ) (The variance is reduced, cf. (25))
| {z }
π
g in expectation of St+1 when v̂b,t = vbπ
 π π
= ĝt + βt {r(st , at ) − ĝt + v̂b,t (st+1 )} − v̂b,t (st ) . (27)
| {z }
δ̂vπ (st , at , st+1 )
b ,t

In comparison to (25), the preceding update is anticipated to have lower variance due to the ad-
π π
justment terms of v̂b,t (st+1 ) − v̂b,t (st ). This update is used by Singh (1994, Algo 1), Degris et al.
(2012, Sec 3B), and Sutton and Barto (2018, p333). An analogous update can be formulated by
replacing vbπ with qbπ , which is possible only when the next action at+1 is already available, as in
the differential Sarsa algorithm (Sutton and Barto, 2018, p251).

4.3.2 Batch methods


In batch settings, there are at least two ways to approximate the gain of a policy, denoted as g.
First, ĝ is set to the total rewards divided by the number of timesteps in a batch, as in
(26), which also shows its incremental equivalence. This is used in (Powell, 2011, p339;
Yu and Bertsekas, 2009, Sec 2B; Gosavi, 2004a, Sec 4, Feinberg and Shwartz, 2002, p319). The
work of Marbach and Tsitsiklis (2001, Eqn 16) also falls in this category, but ĝ is computed as a
weighted average of all past rewards from all batches (regenerative cycles), mixing on- and off-
policy samples. Nonetheless, it is shown to yield lower estimation variance.
Second, based on similiar justification as RVI, we can specify a reference state and action pair.
For instance, Cao (2007, Eqn 6.23, p311) proposed ĝ ← v̂bπ (sref ), whereas Lagoudakis (2003,
Appendix A) advocated ĝ ← q̂bπ (sref , aref ).

4.4 Relative state- and action-value approximation

4.4.1 Tabular methods


Singh (1994, Algo 1, 2) proposed the following incremental TD-based update,

v̂bπ (s) ← v̂bπ (s) + β{ r(s, a) − v̂gπ + v̂bπ (s′ ) − v̂bπ (s)}, for a sample (s, a, s′ ).
| {z }
δvπ (s,a,s′ )
b

A similar update for q̂bπ


can be obtained by substituting v̂bπ with q̂bπ along with a sample for the next
action, as in Sarsa-like algorithms.
In a batch fashion, the relative action value can be approximated as follows,

s
ref
X tπ
qbπ (st , at ) r(sτ , aτ ) − v̂gπ ,
s
≈ using a sample set {(sτ , aτ )}τ =t
ref
(28)
τ =t
| {z }
an episode return

where tπsref denotes the timestep at which sref is visited while following a policy π, assuming
that sref is a recurrent state under all policies. This is used by Sutton et al. (2000, Sec 1), and
Marbach and Tsitsiklis (2001, Sec 6). Another batch approximation technique is based on the
inverse-propensity scoring (Chen-Yu Wei, 2019, Algo 3).

4.4.2 Methods with function approximation


Here, we review gradient-based approximate policy evaluation that relies on the gradient of an error
function to update its parameter. Note that there exist approximation techniques based on least
squares, e.g. LSPE(λ) (Yu and Bertsekas, 2009), gLSTD and LSTDc (Ueno et al., 2008), and LSTD-
Q (Lagoudakis, 2003, Appendix A).

11
Relative state-value approximation: The value approximator is parameterized as v̂ πb (w) ≈ v πb ,
where w ∈ W = Rdim(w) . Then, the mean squared error (MSE) objective is minimized. That is,
 
kv πb − v̂ πb (w)k2diag(p⋆π ) := ES∼p⋆π {vbπ (S) − v̂bπ (S, w)}2 ,
where diag(p⋆π ) is a |S|-by-|S| diagonal matrix with p⋆π (s) in its diagonal. As a result, the gradient
estimate for updating w is given by
1  
− ∇ES∼p⋆π {vbπ (S) − v̂bπ (S; w)}2 = ES∼p⋆π [{vbπ (S) − v̂bπ (S; w)}∇v̂bπ (S; w)]
2
≈ {vbπ (s) − v̂bπ (s; w)}∇v̂bπ (s; w) (Via a single sample s)

≈ {r(s, a) − v̂gπ + v̂bπ (s′ ; w)} − v̂bπ (s; w) ∇v̂bπ (s; w)
| {z }
TD target that approximates vbπ (s)

= δ̂vπb (s, a, s′ ; w)∇v̂bπ (s; w). (29)


The key difference to supervised learning lies in the fact that the true value vbπ (s)
is approximated by
a bootstrapping TD target involving v̂bπ (s′ ; w), which is biased due to its dependency on w. Such a
dependency, however, is not captured (hence, ignored) in the gradient in (29) because the TD target
plays the role of true values. As a result, this way of learning w belongs to semi-gradient methods.
With the same motivation as in policy gradient updates (23), one may use an (accumulating) eligi-
bility trace vector ev with reinforcing event δvπb . This leads to a backward view of TD(λ), whose
weight update is given by
wt+1 = w t + βt δ̂vπb (st , at , st+1 )ev,t ,
with ev,t = λv ev,t−1 + ∇v̂bπ (st ; wt ), (30)
Pt t−τ π
for a trace decay factor λv ∈ (0, 1) and e−1 = 0 such that ev,t = τ =0 λv ∇v̂b (sτ ; w τ ). For
linear function approximation, Tsitsiklis and Roy (1999, Thm 1) provide an asymptotic convergence
proof (with probability 1 with incremental gain estimation, as well as with fixed gain estimates)
and a bound on the resulting approximation error. This TD(λ) learning for v̂bπ is also used by
Sutton and Barto (2018, Eqn 11.4, p333), and Castro and Meir (2010, Eqn 13).
As can be observed, there are 2 dedicated step lengths, i.e. the actor’s αt (23) and the critic’s βt
(30). Together, they form 2-timescale SA-type approximation, where both actor and critic are up-
dated at each iteration but with different step lengths, αt 6= βt . Intuitively, the policy (actor) should
be updated on a slower timescale so that the critic has enough updates to catch up with the ever-
changing policy. This implies that αt should be smaller than βt . More precisely, we require that
limt→∞ βt /αt = ∞ for assuring the asymptotic convergence (Wu et al., 2020; Bhatnagar et al.,
2009a; Konda and Tsitsiklis, 2003). Nevertheless, Castro and Meir (2010) argued that from bio-
logical standpoint, 2-timescale operation within the same anatomical structure is not well justified.
They analyzed the convergence of 1-timescale actor-critic methods to the neighborhood of a local
maximum of vg∗ .

Relative action-value and action-advantage approximation: Estimation for qbπ can be carried
out similarly as that for vbπ in (30), but uses the relative TD on action values, namely
δ̂qπb (s, a, s′ , a′ ) := {r(s, a) − v̂gπ + q̂bπ (s′ , a′ ; w)} − q̂bπ (s, a; w), with a′ := at+1 ,
| {z }
TD target that approximates qbπ (s, a)

as in Sarsa (Sutton and Barto, 2018, p251), as well as Konda and Tsitsiklis (2003, Eqn 3.1). It is
also possible to follow this pattern but with different approaches to approximating the true value of
qbπ (s, a), e.g. via episode returns in (28) (Sutton et al., 2000, Thm 2).
For approximating the action advantage parametrically through âπb (w), the technique used in (29)
can also be applied. However, the true value aπb is estimated by TD on state values. That is,
1  
− ∇ES∼p⋆π ,A∼π {aπb (S, A) − âπb (S, A; w a )}2 = E[{aπb (S, A) − âπb (S, A; wa )}∇âπb (S, A; w a )]
2 
≈ aπb (s, a) − âπb (s, a; wa ) ∇âπb (s, a; w a )
(Approximation via a single sample (s, a))
 π
≈ δ̂vb (s, a, s′ ; w v ) − âπb (s, a; wa ) ∇âπb (s, a; wa ).
(Approximation via δ̂vπb ≈ aπb (s, a), see (24))

12
This is used by Iwaki and Asada (2019, Eqn 17), Heess et al. (2012, Eqn 13), and Bhatnagar et al.
(2009a, Eqn 30). In particular, Iwaki and Asada (2019, Eqn 27) proposed a preconditioning matrix
for the gradients, namely: I − κ(f (s, a)f ⊺ (s, a))/(1 + κkf (s, a)k2 ), where f (s, a) denotes the
feature vector of a state-action pair, while κ ≥ βa some scaling constant.
Futhermore, the action-value or action-advantage estimators can be parameterized linearly with the
so-called θ-compatible state-action feature, denoted by f θ (s, a), as follows,
q̂bπ (s, a; w) = w⊺ f θ (s, a), with f θ (s, a) := ∇ log π(a|s; θ), ∀(s, a) ∈ S × A. (31)
| {z }
θ-compatible state-action feature

This parameterization along with the minimization of MSE loss are beneficial for two reasons. First,
their use with (locally) optimal parameter w∗ ∈ W satisfies
h i
ES,A q̂bπ (S, A; w = w∗ ) ∇π(A|S; θ) = ∇vg (θ), (Exact policy gradients when w = w∗ )
| {z }
in lieu of qbπ (S, A)

as shown by Sutton et al. (2000, Thm 2). Otherwise, the gradient estimate is likely to be biased
(Bhatnagar et al., 2009a, Lemma 4). Second, they make computing natural gradients equivalent to
finding the optimal weight w ∗ for q̂bπ (w). That is,
XX 
p⋆θ (s)π(a|s; θ)f θ (s, a) f ⊺θ (s, a)w ∗ − qbπ (s, a) = 0 (Since MSE is 0 when w = w∗ )
s∈S a∈A
nX X o XX
p⋆θ (s)π(a|s; θ)f θ (s, a)f ⊺θ (s, a) w ∗ = p⋆θ (s)π(a|s; θ)qbπ (s, a)f θ (s, a)
s∈S a∈A s∈S a∈A
| {z } | {z }
F a (θ) ∇vg (θ)

w ∗ = F −1 (θ) ∇vg (θ), (32)


|a {z }
The natural gradient

where F a (θ) denotes the Fisher (information) matrix based on the action distribution (hence, the
subscript a), as introduced by Kakade (2002, Thm 1). This implies that the estimation for natu-
ral gradients is reduced to a regression problem of state-action value functions. It can be either
regressing
i. action value qbπ (Sutton et al., 2000, Thm 2, Kakade, 2002, Thm 1, Konda and Tsitsiklis,
2003, Eqn 3.1),
ii. action advantage aπb (Bhatnagar et al., 2009a, Eqn 30, Algo 3, 4; Heess et al., 2012, Eqn 13;
Iwaki and Asada, 2019, Eqn 17), or
iii. the immediate reward r(s, A ∼ π) as a known ground truth (Morimura et al., 2008, Thm 1),
recall that the ground truth qbπ (Item i.) and aπb (Item ii.) above are unknown. This reward
regression is used along with a θ-compatible state-action feature that is defined differently
compared to (31), namely f θ (s, a) := ∇ log p⋆π (s, a) = ∇ log p⋆π (s) + ∇ log π(a|s; θ),
where f θ is based on the stationary joint state-action distribution p⋆π (s, a). In fact, this
leads to a new Fisher matrix, yielding the so-called natural state-action gradients F s,a
(which subsumes F a in (32) as a special case when ∇ log p⋆π is set to 0).
In addition, Konda and Tsitsiklis (2003) pointed out that the dependency of ∇vg (θ) to qbπ is only
dim(θ)
through its inner products with vectors in the subspace spanned by {∇i log π(a|s; θ)}i=1 . This
π
implies that learning the projection of qb onto the aforementioned (low-dimensional) subspace is
sufficient, instead of learning qbπ fully. A θ-dependent state-action feature can be defined based on
either the action distribution (as in (31)), or the stationary state-action distribution (as mentioned in
Item iii. in the previous passage).

5 Discussion
In this section, we discuss several open questions as first steps towards completing the literature in
average-reward model-free RL. We begin with those of value iteration schemes (Sec 5.1), then of
policy iteration schemes focussing on policy evaluation (Sec 5.2). Lastly, we also outline several
issues that apply to both or beyond the two aforementioned schemes.

13
5.1 Approximation for optimal gain and optimal action-values (of optimal policies)

Two main components of Qb -learning are q̂b∗ and v̂g∗ updates. The former is typically carried out
via either (14) for tabular, or (19) for (parametric) function approximation settings. What remains
is determining how to estimate the optimal gain vg∗ , which becomes the main bottleneck for Qb -
learning.
As can be seen in Fig 1 (Appendix B), there are two classes of vg∗ approximators. First, approxi-
mators that are not SA-based generally need sref and aref specification, which is shown to affect
the performance especially in large state and action sets (Wan et al., 2020; Yang et al., 2016). On
the other hand, SA-based approximators require a dedicated stepsize βg , yielding more complicated
2-timescale Qb -learning. Furthermore, it is not yet clear whether the approximation for vg∗ should
be on-policy, i.e. updating only when a greedy action is executed, at the cost of reduced sample
efficiency. This begs the question of which approach to estimating vg∗ is “best” (in which cases).
In discounted reward
 settings, Hasselt (2010)  (also, Hasselt et al. (2016)) pointed out that the ap-
proximation for E maxa∈A qγ∗ (St+1 , a) poses overestimation, which may be non-uniform and not
concentrated at states that are beneficial in terms of exploration. He proposed instantiating two
decoupled approximators such that
 
∗ ′
ESt+1 max qγ (S t+1 , a ) ≈ q̂γ∗ (st+1 , argmax q̂γ∗ (st+1 , a′ ; w(2) (1)
q ); w q ), (Double Qγ -learning)
a ∈A

a′ ∈A

(1) (2)
where wq and w q denote their corresponding weights. This was shown to be successful in
reducing the negative effect of overestimation. In average reward cases, the overestimation of qb∗
becomes more convoluted due to the involvement of v̂g∗ , as shown in (12). We believe that it is
important to extend the idea of double action-value approximators to Qb -learning.
To our knowledge, there is no finite-time convergence analysis for Qb -learning thus far. There
are also very few works on Qb -learning with function approximation. This is in contrast with its
discounted reward counterpart, e.g. sample complexity of Qγ -learning with UCB-exploration bonus
(Wang et al., 2020), as well as deep Qγ -learning neural-network (DQN) and its variants with non-
linear function approximation (Hessel et al., 2018).

5.2 Approximation for gain, state-, and action-values of any policy

Gain approximation v̂gπ : In order to have more flexibility in terms of learning methods, we can
parameterize the gain estimator, for instance, v̂gπ (s; wg ) := f ⊺ (s)w g , by which the gain estimate is
state-dependent. The learning uses the following gradients,
1 h i  
2
− ∇ES∼p⋆π vgπ − v̂gπ (S; w g ) = ES∼p⋆π vgπ − v̂gπ (S; wg ) ∇v̂gπ (S; wg )
2 
≈ vgπ (s) − v̂gπ (s; wg ) ∇v̂gπ (s; w g ) (By a single sample s)

≈ r(s, a) + v̂bπ (s′ ) − v̂bπ (s) − v̂gπ (s; w g ) ∇v̂gπ (s; w g )
| {z }
approximates the true vgπ based on (20)

= δ̂vπb (s, a, s′ ; wg )∇v̂gπ (s; wg ).


This requires further investigation whether the above parameterized approximator for the gain vgπ is
beneficial in practice.

State-value approximation v̂bπ : We observe that most, if not all, relative state-value approxima-
tors are based on TD, leading to semi-gradient methods (Sec 4.4.2). In discounted reward settings,
there are several non-TD approaches. These include the true gradient methods (Sutton et al., 2009),
as well as those using the Bellman residual (i.e. the expected value of TD), such as Zhang et al.
(2019); Geist et al. (2017); see also surveys by Dann et al. (2014); Geist and Pietquin (2013).
It is interesting to investigate whether TD or not TD for learning v̂bπ . For example, we may formu-
late a (true) gradient TD method that optimizes the mean-squared projected Bellman error (MSPBE),
namely kv̂bπ (w) − P[Bπg [v̂bπ (w)]]k2diag(p⋆ ) with a projection operator P that projects any value ap-
π
proximation onto the space of representable paramaterized approximators.

14
Action-value approximation q̂bπ : Commonly, the estimation for qbπ involves the gain estimate
v̂gπ as dictated by the Bellman expectation equation (20). It is natural then to ask: is it possible to
perform average-reward GPI without estimating the gain of a policy at each iteration? We speculate
that the answer is affirmative as follows.
Let the gain vgπ be the baseline for qbπ in a similiar manner to vbπ in (24); cf. in discounted reward
settings, see Weaver and Tao (2001, Thm 1). Then, based on the Bellman expectation equation in
action values (analogous to (12)), we have the following identity,
qbπ (s, a) − (−vgπ ) = r(s, a) + ESt+1 [vbπ (St+1 )] . (Note different symbols, qbπ vs qπb )
| {z }
The surrogate action-value qπ
b (s, a)

This surrogate can be parameterized as q̂πb (s, a; w). Its parameter w is updated using the following
gradient estimates,
1  
− ∇ES,A {qπb (S, A) − q̂πb (S, A; w)}2 = ES,A [{qπb (S, A) − q̂πb (S, A; w)}∇q̂πb (S, A; w)]
2 hn o i
= ES,A,S ′ r(S, A) + vbπ (S ′ ) − q̂πb (S, A; w) ∇q̂πb (S, A; w)
| {z }
The surrogate qπ
b (S, A)
n o
≈ r(s, a) + vbπ (s′ ) − q̂πb (s, a; w) ∇q̂πb (s, a; w)
(Using a single sample (s, a, s′ ))
n o
∝ r(s, a) + vbπ (s′ ) + (vgπ + κ) − q̂πb (s, a; w) ∇q̂πb (s, a; w),
| {z }
The surrogate state-value νbπ (s′ )
(Note different symbols, vbπ vs νbπ )
for some arbitrary constant κ ∈ R. The key is to exploit the fact that we have one degree of
freedom in the underdetermined linear systems for policy evaluation in (20). Here, the surrogate
state-value νbπ is equal to vbπ up to some constant, i.e. νbπ = vbπ + (vgπ + κ). One can estimate νbπ in
a similar fashion as v̂bπ in (29), except that now the gain approximation v̂gπ is no longer needed. The
parameterized estimator ν̂bπ (w ν ) can be updated by following below gradient estimate,
1   
− ∇ES∼p⋆π {νbπ (S) − ν̂bπ (S, w ν )}2 ≈ r(s, a) + νbπ (s′ ; wν ) − ν̂bπ (s; w ν ) ∇ν̂bπ (s; w ν ), (33)
2
which is based on a single sample (s, a, s′ ). In the RHS of (33) above, the TD of νbπ looks similar
to that of discounted reward vγ , but with abused γ = 1. Therefore, its stability (whether it will grow
unboundedly) warrants experiments.

Extras: We notice that existing TD-based value approximators simply re-use the TD estimate δ̂vπb
obtained using the old gain estimate from the previous iteration, e.g. Sutton and Barto (2018, p251,
333). They do not harness the newly updated gain estimate for cheap recomputation of δ̂vπb . We
anticipate that doing so will yield more accurate evaluation of the current policy.
There are also open research questions that apply to both average- and discounted-reward policy eval-
uation, but generally require different treatment. They are presented below along with the parallel
works in discounted rewards, if any, which are put in brackets at the end of each bullet point.
• How sensitive is the estimation accuracy of v̂bπ with respect to the trace decay factor λ
in TD(λ)? Recall that v̂bπ involves v̂g , which is commonly estimated using δ̂vπb as in (27).
(cf. Sutton and Barto (2018, Fig 12.6))
• What are the advantages and challenges of natural gradients for the critic? Note that natural
gradients typically are applied only for the actor. (cf. Wu et al. (2017))
• As can be observed, the weight update in (30) resembles that of SGD with momentum.
It begs the question: what is the connection between backward-view eligibility traces
and momentum in gradient-based step-wise updates for both actor and critic parameters?
(cf. Vieillard et al. (2020); Nichols (2017); Xu et al. (2006))
• How to construct “high-performing” basis vectors for the value approximation? To what
extent does the limitation of θ-compatible critic ((31)) outweigh its benefit? Also notice
the Bellman average-reward bases (Mahadevan, 2009, Sec 11.2.4), as well as non-linear
(over-)parameterized neural networks. (cf. Wang et al. (2019))

15
• Among 3 choices for advantage approximation (Sec 4.2), which one is most beneficial?

5.3 Further open research questions

In the following passages, relevant works in discounted reward settings are mentioned, if any, inside
the brackets at the end of each part.

On batch settings: For on-policy online RL without experience replay buffer, we pose the follow-
ing questions. How to determine a batch size that balances the trade-off between collecting more
samples with the current policy and updating the policy more often with fewer numbers of samples?
How to apply the concept of backward-view eligibility traces in batch settings? (cf. Harb and Precup
(2017)).

On value- vs policy-iteration schemes: As can be observed in Tables 1 and 2 (in Appendix A),
there is less work on value- than policy-iteration schemes; even less when considering only function
approximation settings. Our preliminary experiments suggest that although following value iteration
is straightforward (cf. policy iteration with evaluation and improvement steps), it is more difficult to
make it “work”, especially with function approximation. One challenge is to specify the proper off-
set, e.g. vg∗ or its estimate, in RVI-like methods to bound the iterates. Moreover, Mahadevan (1996a,
Sec 3.5) highlighted that the seminal value-iteration based RL, i.e. average reward Qb -learning, is
sensitive to exploration.
We posit these questions. Is it still worth it to adopt the value iteration scheme after all? Which
scheme is more advantageous in terms of exploration in RL? How to reconcile both schemes?
(cf. O’Donoghue et al. (2016); Schulman et al. (2017); Wang et al. (2020)).

On distributional (cf. expected-value) perspective: There exist few works on distributional


views. Morimura et al. (2010) proposed estimating ∇ log pθ (s) for obtaining the gradient estimate
ˆ g (θ) in (22), removing the need to estimate the action value q θ . One important question is how
∇v b
to scale up the distribution (density) estimation for large RL problems. (cf. Bellemare et al. (2017)).

On MDP modeling: The broadest class that can be handled by existing average-reward RL is
the unichain MDP; note that most works assume the more specific class, i.e. the recurrent (ergodic)
MDP. To our knowledge, there is still no average reward model-free RL for multichain MDPs (which
is the most general class).
We also desire to apply average-reward RL to continuous state problems, for which we may benefit
from the DP theory on general states, e.g. Sennott (1999, Ch 8). There are few attempts thus far, for
instance, (Yang et al., 2019), which is limited to linear quadratic regulator with ergodic cost.

On optimality criteria: The average reward optimality is underselective for problems with tran-
sient states, for which we need (n = 0)-discount (bias) optimality, or even higher n-discount op-
timality. This underselective-ness motivates the weighted optimality in DP (Krass et al., 1992). In
RL, Mahadevan (1996b) developed bias-optimal Q-learning.
In the other extreme, (n = ∞)-discount optimality is the most selective criterion. According to
Puterman (1994, Thm 10.1.5), it is proven to be equivalent to the Blackwell optimality, which in-
tuitively claims that upon considering sufficiently far into the future via the Blackwell’s discount
factor γBw , there is no policy better than the Blackwell optimal policy. Moreover, optimizing the
discounted reward does not require any knowledge about the MDP structure (i.e. recurrent, unichain,
multichain classification). Therefore, one of pressing questions is on estimating such γBw in RL.

Acknowledgments

We thank Aaron Snoswell, Nathaniel Du Preez-Wilkinson, Jordan Bishop, Russell Tsuchida, and
Matthew Aitchison for insightful discussions that helped improve this paper. Vektor is supported by
the University of Queensland Research Training Scholarship.

16
References
Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C., and Weisz, G. (2019a). POLI-
TEX: Regret bounds for policy iteration using expert prediction. In ICML. (p8, 24, 25, and 35)
Abbasi-Yadkori, Y., Lazic, N., and Szepesvari, C. (2019b). Model-free linear quadratic control via
reduction to expert prediction. In AISTATS. (p24 and 35)
Abdulla, M. S. and Bhatnagar, S. (2007). Reinforcement learning based algorithms for average cost
markov decision processes. Discrete Event Dynamic Systems, 17(1). (p33)
Abounadi, J., Bertsekas, D., and Borkar, V. S. (2001). Learning algorithms for markov decision
processes with average cost. SIAM J. Control Optim., 40(3). (p5, 6, 26, and 30)
Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). On the theory of policy gradient
methods: Optimality, approximation, and distribution shift. arXiv: 1908.00261. (p8)
Avrachenkov, K. and Borkar, V. S. (2020). Whittle index based q-learning for restless bandits with
average reward. arXiv:2004.14427. (p6, 23, 26, and 30)
Bagnell, J. A. D. and Schneider, J. (2003). Covariant policy search. In Proceeding of the Interna-
tional Joint Conference on Artifical Intelligence. (p32)
Bartlett, P. L. and Baxter, J. (2002). Estimation and approximation bounds for gradient-based rein-
forcement learning. Journal of Computer and System Sciences, 64(1). (p2)
Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement
learning. In ICML. (p16)
Bertsekas, D. P. (2012). Dynamic Programming and Optimal Control, Vol. II. Athena Scientific, 4th
edition. (p5, 6, 26, and 30)
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific, 1st
edition. (p6, 7, 26, and 29)
Bhatnagar, S., Ghavamzadeh, M., Lee, M., and Sutton, R. S. (2008). Incremental natural actor-critic
algorithms. In NIPS. (p33)
Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2009a). Natural actor-critic algorithms.
Automatica, 45. (p8, 9, 10, 12, 13, 27, 28, 33, and 34)
Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2009b). Natural actor-critic algorithms.
Technical report. (p23)
Cao, X. (2007). Stochastic Learning and Optimization: A Sensitivity-Based Approach. International
Series on Discrete Event Dynamic Systems, v. 17. Springer US. (p4, 8, 10, 11, and 27)
Castro, D. D. and Mannor, S. (2010). Adaptive bases for reinforcement learning. In Machine
Learning and Knowledge Discovery in Databases. (p34)
Castro, D. D. and Meir, R. (2010). A convergent online single time scale actor critic algorithm.
Journal of Machine Learning Research. (p9, 10, 12, 23, 27, 28, and 34)
Castro, D. D., Volkinshtein, D., and Meir, R. (2009). Temporal difference based actor critic learning
- convergence and neural implementation. In NIPS. (p34)
Chang, H. S. (2009). Decentralized learning in finite markov chains: Revisited. IEEE Transactions
on Automatic Control, 54(7):1648–1653. (p2)
Chen-Yu Wei, Mehdi Jafarnia-Jahromi, H. L. H. S. R. J. (2019). Model-free reinforcement learning
in infinite-horizon average-reward markov decision processes. arXiv:1910.07072. (p8, 11, 23,
24, 28, and 36)
Dann, C., Neumann, G., and Peters, J. (2014). Policy evaluation with temporal differences: A survey
and comparison. Journal of Machine Learning Research, 15(24):809–883. (p14)

17
Das, T., Gosavi, A., Mahadevan, S., and Marchalleck, N. (1999). Solving semi-markov decision
problems using average reward reinforcement learning. Manage. Sci., 45(4). (p6, 7, 23, 26,
and 29)
Dean, S., Mania, H., Matni, N., Recht, B., and Tu, S. (2017). On the sample complexity of the linear
quadratic regulator. arXiv:1710.01688. (p2)
Degris, T., Pilarski, P. M., and Sutton, R. S. (2012). Model-free reinforcement learning with contin-
uous action in practice. In 2012 American Control Conference (ACC). (p9, 11, 27, and 34)
Degris, T., White, M., and Sutton, R. S. (2012). Off-policy actor-critic. In ICML. (p25)
Deisenroth, M. P., Neumann, G., and Peters, J. (2013). A survey on policy search for robotics.
Foundations and Trends in Robotics, 2(1–2):1–142. (p8 and 9)
Devraj, A. M., Kontoyiannis, I., and Meyn, S. P. (2018). Differential temporal difference learning.
arXiv:1812.11137. (p24 and 35)
Devraj, A. M. and Meyn, S. P. (2016). Differential td learning for value function approximation. In
2016 IEEE 55th Conference on Decision and Control (CDC), pages 6347–6354. (p35)
Dewanto, V. and Gallagher, M. (2021). Examining average and discounted reward optimality criteria
in reinforcement learning. arXiv:2107.01348. (p1)
Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking deep rein-
forcement learning for continuous control. In ICML. (p8)
Feinberg, E. A. and Shwartz, A. (2002). Handbook of Markov Decision Processes: Methods and
Applications, volume 40. Springer US. (p7, 11, and 27)
Furmston, T. and Barber, D. (2012). A unifying perspective of parametric policy search methods for
markov decision processes. In Advances in Neural Information Processing Systems 25. (p35)
Furmston, T., Lever, G., and Barber, D. (2016). Approximate newton methods for policy search in
markov decision processes. Journal of Machine Learning Research, 17(227). (p9 and 35)
Geist, M. and Pietquin, O. (2013). Algorithmic survey of parametric value function approximation.
IEEE Transactions on Neural Networks and Learning Systems, 24(6):845–867. (p14)
Geist, M., Piot, B., and Pietquin, O. (2017). Is the bellman residual a bad proxy? In Advances in
Neural Information Processing Systems 30. (p14)
Gosavi, A. (2004a). A reinforcement learning algorithm based on policy iteration for average reward:
Empirical results with yield management and convergence analysis. Machine Learning, 55(1).
(p11, 27, 28, and 33)
Gosavi, A. (2004b). Reinforcement learning for long-run average cost. European Journal of Oper-
ational Research, 155(3):654 – 674. (p29)
Gosavi, A. (2015). Simulation-Based Optimization: Parametric Optimization Techniques and Rein-
forcement Learning. Springer Publishing Company, Incorporated, 2nd edition. (p1 and 5)
Gosavi, A., BANDLA, N., and DAS, T. K. (2002). A reinforcement learning approach to a sin-
gle leg airline revenue management problem with multiple fare classes and overbooking. IIE
Transactions, 34(9). (p24 and 29)
Gray, R. M. (2009). Probability, Random Processes, and Ergodic Properties. Springer US, 2nd
edition. (p10)
Greensmith, E., Bartlett, P. L., and Baxter, J. (2004). Variance reduction techniques for gradient
estimates in reinforcement learning. J. Mach. Learn. Res., 5. (p9)
Hao, B., Lazic, N., Abbasi-Yadkori, Y., Joulani, P., and Szepesvari, C. (2020). Provably efficient
adaptive approximate policy iteration. (p8, 23, and 36)

18
Harb, J. and Precup, D. (2017). Investigating recurrence and eligibility traces in deep q-networks.
CoRR, abs/1704.05495. (p16)
Hasselt, H. V. (2010). Double q-learning. In Lafferty, J. D., Williams, C. K. I., Shawe-Taylor, J.,
Zemel, R. S., and Culotta, A., editors, Advances in Neural Information Processing Systems 23,
pages 2613–2621. Curran Associates, Inc. (p14)
Hasselt, H. v., Guez, A., and Silver, D. (2016). Deep reinforcement learning with double q-learning.
In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. (p14)
Heess, N., Silver, D., and Teh, Y. W. (2012). Actor-critic reinforcement learning with energy-based
policies. In European Workshop on Reinforcement Learning, Proceedings of Machine Learning
Research. (p10, 13, 28, and 34)
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B.,
Azar, M. G., and Silver, D. (2018). Rainbow: Combining improvements in deep reinforcement
learning. In AAAI. AAAI Press. (p14)
Howard, R. A. (1960). Dynamic Programming and Markov Processes. Technology Press of the
Massachusetts Institute of Technology. (p7)
Iwaki, R. and Asada, M. (2019). Implicit incremental natural actor critic algorithm. Neural Net-
works, 109. (p9, 13, 28, and 35)
Jafarnia-Jahromi, M., Wei, C.-Y., Jain, R., and Luo, H. (2020). A model-free learning algorithm for
infinite-horizon average-reward mdps with near-optimal regret. (p23, 24, 26, and 31)
Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning.
J. Mach. Learn. Res., 11:1563–1600. (p2)
Jalali, A. and Ferguson, M. J. (1990). A distributed asynchronous algorithm for expected average
cost dynamic programming. In 29th IEEE Conference on Decision and Control, pages 1394–1395
vol.3. (p5)
Kakade, S. (2002). A natural policy gradient. In NIPS. (p9, 13, 23, 25, 28, 32, and 34)
Karimi, B., Miasojedow, B., Moulines, E., and Wai, H.-T. (2019). Non-asymptotic analysis of biased
stochastic approximation scheme. (p2)
Konda, V. R. and Tsitsiklis, J. N. (2003). On actor-critic algorithms. SIAM J. Control Optim., 42(4).
(p10, 12, 13, 28, and 33)
Krass, D., Filar, J. A., and Sinha, S. S. (1992). A weighted markov decision process. Operations
Research, 40(6). (p16)
Lagoudakis, M. G. (2003). Efficient Approximate Policy Iteration Methods for Sequential Decision
Making in Reinforcement Learning. PhD thesis, Department of Computer Science, Duke Univer-
sity. (p11, 27, 28, and 33)
Liu, Q., Li, L., Tang, Z., and Zhou, D. (2018). Breaking the curse of horizon: Infinite-horizon
off-policy estimation. In Advances in Neural Information Processing Systems 31. (p24 and 25)
Mahadevan, S. (1996a). Average reward reinforcement learning: Foundations, algorithms, and
empirical results. Machine Learning. (p1, 2, 16, 23, 24, and 25)
Mahadevan, S. (1996b). Sensitive discount optimality: Unifying discounted and average reward
reinforcement learning. In ICML. (p16)
Mahadevan, S. (2009). Learning representation and control in markov decision processes: New
frontiers. Foundations and Trends in Machine Learning, 1(4):403–565. (p15)
Marbach, P. and Tsitsiklis, J. N. (2001). Simulation-based optimization of markov reward processes.
IEEE Transactions on Automatic Control, 46(2):191–209. (p9, 10, 11, 24, 27, 28, and 32)

19
Matsubara, T., Morimura, T., and Morimoto, J. (2010a). Adaptive step-size policy gradients with
average reward metric. In Proceedings of the 2nd Asian Conference on Machine Learning, JMLR
Proceedings. (p9 and 34)
Matsubara, T., Morimura, T., and Morimoto, J. (2010b). Adaptive step-size policy gradients with
average reward metric. In Proceedings of the 2nd Asian Conference on Machine Learning, ACML
2010, Tokyo, Japan, November 8-10, 2010, pages 285–298. (p23)
Miyamae, A., Nagata, Y., Ono, I., and Kobayashi, S. (2010). Natural policy gradient methods with
parameter-based exploration for control tasks. In Advances in Neural Information Processing
Systems 23. (p8)
Morimura, T., Osogami, T., and Shirai, T. (2014). Mixing-time regularized policy gradient. In
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI’14. (p23
and 35)
Morimura, T., Uchibe, E., Yoshimoto, J., and Doya, K. (2008). A new natural policy gradient
by stationary distribution metric. In European Conference Machine Learning and Knowledge
Discovery in Databases ECML/PKDD, pages 82–97. Springer. (p9, 13, 23, and 34)
Morimura, T., Uchibe, E., Yoshimoto, J., and Doya, K. (2009). A generalized natural actor-critic
algorithm. In Advances in Neural Information Processing Systems 22. (p9, 23, and 34)
Morimura, T., Uchibe, E., Yoshimoto, J., Peters, J., and Doya, K. (2010). Derivatives of loga-
rithmic stationary distributions for policy gradient reinforcement learning. Neural Computation,
22(2):342–376. (p16, 25, and 34)
Neu, G., Jonsson, A., and Gómez, V. (2017). A unified view of entropy-regularized markov decision
processes. CoRR, abs/1705.07798. (p2)
Nichols, B. D. (2017). A comparison of eligibility trace and momentum on sarsa in continuous
state-and action-space. In 2017 9th Computer Science and Electronic Engineering (CEEC), pages
55–59. (p15)
O’Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2016). PGQL: combining policy
gradient and q-learning. CoRR, abs/1611.01626. (p16)
Ormoneit, D. and Glynn, P. W. (2001). Kernel-based reinforcement learning in average-cost prob-
lems: An application to optimal portfolio choice. In Advances in Neural Information Processing
Systems 13. (p7 and 23)
Ormoneit, D. and Glynn, P. W. (2002). Kernel-based reinforcement learning in average-cost prob-
lems. IEEE Transactions on Automatic Control, 47(10):1624–1636. (p7)
Ortner, R. (2007). Pseudometrics for state aggregation in average reward markov decision processes.
In Hutter, M., Servedio, R. A., and Takimoto, E., editors, Algorithmic Learning Theory, pages
373–387, Berlin, Heidelberg. Springer Berlin Heidelberg. (p7)
Peters, J. and Schaal, S. (2008). Natural actor-critic. Neurocomputing, 71(7). (p32)
Powell, W. (2011). Approximate Dynamic Programming: Solving the Curses of Dimensionality.
Wiley Series in Probability and Statistics. Wiley, 2nd edition. (p10, 11, and 27)
Prashanth, L. A. and Bhatnagar, S. (2011). Reinforcement learning with average cost for adaptive
control of traffic lights at intersections. In 2011 14th International IEEE Conference on Intelligent
Transportation Systems (ITSC), pages 1640–1645. (p6, 7, 24, 26, 30, and 33)
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming.
John Wiley & Sons, Inc., 1st edition. (p2, 3, and 16)
Qiu, S., Yang, Z., Ye, J., and Wang, Z. (2019). On the finite-time convergence of actor-critic algo-
rithm. In Optimization Foundations for Reinforcement Learning Workshop, NeurIPS 2019. (p36)
Schneckenreither, M. (2020). Average reward adjusted discounted reinforcement learning: Near-
blackwell-optimal policies for real-world applications. arXiv: 2004.00857. (p2 and 24)

20
Schulman, J., Abbeel, P., and Chen, X. (2017). Equivalence between policy gradients and soft
q-learning. CoRR, abs/1704.06440. (p16)

Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards.


ICML, pages 298–305. (p6, 23, 24, 26, and 29)

Sennott, L. I. (1999). Stochastic Dynamic Programming and the Control of Queueing Systems.
Wiley-Interscience, New York, NY, USA. (p16)

Silver, D., Sutton, R. S., and Müller, M. (2008). Sample-based learning and search with perma-
nent and transient memories. In Proceedings of the 25th International Conference on Machine
Learning, ICML ’08. (p2)

Singh, S. P. (1994). Reinforcement learning algorithms for average-payoff markovian decision pro-
cesses. AAAI ’94, pages 700–705. (p6, 10, 11, 23, 26, 27, 28, 29, 30, and 32)

Strens, M. J. A. (2000). A bayesian framework for reinforcement learning. In ICML. (p24)

Sutton, R. S. (1990). Integrated architecture for learning, planning, and reacting based on approx-
imating dynamic programming. In Proceedings of the Seventh International Conference (1990)
on Machine Learning. (p2)

Sutton, R. S. and Barto, A. G. (2018). Introduction to Reinforcement Learning. MIT Press. (p7, 8,
9, 11, 12, 15, 24, 27, 28, 33, and 35)

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora,
E. (2009). Fast gradient-descent methods for temporal-difference learning with linear function
approximation. In Proceedings of the 26th ICML. (p14)

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (2000). Policy gradient methods for
reinforcement learning with function approximation. In NIPS. (p9, 11, 12, 13, 28, and 32)

Tadepalli, P. and Ok, D. (1998). Model-based average reward reinforcement learning. Artificial
Intelligence, 100(1). (p2)

Thomas, P. (2014). Genga: A generalization of natural gradient ascent with positive and negative
convergence results. In Proceedings of the 31st International Conference on Machine Learning,
volume 32 of Proceedings of Machine Learning Research. (p9)

Tsitsiklis, J. N. and Roy, B. V. (1999). Average cost temporal-difference learning. Automatica,


35(11). (p10, 12, 27, 28, and 32)

Ueno, T., Kawanabe, M., Mori, T., Maeda, S.-i., and Ishii, S. (2008). A semiparametric statistical
approach to model-free policy evaluation. In Proceedings of the 25th International Conference
on Machine Learning. (p11, 28, and 33)

Vemula, A., Sun, W., and Bagnell, J. A. (2019). Contrasting exploration in parameter and action
space: A zeroth-order optimization perspective. In The 22nd AISTATS 2019, Proceedings of
Machine Learning Research. (p8)

Vieillard, N., Scherrer, B., Pietquin, O., and Geist, M. (2020). Momentum in reinforcement learn-
ing. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and
Statistics, Proceedings of Machine Learning Research. PMLR. (p15)

Wan, Y., Naik, A., and Sutton, R. S. (2020). Learning and planning in average-reward markov
decision processes. (p6, 14, 24, 26, and 30)

Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural policy gradient methods: Global optimal-
ity and rates of convergence. (p15)

Wang, M. (2017). Primal-dual π learning: Sample complexity and sublinear run time for ergodic
markov decision problems. CoRR, abs/1710.06100. (p2)

21
Wang, Y., Dong, K., Chen, X., and Wang, L. (2020). Q-learning with {ucb} exploration is sample
efficient for infinite-horizon {mdp}. In International Conference on Learning Representations.
(p14 and 16)
Weaver, L. and Tao, N. (2001). The optimal reward baseline for gradient-based reinforcement
learning. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence,
UAI’01. (p15)
Wheeler, R. and Narendra, K. (1986). Decentralized learning in finite markov chains. IEEE Trans-
actions on Automatic Control, 31(6):519–526. (p2)
White, D. (1963). Dynamic programming, markov chains, and the method of successive approxima-
tions. Journal of Mathematical Analysis and Applications, 6(3):373 – 376. (p5)
Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba, J. (2017). Scalable trust-region method
for deep reinforcement learning using kronecker-factored approximation. In Advances in Neural
Information Processing Systems 30. (p15)
Wu, Y., Zhang, W., Xu, P., and Gu, Q. (2020). A finite time analysis of two time-scale actor critic
methods. arXiv:2005.01350. (p10, 12, 27, and 36)
Xu, J., Liang, F., and Yu, W. (2006). Learning with eligibility traces in adaptive critic designs. In
2006 IEEE International Conference on Vehicular Electronics and Safety, pages 309–313. (p15)
Yang, S., Gao, Y., An, B., Wang, H., and Chen, X. (2016). Efficient average reward reinforcement
learning using constant shifting values. In Proceedings of the Thirtieth AAAI Conference on
Artificial Intelligence, AAAI’16. AAAI Press. (p6, 7, 14, 24, 25, 26, and 30)
Yang, Z., Chen, Y., Hong, M., and Wang, Z. (2019). Provably global convergence of actor-critic: A
case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Process-
ing Systems 32. (p16)
Yu, H. and Bertsekas, D. P. (2009). Convergence results for some temporal difference methods
based on least squares. IEEE Transactions on Automatic Control, 54(7):1515–1531. (p11, 27,
28, and 33)
Zhang, S., Boehmer, W., and Whiteson, S. (2019). Deep residual reinforcement learning. CoRR,
abs/1905.01072. (p14)

22
APPENDIX

A Tables of existing works


Tables 1 and 2 summarize the major contributions and experiments of the existing works on average-
reward model-free RL that are based on value- and policy-iteration schemes, respectively.

B Taxonomy for approximation techniques


We present taxonomies of approximation techniques. They are for optimal gain vg∗ in value-iteration
schemes (Fig 1), as well as for those in policy-iteration schemes, namely gain vgπ (Fig 2), state values
vbπ (Fig 3), and action-related values qbπ and aπb (Fig 4).

C Existing benchmarking environments


Different average-reward RL methods are often evaluated with different sets of environments, mak-
ing comparisons difficult. Therefore, it is imperative to recapitulate those environments in order to
calibrate the progress, as well as for a sanity check (before targeting more complex environments).
We compile a variety of environments that are used in existing works. They are classified into two
categories, namely continuing (non-episodic) and episodic environments.

C.1 Continuing environments


Continuing environments are those with no terminating goal state and hence continue on infinitely.
These environments induce infinite horizon MDPs and are typically geared towards everlasting real-
world domains found in areas such as business, stock market decision making, manufacturing, power
management, traffic control, server queueing operation and communication networks.
There are various different continuous environments used in average-reward RL literature with the
more popular detailed below. Other examples include Preventive Maintenance in Product Inventory
(Das et al., 1999), Optimal Portfolio Choice (Ormoneit and Glynn, 2001), Obstacle avoidance task
(Mahadevan, 1996a), and Multi-Armed Restless Bandits (Avrachenkov and Borkar, 2020).

C.1.1 Symbolic MDPs


In order to initially test the performance of algorithms in RL, simple environments are required to
use as a test bed. These environments often serve no practical purpose and are developed purely for
testing. Below we describe three such environments being n-state, n-cycle and n-chain.

n-state (randomly-constructed) MDPs:


PI: Singh (1994); Kakade (2002); Morimura et al. (2008, 2009); Bhatnagar et al. (2009b);
Castro and Meir (2010); Matsubara et al. (2010b); Morimura et al. (2014); Chen-Yu Wei (2019);
Hao et al. (2020).
VI: Schwartz (1993); Singh (1994); Jafarnia-Jahromi et al. (2020).

These are environments in which an MDPs with n states is constructed with randomized transition
matrices and rewards. In order to provide context and meaning, a background story is often pro-
vided for these systems such as in the case of DeepSea (Hao et al., 2020) where the environment is
interpreted as a diver searching for treasure.
One special subclass is the Generic Average Reward Non-stationary Environment Testbed
(GARNET), which was proposed by Bhatnagar et al. (2009b). It is parameterized as
GARNET(|S|, |A|, x, σ, τ ), where |S| is the number of states and |A| is the number of actions,
τ determines how non-stationary the problem is, and x is a branching factors which determines how
many next states are available for each state-action pair. The set of next states is created by a random
selection from the state set without replacement. Probabilities of going to each available next state
are uniformly distributed and the reward recieved for each transition is normally distributed with
standard deviation σ.

23
n-cycle (loop, circle) MDPs:
PI: None.
VI: Schwartz (1993); Mahadevan (1996a); Yang et al. (2016); Wan et al. (2020).

This group of problem has n cycles (circles, loops) with the state set defined by the number of cycles
and the number of states in each cycle. The state transitions are typically deterministic. At the state
in which the loops intersect, the agent must decide which loop to take. For each state inside the
loops, the only available actions are to stay or move to the next state. The reward function is set out
such that each loop leads to a different reward, with some rewards being more delayed than others.
Mahadevan (1996a) contains a 2-cycle problem. A robot receives a reward of +5 if it chooses the
cycle going from home to the printer which has 5 states. It receives +20 if it chooses the other cycle
which leads to the mail room and contains 10 states.

n-chain:
PI: Chen-Yu Wei (2019).
VI: Yang et al. (2016); Jafarnia-Jahromi et al. (2020).
These are environments involving n states arranged in a linear chain-like formation. The only two
available actions are to move forwards or backwards. Note that the term “chain” here does not mean
a recurrent class. All n-chain environments has one recurrent class.
Strens (2000) presents are more specific type which is used in the OpenAI gym. In this, moving
forward results in no reward whereas choosing to go backwards returns the agent to beginning results
in a small reward. Reaching the end of the chain presents a large reward. The state transitions are
not deterministic, with each action having a small possibility of resulting in the opposite transition.
RiverSwim is another n-chain style problem consisting of 6 states in a chain-like formation to rep-
resent positions across a river with a current flowing from right to left. Each state has two possible
actions; swim left with the current or right against the current. Swimming with the current is al-
ways successful (deterministic) but swimming against it can fail with some probability. The reward
function is such that the optimal policy is swimming to the rightmost state and remaining there.

C.1.2 Queuing
Another commonly used group of environments in RL involve optimization of systems that involve
queuing. Common examples of queuing environments within literature include call bandwidth allo-
cation (Marbach and Tsitsiklis, 2001), and seating on airplanes (Gosavi et al., 2002).

Server-Access Control Queue:


PI: Sutton and Barto (2018): p252, Devraj et al. (2018); Abbasi-Yadkori et al. (2019a,b)
VI: Wan et al. (2020); Schneckenreither (2020).
The problem of controlling the access of queuing customers to a limited number of servers. Each
customer belongs to a priority group and the state of the system (MDP) is described by this priority
as well as the number of free servers. At each time-step, the servers become available with some
probability and the agent can accept or reject service to the first customer in each line. Accepting a
customer gives a reward proportional to the priority of that customer when their service is complete.
Rejecting a customer always results in a reward of 0. The goal of this problem is to maximize the
reward received by making decisions based on customer priority and the number of available servers.

Traffic Signal Control:


PI: Prashanth and Bhatnagar (2011); Liu et al. (2018)
VI: Prashanth and Bhatnagar (2011)
Maximizing the flow of traffic across a network of junctions through finding the optimal traffic signal
configuration. The controller periodically receives information about congestion which gives us a
state consisting of a vector containing information of queue length and elapsed time since each lane
turned red. Actions available are the sign configurations of signals which can be turned to green
simultaneously. The cost (negative reward) is the sum of all queue lengths and elapsed times across
the whole network. Queue length is used to reduce congestion and the elapsed time ensures fairness
for all lanes.

24
C.1.3 Continuous States and Actions
Many environments with applications to real world systems such as robotics can contain continuous
states and actions. The most common approach to these environments in RL is to discretize the
continuous variables however there are RL methods that can be used even if the MDP remains
continuous. Below we describe commonly used continuous environments from within the literature.

Linear Quadratic Regulators:


PI: Kakade (2002).
VI: None.
The Linear Quadratic Regulator (LQR) problem is fundamental in RL due to the simple structure
providing a useful tool for assessing different methods. The systems dynamics are st+1 = Ast +
Bat + ǫt where ǫt is uniformly distributed random noise, that is i.i.d. for each t ≥ 0 The cost
(negative reward) function for the system is c(s, a) = s⊺ Qx + a⊺ Ra. A, B, Q and R are all
matrices with proper dimensions. This problem is viewed as an MDP with state and action spaces
of S = Rd and A = Rk respectively. The optimal policy of LQR is a linear function of the state.

Swing-up pendulum:
PI: Morimura et al. (2010); Degris et al. (2012); Liu et al. (2018).
VI: None.
Swinging a simulated pendulum with the goal of getting it into and keeping it in the vertical position.
The state of the system is described by the current angle and angular velocity of the pendulum. The
action is the torque applied at the base of the pendulum which is restricted to be within a practical
limit. Rewards are received as the cosine of pendulums angle which is proportional to the height of
the pendulum’s tip.

C.2 Episodic environments


An episodic problem has at least one terminal state (note that multiple types of terminals can be
modeled as one terminal state when the model concerns only with r(s, a), instead of r(s, a, s′ )).
Once an agent enters the terminal state, the agent-environment interaction terminates. An episode
refers to a sequence of states, actions, and rewards from t = 0 until entering the terminal state. It is
typically assumed that the termination eventually occurs in finite time.

Grid-navigation:
PI: None.
VI: Mahadevan (1996a).

Tetris:
PI: Kakade (2002).
VI: Yang et al. (2016).

Atari Pacman:
PI: Abbasi-Yadkori et al. (2019a).
VI: None.

25
Using reward r(s, a) only:
Singh (1994, Algo 4),
Das et al. (1999, Eqn 8)

Update only when


action is greedy

Using reward r(s, a) with adjustment:


Schwartz (1993, Sec 5)

Using reward r(s, a) with adjustment:


SA
Singh (1994, Algo 3),
(2-timescale Qb -learning)
Wan et al. (2020, Eqn 6)

Update at every action

Using a state reference sref :


v̂g∗ Bertsekas and Tsitsiklis (1996, p404),
Abounadi et al. (2001, Eqn 2.9b),
Bertsekas (2012, p551)

Using state sref and action aref references:


Abounadi et al. (2001, Sec 2.2),
Prashanth and Bhatnagar (2011, Eqn 8, 10)

non-SA
(1-timescale Qb -learning)

Not using any references:


Abounadi et al. (2001, Sec 2.2),
Yang et al. (2016),
Avrachenkov and Borkar (2020, Eqn 11),
Jafarnia-Jahromi et al. (2020, Sec 3)
Figure 1: Taxonomy for optimal-gain approximation v̂g∗ , used in value-iteration based average-
reward model-free RL. Here, SA stands for stochastic approximation. For details, see Sec 3.2.

26
Unspecified step-sizes:
Tsitsiklis and Roy (1999, Sec 2),
Marbach and Tsitsiklis (2001, Sec 4),
Bhatnagar et al. (2009a, Eqn 21),
Castro and Meir (2010, Eqn 13)

Basic SA

Specified step-sizes:
Singh (1994, Algo 2),
Wu et al. (2020, Sec 5.2.1)

Incremental

TD-based:
Singh (1994, Algo 1),
Degris et al. (2012, Sec 3B),
Sutton and Barto (2018, p251, 333)
v̂gπ
Using fixed references:
Lagoudakis (2003, Appendix A),
Cao (2007, Eqn 6.23, p311)

Batch Sample averages:


Marbach and Tsitsiklis (2001, Eqn 16),
Feinberg and Shwartz (2002, p319),
Gosavi (2004a, Sec 4),
Yu and Bertsekas (2009, Sec 2B),
Powell (2011, p339),

Constant:
Tsitsiklis and Roy (1999, Sec 5)

Figure 2: Taxonomy for gain approximation v̂gπ , used in policy-iteration based average-reward
model-free RL. Here, SA stands for stochastic approximation, whereas TD for temporal difference.
For details, see Sec 4.3.

27
Incremental:
Tabular Singh (1994, Algo 1, 2)

Gradient based:
Tsitsiklis and Roy (1999, Thm 1),
v̂bπ
Castro and Meir (2010, Eqn 13),
Sutton and Barto (2018, p333)

Function approximator
Least-square based:
Ueno et al. (2008),
Yu and Bertsekas (2009)

Figure 3: Taxonomy for state-value approximation v̂bπ , used in policy-iteration based average-reward
model-free RL. For details, see Sec 4.4.

Incremental:
Sutton et al. (2000),
Marbach and Tsitsiklis (2001),
Gosavi (2004a)

Tabular Batch:
Sutton et al. (2000, Sec 1),
Marbach and Tsitsiklis (2001, Sec 6),
Chen-Yu Wei (2019, Algo 3)

Parameterized action values q̂bπ (w)


(using θ-compatible f θ (s, a)), except stated otherwise):
Sutton et al. (2000, Thm 2),
Kakade (2002, Thm 1),
Konda and Tsitsiklis (2003, Eqn 3.1),
q̂bπ Sutton and Barto (2018, p251) (unspecified f (s, a))

Gradient based Parameterized action advantages âπb (w)


(using θ-compatible f θ (s, a)):
Bhatnagar et al. (2009a, Eqn 30, Algo 3, 4)
Heess et al. (2012, Eqn 13),
Iwaki and Asada (2019, Eqn 17)
Function
approximator
TD approximates action advantages, i.e. δ̂vπb (w v ) ≈ aπb :
Bhatnagar et al. (2009a, Algo 1, 2),
Sutton and Barto (2018, p333)

Least-square based:
Lagoudakis (2003, Appendix A)

Figure 4: Taxonomy for action-value related approximation, i.e. action values q̂bπ and action advan-
tages âπb , used in policy-iteration based average-reward model-free RL. Here, TD stands for temporal
difference. For details, see Sec 4.4.

28
Table 1: Existing works on value-iteration based average-reward model-free RL. The label T, F,
or TF (following the work number) indicates whether the corresponding work deals with tabular,
function approximation, or both settings, respectively

Work 1 (T) Schwartz (1993): R-learning (Qb -learning)


Contribution Derive Qb -learning from its discounted reward counterpart, i.e. Qγ -learning. Update
v̂g∗ only when the executed action is greedy, in order to avoid skewing the approx-
imation due to exploratory actions. The v̂g∗ update involves maxa′ ∈A q̂b∗ (s′ , a′ ) −
maxa′ ∈A q̂b∗ (s, a′ ), which is an adjustment factor for minimizing the variance.
Experiment On a 50-state MDP and a 2-cycle MDP. Qb -learning outperforms Qγ -learning in terms
of the converged average reward v̂g∗ and the empirical rate of learning.
Work 2 (T) Singh (1994)
Contribution Suggest setting q̂b∗ (sref , aref ) ← 0 for an arbitrary reference state-action pair, in order
to prevent q̂b∗ from becoming large. Two approximation techniques for vg∗ . First,
modified version of Schwartz (1993) based on the Bellman optimality equation, so
that updates occur at every action, hence increasing sample efficiency. Second, update
v̂g∗ as the sample average of rewards received for greedy actions.
Experiment On 20,100-state and 5,10-action randomly-constructed MDPs. Both proposed algo-
rithms are shown to converge to vg∗ . No comparison to Schwartz (1993).
Work 3 (F) Bertsekas and Tsitsiklis (1996)
Contribution Qb -learning with function approximation (p404), whose parameter w is updated via
 
∗ ′ ′ ∗ ∗
w ← (1 − β)w + β r(s, a) + max q̂b (s , a ; w) − v̂g ∇q̂b (s, a; w),
′ a ∈A

for some positive stepsize β. Suggest setting q̂b∗ (sref , ·) ← 0 for an arbitrary-but-fixed
reference state sref . The optimal gain vg∗ is estimated using (15, 17).
Experiment None
Work 4 (F) Das et al. (1999); Gosavi et al. (2002); Gosavi (2004b)
Contribution Qb -learning on semi-MDPs, named Semi-Markov Average Reward Technique
(SMART). Use a feed-forward neural network to represent q̂b∗ (w) with the following
update (Eqn 9),
w ← w + β{r(s, a) − v̂g∗ + max q̂b∗ (s′ , a′ , w) − q̂b∗ (s, a; w)}∇q̂b∗ (s, a; w).
′ a ∈A

Decay β (and ǫ-greedy exploration) slowly to 0 according to the Darken-Chang-


Moody procedure (with 2 hyperparameters). Extended to λ-SMART using the
forward-view TD(λ).
Experiment Preventative maintenance for 1- and 5-product inventory (up to 1011 states). SMART
yields the final v̂g∗ whose mean lies within 4% of the exact vg∗ (Table 2). It outperforms
2 heuristic baselines (Table 5), namely operational-readiness and age-replacement.
Continued on next page

29
Table 1 – continued from previous page
Work 5 (T) Abounadi et al. (2001); Bertsekas (2012): Tabular RVI and SSP Qb -learning
Contribution First, asymptotic convergence for RVI Qb -learning based on ODE with the usual di-
minishing stepsize for SA (Sec 2.2, 3). Stepsize-free gain approximation, namely
v̂g∗ = f (q̂b∗ ), where f : R|S|×|A| 7→ R is Lipschitz, f (1sa ) = 1 and f (x + c1sa ) =
f (x)+c, ∀c ∈ R, 1sa denotes a constant vector of all 1’s in R|S|×|A| (Assumption 2.2).
For example,
1 X ∗
f (q̂b∗ ) := q̂b∗ (sref , aref ), f (q̂b∗ ) := max q̂b∗ (sref , a), f (q̂b∗ ) := q̂ (s, a).
a |S||A| s,a b

Second, SSP Qb -learning based on an observation that the gain of any stationary pol-
icy is simply the ratio of expected total reward and expected time between 2 suc-
cesssive visits to the reference state (Sec 2.3, Eqn 2.9a); also based on the contract-
ing value iteration (Bertsekas (2012): Sec 7.2.3). Asymptotic convergence based on
ODE and the usual diminishing stepsize for SA (Sec 4). The update for q̂b∗ involves
maxa′ ∈A q̂b∗ (s′ , a′ ), where q̂b∗ (s′ , a′ ) = 0 if s′ = sref (Bertsekas (2012): Eqn 7.15).
The optimal gain is approximated (in Eqn 2.9b) as
v̂g∗ ← Pκ [v̂g∗ + βg max q̂b∗ (sref , a)],
a∈A

where Pκ denotes the projection on to an interval [−κ, κ] for some constant κ such
that vg∗ ∈ (−κ, κ). This v̂g∗ should be updated at a slower rate than q̂b∗ .
Experiment None
Work 6 (F) Prashanth and Bhatnagar (2011)
Contribution Qb -learning with function approximation (Eqn 10),
w ← w + β{r(s, a) − v̂g∗ + max q̂b∗ (s′ , a′ , w)}, with v̂g∗ = max q̂b∗ (sref , a′′ ; w).
′ a ∈A a ∈A
′′

Experiment On traffic-light control. The parameter w does not converge, it oscillates. The pro-
posal resulted in worse performance, compared to an actor-critic method.
Work 7 (TF) Yang et al. (2016): Constant shifting values (CSVs)
Contribution Use a constant as v̂g∗ , which is inferred from prior knowledge, because RVI Qb -
learning is sensitive to the choice of sref when the state set is large. Argue that un-
bounded q̂b∗ (when v̂g∗ < vg∗ ) are acceptable as long as the policy converges. Hence,
derive a terminating condition (as soon as the policy is deemed stable), and prove that
such a convergence could be towards the optimal policy if CSVs are properly chosen.
Experiment On 4-circle MDP, RiverSwim, Tetris. Outperform SMART, RVI Qb -learning, R-
learning in both tabular and linear function approximation (with some notion of eli-
gibility traces).
Work 8 (T) Avrachenkov and Borkar (2020)
Contribution Tabular RVI Qb -learning for the Whittle index policy (Eqn 11). Approximate vg∗ using
the average of all entries of q̂ ∗b (same as Abounadi et al. (2001): Sec 2.2). Analyse the
asymptotic convergence.
Experiment Multi-armed restless bandits: 4-state problem with circulant dynamics, 5-state prob-
lem with restart. The proposal is shown to converge to the exact vg∗ .
Work 9 (T) Wan et al. (2020)
Contribution Tabular Qb -learning without any reference state-action pair. Show empirically that
such reference retards learning and causes divergence when (sref , aref ) is infrequently
visited (e.g. trasient states in unichain MDPs). Prove its asymptotic convergence un-
der unichain MDPs, whose key is using TD for v̂g∗ estimates (same as Singh (1994):
Algo 3).
Experiment On access-control queuing. The proposal is empirically shown to be competitive in
terms of learning curves in comparison to RVI Qb -learning.
Continued on next page

30
Table 1 – continued from previous page
Work 10 (T) Jafarnia-Jahromi et al. (2020):
 Exploration Enhanced Q-learning (EE-QL)
p
Contribution A regret bound of Õ t̂max with tabular non-parametric policies in weakly commu-
nicating MDPs (more general than the typically-assumed recurrent MDPs). The key
is to use a single scalar estimate v̂g∗ for all states, instead of maintaining the estimate
for each state. This concentrating approximation v̂g∗ is assumed to be available. For

updating q̂b∗ , use β = 1/ k for the sake of analysis, instead of the common β = 1/k.
Experiment On random MDPs and RiverSwim (weakly communicating variant). Outperform
MDP-OOMD and POLITEX, as well as model-based benchmarks UCRL2 and PSRL.

31
Table 2: Existing works on policy-iteration based average-reward model-free RL. The label E, I,
or EI (following the work number) indicates whether the corresponding work is (mainly) about
on-policy policy evaluation, policy improvement, or both, respectively.

Work 1 (E) Singh (1994)


Contribution Tabular policy evaluation: v̂bπ (s) ← {1 − βv }v̂bπ (s) + βv {r(s, a) + v̂bπ (s′ ) − v̂gπ },
with either Algo 1: v̂gπ ← (1 − βg )v̂gπ + βg {r(s, a) + v̂bπ (s′ ) − v̂bπ (st )}, based on SA
and Bellman expectation equation, or Algo 2: v̂gπ ← {(t × v̂gπ ) + r(s, a)}/(t + 1) as
the sample average of the rewards received for greedy actions. It is noted that βv has
dependency on both timesteps and current states, and βg on timesteps, but provide no
further explanation.
Experiment On 20-, 100-state and 5-, 10-action random MDPs. Algo 2 is better than Algo 1 with
respect to the absolute total errors relative to the exact vgπ (over 10 trials).
Work 2 (E) Tsitsiklis and Roy (1999)
Contribution On-policy TD(λ) with v̂bπ (w) as a linearly independent combination of fixed basis
functions. A proof of asymptotic convergence with probability 1, based on ODE and
SA (Thm 1). A bound on approximation errors and its dependence on a mixing factor
(Thm 3). The aforementioned convergence and error analysis are for adaptive gain
estimates; their counterpart for fixed gain estimates is also established in Thm 4 with
an additional mixing factor (for this case λ far from 1 is preferable).
Experiment None
Work 3 (I) Sutton et al. (2000): (Randomized) Policy Gradient Theorem
Contribution Policy gradient with P function approximation
P for a parameterized policy π(θ),
π π
namely ∇vg (θ) = s∈S p (s) q̂
a∈A b (s, a; w)∇π(s, a; θ), with q̂bθ (s, a; w) =
w ∇ log π(a|s; θ), which is said to be compatible with policy parameterization.

Thm 3 assures asymptotic convergence to a locally optimal policy, assuming bounded


rewards, bounded ∇vg (θ), and step sizes satisfying standard SA requirements.
Experiment None
Work 4 (I) Marbach and Tsitsiklis (2001)
Contribution The same randomized policy gradient (Sec 6) as Sutton et al. (2000), but for a regener-
ative process with one recurrent state sref under every deterministic policy. Gradient
updates are either every step (incremental) or every regenerative cycle (batch). The
former uses eligibility traces (Sec 5) with λθ < 1, speeding up convergence by an or-
der of magnitude, while introducing a negligible bias. Using tabular qbθ . The estimate
v̂gπ is computed as a weighted average of all past rewards from all regenerative cycles;
resulting in lower variance (Eqn 16).
Experiment Call admission control, whose exact vg∗ = 0.8868. Exact gradient computation yields
v̂g∗ = 0.8808 (100 updates), whereas those with every step updates yield v̂g∗ = 0.8789
for λθ = 1 in 8 million updates, and v̂g∗ = 0.8785 for λθ = 0.99 in 1 million updates.
Work 5 (I) Kakade (2002): Natural Policy Gradient
Contribution Propose one possible Riemmanian-metric matrix, namely
F (θ) := ES∼p⋆θ [F (S; θ)], with F (s; θ) := EA [∇π(A|s; θ)∇⊺ π(A|s; θ)],
which is the Fisher Information matrix of a distribution π(A|s; θ), ∀s ∈ S. Recall
that there exists a probability manifold (surface) corresponding to each state s, where
π(A|s; θ) is a point on that manifold with coordinates θ. This F (s; θ) defines the
same distance between 2 points (on the manifold of s) regardless of policy param-
eterization (the choice of coordinates). Prove that w∗ = F (θ)∇vg (θ), where w∗
minimizes the MSE and q̂ θb (w) is θ-compatible (Thm 1). For derivation of F (θ) via
the trajectory distribution manifold, see Bagnell and Schneider (2003): Thm 1, and
Peters and Schaal (2008): Appendix A.
Experiment On 2-state MDP, 1-dimensional linear quadratic regulator, and scaled-down Tetris.
Natural gradients lead to a faster rate of learning, compared to standard gradients.
Continued on next page ...

32
Table 2 – continued from previous page
Work 6 (EI) Konda and Tsitsiklis (2003): A class of actor-critic algorithms
Contribution Interpret randomized policy gradients as an inner product of 2 real-valued functions
on S × A, i.e.
XX
∂vg (θ)/∂θi = hqb (θ), ∇i log π(θ)iθ = p⋆θ (s, a)qbθ (s, a)∇i log π(s, a; θ),
s∈S a∈A

where ∇i log π(θ) denotes the i-th unit vector component of ∇ log π(θ) for i =
1, 2, . . . , dθ . Thus, in order to compute ∇vg (θ), it suffices to learn the projection
of qb (θ) onto the span of the unit vectors {∇i log π(θ); 1 ≤ i ≤ dθ } in R|S||A| . This
span should be contained in the span of the basis vectors of the linearly parameterized
critic, e.g. setting ∇i log π(θ) as such basis vectors. Using TD(λ) with backward-view
eligibility traces. The convergence analysis is based on the martingale approach.
Experiment None
Work 7 (E) Lagoudakis (2003): Least-Squares TD of Q for average reward (LSTDQ-AR)
Contribution Batch policy evaluation as part of least-squares policy iteration. Remark that because
there is no exponential drop due to a discount factor, the value function approxima-
tion (for average rewards) is more amenable to fitting with linear architectures (Ap-
pendix A).
Experiment None
Work 8 (E) Gosavi (2004a): QP-Learning
Contribution Tabular GPI-based methods with implicit policy representation (taking a greedy action
with respect to relative action values). Use TD for learning q̂bπ . Decay the learning rate,
as the inverse of the number of updates (timesteps). Convergence analysis to optimal
solution via ODE methods (Sec 6). This method is similar to Sarsa (Sutton and Barto
(2018): p251), but with batch gain approximation; note that the action value approxi-
mation has its own disjoint batch and is updated incrementally per timestep.
Experiment Airline yield management: 6 distinct cases, 10 runs (1 million timestep each). Im-
provements over the commonly used heuristic (expected marginal seat revenue): 3.2%
to 16%; whereas over Qγ=0.99 -learning: 2.3% to 9.3%.
Work 9 (E) Ueno et al. (2008): gLSTD and LSTDc
Contribution LSTD-based batch policy evaluation (on state values) via semiparametric statistical
inference. Using the so-called estimating-function methods, leading to gLSTD and
LSTDc. Analyze asymptotic variance of linear function approximation.
Experiment On a 4-state, 2-action MDP. The gLSTD and LSTDc are shown to have lower variance.
Work 10 (EI) Bhatnagar et al. (2009a); Prashanth and Bhatnagar (2011): Natural actor critic
Contribution An iterative procedure to estimate the inverse of the Fisher information matrix F −1 . It
is based on weighted average in SA and Sherman-Morrison formula (Eqn 26, Algo 2).
This F −1 is also used as a preconditioning matrix for the critic’s gradients (Eqn 35,
Algo 4). Use a θ-compatible parameterized action advantage approximator (critic),
whose parameter w is utilized as natural gradient estimates (Eqn 30, Algo 3). Based
on ODE, prove the asymptotic convergence to a small neighborhood of the set of local
maxima of vg∗ . Related to (Bhatnagar et al., 2008; Abdulla and Bhatnagar, 2007).
Experiment On Generic Average Reward Non-stationary Environment Testbed (GARNET).
Algo 3 with parametric action advantage approximator yields reliably good
performance in both small and large GARNETs, and outperforms that of
Konda and Tsitsiklis (2003).
Work 11 (E) Yu and Bertsekas (2009)
Contribution Analyze convergence and the rate of convergence of TD-based least squares batch
policy evaluation for vbπ , specifically LSPE(λ) algorithm for any λ ∈ (0, 1) and any
constant stepsize β ∈ (0, 1]. The non-expansive mapping is turned to contraction
through the choice of basis functions and a constant stepsize.
Experiment On 2- and 100-state randomly constructed MDPs. LSPE(λ) is competitive with
LSTD(λ).
Continued on next page ...

33
Table 2 – continued from previous page
Work 12 (I) Morimura et al. (2009, 2008): generalized Natural Actor-Critic (gNAC)
Contribution A variant of natural actor critic that uses the generalized natural gradient (gNG), i.e.
F sa (θ, κ) := κF s (θ) + F a (θ) with some constant κ ∈ [0, 1].
This linear interpolation controlled by κ can be interpreted as a continuous interpo-
lation with respect to the k-timesteps state-action joint distribution. In particular,
κ = 0 corresponds to ∞-timesteps (after converging to the stationary state distribu-
tion), whereas κ = 1 corresponds to 1-timestep. Provide an efficient implementation
for the gNG learning based on the estimating function theory, equipped with an aux-
ilary function for variance reduction (Lemma 1, Thm 1). The policy parameter is
updated by a gNG estimate, which is a solution of the estimating function.
Experiment On randomly constructed MDPs with 2 actions and 2, 5, 10, up to 100 states; using the
technique of Morimura et al. (2010) for ∇ps (θ). The generalized variant with κ = 1
converges to the same point as that of κ = 0.25, but with a slower rate of learning.
Both outperform the baseline NAC algorithm, which is equivalent to using κ = 0.
Work 13 (EI) Castro and Meir (2010); Castro and Mannor (2010); Castro et al. (2009)
Contribution ODE-based convergence analysis of 1-timescale actor-critic methods to a neighbor-
hood of a local maximum of vg∗ (instead of to a local maximum itself as in 2-timescale
variants). This single timescale is motivated by a biological context, i.e. unclear justi-
fication for 2 timescales operating within the same anatomical structure.
Experiment On GARNET, standard gradients, and linearly parameterized TD(λ) critic with eligi-
bility trace. During the initial phase, 1-timescale algorithm converges faster than that
of Bhatnagar et al. (2009a), whereas the long term behavior is problem-dependent.
Work 14 (I) Matsubara et al. (2010a): Adaptive stepsizes
Contribution An adaptive stepsize for both standard and natural policy gradient methods. It is
achieved by setting the distance between θ and (θ + ∆θ) to some user-specified con-
stant, where the effect of a change ∆θ on vg is measured by a Riemannian metric.
Thus, α(θ) = 1/{∇⊺ vg (θ)F −1 (θ)∇vg (θ)}. Estimate F (θ) as the mean of the expo-
nential recency-weighted average, i.e. F̂ ← F̂ + λ(∇vg (θ)∇⊺ vg (θ) − F̂ ).
Experiment On 3- and 20-state MDPs, with eligibility trace on policy gradients. The proposed
adaptive natural gradients lead to faster convergence than that of Kakade (2002).
Work 15 (I) Heess et al. (2012): Energy-based policy parameterization
Contribution Energy-based policy parameterization with latent variables (Eqn 3), which enables
complex non-linear relationship between actions and states. Incremental approxima-
tion techniques for the partition function and for two expectations used in computing
compatible features (Eqn 19).
Experiment Octopus arm (continuous states, discrete actions), compatible action advantage ap-
proximator. The proposal (which builds upon Bhatnagar et al. (2009a)) ourperforms
the non-linear version of the Sarsa algorithm (Eqn 5).
Work 16 (EI) Degris et al. (2012)
Contribution Empirical study of actor-critic methods with backward-view eligibility traces for both
actor and critic. Parameterize continuous action policies using a normal distribution,
whose variance is used to scale the gradient.
Experiment On swing-up pendulum (continuous states and actions). Use standard and natural
gradients, and linearly parameterized state-value approximator with tile coding for
state feature extractions. The use of variance-scaled gradients and eligibility traces
significantly improves performance, but the use of natural gradients does not.
Continued on next page ...

34
Table 2 – continued from previous page
Work 17 (I) Morimura et al. (2014)
Contribution Regularization for policy gradient updates, i.e. −κ∇h(θ), where κ denote some scal-
ing factor, and h the hitting time, which controls the magnitude of the mixing time. A
TD-based approximation for the gradient of the hitting time, ∇h(θ).
Experiment On a 2-state MDP. The regularized policy gradient leads to faster learning, compared
to non-regularized ones (both standard and natural gradients). It also reduces the effect
of policy initialization on the learning performance.
Work 18 (I) Furmston et al. (2016); Furmston and Barber (2012)
Contribution Estimate the Hessian of vg (θ) via Gauss-Newton methods (Algo 2 in Appendix B.1).
Use eligibility traces for both the gradients and the Hessian approximates.
Experiment None
Work 19 (EI) Sutton and Barto (2018)
Contribution Present 2 algorithms: a) semi-gradient Sarsa with q̂bπ (w) (p251), and b) actor-Critic
with per-step updates and backward-view eligibility traces for both actor and critic
v̂bπ (w) (p333).
Experiment Access-control queueing with tabular action value approximators (p252).
Work 20 (E) Devraj et al. (2018); Devraj and Meyn (2016): grad-LSTD(λ)
Contribution Grad-LSTD(λ) for w∗ = argminw kv̂bπ (w) − vbπ kpπs ,1 , which is a quadratic program
(Eqn 30, 37), instead of w∗ = argminw kv̂bπ (w) − vbπ k2pπs in the standard LSTD(1).
Analysis of convergence and error rate for linear parameterization. Claim that grad-
LSTD(λ) is applicable for models that do not have regeneration.
Experiment On a single-server queue with controllable service rate. With respect to the Bellman
error, grad-LSTD(λ) appears to converge faster than LSTD(λ). Moreover, the variance
of grad-LSTD’s parameters is smaller compared to those of LSTD.
Work 21 (EI) Abbasi-Yadkori et al. (2019a,b): POLITEX (Policy Iteration with Expert Advice)
Contribution Reduction of controls to expert predictions: in each state, there exists an expert al-
gorithm and the policy’s value losses are fed to the learning algorithm. Each policy
is represented as a Boltzmann distribution over the sum of action-value function esti-
Pi
mates of all previous policies. That is, π i+1 (a|s) ∝ exp(−κ j=1 q̂bj (s, a)), where κ
is   learning rate and i indexes the policy iteration. Achieve a regret bound of
a positive
3/4
Õ tmax , which scales only in the number of features.
Experiment On 4- and 8-server queueing with linear action value approximation (Bertsekas’s
LSPE): POLITEX achieves similar performance to LSPI, and slightly outperforms
RLSVI. On Atari Pacman using non-linear TD-based action-value approximation (3
random seeds): POLITEX obtains higher scores than DQN, arguably due to more
stable learning.
Work 22 (E) Iwaki and Asada (2019): Implicit incremental natural actor critic (I2NAC)
Contribution A preconditioning matrix for the gradient used in âπb (w) updates, namely:
I − κ(f (s, a)f ⊺ (s, a))/(1 + κkf (s, a)k2 ), for some scaling κ ≥ β and some pos-
sible use of eligibility traces on f (Eqn 27). It is derived from the so-called implicit
TD method with linear function approximation. Provide asymptotic convergence anal-
ysis and show that the proposal is less sensitive to the learning rate and state-action
features.
Experiment None (the experiment is all in discounted reward settings)
Continued on next page ...

35
Table 2 – continued from previous page
Work 23 (EI) Qiu et al. (2019)
Contribution A non-asymptotic convergence analysis of actor-critic algorithm with linearly-
parameterized TD-based state-value approximation. The keys are to interpret the actor-
critic algorithm as a bilevel optimization problem (Prop 3.2), namely
max vg (π(θ)), subject to argmin ℓ(w, vg ; θ), for some loss function ℓ,
θ∈Θ w∈W,vg ∈R

and to decouple the actor (upper-level) and critic (lower-level) updates; instead of
typical ODE-based techniques. Here, “decoupling” implies that at every iteration, the
critic start from scratch to estimate the value of the actor’s policy. Actor converges
sublinearly in expectation to a stationary point, although its updates are with biased
gradient estimates (due to critic approximation). The analysis assumes that the actor
receives i.i.d samples.
Experiment None
Work 24 (EI) Chen-Yu Wei (2019): Optimistic online mirror descent (MDP-OOMD)

Contribution A regret bound of Õ tmax with non-parametric tabular randomized policies in
weakly communicating MDPs. The key idea is to maintain an instance of an adversar-
ial multi-armed bandit at each state to learn the best action. Assume that tmax is large
enough so that mixing and hitting times are both smaller than tmax /4.
Experiment On a random MDP and JumpRiverSwim (both have 6 states and 2 actions). MDP-
OOMD outperforms standard Q-learning, but is on par with POLITEX.
Work 25 (EI) Wu et al. (2020)
Contribution A non-asymptotic analysis of 2-timescale actor-critic methods under non-i.i.d Marko-
vian samples and with linearly-parameterized critic v̂bπ (w). Show convergence
guaraantee to an ǫ-approximate 1st order stationary point of the non-concave policy
value function, i.e. k∇vg (θ)k22 ≤ ǫ, in at most Õ ǫ−2.5 samples when per iteration
sample is 1 (Cor 4.10). Establish that for gain approximation with βt = 1/(1 + t)κ1 ,
we have
t̂X
max  
   
E (v̂gθ − vgθ )2 = O t̂κmax
1 1−κ1
+ O t̂max 1−2(κ2 −κ1 )
log t̂max + O t̂max ,
t=τ

for some positive constants 0 < κ1 < κ2 < 1 that indicate the relationship between
actor’s and critic’s update rates (Sec 5.2.1).
Experiment None
Work 26 (EI) Hao et al. (2020): Adaptive approximate policy iteration (AAPI)  
2/3
Contribution A learning scheme that operates in batches (phases) and achieves Õ tmax regret
bound. It uses adaptive, data- and state-dependent learning rates, as well as side-
information (Eqn 4.1), i.e. a vector computable based on past information and being
predictive of the next loss (here, action values). The policy improvement is based
on the adaptive optimistic follow-the-regularized-leader (AO-FTRL), which can be
interpreted as regularizing each policy by the KL-divergence to the previous policy.
Experiment On randomly-constructed MDPs with 5, 10, 20 states, and 5, 10, 20 actions; DeepSea
environment with grid-based states and 2 actions; CartPole with continuous states
and 2 actions. AAPI outperforms POLITEX and RLSVI. Remark that smaller phase
lengths perform better, whereas adaptive per-state learning rate is not effective in Cart-
Pole.

36

You might also like