0% found this document useful (0 votes)
3 views

RL with replacing eligibility traces

This paper introduces a new type of eligibility trace, called the replacing trace, which improves the efficiency and reliability of reinforcement learning compared to conventional traces. The authors analyze both types of traces in the context of the offline TD(1) algorithm and demonstrate that replacing traces lead to unbiased predictions and lower mean squared error in the long run. Empirical results further confirm the advantages of replacing traces, particularly in complex tasks like the 'Mountain-Car' problem.

Uploaded by

preetambissy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

RL with replacing eligibility traces

This paper introduces a new type of eligibility trace, called the replacing trace, which improves the efficiency and reliability of reinforcement learning compared to conventional traces. The authors analyze both types of traces in the context of the offline TD(1) algorithm and demonstrate that replacing traces lead to unbiased predictions and lower mean squared error in the long run. Empirical results further confirm the advantages of replacing traces, particularly in complex tasks like the 'Mountain-Car' problem.

Uploaded by

preetambissy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Machine Learning, 22, 123-158 (1996)

© 1996 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands.

Reinforcement Learning
with Replacing Eligibility Traces
SATINDER P. SINGH singh @psyche.mit.edu
Dept. of Brain and Cognitive Sciences
Massachusetts Institute of Technology, Cambridge, Mass. 02139

RICHARD S. SUTTON [email protected]


Dept. of Computer Science
University of Massachusetts, Amherst, Mass. 01003

Editor: Leslie Pack Kaelbling

Abstract. The eligibility trace is one of the basic mechanisms used in reinforcement learmng to handle
delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it
theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds
of trace assign credit to prior events according to how recently they occurred, but only the conventional trace
gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the
offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods
converge under repeated presentations of the training set to the same predictions as two well known Monte
Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that
the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace
TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related
to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the
long run. Computational results confirm these analyses and show that they are applicable more generally. In
pa~icular, we show that replacing traces significantly improve performance and reduce parameter sensitivity
on the "Mountain-Car" task, a full reinforcement-learning problem with a continuous state space, when using
a feature-based function approximator.

Keywords: reinforcement learning, temporal difference learning, eligibility trace, Monte Carlo method, Markov
chain, CMAC

1. Eligibility Traces

T w o f u n d a m e n t a l m e c h a n i s m s h a v e b e e n u s e d in r e i n f o r c e m e n t l e a r n i n g to h a n d l e d e l a y e d
r e w a r d . O n e is t e m p o r a l - d i f f e r e n c e ( T D ) l e a r n i n g , as in t h e T D ( A ) a l g o r i t h m (Sutton,
1988) a n d in Q - l e a r n i n g (Watkins, 1989). T D l e a r n i n g in e f f e c t c o n s t r u c t s an internal
r e w a r d signal t h a t is less d e l a y e d t h a n the original, e x t e r n a l one. H o w e v e r , T D m e t h o d s
c a n e l i m i n a t e the d e l a y c o m p l e t e l y o n l y on fully M a r k o v p r o b l e m s , w h i c h are rare in
practice. In m o s t p r o b l e m s s o m e d e l a y a l w a y s r e m a i n s b e t w e e n a n a c t i o n a n d its e f f e c t i v e
r e w a r d , a n d o n all p r o b l e m s s o m e d e l a y is a l w a y s p r e s e n t d u r i n g the t i m e b e f o r e T D
l e a r n i n g is c o m p l e t e . T h u s , t h e r e is a g e n e r a l n e e d for a s e c o n d m e c h a n i s m to h a n d l e
w h a t e v e r d e l a y is n o t e l i m i n a t e d b y T D l e a r n i n g .
T h e s e c o n d m e c h a n i s m that h a s b e e n w i d e l y u s e d f o r h a n d l i n g d e l a y is the etigibilio,
t r a c e ) I n t r o d u c e d by K l o p f (1972), e l i g i b i l i t y traces h a v e b e e n u s e d in a v a r i e t y o f rein-
124 S.P. SINGH AND R.S. SUTTON

forcement learning systems (e.g., Barto, Sutton & Anderson, 1983; Lin, 1992; Tesauro,
1992; Peng & Williams, 1994). Systematic empirical studies of eligibility traces in con-
junction with TD methods were made by Sutton (1984), and theoretical results have
been obtained by several authors (e.g., Dayan, 1992; Jaakkola, Jordan & Singh, 1994;
Tsitsiklis, 1994; Dayan & Sejnowski, 1994; Sutton & Singh, 1994).
The idea behind all eligibility traces is very simple. Each time a state is visited it
initiates a short-term memory process, a trace, which then decays gradually over time.
This trace marks the state as eligible for learning. If an unexpectedly good or bad event
occurs while the trace is non-zero, then the state is assigned credit accordingly. In a
conventional accumulating trace, the trace builds up each time the state is entered. In a
replacing trace, on the other hand, each time the state is visited the trace is reset to 1
regardless of the presence of a prior trace. The new trace replaces the old. See Figure 1.
Sutton (1984) describes the conventional trace as implementing the credit assignment
heuristics of recency---more credit to more recent events--and frequency--more credit
to events that have occurred more times. The new replacing trace can be seen simply
as discarding the frequency heuristic while retaining the recency heuristic. As we show
later, this simple change can have a significant effect on performance.
Typically, eligibility traces decay exponentially according to the product of a decay
parameter, A, 0 < A < 1, and a discount-rate parameter, 7, 0 _< 3' < 1. The conventional
accumulating trace is defined by: 2

7Aet(s) if s#st;
6t+l(S)
7Aet(s)+l if s=st,
where et(s) represents the trace for state s at time t, and st is the actual state at time t.
The corresponding replacing trace is defined by:

f TAet(s) i f s # s t ;
e t + l (s)
/ 1 if s = st.

In a control problem, each state-action pair has a separate trace. When a state is visited
and an action taken, the state's trace for that action is reset to 1 while the traces for the
other actions are reset to zero (see Section 5).

I I I IIII T,,,,-,es at ,,,,-,,oh a Sta, e ,s V,s,,eO

~ _ __~ ConventionalAccumulatingTrace

~ Replacing
Trace

Figure 1. Accumulating and replacing eligibility traces.


REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 125

For problems with a large state space it may be extremely unlikely for the exact
same state ever to recur, and thus one might think replacing traces would be irrelevant.
However, large problems require some sort of generalization between states, and thus
some form of function approximator. Even if the same states never recur, states with
the same features will. In Section 5 we show that replacing traces do indeed make a
significant difference on problems with a large state space when the traces are done on
a feature-by-feature basis rather than on a state-by-state basis.
The rest of this paper is structured as follows. In the next section we review the TD(A)
prediction algorithm and prove that its variations using accumulating and replacing traces
are closely related to two Monte Carlo algorithms. In Section 3 we present our main
results on the relative efficiency of the two Monte Carlo algorithms. Sections 4 and 5
are empirical and return to the general case.

2. TD()Q and Monte Carlo Prediction Methods

The prediction problem we consider is a classical one in reinforcement learning and


optimal control. A Markov chain emits on each of its transitions a reward, rt+l E ~,
according to a probability distribution dependent only on the pre-transition state, st, and
the post-transition state, st+l. For each state, we seek to predict the expected total
(cumulative) reward emitted starting from that state until the chain reaches a terminal
state. This is called the value of the state, and the function mapping states s to their
values V(s) is called the value function. In this paper, we assume no discounting ('7 -- 1)
and that the Markov chain always reaches a terminal state. Without loss of generality
we assume that there is a single terminal state, T, with value V(T) = 0. A single trip
from starting state to terminal state is called a trial.

2.1. TD( )O Algorithms

The TD(A) family of algorithms combine TD learning with eligibility traces to estimate
the value function. The discrete-state form of the TD(A) algorithm is defined by

AVt(s) 6~t(S)

where Vt(s) is the estimate at time t of V(s), c~t(s) is a positive step-size parameter,
et+l(S) is the eligibility trace for state s, and A ~ ( s ) is the increment in the estimate
of V(s) determined at time t. 3 The value at the terminal state is of course defined as
~ ( T ) = 0, Vt. In online TD(A), the estimates are incremented on every time step:
V~+l(S ) = V~(s) + A ~ ( s ) . In offline TD(A), on the other hand, the increments A ~ ( s )
are set aside until the terminal state is reached. In this case the estimates ~ ( s ) are
constant while the chain is undergoing state transitions, all changes being deferred until
the end of the trial.
There is also a third case in which updates are deferred until after an entire set of trials
have been presented. Usually this is done with a small fixed step size, at(s) = c~, and
126 S.P. SINGH AND R.S. SUTTON

with the training set (the set of trials) presented over and over again until convergence of
the value estimates. Although this "repeated presentations" training paradigm is rarely
used in practice, it can reveal telling theoretical properties of the algorithms. For example,
Sutton (1988) showed that TD(0) (TD(A) with A = 0) converges under these conditions
to a maximum likelihood estimate, arguably the best possible solution to this prediction
problem (see Section 2.3). In this paper, for convenience, we refer to the repeated
presentations training paradigm simply as batch updating. Later in this section we show
that the batch versions of conventional and replace-trace TD(1) methods are equivalent
to two Monte Carlo prediction methods.

2.2. Monte Carlo Algorithms

The total reward following a particular visit to a state is called the return for that visit.
The value of a state is thus the expected return. This suggests that one might estimate a
state's value simply by averaging all the returns that follow it. This is what is classically
done in Monte Carlo (MC) prediction methods (Rubinstein, 1981; Curtiss, 1954; Wasow,
1952; Barto & Duff, 1994). We distinguish two specific algorithms:

Every-visit MC: Estimate the value of a state as the average of the returns that have
followed all visits to the state.

First-visit MC: Estimate the value of a state as the average of the returns that have
followed the first visits to the state, where a first visit is the first time during a trial that
the state is visited.

Note that both algorithms form their estimates based entirely on actual, complete re-
turns. This is in contrast to TD(A), whose updates (1) are based in part on existing
estimates. However, this is only in part, and, as A ~ 1, TD(A) methods come to more
and more closely approximate MC methods (Sections 2.4 and 2.5). In particular, the
conventional, accumulate-trace version of TD(A) comes to approximate every-visit MC,
whereas replace-trace TD(A) comes to approximate first-visit MC. One of the main points
of this paper is that we can better understand the difference between replace and accumu-
late versions of TD(A) by understanding the difference between these two MC methods.
This naturally brings up the question that we focus on in Section 3: what are the relative
merits of first-visit and every-visit MC methods?

2.3. A Simple Example

To help develop intuitions, first consider the very simple Markov chain shown in Fig-
ure 2a. On each step, the chain either stays in S with probability p, or goes on to
terminate in T with probability 1 - p . Suppose we wish to estimate the expected number
of steps before termination when starting in S. To put this in the form of estimating a
value function, we say that a reward of +1 is emitted on every step, in which case V(S)
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 127

is equal to the expected number of steps before termination. Suppose that the only data
that has been observed is a single trial generated by the Markov chain, and that that trial
lasted 4 steps, 3 passing from S to S, and one passing from S to T, as shown in Figure
2b. What do the two MC methods conclude from this one trial?
We assume that the methods do not know the structure of the chain. All they know
is the one experience shown in Figure 2b. The first-visit MC method in effect sees a
single traversal from the first time S was visited to T. That traversal lasted 4 steps, so
its estimate of V(S) is 4. Every-visit MC, on the other hand, in effect sees 4 separate
traversals from S to T, one with 4 steps, one with 3 steps, one with 2 steps, and one
with 1 step. Averaging over these four effective trials, every-visit MC estimates V(S)
as 1+2+3+44 -- 2.5. The replace and accumulate versions of TD(1) may or may not
form exactly these estimates, depending on their c~ sequence, but they will move their
estimates in these directions. In particular, if the corresponding offline TD(1) method
starts the trial with these estimates, then it will leave them unchanged after experiencing
the trial. The batch version o f the two T D ( t ) algorithms will compute exactly these
estimates.
Which estimate is better, 4 or 2.5? Intuitively, the first answer appears better. The
only trial observed took 4 steps, so 4 seems like the best estimate of its expected value.
In any event, the answer 2.5 seems too low. In a sense, the whole point of this paper
is to present theoretical and empirical analyses in support of this intuition. We show
below that in fact the answer 4 is the only unbiased answer, and that 2.5, the answer of
every-visit MC and of conventional TD(1), is biased in a statistical sense.
It is instructive to compare these two estimates of the value function with the estimate
that is optimal in the maximum likelihood sense. Given some data, in this case a set
of observed trials, we can construct the maximum-likelihood model of the underlying
Markov process. In general, this is the model whose probability of generating the ob-
served data is the highest. Consider our simple example. After the one trial has been
observed, the maximum-likelihood estimate of the S-to-S transition probability is 3, the
fraction of the actual transitions that went that way, and the maximum-likelihood estimate
of the S-to-T transition probability is ~. No other transitions have been observed, so
they are estimated as having probability 0. Thus, the maximum-likelihood model of the
Markov chain is as shown in Figure 2c.

S -S .S -S . T

a) True Process b) Observed Trial c) Max Likelihood Model

Figure2. A simple example of a Markov prediction problem. The objective is to predict the number of steps
until termination.
128 S.P. SINGH AND R.S. S U T T O N

We define the ML estimate of the value function as the value function that would be
exactly correct if the maximum-likelihood model of the Markov process were exactly
correct. That is, it is the estimate equal to the correct answer if the estimate of the
Markov chain was not really an estimate, but was known with certainty. 4 Note that the
ML estimate makes full use of all the observed data.
Let us compute the ML estimate for our simple example. If the maximum-likelihood
model of the chain, as shown in Figure 2c, were exactly correct, what then would be the
expected number of time steps before termination? For each possible number of steps,
k, we can compute the probability of its occurring, and then the expected number, as
O<3

VML(S) = er(k)k
k=l
(2<)

= (ors) .0.25. k
k=l
4.

Thus, in this simple example the ML estimate is the same as the first-visit MC estimate.
In general, these two are not exactly the same, but they are closely related. We establish
the relationship in the general case in Section 3.2.
Computing the ML estimate is in general very computationally complex. If the number
of states is n, then the maximum-likelihood model of the Markov chain requires O(n 2)
memory, and computing the ML estimate from it requires roughly O ( n 3) computational
operations. 5 The TD methods by contrast all use memory and computation per step that
is only O(n). It is in part because of these computational considerations that learning
solutions are of interest while the ML estimate remains an ideal generally unreachable in
practice. However, we can still ask how closely the various learning methods approximate
this ideal.

2.4. Equivalence of Batch TD(1) and MC Methods

In this subsection we establish that the replace and accumulate forms of batch TD(1) are
equivalent, respectively, to first-visit and every-visit MC. The next subsection proves a
similar equivalence for the offline TD(1) algorithms.
The equivalence of the accumulate-trace version of batch TD(1) to every-visit MC
follows immediately from prior results. Batch TD(1) is a gradient-descent procedure
known to converge to the estimate with minimum mean squared error on the training set
(Sutton, 1988; Dayan, 1992; Barnard, 1993). In the case of discrete states, the minimum
MSE estimate for a state is the sample average of the returns from every visit to that
state in the training set, and thus it is the same as the estimate computed by every-visit
MC.
Showing the equivalence of replace-trace batch TD(1) and first-visit MC requires a
little more work.
R E I N F O R C E M E N T LEARNING WITH R E P L A C I N G ELIGIBILITY T R A C E S 129

THEOREM 1: For any training set of N trials and any fixed at(s) = a < 1 , batch
replace TD(1) produces the same estimates as first-visit MC.
P r o o f : In considering updates to the estimates for any state s we need only consider
trials in which s occurs. On trials in which s does not occur, the estimates o f both
methods are obviously unchanged. We index the trials in which state s occurs from 1 to
N(s). Let t~(n) be the time at which state s is first visited in trial n, and let tT(n) be
the time at which the terminal state is reached. Let v~R(s) represent the replace TD(1)
estimate of the value of state s after i passes through the training set, for i _> 1:

N(~) t T ( ~ ) - i
v t(s) -- v?(s)+ E E
rt=l t=t~(n)
N(s) t T ( n ) - - I
: v?(~) + ~
E
n ~ l t=t~ (n)
[ r t + 1 q- ] F i R ( s t + l ) - - v/R(st)]

N(s) tit (n) -- 1 [


= y~R(s) + E
re~l
-v~R(st~(.)) + ~
t=t~(n)
~+1
]
N(s)
= V~(s) + a E [R(ts(n)) - V{R(s)]

N(s)
---- (1 -- N(s)a)viR(s) + c~ E R(ts(n)),
n=l

where R(t) is the return following time t to the end o f the trial. This in turn implies that

N(s)
ViR(s) = (1 -- N(s)a)iVoR(s) + c~ E R(t~(n)) [1 + (1 - N(s)a) + . . . (1 - N(s)~) i-1]
n=l

Therefore,

N(s) o~
v2(~) = (1 - N(s)a)~VoR(S) + ~ E R(t~(n)) E ( 1 - N(s)a) j
n=l j=O
N(s)
1
(because N ( s ) a < 1)
1 - (1 - N(s)c~)
n=l
EN(~) R(t~(~))
g(s) '

which is the first-visit M C estimate. •


130 S.P. SINGH AND R.S. S U T T O N

2.5. Equivalence of Offiine TD(1) and M C Methods by Choice of c~

In this subsection we establish that the replace and accumulate forms of offline TD(1)
can also be made equivalent to the corresponding MC methods by suitable choice of the
step-size sequence c~t(s).

THEOREM 2: Offiine replace TD(1) is equivalent to first-visit MC under the step-size


schedule
1
OQ
ks) = number of first visits to s up through time t

P r o o f : As before, in considering updates to the estimates o f V(s), we need only consider


trials in which s occurs. The cumulative increment in the estimate of V(s) as a result
of the i th trial in which s occurs is
tT(i) tT(i)
E AVt($)= ~ Oft(8)[rt+l ~- V//~-l(St+l)-
V/~l('qt)]
t=ts (4) t=t~ (i)
1
= =(R(ts(i)) -

Therefore, the update for offline replace TD(1), after a complete trial, is

ViR(S) = viR_I(S) + l (R(ts(i)) - V~RI(s)) ,

which is just the iterative recursive equation for incrementally computing the average of
the first-visit returns, {R(ts(1)), R(t~(2)), R ( t s ( 3 ) ) , . . . } . •

THEOREM 3: Offline accumulate TD(1) is equivalent to every-visit MC under the step-


size schedule
1
c~t(s) = number of visits to s up through the entire trial containing time

Proof: Once again we consider only trials in which state s occurs. For this proof we
need to use the time index of every visit to state s, complicating notation somewhat. Let
t~(i; k) be the time index of the k th visit to state s in trial i. Also, let Ks(i) be the
total number of visits to state s in trial i. The essential idea behind the proof is to again
show that the offline TD(1) equation is an iterative recursive averaging equation, only
this time of the returns from every visit to state s.
Let ~i(s) be the step-size parameter used in processing trial i. The cumulative incre-
ment in the estimate of V(s) as a result of trial i is
tT (i) rts (/;2)-- 1 ts (i;3) -- 1

t=ts (i;1) Lt=ts (4;1) t=t~ (i;2)


R E I N F O R C E M E N T LEARNING W I T H R E P L A C I N G ELIGIBILITY T R A C E S 131

+ ... +
tT(i)

Z
]
t=t~(i;K~(i))

R(G(i;j)) - K~(i)vi_l(s ) ,

where A i ( st ) = r t + l -~- < A ( st + l ) -- viA(st), and < A ( s ) is the accumulate-trace estimate


at trial/. Therefore,

v i A ( 8 ) = v i A l ( 8 ) J- OLi(8) R ( t s ( i ; j ) ) - Ks(i)viA_l(S) .

Because ai(s) = 1K this will compute the sample average of all the actual
~ j = ~ ~(J)'
returns from every visit to state s up to and including trial i. •

3. Analytic Comparison of Monte Carlo Methods

In the previous section we established close relationships of replace and accumulate


TD(1) to first-visit and every-visit MC methods respectively. By better understanding
the difference between the MC methods, then, we might hope to better understand the
difference between the TD methods. Accordingly, in this section we evaluate analytically
the quality of the solutions found by the two MC methods. In brief, we explore the
asymptotic correctness of all methods, the bias of the MC methods, the variance and
mean-squared error of the MC methods, and the relationship of the MC methods to the
maximum-likelihood estimate. The results of this section are summarized in Table 1.

3.1. Asymptotic Convergence

In this subsection we briefly establish the asymptotic correctness of the TD methods.


The asymptotic convergence of accumulate TD(A) for general X is well known (Dayan,
1992; Jaakkola, Jordan & Singh, 1994; Tsitsiklis, 1994; Peng, 1993). The main results
appear to carry over to the replace-trace case with minimal modifications. In particular:

THEOREM 4: Offline (online) replace TD(A) converges to the desired value function
w.p. 1 under the conditions for w.p.1 convergence of offline (online) conventional TD( A)
stated by Jaakkola, Jordan and Singh (1994).
Proof: Jaakkola, Jordan and Singh (1994) proved that online and offline TD(A) converges
w.p.1 to the correct predictions, under natural conditions, as the number of trials goes
to infinity (or as the number of time steps goes to infinity in the online, non-absorbing
case, with 7 < 1). Their proof is based on showing that the offline TD(A) estimator is a
132 S.P. SINGH AND R.S. SUTTON

contraction mapping in expected value. They show that it is a weighted sum of n-step
corrected truncated returns,

Vt(n)(st) = rt+l + ")'rt+2 + . . . + "yn-lrt+n + 7 n v t ( s t + n ) ,


that, for all n > 1, are better estimates (in expected value) of V(st) than is Vt(st). The
eligibility trace collects successive n-step estimators, and its magnitude determines their
weighting. The TD(A) estimator is
T--J.

~-~[rt+k+l AF~/Vt(st+k+l) -- Vt(stq_k)]et+k+l(St) -t- Vt(st) :


k=0

(1 - A) Xn-lVt(n)(st) q- Vt(T)(St)
5
n='r+l
An-1
1
/
,

where, for the accumulating trace, T is the number of time steps until termination,
whereas, for the replacing trace, ~- is the number of time steps until the next revisit to
state st. Although the weighted sum is slightly different in the replace case, it is still a
contraction mapping in expected value and meets all the conditions of Jaakkola et al.'s
proofs of convergence for online and offline updating. •

3.2. Relationship to the M L Estimate

In the simple example in Figure 2, the first-visit MC estimate is the same as the ML
estimate. However, this is true in general only for the starting state, assuming all trials
start in the same state. One way of thinking about this is to consider for any state s just
the subset of the training trials that include s. For each of these trials, discard the early
part of the trial before s was visited for the first time. Consider the remaining "tails" of
the trials as a new training set. This reduced training set is really all the MC methods
ever see in forming their estimates for s. We refer to the ML estimate of V(s) based
on this reduced training set as the reduced-ML estimate. In this subsection we show that
the reduced ML estimate is equivalent in general to the first-visit MC estimate.

THEOREM 5: For any undiscounted absorbing Markov chain, the estimates computed
by first-visit MC are the reduced-ML estimates, for all states and after all trials.
Proof: The first-visit MC estimate is the average of the returns from first visits to state
s. Because the maximum-likelihood model is built from the partial experience rooted
in state s, the sum over all t of the probability of making a particular transition at time
step t according to the maximum-likelihood model is equal to the ratio of the number of
times that transition was actually made to the number of trials. Therefore, the reduced-
ML estimate for state s is equal to the first-visit MC estimate. See Appendix A. 1 for a
complete proof. •

Theorem 5 shows the equivalence of the first-visit MC and reduced-ML estimates.


Every-visit MC in general produces an estimate different from the reduced-ML estimate.
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 133

3.3. Reduction to a Two-State Abstracted Markov Chain

In this subsection we introduce a conceptual reduction of arbitrary undiscounted absorbing


Markov chains to a two-state abstracted Markov chain that we then use in the rest of
this paper's analyses of MC methods. The reduction is based on focusing on each state
individually. Assume for the moment that we are interested in the value only of one
state, s. We assume that all training trials start in state s. We can do this without loss
of generality because the change in the value of a state after a trial is unaffected by
anything that happens before the first visit to the state on that trial.
For any Markov chain, a trial produces two sequences, a random sequence of states,
{s}, beginning with s and ending in T, and an associated random sequence of rewards,
{r}. Partition sequence {s} into contiguous subsequences that begin in s and end just
before the next revisit to s. The subsequence starting with the i th revisit to s is denoted
{s}i. The last such subsequence is special in that it ends in the terminal state and
is denoted {S}T. The corresponding reward sequences are similarly denoted {r}i and
{r}r. Because of the Markov property, {s}i is independent of {s}j, for all i ¢ j, and
similarly {r}~ is independent of {r}j. This is useful because it means that the precise
sequence of states that actually occurs between visits to s does not play a role in the
first-visit MC or the every-visit MC estimates for V(s). Similarly, the precise sequence
of rewards, {r}~, does not matter, as only the sum of the rewards in between visits to s
are used in the MC methods.

I {S}I -I I {S}T

• • •

PS " r s

Figure 3. Abstracted Markov chain. At the top is a typical sequence of states comprising a training trial.
The sequence can be divided into contiguous subsequences at the visits to start state s. For our analyses, the
precise sequence of states and rewards in between revisits to s does not matter. Therefore, in considering the
value of s, arbitrary undiscounted Markov chains can be abstracted to the two-state chain shown in the lower
part of the figure.
134 S.P. SINGH AND R.S. SUTTON

Therefore, for the purpose of analysis, arbitrary undiscounted Markov chains can be
reduced to the two-state abstract chain shown in the lower part of Figure 3. The as-
sociated probabilities and rewards require careful elaboration. Let PT and P~ denote
the probabilities of terminating and looping respectively in the abstracted chain. Let rs
and rT represent the random rewards associated with a s ---+ s transition and a s ---+ T
transition in Figure 3. We use the quantities, R~ = E{r~}, V a t ( % ) = E{(r~ - R~)2},
R T = E { r T } , and V a r ( r T ) = E { ( r T -- R T ) 2} in the following analysis. Precise
definitions of these quantities are given in Appendix A.2.

first-visit MC:
Let {x} stand for the paired random sequence ({s}, {r)). The first-visit MC estimate
for V ( s ) after one trial, {x}, is

VF(s) = f({x}) = %, + r~2 + rsa + . . . + r*k + rT,

where k is the random number of revisits to state s, rs~ is the sum of the individual
rewards in the sequence {r}i, and rT is the random total reward received after the last
visit to state s. For all i, E { r s , } = Rs. The first-visit MC estimate of V ( s ) after n
trials, {x} 1, { x } 2 , . . . , {x} n, is

~¢~=~ f ( { x } ~)
VnF(S) = f ( { x } 1 , { z } 2 , . . . , { z } n) = (2)
n
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 135

In words, V ~ ( s ) is simply the average of the estimates from the n sample trajectories,
{x} 1, { x } 2 , . . . , {x} n, all of which are independent of each other because of the Markov
property.

every-visit MC:
The every-visit MC estimate for one trial, {x}, is

rs~ + 2r~2 + . . . + krsk + (k + 1)rT


y[(s) : t({~}) -
k+l k+l '

where k is the random number of revisits to state s in the sequence {x}. Every visit to
state s effectively starts another trial. Therefore, the rewards that occur in between the
i th and (i + 1) st visits to state s are included i times in the estimate.
The every-visit MC estimate after n trials, {x} 1, { x } 2 , . . . , {x} n, is

v f ( s ) : t({x} 1, {x}2,..., { x F ) : E{%1n tn,m({X}0


k (3)
E~=~( ~ + i) '

where k~ is the number of revisits to s in the i th trial {x}< Unlike the first-visit MC
estimator, the every-visit MC estimator for n trials is not simply the average of the
estimates for individual trials, making its analysis more complex.
We derive the bias (Bias) and variance (Var) of first-visit MC and every-visit MC as
a function of the number of trials, n. The mean squared error (MSE) is B i a s 2 + Var.

3.4. Bias Results

First consider the true value of state s in Figure 3. From Bellman's equation (Bellman,
1957):

v(~) = P~(R~ + V(s)) + PT.(R:r + vT)

or

(1 - P~)V(s) = P~R~ + PTR:r,

and therefore

Ps R
V(s) : PT s + RT-

THEOREM 6." First-visit MC is unbiased, i.e., BiasFn(s) = V(s) E { V F ( s ) } = Ofor


all n > O.
136 S.P. SINGH AND R.S. SUTTON

Proof: The first-visit MC estimate is unbiased because the total reward on a sample path
from the start state s to the terminal state T is by definition an unbiased estimate of the
expected total reward across all such paths. Therefore, the average of the estimates from
n independent sample paths is also unbiased. See Appendix A.3 for a detailed proof.

THEOREM 7: Every-visit M C is biased and, after n trials, its bias is

BiasE(s)=V(s)-E{VE(s)}-n+21Biase(s)-- ( n + l ) 2 [2-~T]RS .

Proof: See Appendix A.4.

One way of understanding the bias in the every-visit MC estimate is to note that this
method averages many returns for each trial. Returns from the same trial share many of
the same rewards and are thus not independent. The bias becomes smaller as more trials
are observed because the returns from different trials are independent. Another way of
understanding the bias is to note that the every-visit MC estimate (3) is the ratio of two
random variables. In general, the expected value of such a ratio is not the ratio of the
expected values of the numerator and denominator.

COROLLARY 7a: Every-visit M C is unbiased in the limit as n --~ cxz.

3.5. Variance and M S E Results

THEOREM 8: The variance of first-visit M C is

- - ~Va,(~)+~R~].
- ,~1 Var(,-~) + P.~ Ps 21

Proof: See Appendix A.5.

Because the first-visit MC estimate is the sample average of estimates derived from
independent trials, the variance goes down as _1
n The first two terms in the variance are
due to the variance of the rewards, and the third term is the variance due to the random
number of revisits to state s in each trial.

COROLLARY 8a: The MSE of first-visit M C is

MSEg(~) = ( s i ~ ( ~ ) ) ~ + v~T~(~) = -1~ V ~ ( ~ ) + ~ ,,~(,~,) + p~ ~3.


REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 137

THEOREM 9: The variance of every-visit MC after one trial is bounded by

Var(rs) 3-fiT + V a r ( r T ) + 4 PT~ s <- V a r E ( s )

and

w f(s) < v <(rs) + +

Proof: See Appendix A.6.

We were able to obtain only these upper and lower bounds on the variance of every-visit
MC. For a single trial, every-visit MC produces an estimate that is closer to zero than
the estimate produced by first-visit MC; therefore V a r E <_ V a r f . This effect was seen
in the simple example of Figure 2, in which the every-visit MC estimator underestimated
the expected number of revisits.
Of course, a low variance is not of itself a virtue. For example, an estimator that
returns a constant independent of the data has zero variance, but is not a good estimator.
Of greater importance is to be low in mean squared error (MSE):

COROLLARY 9a: After one trial, M S E E ( s ) <_ M S E F ( s ) because ( B i a s f ( s ) ) 2 +


Vary(s) < M S E F ( s ) .
Thus, after one trial, every-visit MC is always as good or better than first-visit MC in
terms of both variance and MSE. Eventually, however, this relative advantage always
reverses itself:

THEOREM 10: There exists an N < oo, such that for all n > N, V a r E (s) > V a r y ( s ) .
Proof: The basic idea of the proof is that the O ( I ) component of V a t E is larger than
that°fVarFn" T h e ° t h e r O ( -) rnl" c°mp°nents°fVarEn fall off much more rapidly than
the O ( 1 ) component, and can be ignored for large enough n. See Appendix a . 7 for a
complete proof. •

COROLLARY lOa: There exists an N < oe, such that, for all n > N,

M S E ~ ( s ) = ( B i a s ~ ( s ) ) z + V a r E ( s ) > M S E f f (s) = VarF(s).

Figure 4 shows an empirical example of this crossover of MSE. These data are for the
two MC methods applied to an instance of the example task of Figure 2a. In this case
crossover occurred at trial N = 5. In general, crossover can occur as early as the first
trial. For example, if the only non-zero reward in a problem is on termination, then
Rs = 0, and V a t ( % ) = 0, which in turn implies that Bias~ = 0, for all n, and that
VarE(s) = V a r y ( s ) , so that MSEIE(S) = MSEFI (S).
138 S.P. SINGH AND R.S. SUTTON

2-

1.5-

Average
Root MSE

0.5

0
o ; 1'0 1; 2'0
Trials

Figure 4. Empiricaldemonstrationof crossover of MSE on the example task shown in Figure 2a. The S-to-S
transition probabilitywas p = 0.6. These data are averages over 10,000 runs.

3.6. Summary

Table 1 summarizes the results of this section comparing first-visit and every-visit MC
methods. Some of the results are unambiguously in favor of the first-visit method over
the every-visit method: only the first-visit estimate is unbiased and related to the ML es-
timate. On the other hand, the MSE results can be viewed as mixed. Initially, every-visit
MC is of better MSE, but later it is always overtaken by first-visit MC. The implications
of this are unclear. To some it might suggest that we should seek a combination of the
two estimators that is always of lowest MSE. However, that might be a mistake. We
suspect that the first-visit estimate is always the more useful one, even when it is worse
in terms of MSE. Our other theoretical results are consistent with this view, but it remains
a speculation and a topic for future research.

Table 1. Summaryof Statistical Results

Algorithm Convergent Unbiased Short MSE Long MSE Reduced-ML


First-Visit MC Yes Yes Higher Lower Yes
Every-Visit MC Yes No Lower Higher No
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 139

4. Random-Walk Experiment

In this section we present an empirical comparison of replacing and accumulating eligi-


bility traces. Whereas our theoretical results are limited to the case of A = 1 and either
offline or batch updating, in this experiment we used online updating and general A. We
used the random-walk process shown in Figure 5. The rewards were zero everywhere
except upon entering the terminal states. The reward upon transition into State 21 was
+ 1 and upon transition into State 1 was - 1 . The discount factor was 7 = 1. The initial
value estimates were 0 for all states. We implemented online TD(A) with both kinds of
traces for ten different values of A: 0.0, 0.2, 0.4, 0.6, 0.8, 0.9, 0.95, 0.975, 0.99, and
1.0.
The step-size parameter was held constant, a t ( s ) = a, Vt, Vs. For each value of A,
we used a values between 0 and 1.0 in increments of 0.01. Each (A, a ) pair was treated
as a separate algorithm, each of which we ran for 10 trials. The performance measure
for a trial was the root mean squared error (RMSE) between the correct predictions and
the predictions made at the end of the trial from states that had been visited at least once
in that or a previous trial. These errors were then averaged over the 10 trials, and then
over 1000 separate runs to obtain the performance measure for each algorithm plotted
in Figures 6 and 7. The random number generator was seeded such that all algorithms
experienced exactly the same trials.
Figure 6 shows the performance of each method as a function of a and A. For each
value of A, both kinds of TD method performed best at an intermediate value of c~, as is
typically the case for such learning algorithms. The larger the X value, the smaller the c~
value that yielded best performance, presumably because the eligibility trace multiplies
the step-size parameter in the update equation.
The critical results are the differences between replace and accumulate TD methods.
Replace TD was much more robust to the choice of the step-size parameter than accumu-
late TD, Indeed, for A >_ 0.9, accumulate TD(A) became unstable for a > 0.6. At large
A, accumulate TD built up very large eligibility traces for states that were revisited many
times before termination. This caused very large changes in the value estimates and led
to instability. Figure 7 summarizes the data by plotting, for each A, only the performance
at the best c~ for that A. For every A, the best performance of replace TD was better than
or equal to the best performance of accumulate TD. We conclude that, at least for the
problem studied here, replace TD(A) is faster and more robust than accumulate TD(A).

o o I

-1 0 0

Figure 5. The random-walk process. Starting in State 11, steps are taken left or right with equal probability
until either State 1 or State 21 is entered, terminating the trial and generating a final non-zero reward.
140 S,P. S I N G H A N D R.S. S U T T O N

A C C U M U L A T E TRACES REPLACE TRACES


0.8

- 0.7- / ~,=1

~,=.99
- 0.6-
• k=O / / X = , 9 7
Average

~
- 0.5-
RMSE
- 0.4- L=.9

- 0.3-

L=.6 - 0.2-

0'.2 0'.4 016 ols i 6 0'.2 014 0'.6 018 1


(Z

Figure 6. Performance of replace and accumulate TD(A) on the r a n d o m - w a l k task, for various values of A and
a . The performance measure was the R M S E per state per trial over the first 10 trials. These data are averages
o v e r 1000 runs.

0.5

9
0.4.
Accumulate
?
Average
RMSE 0.3.
at best c~

0.2 ¸ Replace

; o2 0_4 0'6 °8

Figure 7. Best performaaces of accumulate and replace TD(A) on the r a n d o m - w a l k task.


R E I N F O R C E M E N T LEARNING W I T H R E P L A C I N G E L I G I B I L I T Y T R A C E S 141

5. Mountain-Car Experiment

In this section we describe an experimental comparison of replacing and accumulating


traces when used as part of a reinforcement learning system to solve a control problem.
In this case, the methods learned to predict the value not of a state, but of a state-action
pair, and the approximate value function was implemented as a set of CMAC neural
networks, one for each action.
The control problem we used was Moore's (1991) mountain car task. A car drives
along a mountain track as shown in Figure 8. The objective is to drive past the top of
the mountain on the righthand side. However, gravity is stronger than the engine, and
even at full thrust the car cannot accelerate up the steep slope. The only way to solve the
problem is to first accelerate backwards, away from the goal, and then apply full thrust
forwards, building up enough speed to carry over the steep slope even while slowing
down the whole way. Thus, one must initially move away from the goal in order to
reach it in the long run. This is a simple example of a task where things must get worse
before they can get better. Many control methodologies have great difficulties with tasks
of this kind unless explicitly aided by a human designer.
The reward in this problem is - 1 for all time steps until the car has passed to the
right of the mountain top. Passing the top ends the trial and ends this punishment. The
reinforcement learning agent seeks to maximize its total reward prior to the termination
of the trial. To do so, it must drive to the goal in minimum time. At each time step the
learning agent chooses one of three actions: full thrust forward, full thrust reverse, or
no thrust. This action, together with the effect of gravity (dependent on the steepness of
the slope), determines the next velocity and position of the car. The complete physics
of the mountain-car task are given in Appendix B.
The reinforcement learning algorithm we applied to this task was the Sarsa algorithm
studied by Rummery and Niranjan (1994) and others. The objective in this algorithm is to
learn to estimate the action-value function Q~r (s, a) for the current policy 7r. The action-

GOAL

Figure 8- The Mountain-Cartask. The force of gravity is stronger than the motor.
142 S.P. SINGH AND R.S. SUTTON

1. Initially: wa(f) := -20, ea(f) : = 0, V a E Actions, V f ~ CMAC-tiles.

2. S t a r t o f Trial: s : = random-state();
F : = features(s);
a : = greedy-policy(F).

3. E l i g i b i l i t y Traces: eb(f) : = /~eb(f), Vb, V f ;


3a. A c c u m u l a t e a l g o r i t h m : ea(f) : = ea(f) + 1, V f E F.
3b. R e p l a c e a l g o r i t h m : ea(f) : = 1, eb(f) : = 0, V f E F , Vb ¢ a.

4. E n v i r o n m e n t Step:
Take action a; o b s e r v e r e s u l t a n t r e w a r d , r , a n d n e x t state s t.

5. C h o o s e N e x t A c t i o n :
F ' : = features(st), u n l e s s s t is t h e t e r m i n a l state, t h e n F' : = 0;
a t : = greedy-policy(F').

6. L e a r n : Wb(f) := wb(f) + ~[r + E f E F ' wa, -- EfEFWa]eb(f), Vb, V f .

7. L o o p : a : = at; s : = st; F : = F t ; i f st is the t e r m i n a l state, go to 2; else go to 3.

Figure 9. The Sarsa Algorithm used on the Mountain-Car task. The function greedy-policy(F) computes
~ I E F wa for each action a and returns the action for which the sum is largest, resolving any ties randomly.
The functionfeatures(s) returns the set of CMAC tiles corresponding to the state s. Programming optimizations
can reduce the expense per iteration to a small multiple (dependent on A) of the number of features, m, present
on a typical time step. Here m is 5.
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 143

value Q~(s, a) gives, for any state, s, and action, a, the expected return for starting from
state s, taking action a, and thereafter following policy 7r. In the case of the mountain-car
task the return is simply the sum of the future reward, i.e., the negative of the number of
time steps until the goal is reached. Most of the details of the Sarsa algorithm we used are
given in Figure 9. The name "Sarsa" comes from the quintuple of actual events involved
in the update: (st,at,rt+l, st+l, at+l). This algorithm is closely related to Q-learning
(Watkins, 1989) and to various simplified forms of the bucket brigade (Holland, 1986;
Wilson, to appear). It is also identical to the TD(A) algorithm applied to state-action
pairs rather than to states. 6
The mountain-car task has a continuous two-dimensional state space with an infinite
number of states. To apply reinforcement learning requires some form of function ap-
proximator. We used a set of three CMACs (Albus, 1981; Miller, Glanz, & Kraft, 1990),
one for each action. These are simple functions approximators using repeated overlapping
tilings of the state space to produce a feature representation for a final linear mapping.
In this case we divided the two state variables, the position and velocity of the car, each
into eight evenly spaced intervals, thereby partitioning the state space into 64 regions,
or boxes. A ninth row and column were added so that the tiling could be offset by a
random fraction of an interval without leaving any states uncovered. We repeated this
five times, each with a different, randomly selected offset. For example, Figure 10 shows
two tilings superimposed on the 2D state space. The result was a total of 9 x 9 x 5 = 405
boxes. The state at any particular time was represented by the five boxes, one per tiling,
within which the state resided. We think of the state representation as a feature vector
with 405 features, exactly 5 of which are present (non-zero) at any point in time. The
approximate action-value function is linear in this feature representation. Note that this
representation of the state causes the problem to no longer be Markov: many different
nearby states produce exactly the same feature representation.

- - Tiling #1
....i .................... ..¸ .-{

i , , -J Tiling #2
¢.)
o
i1)
>
= i!i
i i ~ i I i i i
o

..............t..................i i ....... I.,:.........

Car Position

Figure 10. Two 9 x 9 CMAC tilings offset and overlaid over the continuous, two-dimensional state space of
the Mountain-Car task. Any state is in exactly one tile/box/feature of each tiling. The experiments used 5
tilings, each offset by a random fraction of a tile width.
144 S.P. SINGH AND R.S. SUTTON

The eligibility traces were implemented on a feature-by-feature basis. Corresponding


to each feature were three traces, one per action. The features are treated in essence
like states. For replace algorithms, whenever a feature occurs, its traces are reset to 1
(for the action selected) or 0 (for all the other actions). This is not the only possibility,
of course. Another would be to allow the traces for each state-action pair to continue
until that pair occurred again. This would be more in keeping with the idea of replacing
traces as a mechanism, but the approach we chose seems like the appropriate way to
generalize the idea o f first-visit M C to the control case: after a state has been revisited,
it no longer matters what action was taken on the previous visit. A comparison of these
two possibilities (and perhaps others) would make a good extension to this work.
The greedy policy was used to select actions. The initial weights were set to produce a
uniform, optimistic initial estimate of value (-100) across the state space. 7 See Figure 9
for further details.
We applied replace and accumulate Sarsa algorithms to this task, each with a range of
values for ,k and a . Each algorithm was run for 20 trials, where a trial was one passage
from a randomly selected starting state to the goal. All algorithms used the same sets
of random starting states. The performance measure for the run was the average trial
length over the 20 trials. This measure was then averaged over 30 runs to produce the
results shown in Figures 11 and 12. Figure 11 shows the detailed results for each value
of/~ and c~, whereas Figure 12 is a summary showing only the best performance of each
algorithm at each ~k value.
Several interesting results are evident from this data. First, the replace-trace method
performed better than the accumulate-trace method at all A values. The accumulate
method performed particularly poorly relative to the replace method at high values of A.
For both methods, performance appeared to be best at an intermediate /~ value. These

REPLACE TRACES ACCUMULATE TRACES


- 800

k= 1 ~.=.95/"
, ," - 700
',~ ~=.99 ;' i ~ ~.=.6 ,/
Steps/TriM
Averaged over '~ k=O
first 20 trials ~ ," ~ - 600 -
and 30 runs

" ~ ~ ' ~ I="°/'"


- - // "~ 7 . 500- ~.=.3

k= 9
,-400-
0 0'.2 014 0'.6 0'.8 1.2 0~2 {;.4 0'.6 018 1.2
o~

Figure 11. Results on the Mountain-Car task for each value of ,k and o~. Each data point is the average duration
of the first 20 trials of a run, averaged over 30 runs_ The standard errors are omitted to simplify the graph;
they ranged from about 10 to about 50.
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 145

700

650-

600-
T
Accumulate , o
Average
Steps/Trial 550-
at best ~
500-
Replace i

450 -

400 t i i i 1 i
0 0.2 0.4 0.6 0.8 1

Figure 12. Summary of results on the Mountain-Car task. For each value of A we show its performance at its
best value of c~. The error bars indicate one standard error.

results are all consistent with those presented for the random-walk task in the previous
section. On the mountain-car task, accumulating traces at best improved only slightly
over no traces (A = 0) and at worst dramatically degraded performance. Replacing traces,
on the other hand, significantly improved performance at all except the very longest trace
lengths (A > .99). Traces that do not decay (A = 1) resulted in significantly worse
performance than all other A values tried, including no traces at all (A = 0).
Much more empirical experience is needed with trace mechanisms before a definitive
conclusion can be drawn about their relative effectiveness, particularly when function
approximators are used. However, these experiments do provide significant evidence for
two key points: 1) that replace-trace methods can perform much better than conventional,
accumulate-trace methods, particularly at long trace lengths, and 2) that although long
traces may help substantially, best performance is obtained when the traces are not
infinite, that is, when intermediate predictions are used as targets rather than actual
sample returns.

6. Conclusions

We have presented a variety of analytical and empirical evidence supporting the idea
that replacing eligibility traces permit more efficient use of experience in reinforcement
learning and long-term prediction.
Our analytical results concerned a special case closely related to that used in classical
studies of Monte Carlo methods. We showed that methods using conventional traces are
biased, whereas replace-trace methods are unbiased. While the conclusions of our mean-
squared-error analysis are mixed, the maximum likelihood analysis is clearly in favor
of replacing traces. As a whole, these analytic results strongly support the conclusion
146 S.P. SINGH AND R.S. SUTTON

that replace-trace methods make better inferences from limited data than conventional
accumulate-trace methods.
On the other hand, these analytic results concern only a special case quite different
from those encountered in practice. It would be desirable to extend our analyses to
the case of A < 1 and to permit other step-size schedules. Analysis of cases involving
function approximators and violations of the Markov assumption would also be useful
further steps.
Our empirical results treated a much more realistic case, including in some cases all of
the extensions listed above. These results showed consistent, significant, and sometimes
large advantages of replace-trace methods over accumulate-trace methods, and of trace
methods generally over trace-less methods. The mountain-car experiment showed that the
replace-trace idea can be successfully used in conjunction with a feature-based function
approximator. Although it is not yet clear how to extend the replace-trace idea to other
kinds of function approximators, such as back-propagation networks or nearest-neighbor
methods, Sutton and Whitehead (1993) and others have argued that feature-based function
approximators are actually preferable for online reinforcement learning.
Our empirical results showed a sharp drop in performance as the trace parameter
A approached 1, corresponding to very tong traces. This drop was much less severe
with replacing traces but was still clearly present. This bears on the long-standing
question of the relative merits of TD(1) methods versus true temporal-difference 0, <
1) methods. It might appear that replacing traces make TD(1) methods more capable
competitors; the replace TD(1) method is unbiased in the special case, and more efficient
than conventional TD(1) in both theory and practice. However, this is at the cost of losing
some of the theoretical advantages of conventional TD(1). In particular, conventional
TD(1) converges in many cases to a minimal mean-squared-error solution when function
approximators are used (Dayan, 1992) and has been shown to be useful in non-Markov
problems (Jaakkola, Singh & Jordan, 1995). The replace version of TD(1) does not share
these theoretical guarantees. Like A < 1 methods, it appears to achieve greater efficiency
in part by relying on the Markov property. In practice, however, the relative merits of
different A = 1 methods may not be of great significance. All of our empirical results
suggest far better performance is obtained with A < 1, even when function approximators
are used that create an apparently non-Markov task.
Replacing traces are a simple modification of existing discrete-state or feature-based
reinforcement learning algorithms. In cases in which a good state representation can be
obtained they appear to offer significant improvements in learning speed and reliability.

Acknowledgments
We thank Peter Dayan for pointing out and helping to correct errors in the proofs.
His patient and thorough reading of the paper and his participation in our attempts to
complete the proofs were invaluable. Of course, any remaining errors are our own. We
thank Tommi Jaakkola for providing the central idea behind the proof of Theorem 10, an
anonymous reviewer for pointing out an error in our original proof of Theorem 10, and
Lisa White for pointing out another error. Satinder R Singh was supported by grants to
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 147

Michael I. Jordan (Brain and Cognitive Sciences, MIT) from ATR Human Information
Processing Research and from Siemens Corporation.

Appendix A
Proofs of Analytical Results

A.1. Proof of Theorem 5: First-Visit MC is Redueed-ML

In considering the estimate Vt (s), we can assume that all trials start in s, because both
first-visit MC and reduced-ML methods ignore transitions prior to the first-visit to s. Let
n~ be the number of times state i has been visited, and let nij be the number of times
transition i ---+ j has been encountered. Let Rjk be the average of the rewards seen on
the j ~ k transitions.
Then VNF(S), the first-visit MC estimate after N trials with start state s, is
1
VNF(s) = --~ ~ njkRjk.
kES,jES

This is identical to (2) because ~ k e s , j e s njkRjk is the total summed reward seen during
the N trials. Because N = ns - ~ i e s nis, we can rewrite this as

njk /tjk = ~ V~kRjk. (A.1)


kCS,jCS j,k

The maximum-likelihood model of the Markov process after N trials has transition
probabilities P(ij) = n'---z
9%4
and expected rewards R(ij) = Rij. Let vML(s) denote the
reduced-ML estimate after N trials. By definition VNML (s) = E N {r 1 + r2 + r 3 + r 4 + . . . },
where EN is the expectation operator for the maximum-likelihood model after N trials,
and re is the payoff at step/. Therefore

VNML(s) = ~ Rjk [Probl(j ~ k) + Prob2(j --~ k) + Proba(j ~ k) +...]


j,k

= ~-~RjkUjk, (A.2)
j,k

where Probi(j ~.~ k) is the probability of a j-to-k transition at the i th step according to
the maximum-likelihood model. We now show that for all j, k, Ujk of (A.2) is equal to
u~k of (A.1).
Consider two special cases of j in (A.2):
Casel, j=s:

Usk = P(sk) + P(ss)P(sk) + ~ P(sm)P(ms)P(sk) + . . .


m

. I
148 S.e. SINGH AND R.S. SUTTON

-- risk [1-1"-P1(88) 4. t°2(88) q- p3(ss) 4.,..]


ns
_ nsk (1 + Nss), (A.3)
ns
where pn(ij) is the probability of going from state i to state j in exactly n steps, and
Nss is the expected number of revisits to state s as per our current maximum-likelihood
model.
Case 2, j ¢ s:

Uj~ = P ( s j ) P ( j k ) + ~ P(sm)P(mj)P(jk) +...


m

nj Iris m ns nm
_ njk [pl(sj ) + p2(sj) + p3(sj) +...]
nj
_ njk Nsj, (A.4)
nj
where N~j is the expected number of visits to state j.
For all j, the N~j satisfy the recursions

N~j = P(sj) + ~ N s m P ( m j ) = n~--Aj+ ~ N~m nraj (1.5)


It s ?'l,rn
m m

We now show that Nsj = n s - ~~Ji n l s for j ¢ s, and N~s = ~-~-]~


n~ n~ _ 1, by showing
that these quantities satisfy the recursions (A.5).

N~j- nsj+~2 (nm j nm )+nsj ( n s )


rn¢s
---- --BsJ 4- ~ nrnJ 72sJ
Its m n s -- ~ i r~is T~s

~m nmj nj

and, again from (A.5),

ns rues
\nm ns ---~i his ns ns -- ~-]i his 1
_ Y2m nms _ ns 1.

Plugging the above values of Nss and Nsj into (A.3) and (A.4), we obtain Ujk =
"r~.s--~ i ftis
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 149

A.2. Facts Used in Proofs of Theorems 6-10

The proofs for Theorems 6-10 assume the abstract chain of Figure 3 with just two states,
s and T. The quantities Rs = E{rs}, Var(rs) = E{(r8 - Rs)2}, RT = E { r T } ,
and V a r ( r T ) = E { ( r T -- R T ) 2} are o f interest for the analysis and require careful
elaboration. Let S~ be the set of all state sequences that can occur between visits to
state s (including state s at the head), and let ST be the set o f all state sequences that
can occur on the final run from s to T (including state s). The termination probability is
PT = ~{S}TeST P({S}T), where the probability of a sequence of states is the product
of the probabilities of the individual state transitions. By definition Ps = t - PT- The
reward probabilities are defined as follows: Prob{rs = q} = II{s}cs P({s})P(r{~ } =
ql{s}), and Prob{rT = q} = I I { s I r e s T P ( { s } T ) P ( r T = qI{S}T). Therefore, R~ =
y~q qProb{% = q}, RT = ~ q qProb{rT = q}. Similarly, Var(rs) = ~ q Prob{r~ =
q}(q - Rs) 2, and V a r ( r T ) = ~ q Prob{rT = q)(q - R T ) 2.
If the rewards in the original M a r k o v chain are deterministic functions of the state
transitions, then there will be a single ri associated with each {s}i. I f the rewards
in the original problem are stochastic, however, then there is a set of possible random
r i ' S associated with each {s}i. Also note that even if all the individual rewards in the
original Markov chain are deterministic, V a t ( % ) and Var(rT) can still be greater than
zero because r~ and rT will be stochastic because of the many different paths from s to
s and from s to T.
The following fact is used throughout:

E[f({x})] = ~ P({x})f({x})
{x}

= ~ P(k)E{,.}{f({x})[k}, (A.6)
k

where k is the number of revisits to state s. We also use the facts that, if r < 1, then,

• r ~ r(1 + r)
(1 - r)2 and E _
i=O i=0

A.3. Proof of Theorem 6: First-Visit MC is Unbiased

First we show that first-visit M C is unbiased for one trial. From (A.6)

E { V l f ( S ) } = E{x}[f({x})] = ~ P(k)E{r}{f({x})lk }
k

= ~-~PTP:(kRs +RT)
k

1 R
150 S.P. SINGH AND R.S. S U T T O N

= p~R~ +RT

= v(s).
Because the estimate after n trials, V~(s), is the sample average of n independent
estimates each of which is unbiased, the n-trial estimate itself is unbiased. •

A.4. Proof of Theorem 7: Every-Visit MC is Biased

For a single trial, the bias of the every-visit MC algorithm is

E { V ~ ( s ) } = E{~}[t({x})l = ~ P(k)E{~}{t({zI)lk}
k

= E p T p ~ (Rs+2Rs+...+kRs+(k+l)RT)k+l
k

- -
P~R
2PT s + R T .

Therefore, Biasf = V(s) - E{Vf(s)} = 2P~


PT
R s"
Computing the bias after n trials is a bit more complex, because of the combinatorics
of getting k revisits to state s in n trials, denoted /3(n; k). Equivalently, one can think
of B(n; k) as the number of different ways of factoring k into n non-negative integer
additive factors with the order considered important. Therefore, B(n; k) = \[k+n-l,~ n--1 7"
Further let B(n; kl, k 2 , . . . , knlk) be the number of different ways one can get kl to
kn as the factors with the order ignored. Note t h a t Efactors of k B(n; kl , k 2 , . . . , knlk ) =
B(n; k). We use superscripts to distinguish the rewards from different trials, e.g., r 2
refers to the random total reward received between the jth and (j + 1) st visits to start
state s in the second trial.

{ Ei~=l tnum({X} i) }
E{V~(~)} E{~} ~ k
E ~ = I ( i + 1)
2-.,~=t ?--,j=t 3 ~3
kl,k2,...,k,Z P(kl,k2,...,kn)E{r} [ ~-~=1-~/T-~)'= ]¢1, k2,. - - ,kn
J
[EL~
ZP(k) Z B(~;k,,k~,...,k,~lk)[ k~(k,+ 1) R~+R~]
~;T;~
k factors of k

(
\ k factors of k L
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 151

This, together with

B(n;kl'k~"'"knlk)
[27=1(ki) 2 +
L
k] -
k B(n;k),
n +------1 (A.7)
factors of k

which we show below, and the fact that


O<3

k=O
-- -fi-TT'

leads to the conclusion that

E { V $ ( s ) } = RT + +-----i
n Ps
ns.

Therefore BiasEn(s) = V(s) - E { V ~ ( s ) } = nq-11 PPsT *~S"


~? This also proves that the
every-visit MC algorithm is unbiased in the limit as n ---+oo.
Proof of (A.7):
Define T(n; k) as the sum over all factorizations of the squared factors of k:
n

T(n;k) = ~ B(n;kl,k2,...,k•lk) E ( k i ) 2.
factors of k i=1

We know the following facts from first principles:


k
Fact 1: E B(n; k - j ) = B ( n + 1; k);
j=O

k
Fact 2: E j B ( n ; k - j ) = k B ( n + 1;k).
n+l
j=0

k
Fact 3: E j Z B ( n ; k - j ) - T ( n + 1;k).
n+l '
j=0

and also that, by definition,


k
Fact 4: T ( n + 1; k) = ~-~T(n; k - j ) + j 2 B ( n ; k - j).
j=0

Facts 3 and 4 imply that


k
T(n+ l;k)- n+l ET(n;k_j). (A.8)
n j=0
152 S.P. SINGH AND R.S. SUTTON

Using Facts 1-3 we can show by substitution that T(n;k) = k(2k+~-l)B(n;k)


n+l
is a
solution of the recursion (A.8), hence proving that

Z B(n;kl,k2,...,k~lk) (ki) 2 = k ( 2 k + n - 1)B(n;k )


n+l
factors of k

Z B(n;kl,k2,...,k~]k) - E L l ( k , ) 2] k B(n;k) - k B(n;k)


factors of k

Z B(n;kl,k2, ... ,klk)l[En=l(ki


k']2(k
+ n) '+ = +1 •
factors of k

A.5. Proof of Theorem 8: Variance of First-Visit MC

We first compute the variance of first-visit MC after one trial (x}:


Vary(s) = E{:~}{(f({x}) - V(~)) 2}
= ~ P(k)E{~}{(f({x}) V(s))2]k} -

k i=1

= E PTPskE{r} (I"2' -{- 2rTrsi) -Ira2 Z Ts~Ts~ Jr_r~ Jr- P2R2


k - i#j

= EPTP2 kVar(r~)+Var(rT)+R 2 k - PT] J


k
1%

The first term is the variance due to the variance in the number of revisits to state s, and
the second term is the variance due to the random rewards. The first-visit MC estimate
after n trials is the sample average of the n independent trials; therefore, VarF.(s) =
v~[(~) •
n
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 153

A.6. Proof of Theorem 9: Variance of Every-Visit MC After 1 Trial

VarEl (S) = E{~}{(t({x}) - E{t({x})}) 2}


: EP({x})[ rsa+2rs2+''+kr~k+(k+l)rTk+l _ (Ps~_~TRS+RT)]2
{~}

= r 2 4- 2-~-~rs~r T +r~
k i=l

+~ (k2ij
+ ~)~% + ~
1,2 R 2 Ps RsRT + P~
" + P~
i#j

i=1 ~-lrsi ~ 2PT s + RT - ~ T R S r T -- 2RTr T k

[ k(2k+l). , , R28 ( Ps'~ 2]


: ~PTWk va~(T~)+ ~(kJ-5 wT<<';+-i- k - p ~ ] ]

= Var(rT) + R2 Ps ~-" p p k k(2k + 1)


-~ + v~(~,) z~ ~ ~ ~(k g f ) "
k
1 < k(2k+l) k k <
Note that -k
3 - g _ 6(k+l) -- 3 6(k+t) - ~. Therefore,

va,-(,~T) + + va~.(,-~) < Varf(s),


3PT
and

Vary(s) <_Var(rT) + -~-p--~TRS+ Var(%) -~T "

A.7. Proof of Theorem 10: First-Visit MC is Eventually L o w e r in Variance and


MSE

The central idea for this proof was provided to us by Jaakkola (personal communication).
The every-visit MC estimate,

v~(~) = E L I t,,,,m({x}O
CLI(k~ + 1)
can be rewritten as

] Tt

~E(s ) = ~ E{(k~+~)}
n E{k~+l}
154 S.P. SINGH AND R.S. SUTTON

because, for all i, E { k { + l } = ~-T" It is also easy to show that for all i, E{PTt~,.~({x}~)} =
V(s), and E{PT(k{ + 1)} = 1.
Consider the sequence of functions

V(s) + ~T~
f~(fi)= 1-}-6K~ '
n p,,
where Cn : ~ ~'=l(PTtn~m({X} {) -- V(s)), and /(n : ~ ~-~i=1( T(ki + 1) 1).
Note that Vn. E{Tn} : O, E{ff[n} = 0, and fn (-~n) : VE(s)" Therefore, VarEn (s) =
E{(f~(~)) 2} - (E{V~(s)}) 2. Using Taylor's expansion,

f2 ( 1 ) ~ = f2(O) + - ~1- ~of : ( ~5 ) 6=o+~.~b-V


: ~ (°~' ) 1 ~ ~=o + " ' "

Therefore,

VarE(s)= E{f2(O)+1026 1 0225 6=0}


~o-6f~( ) 6=o +~nnO-~ f~( )

+E -a --~f~(6) e=o + . . . . (E{Vf(s)})2" (A.9)


6n~
We prove below that E ~--W~ g--g.sjnt, ; + - - is O ~ by showing that for all
k 6nz n~

, > o(1) o to m , . Ue i noroa,


n"2
YarF(s) decreases as ±~'t' and the goal here is to show that for large N, VarEn(s) >
VarFn(s) for all n > N.
We use the following facts without proof (except 12 which is proved subsequently):

Fact 5: E{f~(0)} = V2(s);

Fact 6: E f~(6) e:o = 0 because ~f~(~5) 6:° = 2V(s)(7~ - V(s)/£n);

02
Fact 7: o2~-~f~(5)
co +=° = 22P~ - 8V(s)T~/£n + 6V2(s)/(~;

Fact 8: E { / ( n2 } : Ps;

2P, + p2
Fact 9: E{K~T~} - s Rs + R T ( P s + 1) - V ( s ) ;
PT

Fact i0: S{¢~} - p~ + Pfva<(<~) + (R~ + I)WT(<T) -- V~(~)


P,r
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 155

+ (Ps + 1)P@ + 4Ps + 2P2 R~RT + P~ + 4P2s + p3 -2


PT p~ R~;

2P~
Fact 11: V2(s) - (E{VnE(S)}) 2 = (n + 1)2PT2/gs + RsRT;
(n + 1)PT

e r ]- O2 ~ P~+PY (P,+
Fact 12: ' I N ~ -~s'~(~) ~=o} - h-?-~ W,'(,'~) + ~ l~)wr(,-~)
P~--P~R2 2P~
+ nP~ ~ nPTR~RT"

Therefore, the O ( ~ ) behavior of the variance of the every-visit MC estimate is

VarEn (s) P~ + P~ (P~ + Ps 2 n2 - n - l P~ 2


nP~ W<(<~) + n ~-----~)W,(,~)+ -Ep~n~ + ~ g ~ p~R~.
Finally, comparing with

P~ 2
w~((~) = w , ( , ~ ) + 5V~('~)n + ~--y~Rs

proves Theorem 10. •

Ignoring higher order terms:


Note that f2(6) is of the form < 1+6K,~ ] " Therefore, for all i > 0, the denominator

of ---g3v-- is of the form (1 + 6/£,~)J for some j > 0. On evaluating at ~ = 0, the


denominator will always be 1, leaving in the numerator terms of the form e T m / ~ z,
where c is some constant and m, z _> 0. For example, Fact 6 shows that NoJf2:~5,
n ~ :~ [6=0 =
02 2
2V(s)(2rn - V(s)[in), and Fact 7 shows that ~$~_fz(~) 16=0= 22~ - 8V(s)T,~/-(,~ +
6 V 2 ( s ) / £ 2. We show that E { T ~ K z} is O(1).
Both 2P~ and /4~ are sums of n independent mean-zero random variables. TmKZ~
contains terms that are products of random variables for the n trials. On taking ex-
pectation, any term that contains a random variable from some trial exactly once drops
out. Therefore all terms that remain should contain variables from no more than [~Jm+~
m+z j
trials. This implies that the number of terms that survive are O(n[--~- ) (the constant
is a function of i, but is independent of n). Therefore,

-m -z 1
= --~[-'~Jc 1 which is O(1),
n 2

where c] is some constant. This implies that the expected value of the terms in the
Taylor expansion corresponding to i > 2 are O ( ~ ) .
n~
156 S.P. SINGH AND R.S. S U T T O N

Proof of Fact 12:


E ( 1 02 2 6 E{22p2 - 8V(s)TnK,~ + 6 V 2 ( s ) / ~ }
(A. 1O)
l 2-nn-0--~f ~ ( ) a=o} = 2n
From Fact 10:
+ p2
= 2P~ * S V a r ( r s ) + 2(Ps + 1)Var(rT) -- 2V2(s)
PT
+2(l+p~)(
~P2sR s 2 + 2~RsR~
Ps
+ P4 )
T
4P, 2P, 2 6pf 2

_P,+ P:
= z~Var(rs) + 2(Ps + 1)Var(rT)

4P~ 2Ps 2 6P~ 2


+ 2 Ps V 2 ( s ) + -~T R s R T + --p--fTR S + --~-T R S .

Similarly, from Fact 9:

E{-SV(s)TnK~} = _sv(s)(P~R
\ PT s + (1 + Ps)( -~ Rs + RT) - V(s) )
: sv2(s)P~ 8P~s
~R~ 2 _ 8p@ R s R T '

and from Fact 8:

E{6v~(s)~} = 6PsV}.
Therefore, from (A.10), we see that
1 02 2
E{-ug~-b-~f (5)[a=o} -- --h-p-~ v a r t r s ) + + -

Appendix B
Details of the Mountain-Car Task

The mountain-car task (Figure 8) has two continuous state variables, the position of the
car, Pt, and the velocity of the car, yr. At the start of each trial, the initial state is chosen
randomly, uniformly from the allowed ranges: -1.2 < p < 0.5, -0.07 < v < 0.07. The
mountain geography is described by altitude = sin(3p). The action, at, takes on values
in { + 1 , 0 , - 1 } corresponding to forward thrust, no thrust, and reverse thrust. The state
evolution was according to the following simplified physics:

vt+l = bound [vt + 0.001at - g cos(3pt)]


REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 157

and

P t + l = b o u n d [Pt + v t + l ] ,

where 9 = -0.0025 is t h e f o r c e o f g r a v i t y a n d t h e b o u n d o p e r a t i o n c l i p s e a c h v a r i a b l e
w i t h i n its a l l o w e d r a n g e . I f P t + i is c l i p p e d in t h i s w a y , t h e n v t + l is a l s o r e s e t to z e r o .
Reward is - 1 o n all t i m e s t e p s . T h e t r i a l t e r m i n a t e s w i t h t h e first p o s i t i o n v a l u e t h a t
exceeds pt+l > 0.5.

Notes

1. Arguably, yet a third mechanism for managing delayed reward is to change representations or world models
(e.g., Dayan, 1993; Sutton, 1995).
2. In some previous work (e.g., Sutton & Barto, 1987, 1990) the traces were normalized by a factor of 1 - 7A,
which is equivalent to replacing the "1" in these equations by 1 - ~'A. In this paper, as in most previous
work, we absorb this linear normalization into the step-size parameter, c~, in equation (1).
3. The time index here is assumed to continue increasing across trials. For example, if one trial reaches a
terminal state at time 7-, then the next trial begins at time 7- + 1.
4. For this reason, this estimate is sometimes also referred to as the certainty equivalent estimate (e.g., Kumar
and Varaiya, 1986).
5. In theory it is possible to get this down to O ( n 2'376) operations (Baase, 1988), but, even if practical, this
is still far too complex for many applications.
6. Although this algorithm is indeed identical to TD(A), the theoretical results for TD(A) on stationary pre-
diction problems (e.g., Sutton, 1988; Dayan, 1992) do not apply here because the policy is continually
changing, creating a nonstationary prediction problem.
7. This is a very simple way of assuring initial exploration of the state space. Because most values are better
than they should be, the learning system is initially disappointed no matter what it does, which causes
it to try a variety of things even though its policy at any one time is deterministic. This approach was
sufficient for this task, but of course we do not advocate it in general as a solution to the problem of
assuring exploration.

References

Albus, J. S., (1981). Brain, Behavior, and Robotics, chapter 6, pages 139-179. Byte Books.
Baase, S., (1988). Computer Algorithms: Introduction to design and analysis. Reading, MA: Addison-Wesley.
Barnard, E., (1993). Temporal-difference methods and Markov models. IEEE Transactions on Systems, Man,
and Cybernetics, 23(2), 357-365_
Barto, A. G. & Duff, M , (1994). Monte Carlo matrix inversion and reinforcement learning. In Advances in
Neural Information Processing Systems 6, pages 687-694, San Mateo, CA. Morgan Kaufmann.
Barto, A. G., Sutton, R. S., & Anderson, C. W., (1983). Neuronlike elements that can solve difficult learning
control problems. IEEE Trans. on Systems, Man, and Cybernetics, 13, 835-846.
Bellman, R. E., (1957). Dynamic Programming. Princeton, NJ: Princeton University Press.
Curtiss, J. H., (1954). A theoretical comparison of the efficiencies of two classical methods and a Monte Carlo
method for computing one component of the solution of a set of linear algebraic equations. In Meyer, H. A.
(Ed.), Symposium on Monte Carlo Methods, pages 191-233, New York: Wiley.
Dayan, P., (1992). The convergence of TD(A) for general A. Machine Learning, 8(3/4), 341-362.
Dayan, P., (1993). Improving generalization tot temporal difference learning: The successor representation.
Neural Computation, 5(4), 613-624.
Dayan, P. & Sejnowski, T., (1994). TD(A) converges with probability 1. Machine Learning, 14, 295-301.
158 S.P. SINGH AND R.S. SUTTON

Holland, J. H., (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to
parallel rule-based systems, Volume 2 of Machine Learning: An Artificial Intelligence Approach, chapter 20.
Morgan Kaufmann.
Jaakkola, T., Jordan, M. I., & Singh, S. P., (1994). On the convergence of stochastic iterative dynamic
programming algorithms. Neural Computation, 6(6), 1185-1201.
Jaakkola, T., Singh, S. P., & Jordan, M. I., (1995). Reinforcement learning algorithm for partially observable
Markov decision problems. In Advances in Neural Information Processing Systems 7. Morgan Kaufmann.
Klopf, A. H., (1972). Brain function and adaptive systems--A heterostatic theory. Technical Report AFCRL-
72-0164, Air Force Cambridge Research Laboratories, Bedford, MA.
Kumar, P. R. & Varaiya, P. P., (1986). Stochastic Systems: Estimation, Identification, and Adaptive Control.
Englewood Cliffs, N.J.: Prentice Hall.
Lin, L. J., (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching.
Machine Learning, 8(3/4), 293-321.
Miller, W. T., Glanz, E H., & Kraft, L. G., (1990). CMAC: An associative neural network alternative to
backpropagation. Proc. of the IEEE, 78, 1561-1567.
Moore, A. W., (1991). Variable resolution dynamic programming: Efficiently learning action maps in multi-
variate real-valued state-spaces. In Machine Learning: Proceedings of the Eighth International Workshop,
pages 333-337, San Mateo, CA. Morgan Kaufmann.
Peng, J., (1993). Dynamic Programming-based Learning for Control. PhD thesis, Northeastern University.
Peng, J. & Williams, R. J., (1994). Incremental multi-step Q-learning. In Machine Learning: Proceedings of
the Eleventh International Conference, pages 226-232. Morgan Kaufmann.
Rubinstein, R., (1981). Simulation and the Monte Carlo method. New York: John Wiley Sons.
Rummery, G. A. & Niranjan, M., (1994). On-line Q-learning using connectionist systems. Technical Report
CUED/F-INFENG/TR 166, Cambridge University Engineering Dept.
Sutton, R. S., (1984). Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of
Massachusetts, Amherst, MA.
Sutton, R. S., (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44.
Sutton, R. S., (1995). TD models: Modeling the world at a mixture of time scales. In Machine Learning:
Proceedings of the Twelth International Conference, pages 531-539. Morgan Kaufmann.
Sutton, R. S. & Barto, A. G., (1987). A temporal-difference model of classical conditioning. In Proceedings
of the Ninth Annual Conference of the Cognitive Science Society, pages 355-378, Hillsdale, NL Erlbaum
Sutton, R. S. & Barto, A. G., (1990): Time-derivative models of Pavlovian conditioning. In Gabriel, M.
& Moore, J. W. (Eds.), Learning and Computational Neuroscience, pages 497-537. Cambridge, MA: MIT
Press.
Sutton, R. S. & Singh, S. P., (1994). On step-size and bias in temporal-difference learning. In Eighth Yale
Workshop on Adaptive and Learning Systems, pages 91-96, New Haven, CT.
Sutton, R. S. & Whitehead, S. D., (1993). Online learning with random representations. In Machine Learning:
Proceedings of the Tenth Int. Conference, pages 314-321. Morgan Kaufmann.
Tesauro, G. J., (1992). Practical issues in temporal difference learning. Machine Learning, 8(3/4), 257-277.
Tsitsiklis, J., (1994). Asynchronous stochastic approximation and Q-leanaing. Machine Learning, 16(3),
185-202.
Wasow, W. R., (1952). A note on the inversion of matrices by random walks. Math. Tables Other Aids
Comput., 6, 78-81.
Watkins, C. J. C. H., (1989). Learning from Delayed Rewards. PhD thesis, Cambridge Univ., Cambridge,
England.
Wilson, S_ W., (to appear)_ Classifier fitness based on accuracy. Evolutionary Computation.

Received November 7, 1994


Accepted March 10, 1995
Final Manuscript October 4, 1995

You might also like