RL with replacing eligibility traces
RL with replacing eligibility traces
Reinforcement Learning
with Replacing Eligibility Traces
SATINDER P. SINGH singh @psyche.mit.edu
Dept. of Brain and Cognitive Sciences
Massachusetts Institute of Technology, Cambridge, Mass. 02139
Abstract. The eligibility trace is one of the basic mechanisms used in reinforcement learmng to handle
delayed reward. In this paper we introduce a new kind of eligibility trace, the replacing trace, analyze it
theoretically, and show that it results in faster, more reliable learning than the conventional trace. Both kinds
of trace assign credit to prior events according to how recently they occurred, but only the conventional trace
gives greater credit to repeated events. Our analysis is for conventional and replace-trace versions of the
offline TD(1) algorithm applied to undiscounted absorbing Markov chains. First, we show that these methods
converge under repeated presentations of the training set to the same predictions as two well known Monte
Carlo methods. We then analyze the relative efficiency of the two Monte Carlo methods. We show that
the method corresponding to conventional TD is biased, whereas the method corresponding to replace-trace
TD is unbiased. In addition, we show that the method corresponding to replacing traces is closely related
to the maximum likelihood solution for these tasks, and that its mean squared error is always lower in the
long run. Computational results confirm these analyses and show that they are applicable more generally. In
pa~icular, we show that replacing traces significantly improve performance and reduce parameter sensitivity
on the "Mountain-Car" task, a full reinforcement-learning problem with a continuous state space, when using
a feature-based function approximator.
Keywords: reinforcement learning, temporal difference learning, eligibility trace, Monte Carlo method, Markov
chain, CMAC
1. Eligibility Traces
T w o f u n d a m e n t a l m e c h a n i s m s h a v e b e e n u s e d in r e i n f o r c e m e n t l e a r n i n g to h a n d l e d e l a y e d
r e w a r d . O n e is t e m p o r a l - d i f f e r e n c e ( T D ) l e a r n i n g , as in t h e T D ( A ) a l g o r i t h m (Sutton,
1988) a n d in Q - l e a r n i n g (Watkins, 1989). T D l e a r n i n g in e f f e c t c o n s t r u c t s an internal
r e w a r d signal t h a t is less d e l a y e d t h a n the original, e x t e r n a l one. H o w e v e r , T D m e t h o d s
c a n e l i m i n a t e the d e l a y c o m p l e t e l y o n l y on fully M a r k o v p r o b l e m s , w h i c h are rare in
practice. In m o s t p r o b l e m s s o m e d e l a y a l w a y s r e m a i n s b e t w e e n a n a c t i o n a n d its e f f e c t i v e
r e w a r d , a n d o n all p r o b l e m s s o m e d e l a y is a l w a y s p r e s e n t d u r i n g the t i m e b e f o r e T D
l e a r n i n g is c o m p l e t e . T h u s , t h e r e is a g e n e r a l n e e d for a s e c o n d m e c h a n i s m to h a n d l e
w h a t e v e r d e l a y is n o t e l i m i n a t e d b y T D l e a r n i n g .
T h e s e c o n d m e c h a n i s m that h a s b e e n w i d e l y u s e d f o r h a n d l i n g d e l a y is the etigibilio,
t r a c e ) I n t r o d u c e d by K l o p f (1972), e l i g i b i l i t y traces h a v e b e e n u s e d in a v a r i e t y o f rein-
124 S.P. SINGH AND R.S. SUTTON
forcement learning systems (e.g., Barto, Sutton & Anderson, 1983; Lin, 1992; Tesauro,
1992; Peng & Williams, 1994). Systematic empirical studies of eligibility traces in con-
junction with TD methods were made by Sutton (1984), and theoretical results have
been obtained by several authors (e.g., Dayan, 1992; Jaakkola, Jordan & Singh, 1994;
Tsitsiklis, 1994; Dayan & Sejnowski, 1994; Sutton & Singh, 1994).
The idea behind all eligibility traces is very simple. Each time a state is visited it
initiates a short-term memory process, a trace, which then decays gradually over time.
This trace marks the state as eligible for learning. If an unexpectedly good or bad event
occurs while the trace is non-zero, then the state is assigned credit accordingly. In a
conventional accumulating trace, the trace builds up each time the state is entered. In a
replacing trace, on the other hand, each time the state is visited the trace is reset to 1
regardless of the presence of a prior trace. The new trace replaces the old. See Figure 1.
Sutton (1984) describes the conventional trace as implementing the credit assignment
heuristics of recency---more credit to more recent events--and frequency--more credit
to events that have occurred more times. The new replacing trace can be seen simply
as discarding the frequency heuristic while retaining the recency heuristic. As we show
later, this simple change can have a significant effect on performance.
Typically, eligibility traces decay exponentially according to the product of a decay
parameter, A, 0 < A < 1, and a discount-rate parameter, 7, 0 _< 3' < 1. The conventional
accumulating trace is defined by: 2
7Aet(s) if s#st;
6t+l(S)
7Aet(s)+l if s=st,
where et(s) represents the trace for state s at time t, and st is the actual state at time t.
The corresponding replacing trace is defined by:
f TAet(s) i f s # s t ;
e t + l (s)
/ 1 if s = st.
In a control problem, each state-action pair has a separate trace. When a state is visited
and an action taken, the state's trace for that action is reset to 1 while the traces for the
other actions are reset to zero (see Section 5).
~ _ __~ ConventionalAccumulatingTrace
~ Replacing
Trace
For problems with a large state space it may be extremely unlikely for the exact
same state ever to recur, and thus one might think replacing traces would be irrelevant.
However, large problems require some sort of generalization between states, and thus
some form of function approximator. Even if the same states never recur, states with
the same features will. In Section 5 we show that replacing traces do indeed make a
significant difference on problems with a large state space when the traces are done on
a feature-by-feature basis rather than on a state-by-state basis.
The rest of this paper is structured as follows. In the next section we review the TD(A)
prediction algorithm and prove that its variations using accumulating and replacing traces
are closely related to two Monte Carlo algorithms. In Section 3 we present our main
results on the relative efficiency of the two Monte Carlo algorithms. Sections 4 and 5
are empirical and return to the general case.
The TD(A) family of algorithms combine TD learning with eligibility traces to estimate
the value function. The discrete-state form of the TD(A) algorithm is defined by
AVt(s) 6~t(S)
where Vt(s) is the estimate at time t of V(s), c~t(s) is a positive step-size parameter,
et+l(S) is the eligibility trace for state s, and A ~ ( s ) is the increment in the estimate
of V(s) determined at time t. 3 The value at the terminal state is of course defined as
~ ( T ) = 0, Vt. In online TD(A), the estimates are incremented on every time step:
V~+l(S ) = V~(s) + A ~ ( s ) . In offline TD(A), on the other hand, the increments A ~ ( s )
are set aside until the terminal state is reached. In this case the estimates ~ ( s ) are
constant while the chain is undergoing state transitions, all changes being deferred until
the end of the trial.
There is also a third case in which updates are deferred until after an entire set of trials
have been presented. Usually this is done with a small fixed step size, at(s) = c~, and
126 S.P. SINGH AND R.S. SUTTON
with the training set (the set of trials) presented over and over again until convergence of
the value estimates. Although this "repeated presentations" training paradigm is rarely
used in practice, it can reveal telling theoretical properties of the algorithms. For example,
Sutton (1988) showed that TD(0) (TD(A) with A = 0) converges under these conditions
to a maximum likelihood estimate, arguably the best possible solution to this prediction
problem (see Section 2.3). In this paper, for convenience, we refer to the repeated
presentations training paradigm simply as batch updating. Later in this section we show
that the batch versions of conventional and replace-trace TD(1) methods are equivalent
to two Monte Carlo prediction methods.
The total reward following a particular visit to a state is called the return for that visit.
The value of a state is thus the expected return. This suggests that one might estimate a
state's value simply by averaging all the returns that follow it. This is what is classically
done in Monte Carlo (MC) prediction methods (Rubinstein, 1981; Curtiss, 1954; Wasow,
1952; Barto & Duff, 1994). We distinguish two specific algorithms:
Every-visit MC: Estimate the value of a state as the average of the returns that have
followed all visits to the state.
First-visit MC: Estimate the value of a state as the average of the returns that have
followed the first visits to the state, where a first visit is the first time during a trial that
the state is visited.
Note that both algorithms form their estimates based entirely on actual, complete re-
turns. This is in contrast to TD(A), whose updates (1) are based in part on existing
estimates. However, this is only in part, and, as A ~ 1, TD(A) methods come to more
and more closely approximate MC methods (Sections 2.4 and 2.5). In particular, the
conventional, accumulate-trace version of TD(A) comes to approximate every-visit MC,
whereas replace-trace TD(A) comes to approximate first-visit MC. One of the main points
of this paper is that we can better understand the difference between replace and accumu-
late versions of TD(A) by understanding the difference between these two MC methods.
This naturally brings up the question that we focus on in Section 3: what are the relative
merits of first-visit and every-visit MC methods?
To help develop intuitions, first consider the very simple Markov chain shown in Fig-
ure 2a. On each step, the chain either stays in S with probability p, or goes on to
terminate in T with probability 1 - p . Suppose we wish to estimate the expected number
of steps before termination when starting in S. To put this in the form of estimating a
value function, we say that a reward of +1 is emitted on every step, in which case V(S)
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 127
is equal to the expected number of steps before termination. Suppose that the only data
that has been observed is a single trial generated by the Markov chain, and that that trial
lasted 4 steps, 3 passing from S to S, and one passing from S to T, as shown in Figure
2b. What do the two MC methods conclude from this one trial?
We assume that the methods do not know the structure of the chain. All they know
is the one experience shown in Figure 2b. The first-visit MC method in effect sees a
single traversal from the first time S was visited to T. That traversal lasted 4 steps, so
its estimate of V(S) is 4. Every-visit MC, on the other hand, in effect sees 4 separate
traversals from S to T, one with 4 steps, one with 3 steps, one with 2 steps, and one
with 1 step. Averaging over these four effective trials, every-visit MC estimates V(S)
as 1+2+3+44 -- 2.5. The replace and accumulate versions of TD(1) may or may not
form exactly these estimates, depending on their c~ sequence, but they will move their
estimates in these directions. In particular, if the corresponding offline TD(1) method
starts the trial with these estimates, then it will leave them unchanged after experiencing
the trial. The batch version o f the two T D ( t ) algorithms will compute exactly these
estimates.
Which estimate is better, 4 or 2.5? Intuitively, the first answer appears better. The
only trial observed took 4 steps, so 4 seems like the best estimate of its expected value.
In any event, the answer 2.5 seems too low. In a sense, the whole point of this paper
is to present theoretical and empirical analyses in support of this intuition. We show
below that in fact the answer 4 is the only unbiased answer, and that 2.5, the answer of
every-visit MC and of conventional TD(1), is biased in a statistical sense.
It is instructive to compare these two estimates of the value function with the estimate
that is optimal in the maximum likelihood sense. Given some data, in this case a set
of observed trials, we can construct the maximum-likelihood model of the underlying
Markov process. In general, this is the model whose probability of generating the ob-
served data is the highest. Consider our simple example. After the one trial has been
observed, the maximum-likelihood estimate of the S-to-S transition probability is 3, the
fraction of the actual transitions that went that way, and the maximum-likelihood estimate
of the S-to-T transition probability is ~. No other transitions have been observed, so
they are estimated as having probability 0. Thus, the maximum-likelihood model of the
Markov chain is as shown in Figure 2c.
S -S .S -S . T
Figure2. A simple example of a Markov prediction problem. The objective is to predict the number of steps
until termination.
128 S.P. SINGH AND R.S. S U T T O N
We define the ML estimate of the value function as the value function that would be
exactly correct if the maximum-likelihood model of the Markov process were exactly
correct. That is, it is the estimate equal to the correct answer if the estimate of the
Markov chain was not really an estimate, but was known with certainty. 4 Note that the
ML estimate makes full use of all the observed data.
Let us compute the ML estimate for our simple example. If the maximum-likelihood
model of the chain, as shown in Figure 2c, were exactly correct, what then would be the
expected number of time steps before termination? For each possible number of steps,
k, we can compute the probability of its occurring, and then the expected number, as
O<3
VML(S) = er(k)k
k=l
(2<)
= (ors) .0.25. k
k=l
4.
Thus, in this simple example the ML estimate is the same as the first-visit MC estimate.
In general, these two are not exactly the same, but they are closely related. We establish
the relationship in the general case in Section 3.2.
Computing the ML estimate is in general very computationally complex. If the number
of states is n, then the maximum-likelihood model of the Markov chain requires O(n 2)
memory, and computing the ML estimate from it requires roughly O ( n 3) computational
operations. 5 The TD methods by contrast all use memory and computation per step that
is only O(n). It is in part because of these computational considerations that learning
solutions are of interest while the ML estimate remains an ideal generally unreachable in
practice. However, we can still ask how closely the various learning methods approximate
this ideal.
In this subsection we establish that the replace and accumulate forms of batch TD(1) are
equivalent, respectively, to first-visit and every-visit MC. The next subsection proves a
similar equivalence for the offline TD(1) algorithms.
The equivalence of the accumulate-trace version of batch TD(1) to every-visit MC
follows immediately from prior results. Batch TD(1) is a gradient-descent procedure
known to converge to the estimate with minimum mean squared error on the training set
(Sutton, 1988; Dayan, 1992; Barnard, 1993). In the case of discrete states, the minimum
MSE estimate for a state is the sample average of the returns from every visit to that
state in the training set, and thus it is the same as the estimate computed by every-visit
MC.
Showing the equivalence of replace-trace batch TD(1) and first-visit MC requires a
little more work.
R E I N F O R C E M E N T LEARNING WITH R E P L A C I N G ELIGIBILITY T R A C E S 129
THEOREM 1: For any training set of N trials and any fixed at(s) = a < 1 , batch
replace TD(1) produces the same estimates as first-visit MC.
P r o o f : In considering updates to the estimates for any state s we need only consider
trials in which s occurs. On trials in which s does not occur, the estimates o f both
methods are obviously unchanged. We index the trials in which state s occurs from 1 to
N(s). Let t~(n) be the time at which state s is first visited in trial n, and let tT(n) be
the time at which the terminal state is reached. Let v~R(s) represent the replace TD(1)
estimate of the value of state s after i passes through the training set, for i _> 1:
N(~) t T ( ~ ) - i
v t(s) -- v?(s)+ E E
rt=l t=t~(n)
N(s) t T ( n ) - - I
: v?(~) + ~
E
n ~ l t=t~ (n)
[ r t + 1 q- ] F i R ( s t + l ) - - v/R(st)]
N(s)
---- (1 -- N(s)a)viR(s) + c~ E R(ts(n)),
n=l
where R(t) is the return following time t to the end o f the trial. This in turn implies that
N(s)
ViR(s) = (1 -- N(s)a)iVoR(s) + c~ E R(t~(n)) [1 + (1 - N(s)a) + . . . (1 - N(s)~) i-1]
n=l
Therefore,
N(s) o~
v2(~) = (1 - N(s)a)~VoR(S) + ~ E R(t~(n)) E ( 1 - N(s)a) j
n=l j=O
N(s)
1
(because N ( s ) a < 1)
1 - (1 - N(s)c~)
n=l
EN(~) R(t~(~))
g(s) '
In this subsection we establish that the replace and accumulate forms of offline TD(1)
can also be made equivalent to the corresponding MC methods by suitable choice of the
step-size sequence c~t(s).
Therefore, the update for offline replace TD(1), after a complete trial, is
which is just the iterative recursive equation for incrementally computing the average of
the first-visit returns, {R(ts(1)), R(t~(2)), R ( t s ( 3 ) ) , . . . } . •
Proof: Once again we consider only trials in which state s occurs. For this proof we
need to use the time index of every visit to state s, complicating notation somewhat. Let
t~(i; k) be the time index of the k th visit to state s in trial i. Also, let Ks(i) be the
total number of visits to state s in trial i. The essential idea behind the proof is to again
show that the offline TD(1) equation is an iterative recursive averaging equation, only
this time of the returns from every visit to state s.
Let ~i(s) be the step-size parameter used in processing trial i. The cumulative incre-
ment in the estimate of V(s) as a result of trial i is
tT (i) rts (/;2)-- 1 ts (i;3) -- 1
+ ... +
tT(i)
Z
]
t=t~(i;K~(i))
R(G(i;j)) - K~(i)vi_l(s ) ,
v i A ( 8 ) = v i A l ( 8 ) J- OLi(8) R ( t s ( i ; j ) ) - Ks(i)viA_l(S) .
Because ai(s) = 1K this will compute the sample average of all the actual
~ j = ~ ~(J)'
returns from every visit to state s up to and including trial i. •
THEOREM 4: Offline (online) replace TD(A) converges to the desired value function
w.p. 1 under the conditions for w.p.1 convergence of offline (online) conventional TD( A)
stated by Jaakkola, Jordan and Singh (1994).
Proof: Jaakkola, Jordan and Singh (1994) proved that online and offline TD(A) converges
w.p.1 to the correct predictions, under natural conditions, as the number of trials goes
to infinity (or as the number of time steps goes to infinity in the online, non-absorbing
case, with 7 < 1). Their proof is based on showing that the offline TD(A) estimator is a
132 S.P. SINGH AND R.S. SUTTON
contraction mapping in expected value. They show that it is a weighted sum of n-step
corrected truncated returns,
(1 - A) Xn-lVt(n)(st) q- Vt(T)(St)
5
n='r+l
An-1
1
/
,
where, for the accumulating trace, T is the number of time steps until termination,
whereas, for the replacing trace, ~- is the number of time steps until the next revisit to
state st. Although the weighted sum is slightly different in the replace case, it is still a
contraction mapping in expected value and meets all the conditions of Jaakkola et al.'s
proofs of convergence for online and offline updating. •
In the simple example in Figure 2, the first-visit MC estimate is the same as the ML
estimate. However, this is true in general only for the starting state, assuming all trials
start in the same state. One way of thinking about this is to consider for any state s just
the subset of the training trials that include s. For each of these trials, discard the early
part of the trial before s was visited for the first time. Consider the remaining "tails" of
the trials as a new training set. This reduced training set is really all the MC methods
ever see in forming their estimates for s. We refer to the ML estimate of V(s) based
on this reduced training set as the reduced-ML estimate. In this subsection we show that
the reduced ML estimate is equivalent in general to the first-visit MC estimate.
THEOREM 5: For any undiscounted absorbing Markov chain, the estimates computed
by first-visit MC are the reduced-ML estimates, for all states and after all trials.
Proof: The first-visit MC estimate is the average of the returns from first visits to state
s. Because the maximum-likelihood model is built from the partial experience rooted
in state s, the sum over all t of the probability of making a particular transition at time
step t according to the maximum-likelihood model is equal to the ratio of the number of
times that transition was actually made to the number of trials. Therefore, the reduced-
ML estimate for state s is equal to the first-visit MC estimate. See Appendix A. 1 for a
complete proof. •
I {S}I -I I {S}T
• • •
PS " r s
Figure 3. Abstracted Markov chain. At the top is a typical sequence of states comprising a training trial.
The sequence can be divided into contiguous subsequences at the visits to start state s. For our analyses, the
precise sequence of states and rewards in between revisits to s does not matter. Therefore, in considering the
value of s, arbitrary undiscounted Markov chains can be abstracted to the two-state chain shown in the lower
part of the figure.
134 S.P. SINGH AND R.S. SUTTON
Therefore, for the purpose of analysis, arbitrary undiscounted Markov chains can be
reduced to the two-state abstract chain shown in the lower part of Figure 3. The as-
sociated probabilities and rewards require careful elaboration. Let PT and P~ denote
the probabilities of terminating and looping respectively in the abstracted chain. Let rs
and rT represent the random rewards associated with a s ---+ s transition and a s ---+ T
transition in Figure 3. We use the quantities, R~ = E{r~}, V a t ( % ) = E{(r~ - R~)2},
R T = E { r T } , and V a r ( r T ) = E { ( r T -- R T ) 2} in the following analysis. Precise
definitions of these quantities are given in Appendix A.2.
first-visit MC:
Let {x} stand for the paired random sequence ({s}, {r)). The first-visit MC estimate
for V ( s ) after one trial, {x}, is
where k is the random number of revisits to state s, rs~ is the sum of the individual
rewards in the sequence {r}i, and rT is the random total reward received after the last
visit to state s. For all i, E { r s , } = Rs. The first-visit MC estimate of V ( s ) after n
trials, {x} 1, { x } 2 , . . . , {x} n, is
~¢~=~ f ( { x } ~)
VnF(S) = f ( { x } 1 , { z } 2 , . . . , { z } n) = (2)
n
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 135
In words, V ~ ( s ) is simply the average of the estimates from the n sample trajectories,
{x} 1, { x } 2 , . . . , {x} n, all of which are independent of each other because of the Markov
property.
every-visit MC:
The every-visit MC estimate for one trial, {x}, is
where k is the random number of revisits to state s in the sequence {x}. Every visit to
state s effectively starts another trial. Therefore, the rewards that occur in between the
i th and (i + 1) st visits to state s are included i times in the estimate.
The every-visit MC estimate after n trials, {x} 1, { x } 2 , . . . , {x} n, is
where k~ is the number of revisits to s in the i th trial {x}< Unlike the first-visit MC
estimator, the every-visit MC estimator for n trials is not simply the average of the
estimates for individual trials, making its analysis more complex.
We derive the bias (Bias) and variance (Var) of first-visit MC and every-visit MC as
a function of the number of trials, n. The mean squared error (MSE) is B i a s 2 + Var.
First consider the true value of state s in Figure 3. From Bellman's equation (Bellman,
1957):
or
and therefore
Ps R
V(s) : PT s + RT-
Proof: The first-visit MC estimate is unbiased because the total reward on a sample path
from the start state s to the terminal state T is by definition an unbiased estimate of the
expected total reward across all such paths. Therefore, the average of the estimates from
n independent sample paths is also unbiased. See Appendix A.3 for a detailed proof.
BiasE(s)=V(s)-E{VE(s)}-n+21Biase(s)-- ( n + l ) 2 [2-~T]RS .
One way of understanding the bias in the every-visit MC estimate is to note that this
method averages many returns for each trial. Returns from the same trial share many of
the same rewards and are thus not independent. The bias becomes smaller as more trials
are observed because the returns from different trials are independent. Another way of
understanding the bias is to note that the every-visit MC estimate (3) is the ratio of two
random variables. In general, the expected value of such a ratio is not the ratio of the
expected values of the numerator and denominator.
- - ~Va,(~)+~R~].
- ,~1 Var(,-~) + P.~ Ps 21
Because the first-visit MC estimate is the sample average of estimates derived from
independent trials, the variance goes down as _1
n The first two terms in the variance are
due to the variance of the rewards, and the third term is the variance due to the random
number of revisits to state s in each trial.
and
We were able to obtain only these upper and lower bounds on the variance of every-visit
MC. For a single trial, every-visit MC produces an estimate that is closer to zero than
the estimate produced by first-visit MC; therefore V a r E <_ V a r f . This effect was seen
in the simple example of Figure 2, in which the every-visit MC estimator underestimated
the expected number of revisits.
Of course, a low variance is not of itself a virtue. For example, an estimator that
returns a constant independent of the data has zero variance, but is not a good estimator.
Of greater importance is to be low in mean squared error (MSE):
THEOREM 10: There exists an N < oo, such that for all n > N, V a r E (s) > V a r y ( s ) .
Proof: The basic idea of the proof is that the O ( I ) component of V a t E is larger than
that°fVarFn" T h e ° t h e r O ( -) rnl" c°mp°nents°fVarEn fall off much more rapidly than
the O ( 1 ) component, and can be ignored for large enough n. See Appendix a . 7 for a
complete proof. •
COROLLARY lOa: There exists an N < oe, such that, for all n > N,
Figure 4 shows an empirical example of this crossover of MSE. These data are for the
two MC methods applied to an instance of the example task of Figure 2a. In this case
crossover occurred at trial N = 5. In general, crossover can occur as early as the first
trial. For example, if the only non-zero reward in a problem is on termination, then
Rs = 0, and V a t ( % ) = 0, which in turn implies that Bias~ = 0, for all n, and that
VarE(s) = V a r y ( s ) , so that MSEIE(S) = MSEFI (S).
138 S.P. SINGH AND R.S. SUTTON
2-
1.5-
Average
Root MSE
0.5
0
o ; 1'0 1; 2'0
Trials
Figure 4. Empiricaldemonstrationof crossover of MSE on the example task shown in Figure 2a. The S-to-S
transition probabilitywas p = 0.6. These data are averages over 10,000 runs.
3.6. Summary
Table 1 summarizes the results of this section comparing first-visit and every-visit MC
methods. Some of the results are unambiguously in favor of the first-visit method over
the every-visit method: only the first-visit estimate is unbiased and related to the ML es-
timate. On the other hand, the MSE results can be viewed as mixed. Initially, every-visit
MC is of better MSE, but later it is always overtaken by first-visit MC. The implications
of this are unclear. To some it might suggest that we should seek a combination of the
two estimators that is always of lowest MSE. However, that might be a mistake. We
suspect that the first-visit estimate is always the more useful one, even when it is worse
in terms of MSE. Our other theoretical results are consistent with this view, but it remains
a speculation and a topic for future research.
4. Random-Walk Experiment
o o I
-1 0 0
Figure 5. The random-walk process. Starting in State 11, steps are taken left or right with equal probability
until either State 1 or State 21 is entered, terminating the trial and generating a final non-zero reward.
140 S,P. S I N G H A N D R.S. S U T T O N
- 0.7- / ~,=1
~,=.99
- 0.6-
• k=O / / X = , 9 7
Average
~
- 0.5-
RMSE
- 0.4- L=.9
- 0.3-
L=.6 - 0.2-
Figure 6. Performance of replace and accumulate TD(A) on the r a n d o m - w a l k task, for various values of A and
a . The performance measure was the R M S E per state per trial over the first 10 trials. These data are averages
o v e r 1000 runs.
0.5
9
0.4.
Accumulate
?
Average
RMSE 0.3.
at best c~
0.2 ¸ Replace
; o2 0_4 0'6 °8
5. Mountain-Car Experiment
GOAL
Figure 8- The Mountain-Cartask. The force of gravity is stronger than the motor.
142 S.P. SINGH AND R.S. SUTTON
2. S t a r t o f Trial: s : = random-state();
F : = features(s);
a : = greedy-policy(F).
4. E n v i r o n m e n t Step:
Take action a; o b s e r v e r e s u l t a n t r e w a r d , r , a n d n e x t state s t.
5. C h o o s e N e x t A c t i o n :
F ' : = features(st), u n l e s s s t is t h e t e r m i n a l state, t h e n F' : = 0;
a t : = greedy-policy(F').
Figure 9. The Sarsa Algorithm used on the Mountain-Car task. The function greedy-policy(F) computes
~ I E F wa for each action a and returns the action for which the sum is largest, resolving any ties randomly.
The functionfeatures(s) returns the set of CMAC tiles corresponding to the state s. Programming optimizations
can reduce the expense per iteration to a small multiple (dependent on A) of the number of features, m, present
on a typical time step. Here m is 5.
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 143
value Q~(s, a) gives, for any state, s, and action, a, the expected return for starting from
state s, taking action a, and thereafter following policy 7r. In the case of the mountain-car
task the return is simply the sum of the future reward, i.e., the negative of the number of
time steps until the goal is reached. Most of the details of the Sarsa algorithm we used are
given in Figure 9. The name "Sarsa" comes from the quintuple of actual events involved
in the update: (st,at,rt+l, st+l, at+l). This algorithm is closely related to Q-learning
(Watkins, 1989) and to various simplified forms of the bucket brigade (Holland, 1986;
Wilson, to appear). It is also identical to the TD(A) algorithm applied to state-action
pairs rather than to states. 6
The mountain-car task has a continuous two-dimensional state space with an infinite
number of states. To apply reinforcement learning requires some form of function ap-
proximator. We used a set of three CMACs (Albus, 1981; Miller, Glanz, & Kraft, 1990),
one for each action. These are simple functions approximators using repeated overlapping
tilings of the state space to produce a feature representation for a final linear mapping.
In this case we divided the two state variables, the position and velocity of the car, each
into eight evenly spaced intervals, thereby partitioning the state space into 64 regions,
or boxes. A ninth row and column were added so that the tiling could be offset by a
random fraction of an interval without leaving any states uncovered. We repeated this
five times, each with a different, randomly selected offset. For example, Figure 10 shows
two tilings superimposed on the 2D state space. The result was a total of 9 x 9 x 5 = 405
boxes. The state at any particular time was represented by the five boxes, one per tiling,
within which the state resided. We think of the state representation as a feature vector
with 405 features, exactly 5 of which are present (non-zero) at any point in time. The
approximate action-value function is linear in this feature representation. Note that this
representation of the state causes the problem to no longer be Markov: many different
nearby states produce exactly the same feature representation.
- - Tiling #1
....i .................... ..¸ .-{
i , , -J Tiling #2
¢.)
o
i1)
>
= i!i
i i ~ i I i i i
o
Car Position
Figure 10. Two 9 x 9 CMAC tilings offset and overlaid over the continuous, two-dimensional state space of
the Mountain-Car task. Any state is in exactly one tile/box/feature of each tiling. The experiments used 5
tilings, each offset by a random fraction of a tile width.
144 S.P. SINGH AND R.S. SUTTON
k= 1 ~.=.95/"
, ," - 700
',~ ~=.99 ;' i ~ ~.=.6 ,/
Steps/TriM
Averaged over '~ k=O
first 20 trials ~ ," ~ - 600 -
and 30 runs
k= 9
,-400-
0 0'.2 014 0'.6 0'.8 1.2 0~2 {;.4 0'.6 018 1.2
o~
Figure 11. Results on the Mountain-Car task for each value of ,k and o~. Each data point is the average duration
of the first 20 trials of a run, averaged over 30 runs_ The standard errors are omitted to simplify the graph;
they ranged from about 10 to about 50.
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 145
700
650-
600-
T
Accumulate , o
Average
Steps/Trial 550-
at best ~
500-
Replace i
450 -
400 t i i i 1 i
0 0.2 0.4 0.6 0.8 1
Figure 12. Summary of results on the Mountain-Car task. For each value of A we show its performance at its
best value of c~. The error bars indicate one standard error.
results are all consistent with those presented for the random-walk task in the previous
section. On the mountain-car task, accumulating traces at best improved only slightly
over no traces (A = 0) and at worst dramatically degraded performance. Replacing traces,
on the other hand, significantly improved performance at all except the very longest trace
lengths (A > .99). Traces that do not decay (A = 1) resulted in significantly worse
performance than all other A values tried, including no traces at all (A = 0).
Much more empirical experience is needed with trace mechanisms before a definitive
conclusion can be drawn about their relative effectiveness, particularly when function
approximators are used. However, these experiments do provide significant evidence for
two key points: 1) that replace-trace methods can perform much better than conventional,
accumulate-trace methods, particularly at long trace lengths, and 2) that although long
traces may help substantially, best performance is obtained when the traces are not
infinite, that is, when intermediate predictions are used as targets rather than actual
sample returns.
6. Conclusions
We have presented a variety of analytical and empirical evidence supporting the idea
that replacing eligibility traces permit more efficient use of experience in reinforcement
learning and long-term prediction.
Our analytical results concerned a special case closely related to that used in classical
studies of Monte Carlo methods. We showed that methods using conventional traces are
biased, whereas replace-trace methods are unbiased. While the conclusions of our mean-
squared-error analysis are mixed, the maximum likelihood analysis is clearly in favor
of replacing traces. As a whole, these analytic results strongly support the conclusion
146 S.P. SINGH AND R.S. SUTTON
that replace-trace methods make better inferences from limited data than conventional
accumulate-trace methods.
On the other hand, these analytic results concern only a special case quite different
from those encountered in practice. It would be desirable to extend our analyses to
the case of A < 1 and to permit other step-size schedules. Analysis of cases involving
function approximators and violations of the Markov assumption would also be useful
further steps.
Our empirical results treated a much more realistic case, including in some cases all of
the extensions listed above. These results showed consistent, significant, and sometimes
large advantages of replace-trace methods over accumulate-trace methods, and of trace
methods generally over trace-less methods. The mountain-car experiment showed that the
replace-trace idea can be successfully used in conjunction with a feature-based function
approximator. Although it is not yet clear how to extend the replace-trace idea to other
kinds of function approximators, such as back-propagation networks or nearest-neighbor
methods, Sutton and Whitehead (1993) and others have argued that feature-based function
approximators are actually preferable for online reinforcement learning.
Our empirical results showed a sharp drop in performance as the trace parameter
A approached 1, corresponding to very tong traces. This drop was much less severe
with replacing traces but was still clearly present. This bears on the long-standing
question of the relative merits of TD(1) methods versus true temporal-difference 0, <
1) methods. It might appear that replacing traces make TD(1) methods more capable
competitors; the replace TD(1) method is unbiased in the special case, and more efficient
than conventional TD(1) in both theory and practice. However, this is at the cost of losing
some of the theoretical advantages of conventional TD(1). In particular, conventional
TD(1) converges in many cases to a minimal mean-squared-error solution when function
approximators are used (Dayan, 1992) and has been shown to be useful in non-Markov
problems (Jaakkola, Singh & Jordan, 1995). The replace version of TD(1) does not share
these theoretical guarantees. Like A < 1 methods, it appears to achieve greater efficiency
in part by relying on the Markov property. In practice, however, the relative merits of
different A = 1 methods may not be of great significance. All of our empirical results
suggest far better performance is obtained with A < 1, even when function approximators
are used that create an apparently non-Markov task.
Replacing traces are a simple modification of existing discrete-state or feature-based
reinforcement learning algorithms. In cases in which a good state representation can be
obtained they appear to offer significant improvements in learning speed and reliability.
Acknowledgments
We thank Peter Dayan for pointing out and helping to correct errors in the proofs.
His patient and thorough reading of the paper and his participation in our attempts to
complete the proofs were invaluable. Of course, any remaining errors are our own. We
thank Tommi Jaakkola for providing the central idea behind the proof of Theorem 10, an
anonymous reviewer for pointing out an error in our original proof of Theorem 10, and
Lisa White for pointing out another error. Satinder R Singh was supported by grants to
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 147
Michael I. Jordan (Brain and Cognitive Sciences, MIT) from ATR Human Information
Processing Research and from Siemens Corporation.
Appendix A
Proofs of Analytical Results
In considering the estimate Vt (s), we can assume that all trials start in s, because both
first-visit MC and reduced-ML methods ignore transitions prior to the first-visit to s. Let
n~ be the number of times state i has been visited, and let nij be the number of times
transition i ---+ j has been encountered. Let Rjk be the average of the rewards seen on
the j ~ k transitions.
Then VNF(S), the first-visit MC estimate after N trials with start state s, is
1
VNF(s) = --~ ~ njkRjk.
kES,jES
This is identical to (2) because ~ k e s , j e s njkRjk is the total summed reward seen during
the N trials. Because N = ns - ~ i e s nis, we can rewrite this as
The maximum-likelihood model of the Markov process after N trials has transition
probabilities P(ij) = n'---z
9%4
and expected rewards R(ij) = Rij. Let vML(s) denote the
reduced-ML estimate after N trials. By definition VNML (s) = E N {r 1 + r2 + r 3 + r 4 + . . . },
where EN is the expectation operator for the maximum-likelihood model after N trials,
and re is the payoff at step/. Therefore
= ~-~RjkUjk, (A.2)
j,k
where Probi(j ~.~ k) is the probability of a j-to-k transition at the i th step according to
the maximum-likelihood model. We now show that for all j, k, Ujk of (A.2) is equal to
u~k of (A.1).
Consider two special cases of j in (A.2):
Casel, j=s:
. I
148 S.e. SINGH AND R.S. SUTTON
nj Iris m ns nm
_ njk [pl(sj ) + p2(sj) + p3(sj) +...]
nj
_ njk Nsj, (A.4)
nj
where N~j is the expected number of visits to state j.
For all j, the N~j satisfy the recursions
~m nmj nj
ns rues
\nm ns ---~i his ns ns -- ~-]i his 1
_ Y2m nms _ ns 1.
Plugging the above values of Nss and Nsj into (A.3) and (A.4), we obtain Ujk =
"r~.s--~ i ftis
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 149
The proofs for Theorems 6-10 assume the abstract chain of Figure 3 with just two states,
s and T. The quantities Rs = E{rs}, Var(rs) = E{(r8 - Rs)2}, RT = E { r T } ,
and V a r ( r T ) = E { ( r T -- R T ) 2} are o f interest for the analysis and require careful
elaboration. Let S~ be the set of all state sequences that can occur between visits to
state s (including state s at the head), and let ST be the set o f all state sequences that
can occur on the final run from s to T (including state s). The termination probability is
PT = ~{S}TeST P({S}T), where the probability of a sequence of states is the product
of the probabilities of the individual state transitions. By definition Ps = t - PT- The
reward probabilities are defined as follows: Prob{rs = q} = II{s}cs P({s})P(r{~ } =
ql{s}), and Prob{rT = q} = I I { s I r e s T P ( { s } T ) P ( r T = qI{S}T). Therefore, R~ =
y~q qProb{% = q}, RT = ~ q qProb{rT = q}. Similarly, Var(rs) = ~ q Prob{r~ =
q}(q - Rs) 2, and V a r ( r T ) = ~ q Prob{rT = q)(q - R T ) 2.
If the rewards in the original M a r k o v chain are deterministic functions of the state
transitions, then there will be a single ri associated with each {s}i. I f the rewards
in the original problem are stochastic, however, then there is a set of possible random
r i ' S associated with each {s}i. Also note that even if all the individual rewards in the
original Markov chain are deterministic, V a t ( % ) and Var(rT) can still be greater than
zero because r~ and rT will be stochastic because of the many different paths from s to
s and from s to T.
The following fact is used throughout:
E[f({x})] = ~ P({x})f({x})
{x}
= ~ P(k)E{,.}{f({x})[k}, (A.6)
k
where k is the number of revisits to state s. We also use the facts that, if r < 1, then,
• r ~ r(1 + r)
(1 - r)2 and E _
i=O i=0
First we show that first-visit M C is unbiased for one trial. From (A.6)
E { V l f ( S ) } = E{x}[f({x})] = ~ P(k)E{r}{f({x})lk }
k
= ~-~PTP:(kRs +RT)
k
1 R
150 S.P. SINGH AND R.S. S U T T O N
= p~R~ +RT
= v(s).
Because the estimate after n trials, V~(s), is the sample average of n independent
estimates each of which is unbiased, the n-trial estimate itself is unbiased. •
E { V ~ ( s ) } = E{~}[t({x})l = ~ P(k)E{~}{t({zI)lk}
k
= E p T p ~ (Rs+2Rs+...+kRs+(k+l)RT)k+l
k
- -
P~R
2PT s + R T .
{ Ei~=l tnum({X} i) }
E{V~(~)} E{~} ~ k
E ~ = I ( i + 1)
2-.,~=t ?--,j=t 3 ~3
kl,k2,...,k,Z P(kl,k2,...,kn)E{r} [ ~-~=1-~/T-~)'= ]¢1, k2,. - - ,kn
J
[EL~
ZP(k) Z B(~;k,,k~,...,k,~lk)[ k~(k,+ 1) R~+R~]
~;T;~
k factors of k
(
\ k factors of k L
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 151
B(n;kl'k~"'"knlk)
[27=1(ki) 2 +
L
k] -
k B(n;k),
n +------1 (A.7)
factors of k
k=O
-- -fi-TT'
E { V $ ( s ) } = RT + +-----i
n Ps
ns.
T(n;k) = ~ B(n;kl,k2,...,k•lk) E ( k i ) 2.
factors of k i=1
k
Fact 2: E j B ( n ; k - j ) = k B ( n + 1;k).
n+l
j=0
k
Fact 3: E j Z B ( n ; k - j ) - T ( n + 1;k).
n+l '
j=0
k i=1
The first term is the variance due to the variance in the number of revisits to state s, and
the second term is the variance due to the random rewards. The first-visit MC estimate
after n trials is the sample average of the n independent trials; therefore, VarF.(s) =
v~[(~) •
n
REINFORCEMENT LEARNING WITH REPLACING ELIGIBILITY TRACES 153
= r 2 4- 2-~-~rs~r T +r~
k i=l
+~ (k2ij
+ ~)~% + ~
1,2 R 2 Ps RsRT + P~
" + P~
i#j
The central idea for this proof was provided to us by Jaakkola (personal communication).
The every-visit MC estimate,
v~(~) = E L I t,,,,m({x}O
CLI(k~ + 1)
can be rewritten as
] Tt
~E(s ) = ~ E{(k~+~)}
n E{k~+l}
154 S.P. SINGH AND R.S. SUTTON
because, for all i, E { k { + l } = ~-T" It is also easy to show that for all i, E{PTt~,.~({x}~)} =
V(s), and E{PT(k{ + 1)} = 1.
Consider the sequence of functions
V(s) + ~T~
f~(fi)= 1-}-6K~ '
n p,,
where Cn : ~ ~'=l(PTtn~m({X} {) -- V(s)), and /(n : ~ ~-~i=1( T(ki + 1) 1).
Note that Vn. E{Tn} : O, E{ff[n} = 0, and fn (-~n) : VE(s)" Therefore, VarEn (s) =
E{(f~(~)) 2} - (E{V~(s)}) 2. Using Taylor's expansion,
Therefore,
02
Fact 7: o2~-~f~(5)
co +=° = 22P~ - 8V(s)T~/£n + 6V2(s)/(~;
Fact 8: E { / ( n2 } : Ps;
2P, + p2
Fact 9: E{K~T~} - s Rs + R T ( P s + 1) - V ( s ) ;
PT
2P~
Fact 11: V2(s) - (E{VnE(S)}) 2 = (n + 1)2PT2/gs + RsRT;
(n + 1)PT
e r ]- O2 ~ P~+PY (P,+
Fact 12: ' I N ~ -~s'~(~) ~=o} - h-?-~ W,'(,'~) + ~ l~)wr(,-~)
P~--P~R2 2P~
+ nP~ ~ nPTR~RT"
P~ 2
w~((~) = w , ( , ~ ) + 5V~('~)n + ~--y~Rs
-m -z 1
= --~[-'~Jc 1 which is O(1),
n 2
where c] is some constant. This implies that the expected value of the terms in the
Taylor expansion corresponding to i > 2 are O ( ~ ) .
n~
156 S.P. SINGH AND R.S. S U T T O N
_P,+ P:
= z~Var(rs) + 2(Ps + 1)Var(rT)
E{-SV(s)TnK~} = _sv(s)(P~R
\ PT s + (1 + Ps)( -~ Rs + RT) - V(s) )
: sv2(s)P~ 8P~s
~R~ 2 _ 8p@ R s R T '
E{6v~(s)~} = 6PsV}.
Therefore, from (A.10), we see that
1 02 2
E{-ug~-b-~f (5)[a=o} -- --h-p-~ v a r t r s ) + + -
Appendix B
Details of the Mountain-Car Task
The mountain-car task (Figure 8) has two continuous state variables, the position of the
car, Pt, and the velocity of the car, yr. At the start of each trial, the initial state is chosen
randomly, uniformly from the allowed ranges: -1.2 < p < 0.5, -0.07 < v < 0.07. The
mountain geography is described by altitude = sin(3p). The action, at, takes on values
in { + 1 , 0 , - 1 } corresponding to forward thrust, no thrust, and reverse thrust. The state
evolution was according to the following simplified physics:
and
P t + l = b o u n d [Pt + v t + l ] ,
where 9 = -0.0025 is t h e f o r c e o f g r a v i t y a n d t h e b o u n d o p e r a t i o n c l i p s e a c h v a r i a b l e
w i t h i n its a l l o w e d r a n g e . I f P t + i is c l i p p e d in t h i s w a y , t h e n v t + l is a l s o r e s e t to z e r o .
Reward is - 1 o n all t i m e s t e p s . T h e t r i a l t e r m i n a t e s w i t h t h e first p o s i t i o n v a l u e t h a t
exceeds pt+l > 0.5.
Notes
1. Arguably, yet a third mechanism for managing delayed reward is to change representations or world models
(e.g., Dayan, 1993; Sutton, 1995).
2. In some previous work (e.g., Sutton & Barto, 1987, 1990) the traces were normalized by a factor of 1 - 7A,
which is equivalent to replacing the "1" in these equations by 1 - ~'A. In this paper, as in most previous
work, we absorb this linear normalization into the step-size parameter, c~, in equation (1).
3. The time index here is assumed to continue increasing across trials. For example, if one trial reaches a
terminal state at time 7-, then the next trial begins at time 7- + 1.
4. For this reason, this estimate is sometimes also referred to as the certainty equivalent estimate (e.g., Kumar
and Varaiya, 1986).
5. In theory it is possible to get this down to O ( n 2'376) operations (Baase, 1988), but, even if practical, this
is still far too complex for many applications.
6. Although this algorithm is indeed identical to TD(A), the theoretical results for TD(A) on stationary pre-
diction problems (e.g., Sutton, 1988; Dayan, 1992) do not apply here because the policy is continually
changing, creating a nonstationary prediction problem.
7. This is a very simple way of assuring initial exploration of the state space. Because most values are better
than they should be, the learning system is initially disappointed no matter what it does, which causes
it to try a variety of things even though its policy at any one time is deterministic. This approach was
sufficient for this task, but of course we do not advocate it in general as a solution to the problem of
assuring exploration.
References
Albus, J. S., (1981). Brain, Behavior, and Robotics, chapter 6, pages 139-179. Byte Books.
Baase, S., (1988). Computer Algorithms: Introduction to design and analysis. Reading, MA: Addison-Wesley.
Barnard, E., (1993). Temporal-difference methods and Markov models. IEEE Transactions on Systems, Man,
and Cybernetics, 23(2), 357-365_
Barto, A. G. & Duff, M , (1994). Monte Carlo matrix inversion and reinforcement learning. In Advances in
Neural Information Processing Systems 6, pages 687-694, San Mateo, CA. Morgan Kaufmann.
Barto, A. G., Sutton, R. S., & Anderson, C. W., (1983). Neuronlike elements that can solve difficult learning
control problems. IEEE Trans. on Systems, Man, and Cybernetics, 13, 835-846.
Bellman, R. E., (1957). Dynamic Programming. Princeton, NJ: Princeton University Press.
Curtiss, J. H., (1954). A theoretical comparison of the efficiencies of two classical methods and a Monte Carlo
method for computing one component of the solution of a set of linear algebraic equations. In Meyer, H. A.
(Ed.), Symposium on Monte Carlo Methods, pages 191-233, New York: Wiley.
Dayan, P., (1992). The convergence of TD(A) for general A. Machine Learning, 8(3/4), 341-362.
Dayan, P., (1993). Improving generalization tot temporal difference learning: The successor representation.
Neural Computation, 5(4), 613-624.
Dayan, P. & Sejnowski, T., (1994). TD(A) converges with probability 1. Machine Learning, 14, 295-301.
158 S.P. SINGH AND R.S. SUTTON
Holland, J. H., (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to
parallel rule-based systems, Volume 2 of Machine Learning: An Artificial Intelligence Approach, chapter 20.
Morgan Kaufmann.
Jaakkola, T., Jordan, M. I., & Singh, S. P., (1994). On the convergence of stochastic iterative dynamic
programming algorithms. Neural Computation, 6(6), 1185-1201.
Jaakkola, T., Singh, S. P., & Jordan, M. I., (1995). Reinforcement learning algorithm for partially observable
Markov decision problems. In Advances in Neural Information Processing Systems 7. Morgan Kaufmann.
Klopf, A. H., (1972). Brain function and adaptive systems--A heterostatic theory. Technical Report AFCRL-
72-0164, Air Force Cambridge Research Laboratories, Bedford, MA.
Kumar, P. R. & Varaiya, P. P., (1986). Stochastic Systems: Estimation, Identification, and Adaptive Control.
Englewood Cliffs, N.J.: Prentice Hall.
Lin, L. J., (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching.
Machine Learning, 8(3/4), 293-321.
Miller, W. T., Glanz, E H., & Kraft, L. G., (1990). CMAC: An associative neural network alternative to
backpropagation. Proc. of the IEEE, 78, 1561-1567.
Moore, A. W., (1991). Variable resolution dynamic programming: Efficiently learning action maps in multi-
variate real-valued state-spaces. In Machine Learning: Proceedings of the Eighth International Workshop,
pages 333-337, San Mateo, CA. Morgan Kaufmann.
Peng, J., (1993). Dynamic Programming-based Learning for Control. PhD thesis, Northeastern University.
Peng, J. & Williams, R. J., (1994). Incremental multi-step Q-learning. In Machine Learning: Proceedings of
the Eleventh International Conference, pages 226-232. Morgan Kaufmann.
Rubinstein, R., (1981). Simulation and the Monte Carlo method. New York: John Wiley Sons.
Rummery, G. A. & Niranjan, M., (1994). On-line Q-learning using connectionist systems. Technical Report
CUED/F-INFENG/TR 166, Cambridge University Engineering Dept.
Sutton, R. S., (1984). Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of
Massachusetts, Amherst, MA.
Sutton, R. S., (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44.
Sutton, R. S., (1995). TD models: Modeling the world at a mixture of time scales. In Machine Learning:
Proceedings of the Twelth International Conference, pages 531-539. Morgan Kaufmann.
Sutton, R. S. & Barto, A. G., (1987). A temporal-difference model of classical conditioning. In Proceedings
of the Ninth Annual Conference of the Cognitive Science Society, pages 355-378, Hillsdale, NL Erlbaum
Sutton, R. S. & Barto, A. G., (1990): Time-derivative models of Pavlovian conditioning. In Gabriel, M.
& Moore, J. W. (Eds.), Learning and Computational Neuroscience, pages 497-537. Cambridge, MA: MIT
Press.
Sutton, R. S. & Singh, S. P., (1994). On step-size and bias in temporal-difference learning. In Eighth Yale
Workshop on Adaptive and Learning Systems, pages 91-96, New Haven, CT.
Sutton, R. S. & Whitehead, S. D., (1993). Online learning with random representations. In Machine Learning:
Proceedings of the Tenth Int. Conference, pages 314-321. Morgan Kaufmann.
Tesauro, G. J., (1992). Practical issues in temporal difference learning. Machine Learning, 8(3/4), 257-277.
Tsitsiklis, J., (1994). Asynchronous stochastic approximation and Q-leanaing. Machine Learning, 16(3),
185-202.
Wasow, W. R., (1952). A note on the inversion of matrices by random walks. Math. Tables Other Aids
Comput., 6, 78-81.
Watkins, C. J. C. H., (1989). Learning from Delayed Rewards. PhD thesis, Cambridge Univ., Cambridge,
England.
Wilson, S_ W., (to appear)_ Classifier fitness based on accuracy. Evolutionary Computation.