Learning To Trade Via Direct Reinforcement
Learning To Trade Via Direct Reinforcement
Abstract—We present methods for optimizing portfolios, asset world costs of trading,1 make arbitrarily frequent trades or large
allocations, and trading systems based on direct reinforcement changes in portfolio composition become prohibitively expen-
(DR). In this approach, investment decision making is viewed as a sive. Thus, optimal decisions about establishing new positions
stochastic control problem, and strategies are discovered directly.
We present an adaptive algorithm called recurrent reinforcement must consider current positions held.
learning (RRL) for discovering investment policies. The need In [1] and [2], we proposed the RRL algorithm for DR. RRL is
to build forecasting models is eliminated, and better trading an adaptive policy search algorithm that can learn an investment
performance is obtained. The direct reinforcement approach strategy on-line. We demonstrated in those papers that Direct
differs from dynamic programming and reinforcement algorithms
such as TD-learning and Q-learning, which attempt to estimate
Reinforcement provides a more elegant and effective means for
a value function for the control problem. We find that the RRL training trading systems and portfolio managers when market
direct reinforcement framework enables a simpler problem rep- frictions are considered than do more standard supervised ap-
resentation, avoids Bellman’s curse of dimensionality and offers proaches.
compelling advantages in efficiency. We demonstrate how direct In this paper, we contrast our DR (or “policy search”) ap-
reinforcement can be used to optimize risk-adjusted investment
returns (including the differential Sharpe ratio), while accounting proach with commonly used value function based approaches.
for the effects of transaction costs. In extensive simulation work We use the term DR to refer to algorithms that do not have to
using real financial data, we find that our approach based on learn a value function in order to derive a policy. DR methods
RRL produces better trading strategies than systems utilizing date back to the pioneering work by Farley and Clark [3], [4],
Q-Learning (a value function method). Real-world applications
include an intra-daily currency trader and a monthly asset
but have received little attention from the reinforcement learning
allocation system for the S&P 500 Stock Index and T-Bills. community during the past two decades. Notable exceptions
are Williams’ REINFORCE algorithm [5], [6] and Baxter and
Index Terms—Differential Sharpe ratio, direct reinforcement
(DR) , downside deviation, policy gradient, Q-learning, recurrent Bartlett’s recent work [7].2
reinforcement learning, TD-learning, trading, risk, value function. Methods such as dynamic programming [8], TD-Learning
[9] or Q-Learning [10], [11] have been the focus of most of
the modern research. These methods attempt to learn a value
I. INTRODUCTION function or the closely related Q-function. Such value function
methods are natural for problems like checkers or backgammon
T HE investor’s or trader’s ultimate goal is to optimize some
relevant measure of trading system performance, such as
profit, economic utility, or risk-adjusted return. In this paper,
where immediate feedback on performance is not readily avail-
able at each point in time. Actor-critic methods [12], [13] have
we describe direct reinforcement (DR) methods to optimize in- also received substantial attention. These algorithms are inter-
vestment performance criteria. Investment decision making is mediate between DR and value function methods, in that the
viewed as a stochastic control problem, and strategies are dis- “critic” learns a value function which is then used to update the
covered directly. We present an adaptive algorithm called re- parameters of the “actor.”3
current reinforcement learning (RRL). The need to build fore- Though much theoretical progress has been made in recent
casting models is eliminated, and better trading performance years in the area of value function learning, there have been
is obtained. This methodology can be applied to optimizing relatively few widely-cited, successful applications of the tech-
systems designed to trade a single security, allocate assets or niques. Notable examples include TD-gammon [19], [20], an
manage a portfolio. elevator scheduler [21] and a space-shuttle payload scheduler
Investment performance depends upon sequences of interde- [22]. Due to the inherently delayed feedback, these applica-
pendent decisions, and is thus path-dependent. Optimal trading tions all use the TD-Learning or Q-Learning value function RL
or portfolio rebalancing decisions require taking into account methods.
the current system state, which includes both market condi- For many financial decision making problems, however, re-
tions and the currently held positions. Market frictions, the real- sults accrue gradually over time, and one can immediately mea-
sure short-term performance. This enables use of a DR approach
1Market frictions include taxes and a variety of transaction costs, such as com-
missions, bid/ask spreads, price slippage and market impact.
Manuscript received March 6, 2001; revised March 29, 2001. This work was
2Baxter and Bartlett have independently proposed the term DR for policy
supported by the Nonlinear Prediction Systems and from DARPA under Con-
tract DAAH01-96-C-R026 and AASERT Grant DAAH04-95-1-0485. gradient algorithms in a Markov decision process framework. We use the term
The authors are with the Computational Finance Program, Oregon Graduate in the same spirit, but perhaps more generally, to refer to any reinforcement
Institute of Science and Technology, Beaverton, OR 97006 USA, and also with learning algorithm that does not require learning a value function.
Nonlinear Prediction Systems, Beaverton, OR 97006 USA. 3For reviews and in-depth presentations of value function and actor-critic
Publisher Item Identifier S 1045-9227(01)05010-X. methods, see [16]–[18].
to provide immediate feedback to optimize the strategy. One for asset allocation in [23], and explore the minimization of
class of performance criteria frequently used in the financial downside risk using DR in [24].
community are measures of risk-adjusted investment returns.
RRL can be used to learn trading strategies that balance the ac- II. TRADING SYSTEMS AND PERFORMANCE CRITERIA
cumulation of return with the avoidance of risk. We describe
A. Structure of Trading Systems
commonly used measures of risk, and review how differential
forms of the Sharpe ratio and downside deviation ratio can be In this paper, we consider agents that trade fixed position sizes
formulated to enable efficient online learning with RRL. in a single security. The methods described here can be general-
We present empirical results for discovering tradeable ized to more sophisticated agents that trade varying quantities of
structure in the U.S. Dollar/British Pound foreign exchange a security, allocate assets continuously or manage multiple asset
market via DR. In addition, we compare performance for portfolios. See [2] for a discussion of multiple asset portfolios.
an RRL-Trader and Q-Trader that learn switching strategies Here, our traders are assumed to take only long, neutral, or
between the S&P 500 Stock Index and Treasury Bills. For both short positions, , of constant magnitude. A long
traders, the results demonstrate the presence of predictable position is initiated by purchasing some quantity of a security,
structure in US stock prices over a 25-year test period. However, while a short position is established by selling the security.4
we find that the RRL-Trader performs substantially better than The price series being traded is denoted . The position is
the Q-Trader. Relative to Q-Learning, we observe that RRL established or maintained at the end of each time interval , and
enables a simpler problem representation, avoids Bellman’s is reassessed at the end of period . A trade is thus possible
curse of dimensionality and offers compelling advantages in at the end of each time period, although nonzero trading costs
efficiency. The S&P 500 and foreign exchange results were will discourage excessive trading. A trading system return
previously presented in [2], [23], and [24]. is realized at the end of the time interval and includes
the profit or loss resulting from the position held during
We discuss the relative merits of DR and value function
that interval and any transaction cost incurred at time due to a
learning, and provide arguments and examples for why value
difference in the positions and .
function based methods may result in unnatural problem
In order to properly incorporate the effects of transactions
representations. Our results suggest that DR offers a powerful
costs, market impact and taxes in a trader’s decision making, the
alternative to reinforcement algorithms that learn a value
trader must have internal state information and must therefore be
function, for problem domains where immediate estimates of
recurrent. A single asset trading system that takes into account
incremental performance can be obtained.
transactions costs and market impact has the following decision
To conclude the introduction, we would like to note that
function:
computational finance offers many interesting and challenging
potential applications of reinforcement learning methods.
with
While our work emphasizes DR, most applications in finance
to date have been based upon dynamic programming type (1)
methods. Elton and Gruber [28] provide an early survey of
dynamic programming applications in finance. The problems where denotes the (learned) system parameters at time and
of optimum consumption and portfolio choice in continuous denotes the information set at time , which includes present
and past values of the price series and an arbitrary number
time have been formulated by Merton [25]–[27] from the
of other external variables denoted . A simple example is a
standpoints of dynamic programming and stochastic control.
long, short trader with autoregressive inputs
The extensive body of work on intertemporal (multi-period)
portfolio management and asset pricing is reviewed by Breeden
[29]. Duffie [30], [31] describes stochastic control and dy- sign (2)
namic programming methods in finance in depth. Dynamic
programming provides the basis of the Cox et al. [32] and other
where are the price returns of (defined below) and the
widely used binomial option pricing methods. See also the
system parameters are the weights . A trader of this
strategic asset allocation work of Brennan et al. [33]. Due to the
form is used in the simulations described in Section IV-A.
curse of dimensionality, approximate dynamic programming
The above formulation describes a discrete-action, deter-
is often required to solve practical problems, as in the work ministic trader, but can be easily generalized. One simple
by Longstaff and Schwartz [34] on pricing American options. generalization is to use continuously valued , for ex-
During the past six years, there have been several applica- ample, by replacing with . When discrete values
tions that make use of value function reinforcement learning are imposed, however, the decision function
methods. Van Roy [35] uses a TD( ) approach for valuing is not differentiable. Nonetheless, gradient based optimization
options and performing portfolio optimization. Neuneier [36] methods for may be developed by considering differentiable
uses a Q-Learning approach to make asset allocation decisions,
and Neuneier and Mihatsch [37] incorporate a notion of risk 4For stocks, a short sale involves borrowing shares and then selling the bor-
sensitivity into the construction of the Q-Function. Derivatives rowed shares to a third party. A profit is made when the shorted shares are repur-
chased at a later time at a lower price. Short sales of many securities, including
pricing applications have been studied by Tsitsiklis and Van stocks, bonds, futures, options, and foreign exchange contracts, are common
Roy [38], [39]. Moody and Saffell compare DR to Q-Learning place.
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 877
prethresholded outputs or, for example, by replacing with When the risk-free rate of interest is ignored ( ), a second
during learning and discretizing the outputs when trading. simplified expression is obtained
Moreover, the models can be extended to a stochastic frame-
work by including a noise variable in (8)
with (3)
Relaxing the constant magnitude assumption is more realistic
for asset allocations and portfolios, and enables better risk con-
The random variable induces a joint probability density for
trol. Related expressions for portfolios are presented in [2].
the discrete actions , model parameters and model inputs:
C. Performance Criteria
(4)
In general, the performance criteria that we consider are func-
tions of profit or wealth after a sequence of time steps,
The noise level (measured by or more generally the scale of
or more generally of the whole time sequence of trades
) can be varied to control the “exploration vs. exploitation”
behavior of the trader. Also, differentiability of the probability
distribution of actions enables the straightforward application of (9)
gradient based learning methods.
The simple form includes standard economic utility
B. Profit and Wealth for Trading Systems functions. The second case is the general form for path-de-
Trading systems can be optimized by maximizing perfor- pendent performance functions, which include inter-temporal
mance functions, , such as profit, wealth, utility functions utility functions and performance ratios like the Sharpe ratio
of wealth or performance ratios like the Sharpe ratio. The and Sterling ratio. In either case, the performance criterion
simplest and most natural performance function for a risk-in- at time can be expressed as a function of the sequence of
sensitive trader is profit. trading returns
Additive profits are appropriate to consider if each trade is
for a fixed number of shares or contracts of security . This is (10)
often the case, for example, when trading small stock or futures
accounts or when trading standard US$ FX contracts in dollar-
For brevity, we denote this general form by .
denominated foreign currencies. We define and
For optimizing our traders, we will be interested in the mar-
as the price returns of a risky (traded) asset
ginal increase in performance due to return at each time step
and a risk-free asset (like T-Bills), respectively, and denote the
transactions cost rate as . The additive profit accumulated over
time periods with trading position size is then defined (11)
in term of the trading returns, , as:
Note that depends upon the current trading return , but that
does not. Our strategy will be to derive differential perfor-
where mance criteria that capture the marginal “utility” of
the trading return at each period.5
(5)
D. The Differential Sharpe Ratio
with and typically . When the risk-free Rather than maximizing profits, most modern fund man-
rate of interest is ignored ( ), a simplified expression is agers attempt to maximize risk-adjusted return, as suggested
obtained by modern portfolio theory. The Sharpe ratio is the most
widely-used measure of risk-adjusted return [40]. Denoting
(6)
as before the trading system returns for period (including
The wealth of the trader is defined as . transactions costs) as , the Sharpe ratio is defined to be
Multiplicative profits are appropriate when a fixed fraction
of accumulated wealth is invested in each long or short Average
(12)
trade. Here, and . If Standard Deviation
no short sales are allowed and the leverage factor is set fixed at
, the wealth at time is where the average and standard deviation are estimated for pe-
riods . Note that for ease of exposition and anal-
ysis, we have suppressed inclusion of portfolio returns due
where to the risk free rate on capital . Substituting excess returns
5Strictly speaking, many of the performance criteria commonly used in the
financial industry are not true utility functions, so we use the term “utility” in a
(7) more colloquial sense.
878 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001
for in the equation above produces the stan- The differential Sharpe ratio has several attractive properties:
dard definition. With this caveat in mind, we use (12) for dis- • Facilitates recursive updating: The incremental nature of
cussion purposes without loss of mathematical generality.6 the calculations of and make updating the expo-
Proper on-line learning requires that we compute the influ- nential moving Sharpe ratio straightforward. It is not nec-
ence on the Sharpe ratio (marginal utility ) of the trading re- essary to recompute the average and standard deviation of
turn at time . To accomplish this, we have derived a new ob- returns for the entire trading history in order to update the
jective function called the differential Sharpe ratio for on-line Sharpe ratio for the most recent time period.
optimization of trading system performance [1], [2]. It is ob- • Enables efficient on-line optimization: and
tained by considering exponential moving averages of the re- can be cheaply calculated using the previously computed
turns and standard deviation of returns in (12), and expanding moving averages and and the current return
to first order in the adaptation rate . This enables efficient stochastic optimization.
• Weights recent returns more: Based on the exponential
moving average Sharpe ratio, recent returns receive
stronger weightings in than do older returns.
• Provides interpretability: The differential Sharpe ratio iso-
(13) lates the contribution of the current return to the ex-
ponential moving average Sharpe ratio. The simple form
Note that a zero adaptation rate corresponds to an infinite time of makes clear how risk and reward affect the Sharpe
average. Expanding about amounts to “turning on” the ratio.
adaptation. One difficulty with the Sharpe ratio, however, is that the use
Since only the first order term in this expansion depends upon of variance or as a risk measure does not distinguish between
the return at time , we define the differential Sharpe ratio as upside and downside “risk.” Assuming that , the
largest possible improvement in occurs when
(14) (17)
where the quantities and are exponential moving esti- Thus, the Sharpe ratio actually penalizes gains larger than ,
mates of the first and second moments of which is counter-intuitive relative to most investors’ notions of
risk and reward.
E. Downside Risk
(15)
Symmetric measures of risk such as variance are more and
more being viewed as inadequate measures due to the asym-
Treating and as numerical constants, note that in metric preferences of most investors to price changes. Few in-
the update equations controls the magnitude of the influence vestors consider large positive returns to be “risky,” though both
of the return on the Sharpe ratio . Hence, the differential large positive as well as negative returns are penalized using a
Sharpe ratio represents the influence of the trading return symmetric measure of risk such as the variance. To most in-
realized at time on . It is the marginal utility for the Sharpe vestors, the term “risk” refers intuitively to returns in a portfolio
ratio criterion. that decrease its profitability.
The influences of risk and return on the differential Sharpe Markowitz, the father of modern portfolio theory, understood
ratio are readily apparent. The current return enters expres- this. Even though most of his work focussed on the mean-vari-
sion (14) only in the numerator through and ance framework for portfolio optimization, he proposed the
. The first term in the numerator is positive if semivariance as a means for dealing with downside returns
exceeds the moving average of past returns (increased [41]. After a long hiatus lasting three decades, there is now
reward), while the second term is negative if exceeds the a vigorous industry in the financial community in modeling
moving average of past squared returns (increased risk). and minimizing downside risk. Criteria of interest include
The differential Sharpe ratio is used in the RRL algorithm the double deviation (DD), the second lower partial moment
(see (31) in Section III) as the current contribution to the perfor- (SLPM) and the th lower partial moment [42]–[46].
mance function . Since in (13) does not depend on , One measure of risk-adjusted performance widely used in
we have . When optimizing the professional fund management community (especially for
the trading system using (14), the relevant derivatives have the hedge funds) is the Sterling ratio, commonly defined as
simple form:
Annualized Average Return
Sterling Ratio (18)
(16) Maximum Drawn-Down
Here, the maximum draw-down (from peak to trough) in ac-
6For systems that trade futures and forward, R should be used in place of
R~ , because the risk free rate is already accounted for in the relation between count equity or net asset value is defined relative to some stan-
forward prices and spot prices. dard reference period, for example one to three years. Mini-
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 879
mizing drawdowns is somewhat cumbersome, so we focus on error exploration of the environment and space of strategies. In
the DD as a measure of downside risk in this paper.7 contrast to supervised learning, the system is not presented with
The DD is defined to be the square root of the average of the examples of desired actions. Rather, it receives a reinforcement
square of the negative returns signal from its environment (a reward) that provides information
on whether its actions are good or bad.
In [1], [2], we compared supervised learning to our DR
DD (19) approach. The supervised methods discussed included trading
based upon forecasts of market prices and training a trader
using labeled data. In both supervised frameworks, difficulties
Using the DD as a measure of risk, we can now define a utility are encountered when transaction costs are included. While
function similar to the Sharpe ratio, which we will call the down- supervised learning methods can be effective for solving the
side deviation ratio (DDR) structural credit assignment problem, they do not typically
address the temporal credit assignment problem.
Average
DDR (20) Structural credit assignment refers to the problem of as-
DD
signing credit to the individual parameters of a system. If
The DDR rewards the presence of large average positive returns the reward produced also depends on a series of actions of
and penalizes risky returns, where “risky” now refers to down- the system, then the temporal credit assignment problem is
side returns. encountered, ie. assigning credit to the individual actions
In order to facilitate the use of our recurrent reinforcement taken over time [48]. Reinforcement learning algorithms offer
learning algorithm (Section III), we need to compute the influ- advantages over supervised methods by attempting to solve
ence of the return at time on the DDR. In a similar manner to both problems simultaneously.
the development of the differential Sharpe ratio in [2], we define Reinforcement learning algorithms can be classified as either
exponential moving averages of returns and of the squared DD DR (sometimes called “policy search”), value function, or actor-
critic methods. The choice of the best method depends upon
the nature of the problem domain. We will discuss this issue
in greater detail in Section V. In this section, we present the
DD DD DD (21) recurrent reinforcement learning algorithm for DR and review
value function based methods, specifically Q-learning [10] and
and define the DDR in terms of these moving averages. We ob- a refinement of Q-learning called advantage updating [49]. In
tain our performance function by considering a first-order ex- Section IV-C, we compare the RRL and value function methods
pansion in the adaptation rate of the DDR for systems that learn to allocate assets between the S&P 500
stock index and T-Bills.
DDR
DDR DDR (22)
A. Recurrent Reinforcement Learning
In this section, we describe the recurrent reinforcement
We define the first-order term DDR to be the differential
learning algorithm for DR. This algorithm was originally
downside deviation ratio. It has the form
presented in [1] and [2].
DDR Given a trading system model , the goal is to adjust the
parameters in order to maximize . For traders of form (1)
and trading returns of form (6) or (8), the gradient of with
(23) respect to the parameters of the system after a sequence of
DD periods is
DD
DD
(24) (25)
From (24) it is obvious that when , the utility increases The system can be optimized in batch mode by repeatedly
as increases, with no penalty for large positive returns such computing the value of on forward passes through the data
as exists when using variance as the risk measure. See [24] for and adjusting the trading system parameters by using gradient
detailed experimental results on the use of the DDR to build ascent (with learning rate )
RRL trading systems.
cient manner requires an approach similar to backpropagation We use on-line algorithms of this recurrent reinforcement
through time (BPTT) [50], [51]. The temporal dependencies in learning type in the simulations presented in Section IV. Note
a sequence of decisions are accounted for through a recursive that we find that use of a noise variable provides little
update equation for the parameter gradients advantage for the real financial applications that we consider,
since the data series contain significant intrinsic noise. Hence,
we find that a simple “greedy” update is adequate.8
(27) The above description of the RRL algorithm is for traders
that optimize immediate estimates of performance for spe-
cific actions taken. This presentation can be thought of as a spe-
The above expressions (25) and (27) assume differentiability
cial case of a more general Markov decision process (MDP)
of . For the long/short traders with thresholds described in
and policy gradient formulation. One straightforward extension
Section II-A, the reinforcement signal can be backpropagated
of our formulation can be obtained for traders that maximize
through the prethresholded outputs in a manner similar to the
discounted future rewards. We have experimented with this ap-
Adaline learning rule [52]. Equations (25)–(27) constitute the
proach, but found little advantage for the problems we consider.
batch RRL algorithm.
A second extension to the formulation is to consider a stochastic
There are two ways in which the batch algorithm described
trader (3) and an expected reward framework, for which the
above can be extended into a stochastic framework. First, ex-
probability distribution of actions is differentiable. This latter
ploration of the strategy space can be induced by incorporating
approach makes use of the joint density of (4). While the ex-
a noise variable , as in the stochastic trader formulation of (3).
pected reward framework is appealing from a theoretical per-
The tradeoff between exploration of the strategy space and ex-
spective, (28)–(30) presented above provide the practical basis
ploitation of a learned policy can be controlled by the magnitude
for simulations.
of the noise variance . The noise magnitude can be annealed
Although we have focussed our discussion on traders of a
over time during simulation, in order to arrive at a good strategy.
single risky asset with scalar , the algorithms described in
Second, a simple on-line stochastic optimization can be ob-
this section can be trivially generalized to the vector case for
tained by considering only the term in (25) that depends on the
portfolios. Optimization of portfolios is described in [1], [2].
most recently realized return during a forward pass through
the data B. Value Functions and Q-Learning
Besides explicitly training a trader to take actions, we can also
(28) implicitly learn correct actions through the technique of value
iteration. Value iteration uses a value function to evaluate and
improve policies (see [16] for a tutorial introduction and [18] for
The parameters are then updated on-line using a full overview of these algorithms). The value function, ,
is an estimate of discounted future rewards that will be received
from starting in state , and by following the policy thereafter.
(29)
The value function satisfies Bellman’s equation
(33)
8Tesauro finds a similar result for TD-Gammon [19], [20]. A “greedy” update
(31) works well, because the dice rolls in the game provided enough uncertainty to
induce extensive strategy exploration.
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 881
(34)
(35)
(37) action while in state versus choosing the best possible ac-
tion for that state. The value function measures the expected
Similarly to (35), the Q-function can be learned using a value discounted future rewards as described previously. Advantage
iteration approach updating has the following relationship with Q-learning:
(38) (41)
Fig. 2. Artificial prices (top panel), trading signals (second panel), cumulative Fig. 3. An expanded view of the last thousand time periods of Fig. 2. The
sums of profits (third panel) and the moving average Sharpe ratio with = 0:01 exponential moving Sharpe ratio has a forgetting time scale of 1= = 100
(bottom panel). The system performs poorly while learning from scratch during periods. A smaller would smooth the fluctuations out.
the first 2000 time periods, but its performance remains good thereafter.
(42)
(43)
Fig. 4. Histograms of the price changes (top), trading profits per time period
(middle) and Sharpe ratios (bottom) for the simulation shown in Fig. 2. The left
where and are constants, and and are normal column is for the first 5000 time periods, and the right column is for the last
random deviates with zero mean and unit variance. We define 5000 time periods. The transient effects during the first 2000 time periods for
the artificial price series as the real-time recurrent learning are evident in the lower left graph.
are first optimized on the training data set for 100 epochs and
adapted on-line throughout the whole test data set. Each trial
has different realizations of the artificial price process and
different randomly chosen initial trader parameter values. We
vary the transaction cost from 0.2%, 0.5%, to 1%, and observe
the trading frequency, cumulative profit and Sharpe ratio over
the test data set. As shown, in all 100 experiments, positive
Sharpe ratios are obtained. As expected, trading frequency is
reduced as transaction costs increase.
10The data is part of the Olsen & Associates HFDF96 dataset, obtainable by through the interbank FX market in order to verify real time
contacting www.olsen.ch. transactable prices and profitability.
884 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001
Fig. 6. fLong, short, neutralg trading system of the U.S. Dollar/British Pound that uses the bid/ask spread as transaction costs. The data consists of half-hourly
quotes for the five-day-per-week, 24–hour interbank FX market. The time period shown is the first eight months of 1996. The trader is optimized via recurrent
reinforcement learning to maximize the differential downside deviation ratio. The first 2000 data points (approximately two months) are used for training and
validation. The trading system achieves an annualized 15% return with an annualized Sharpe ratio of 2.3 over the approximately six-month out-of-sample test
period. On average, the system makes a trade once every five hours.
Fig. 7. Time series that influence the return attainable by the S&P 500/TBill asset allocation system. The top panel shows the S&P 500 series with and without
dividends reinvested. The bottom panel shows the annualized monthly Treasury Bill and S&P 500 dividend yields.
C. S&P 500/T-Bill Asset Allocation 500, no T-Bill interest is earned, but when the traders are short
In this section we compare the use of recurrent reinforce- stocks (using standard 2:1 leverage), they earn twice the T-Bill
ment learning to the advantage updating formulation of the rate. We use the advantage updating refinement instead of the
Q-learning algorithm for building a trading system. These standard Q-Learning algorithm, because we found it to yield
comparative results were presented previously at NIPS*98 better trading results. See Section III-B.2 for a description of
[23]. The long/short trading systems trade the S&P 500 stock the representational advantages of the approach.
index, in effect allocating assets between the S&P 500 and The S&P 500 target series is the total return index computed
three-month Treasury Bills. When the traders are long the S&P monthly by reinvesting dividends. The S&P 500 indexes with
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 885
Fig. 9. Test results for ensembles of simulations using the S&P 500 stock index Fig. 10. Sensitivity traces for three of the inputs to the RRL-Trader trading
and three-month Treasury Bill data over the 1970-1994 time period. Shown are system averaged over the ensemble of traders. The nonstationary relationships
the equity curves associated with the systems and the buy and hold strategy, as typical among economic variables is evident from the time-varying sensitivities.
well as the trading signals produced by the systems. The solid curves correspond
to the RRL-Trader system performance, dashed curves to the Q-Trader system
and the dashed and dotted curves indicate the buy and hold performance. Both
systems significantly outperform the buy and hold strategy. In both cases, the maneuvering around obstacles [58], reinforcement from the
traders avoid the dramatic losses that the buy and hold strategy incurred during environment occurs only at the end of the game or task. The
1974 and 1987. final rewards received are success, failure or win, lose .
For such tasks, the temporal credit assignment problem is ex-
3) Model Insight Through Sensitivity Analysis: A sensitivity treme. There is usually no a priori assessment of performance
analysis of the RRL-Trader systems was performed in an at- available during the course of each game or trial. Hence, one
tempt to determine on which economic factors the traders are is forced to learn a value function of the system state at each
basing their decisions. Fig. 10 shows the absolute normalized time. This is accomplished by doing many runs on a trial and
sensitivities for three of the more salient input series as a func- error basis, and discounting the ultimate reward received back
tion of time, averaged over the 30 members of the RRL-Trader in time. This discounting approach is the basis of dynamic
committee. The sensitivity of input is defined as: programming [8], TD-Learning [9] and Q-Learning [10], [11].
For these value function methods, the action taken at each
time is that which offers the largest increase in expected value.
(45) Thus, the policy is not represented directly. An intermediate
class of reinforcement algorithms are actor-critic methods [12].
While the actor module provides a direct representation of the
where is the unthresholded trading output of the policy func- policy for these methods, it relies on the critic module for feed-
tion and denotes input . back. The role of the critic is to learn the value function.
The time-varying sensitivities in Fig. 10 emphasize the non- In contrast, direct reinforcement methods represent the
stationarity of economic relationships. For example, the yield policy directly, and make use of immediate feedback to adjust
curve slope (which measures inflation expectations) is found to the policy. This approach is appealing when it is possible to
be a very important factor in the 1970s, while trends in long term specify an instantaneous measure of performance, because the
interest rates (measured by the six-month difference in the AAA need to learn a value function is bypassed.
bond yield) becomes more important in the 1980s, and trends in In trading, asset allocation and portfolio management prob-
short term interest rates (measured by the six-month difference lems, for example, overall performance accrues gradually over
in the treasury bill yield) dominate in the early 1990s. time. For these financial decision making problems, an imme-
diate measure of incremental performance is available at each
V. LEARN THE POLICY OR LEARN THE VALUE? time step. Although total performance usually involves inte-
grating or averaging over time, it is nonetheless possible to adap-
As mentioned in Section III, reinforcement learning algo- tively update the strategy based upon the investment return re-
rithms can be classified as eitherDR (sometimes called “policy ceived at each time step.
search”), value function methods, or actor-critic methods. The Other domains that offer the possibility of immediate feed-
choice of the best method depends upon the nature of the back include a wide range of control applications. The standard
problem domain. formulation for optimal control problems involves time inte-
grals of an instantaneous performance measure. Examples of
A. Immediate versus Future Rewards
common loss functions include average squared deviation from
Reinforcement signals received from the environment can a desired trajectory or average squared jerk.15
be immediate or delayed. In some problems, such as checkers
[55], [56], backgammon [19], [20], navigating a maze [57], or 15“Jerk” is the rate of change of acceleration.
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 887
maintaining its positions for extended periods. The frequent ential Sharpe ratio and differential downside deviation ratio. We
switches in position by the Q-Trader suggests that it is more have also provided empirical results that demonstrate the pres-
sensitive to noise in the inputs. Hence, the strategy it has ence of predictability as discovered by RRL in intradaily U.S.
learned is brittle. Dollar/British Pound exchange rates and in the monthly S&P
Regarding interpretability, we find the value function repre- 500 Stock Index for the 25-year test period 1970 through 1994.
sentation to be obscure. While the change in the policy as im- In previous work [1], [2], we showed that trading systems
plemented by the RRL algorithm is directly related to changes trained via RRL significantly outperform systems trained using
in the inputs, for the value function the effect on policy is not supervised methods. In this paper, we have compared the DR
so clear. While the RRL-Trader has an almost linear policy rep- approach using RRL to the Q-learning value function method.
resentation (a net with just a single unit), the Q-Trader’s We find that an RRL-Trader achieves better performance than
policy is the of a two layer network for which the policy a Q-Trader for the S&P 500/T-Bill asset allocation problem.
is an input. The brittle behavior of the Q-Trader is probably due We observe that relative to Q-learning, RRL enables a simpler
to the complexity of the learned Q-function with respect to the problem representation, avoids Bellman’s curse of dimension-
inputs and actions. The problem representation for the Q-Trader ality and offers compelling advantages in efficiency.
thus reduces explanatory value. We have also discussed the relative merits of DR and value
The sensitivity analysis presented for the RRL-Trader function learning, and provided arguments and examples for
strategy in Section IV-C.3 was easy to formulate and imple- why value function-based methods may result in unnatural
ment. It enables us to identify the most important explanatory problem representations. For problem domains where imme-
variables, and to observe how their relative saliency varies diate estimates of incremental performance can be obtained,
slowly over time. For the Q-Trader, however, a similar analysis our results suggest that DR offers a powerful alternative.
is not straightforward. The possible actions are represented as
inputs to the Q-function network, with the chosen action being ACKNOWLEDGMENT
determined by the . While we can imagine proxies
for a sensitivity analysis in a simple two action long, short The authors wish to thank L. Wu and Y. Liao for their contri-
framework, it is not clear how to perform a sensitivity analysis butions to our early work on DR, and A. Atiya, G. Tesauro, and
for actions versus inputs in general for a Q-learning framework. the reviewers for their helpful comments on this manuscript.
This reduces the explanatory value of a Q-Trader.
Since the long, short Q-Trader is implemented using a REFERENCES
neural network function approximator, Bellman’s curse of
[1] J. Moody and L. Wu, “Optimization of trading systems and portfolios,”
dimensionality has a relatively small impact on the results in Decision Technologies for Financial Engineering, Y. Abu-Mostafa, A.
of the experiments presented here. The input dimensionality N. Refenes, and A. S. Weigend, Eds. London: World Scientific, 1997,
of the Q-Trader is increased by only one, and there are only pp. 23–35.
[2] J. Moody, L. Wu, Y. Liao, and M. Saffell, “Performance functions and re-
two actions to consider. However, in the case of a portfolio inforcement learning for trading systems and portfolios,” J. Forecasting,
management or multi-sector asset allocation system, the di- vol. 17, pp. 441–470, 1998.
mensionality problem becomes severe.18 Portfolio management [3] W. A. Clark and B. G. Farley, “Simulation of self-organizing systems by
digital computer,” IRE Trans. Inform. Theory, vol. 4, pp. 76–84, 1954.
requires a continuous weight for each of assets included [4] , “Generalization of pattern recognition in a self-organizing
in the portfolio. This increases the input dimension for the system,” in Proc. 1955 Western Joint Comput. Conf., 1955, pp. 86–91.
Q-Trader by relative to the RRL-Trader. Then, in order [5] R. J. Williams, “Toward a theory of reinforcement-learning connec-
tionist systems,” College Comput. Sci., Northeastern Univ., Boston,
to facilitate the discovery of actions, we can only MA, Tech. Rep. NU-CCS-88-3, 1988.
consider discrete action sets. The number of discrete actions [6] R. J. Williams, “Simple statistical gradient-following algorithms for
that must be considered is exponential in . As another issue, connectionist reinforcement learning,” Machine Learning, vol. 8, pp.
229–256, 1992.
we must also consider the possible loss of utility that results [7] J. Baxter and P. L. Bartlett, “Direct gradient-based reinforcement
due to the finite resolution of action choices. learning: I. Gradient estimation algorithms,” Comput. Sci. Lab.,
In terms of efficiency, the advantage updating representa- Australian Nat. Univ., Tech. Rep., 1999.
[8] R. E. Bellman, Dynamic Programming. Princeton, NJ: Princeton Univ.
tion used for the Q-Trader required two networks each with 30 Press, 1957.
units. In order to reduce run time, the simulation code was [9] R. S. Sutton, “Learning to predict by the method of temporal differ-
written in C. Still, each run required approximately 25 hours to ences,” Machine Learning, vol. 3, no. 1, pp. 9–44, 1988.
[10] C. J. C. H. Watkins, “Learning with Delayed Rewards,” Ph.D. thesis,
complete using a Pentium Pro 200 running the Linux operating Cambridge Univ, Psychol. Dept., 1989.
system. The RRL networks used a single unit, and were [11] C. J. Watkins and P. Dayan, “Technical note: Q-Learning,” Machine
implemented as uncompiled Matlab code. Even given this un- Learning, vol. 8, no. 3, pp. 279–292, 1992.
[12] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive
optimized coding, the RRL simulations were 150 times faster, elements that can solve difficult learning control problems,” IEEE Trans.
taking only 10 min. Syst., Man, Cybern., vol. 13, no. 5, pp. 835–846, Sept. 1983.
[13] A. G. Barto, Handbook of Intelligent Control. New York: Van Nos-
trand Reinhold, 1992, ch. 12.
VI. CONCLUSION [14] Handbook of Intelligent Control, D. A. White and D. A. Sofge, Eds.,
Van Nostrand Reinhold, New York, 1992, pp. 65–90.
In this paper, we have demonstrated how to train trading sys- [15] Handbook of Intelligent Control, D. A. White and D. A. Sofge, Eds.,
tems via DR. We have described the RRL algorithm, and used Van Nostrand-Reinhold, New York, 1992, pp. 493–526.
[16] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement
it to optimize financial performance criteria such as the differ- learning: A survey,” J. Artificial Intell. Res., vol. 4, 1996.
[17] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Program-
18We have encountered this obstacle in preliminary, unpublished experiments. ming. Belmont, MA: Athena Scientific, 1996.
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 889
[18] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc- [47] H. White, private communication, 1996.
tion. Cambridge, MA: MIT Press, 1997. [48] R. S. Sutton, “Temporal credit assignment in reinforcement learning,”
[19] G. Tesauro, “TD-Gammon, a self-teaching backgammon program, Ph.D. dissertation, Univ. Massachusetts, Amherst, 1984.
achieves master-level play,” Neural Comput., vol. 6, no. 2, pp. 215–219, [49] L. C. Baird, “Advantage updating,” Wright Laboratory, Wright-Pat-
1994. terson Air Force Base, OH, Tech. Rep. WL-TR-93-1146, 1993.
[20] , “Temporal difference learning and TD-Gammon,” Commun. [50] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning in-
ACM, vol. 38, no. 3, pp. 58–68, 1995. ternal representations by error propagation,” in Parallel Distributed
[21] R. H. Crites and A. G. Barto, “Improving elevator performance using Processing: Exploration in the Microstructure of Cognition, D. E.
reinforcement learning,” in Advances in Neural Information Processing Rumelhart and J. L. McClelland, Eds. Cambridge, MA: MIT Press,
Systems, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds., 1996, 1986, ch. 8, pp. 310–362.
vol. 8, pp. 1017–1023. [51] P. J. Werbos, “Back-propagation through time: What it does and how to
[22] W. Zhang and T. G. Dietterich, “High-performance job-shop scheduling do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990.
with a time-delay TD() network,” in Advances Neural Inform. Pro- [52] B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in IRE
cessing Syst., D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds., WESCON Convention Record, 1960, pp. 96–104.
1996, vol. 8, pp. 1024–1030. [53] R. J. Williams and D. Zipser, “A learning algorithm for continually
[23] J. Moody and M. Saffell, “Reinforcement learning for trading,” in Ad- running fully recurrent neural networks,” Neural Comput., vol. 1, pp.
vances Neural Inform. Processing Syst., S. A. Solla, M. S. Kearns, and 270–280, 1989.
D. A. Cohn, Eds: MIT Press, 1999, vol. 11, pp. 917–923. [54] K. S. Narendra and K. Parthasarathy, “Identification and control of dy-
[24] , “Minimizing downside risk via stochastic dynamic program- namical systems using neural networks,” IEEE Trans. Neural Networks,
ming,” in Computational Finance 1999, A. W. Lo, Y. S. Abu-Mostafa, vol. 1, pp. 4–27, 1990.
B. LeBaron, and A. S. Weigend, Eds. Cambridge, MA: MIT Press, [55] A. L. Samuel, “Some studies in machine learning using the game of
2000, pp. 403–415. checkers,” IBM J. Res. Development, vol. 3, pp. 211–229, 1959.
[25] R. C. Merton, “Lifetime portfolio selection under uncertainty: The con- [56] , “Some studies in machine learning using the game of checkers.
tinuous-time case,” Rev. Economics Statist., vol. 51, pp. 247–257, Aug. II—Recent progress,” IBM J. Res. Development, vol. 11, pp. 601–617,
1969. 1967.
[26] , “Optimum consumption and portfolio rules in a continuous-time [57] J. Peng and R. J. Williams, “Efficient learning and planning within the
model,” J. Economic Theory, vol. 3, pp. 373–413, Dec. 1971. Dyna framework,” Adaptive Behavior, vol. 1, no. 4, pp. 437–454, 1993.
[27] R. C. Merton, Continuous-Time Finance. Oxford, U.K.: Blackwell, [58] A. W. Moore and C. G. Atkeson, “Prioritized sweeping: Reinforcement
1990. learning with less data and less real time,” Machine Learning, vol. 13,
[28] E. J. Elton and M. J. Gruber, “Dynamic programming applications in pp. 103–130, 1993.
finance,” J. Finance, vol. 26, no. 2, 1971. [59] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization of
[29] D. T. Breeden, “Intertemporal portfolio theory and asset pricing,” in Markov reward processes,” in Proc. IEEE Conf. Decision Contr., 1998.
Finance, J. Eatwell, M. Milgate, and P. Newman, Eds. New York: [60] , “Simulation-based optimization of Markov reward processes,”
Macmillan, 1987, pp. 180–193. IEEE Trans. Automat. Contr., vol. 46, pp. 191–209, Feb. 2001.
[30] D. Duffie, Security Markets: Stochastic Models. New York: Aca- [61] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient
demic, 1988. methods for reinforcement learning with function approximation,” in
[31] , Dynamic Asset Pricing Theory, 2nd ed. Princeton, NJ: Princeton Advances in Neural Information Processing Systems, T. K. Leen, S. A.
Univ. Press, 1996. Solla, and K.-R. Muller, Eds. Cambridge, MA: MIT Press, 2000, vol.
[32] J. C. Cox, S. A. Ross, and M. Rubinstein, “Option pricing: A simplified 12, pp. 1057–1063.
approach,” J. Financial Economics, vol. 7, pp. 229–263, Oct. 1979. [62] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances
[33] M. J. Brennan, E. S. Schwartz, and R. Lagnado, “Strategic asset alloca- in Neural Information Processing Systems, S. A. Solla, T. K. Leen, and
tion,” J. Economic Dynamics Contr., vol. 21, pp. 1377–1403, 1997. K.-R. Muller, Eds. Cambridge, MA: MIT Press, 2000, vol. 12, pp.
[34] F. Longstaff and E. Schwartz, “Valuing American options by simulation: 1008–1014.
A simple least squares approach,” Rev. Financial Studies, 2001, to be [63] G. Z. Grudic and L. H. Ungar, “Localizing policy gradient estimates to
published. action transitions,” in Proc. 17th Int. Conf. Machine Learning, 2000.
[35] B. Van Roy, “Temporal-difference learning and applications in finance,” [64] L. Baird and A. Moore, “Gradient descent for general reinforcement
in Computational Finance 1999, Y. S. Abu-Mostafa, B. LeBaron, A. W. learning,” in Advances in Neural Information Processing Systems, S.
Lo, and A. S. Weigend, Eds. Cambridge, MA: MIT Press, 2001, pp. A. Solla, M. S. Kearns, and D. A. Cohn, Eds. Cambridge, MA: MIT
447–461. Press, 1999, vol. 11, pp. 968–974.
[36] R. Neuneier, “Optimal asset allocation using adaptive dynamic program- [65] T. X. Brown, “Policy vs. value function learning with variable discount
ming,” in Advances in Neural Information Processing Systems, D. S. factors,” in Proc. NIPS 2000 Workshop Reinforcement Learning: Learn
Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds. Cambridge, MA: the Policy or Learn the Value Function?, Dec. 2000.
MIT Press, 1996, vol. 8, pp. 952–958.
[37] R. Neuneier and O. Mihatsch, “Risk sensitive reinforcement learning,”
in Advances in Neural Information Processing Systems, M. S. Kearns, John Moody received the B.A degree in physics from the University of Chicago,
S. A. Solla, and D. A. Cohn, Eds. Cambridge, MA: MIT Press, 1999, Chicago, IL, in 1979 and the M.A. and Ph.D. degrees in theoretical physics from
vol. 11, pp. 1031–1037. Princeton University, Princeton, NJ, in 1981 and 1984, respectively.
[38] J. N. Tsitsiklis and B. Van Roy, “Optimal stopping of Markov processes: He is the Director of the Computational Finance Program and a Professor of
Hilbert space theory, approximation algorithms, and an application to Computer Science and Electrical Engineering at Oregon Graduate Institute of
pricing high-dimensional financial derivatives,” IEEE Trans. Automat. Science and Technology, Beaverton. He is also the Founder and President of
Contr., vol. 44, pp. 1840–1851, Oct. 1999. Nonlinear Prediction Systems, a company specializing in the development of
[39] , “Regression methods for pricing complex American-Style Op- forecasting and trading systems. His research interests include computational
tions,” IEEE Trans. Neural Networks, vol. 12, no. 4, pp. 694–703, July finance, time series analysis, and machine learning.
2001.
Dr. Moody recently served as Program Co-Chair for Computational Finance
[40] W. F. Sharpe, “Mutual fund performance,” J. Business, pp. 119–138, Jan.
2000 in London, and is a past General Chair and Program Chair of the Neural
1966.
Information Processing Systems (NIPS) Conference, and is a member of the
[41] H. M. Markowitz, Portfolio Selection: Efficient Diversification of Invest-
ments. New York: Wiley, 1959. editorial board of Quantitative Finance.
[42] F. A. Sortino and R. van der Meer, “Downside risk—capturing what’s
at stake in investment situations,” J. Portfolio Management, vol. 17, pp.
27–31, 1991.
[43] D. Nawrocki, “Optimal algorithms and lower partial moment: Ex post
results,” Appl. Economics, vol. 23, pp. 465–470, 1991. Matthew Saffell received the B.Sc. degree in computer science and engineering
[44] , “The characteristics of portfolios selected by n-degree lower par- with a minor in mathematics from LeTourneau University, Longview, TX, in
tial moment,” Int. Rev. Financial Anal., vol. 1, pp. 195–209, 1992. 1992, and the M.Sc. degree in computer science and engineering from the Uni-
[45] F. A. Sortino and H. J. Forsey, “On the use and misuse of downside risk,” versity of Tennessee, in 1994. He is pursuing the Ph.D. degree in the Com-
J. Portfolio Management, vol. 22, pp. 35–42, 1996. puter Science and Engineering Department at the Oregon Graduate Institute,
[46] D. Nawrocki, “A brief history of downside risk measures,” J. Investing, Beaverton.
pp. 9–26, Fall 1999. He is a Consulting Scientist at Nonlinear Prediction Systems.