0% found this document useful (0 votes)
97 views

Learning To Trade Via Direct Reinforcement

Uploaded by

Newton Linchen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

Learning To Trade Via Direct Reinforcement

Uploaded by

Newton Linchen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO.

4, JULY 2001 875

Learning to Trade via Direct Reinforcement


John Moody and Matthew Saffell

Abstract—We present methods for optimizing portfolios, asset world costs of trading,1 make arbitrarily frequent trades or large
allocations, and trading systems based on direct reinforcement changes in portfolio composition become prohibitively expen-
(DR). In this approach, investment decision making is viewed as a sive. Thus, optimal decisions about establishing new positions
stochastic control problem, and strategies are discovered directly.
We present an adaptive algorithm called recurrent reinforcement must consider current positions held.
learning (RRL) for discovering investment policies. The need In [1] and [2], we proposed the RRL algorithm for DR. RRL is
to build forecasting models is eliminated, and better trading an adaptive policy search algorithm that can learn an investment
performance is obtained. The direct reinforcement approach strategy on-line. We demonstrated in those papers that Direct
differs from dynamic programming and reinforcement algorithms
such as TD-learning and Q-learning, which attempt to estimate
Reinforcement provides a more elegant and effective means for
a value function for the control problem. We find that the RRL training trading systems and portfolio managers when market
direct reinforcement framework enables a simpler problem rep- frictions are considered than do more standard supervised ap-
resentation, avoids Bellman’s curse of dimensionality and offers proaches.
compelling advantages in efficiency. We demonstrate how direct In this paper, we contrast our DR (or “policy search”) ap-
reinforcement can be used to optimize risk-adjusted investment
returns (including the differential Sharpe ratio), while accounting proach with commonly used value function based approaches.
for the effects of transaction costs. In extensive simulation work We use the term DR to refer to algorithms that do not have to
using real financial data, we find that our approach based on learn a value function in order to derive a policy. DR methods
RRL produces better trading strategies than systems utilizing date back to the pioneering work by Farley and Clark [3], [4],
Q-Learning (a value function method). Real-world applications
include an intra-daily currency trader and a monthly asset
but have received little attention from the reinforcement learning
allocation system for the S&P 500 Stock Index and T-Bills. community during the past two decades. Notable exceptions
are Williams’ REINFORCE algorithm [5], [6] and Baxter and
Index Terms—Differential Sharpe ratio, direct reinforcement
(DR) , downside deviation, policy gradient, Q-learning, recurrent Bartlett’s recent work [7].2
reinforcement learning, TD-learning, trading, risk, value function. Methods such as dynamic programming [8], TD-Learning
[9] or Q-Learning [10], [11] have been the focus of most of
the modern research. These methods attempt to learn a value
I. INTRODUCTION function or the closely related Q-function. Such value function
methods are natural for problems like checkers or backgammon
T HE investor’s or trader’s ultimate goal is to optimize some
relevant measure of trading system performance, such as
profit, economic utility, or risk-adjusted return. In this paper,
where immediate feedback on performance is not readily avail-
able at each point in time. Actor-critic methods [12], [13] have
we describe direct reinforcement (DR) methods to optimize in- also received substantial attention. These algorithms are inter-
vestment performance criteria. Investment decision making is mediate between DR and value function methods, in that the
viewed as a stochastic control problem, and strategies are dis- “critic” learns a value function which is then used to update the
covered directly. We present an adaptive algorithm called re- parameters of the “actor.”3
current reinforcement learning (RRL). The need to build fore- Though much theoretical progress has been made in recent
casting models is eliminated, and better trading performance years in the area of value function learning, there have been
is obtained. This methodology can be applied to optimizing relatively few widely-cited, successful applications of the tech-
systems designed to trade a single security, allocate assets or niques. Notable examples include TD-gammon [19], [20], an
manage a portfolio. elevator scheduler [21] and a space-shuttle payload scheduler
Investment performance depends upon sequences of interde- [22]. Due to the inherently delayed feedback, these applica-
pendent decisions, and is thus path-dependent. Optimal trading tions all use the TD-Learning or Q-Learning value function RL
or portfolio rebalancing decisions require taking into account methods.
the current system state, which includes both market condi- For many financial decision making problems, however, re-
tions and the currently held positions. Market frictions, the real- sults accrue gradually over time, and one can immediately mea-
sure short-term performance. This enables use of a DR approach
1Market frictions include taxes and a variety of transaction costs, such as com-
missions, bid/ask spreads, price slippage and market impact.
Manuscript received March 6, 2001; revised March 29, 2001. This work was
2Baxter and Bartlett have independently proposed the term DR for policy
supported by the Nonlinear Prediction Systems and from DARPA under Con-
tract DAAH01-96-C-R026 and AASERT Grant DAAH04-95-1-0485. gradient algorithms in a Markov decision process framework. We use the term
The authors are with the Computational Finance Program, Oregon Graduate in the same spirit, but perhaps more generally, to refer to any reinforcement
Institute of Science and Technology, Beaverton, OR 97006 USA, and also with learning algorithm that does not require learning a value function.
Nonlinear Prediction Systems, Beaverton, OR 97006 USA. 3For reviews and in-depth presentations of value function and actor-critic
Publisher Item Identifier S 1045-9227(01)05010-X. methods, see [16]–[18].

1045–9227/01$10.00 © 2001 IEEE


876 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001

to provide immediate feedback to optimize the strategy. One for asset allocation in [23], and explore the minimization of
class of performance criteria frequently used in the financial downside risk using DR in [24].
community are measures of risk-adjusted investment returns.
RRL can be used to learn trading strategies that balance the ac- II. TRADING SYSTEMS AND PERFORMANCE CRITERIA
cumulation of return with the avoidance of risk. We describe
A. Structure of Trading Systems
commonly used measures of risk, and review how differential
forms of the Sharpe ratio and downside deviation ratio can be In this paper, we consider agents that trade fixed position sizes
formulated to enable efficient online learning with RRL. in a single security. The methods described here can be general-
We present empirical results for discovering tradeable ized to more sophisticated agents that trade varying quantities of
structure in the U.S. Dollar/British Pound foreign exchange a security, allocate assets continuously or manage multiple asset
market via DR. In addition, we compare performance for portfolios. See [2] for a discussion of multiple asset portfolios.
an RRL-Trader and Q-Trader that learn switching strategies Here, our traders are assumed to take only long, neutral, or
between the S&P 500 Stock Index and Treasury Bills. For both short positions, , of constant magnitude. A long
traders, the results demonstrate the presence of predictable position is initiated by purchasing some quantity of a security,
structure in US stock prices over a 25-year test period. However, while a short position is established by selling the security.4
we find that the RRL-Trader performs substantially better than The price series being traded is denoted . The position is
the Q-Trader. Relative to Q-Learning, we observe that RRL established or maintained at the end of each time interval , and
enables a simpler problem representation, avoids Bellman’s is reassessed at the end of period . A trade is thus possible
curse of dimensionality and offers compelling advantages in at the end of each time period, although nonzero trading costs
efficiency. The S&P 500 and foreign exchange results were will discourage excessive trading. A trading system return
previously presented in [2], [23], and [24]. is realized at the end of the time interval and includes
the profit or loss resulting from the position held during
We discuss the relative merits of DR and value function
that interval and any transaction cost incurred at time due to a
learning, and provide arguments and examples for why value
difference in the positions and .
function based methods may result in unnatural problem
In order to properly incorporate the effects of transactions
representations. Our results suggest that DR offers a powerful
costs, market impact and taxes in a trader’s decision making, the
alternative to reinforcement algorithms that learn a value
trader must have internal state information and must therefore be
function, for problem domains where immediate estimates of
recurrent. A single asset trading system that takes into account
incremental performance can be obtained.
transactions costs and market impact has the following decision
To conclude the introduction, we would like to note that
function:
computational finance offers many interesting and challenging
potential applications of reinforcement learning methods.
with
While our work emphasizes DR, most applications in finance
to date have been based upon dynamic programming type (1)
methods. Elton and Gruber [28] provide an early survey of
dynamic programming applications in finance. The problems where denotes the (learned) system parameters at time and
of optimum consumption and portfolio choice in continuous denotes the information set at time , which includes present
and past values of the price series and an arbitrary number
time have been formulated by Merton [25]–[27] from the
of other external variables denoted . A simple example is a
standpoints of dynamic programming and stochastic control.
long, short trader with autoregressive inputs
The extensive body of work on intertemporal (multi-period)
portfolio management and asset pricing is reviewed by Breeden
[29]. Duffie [30], [31] describes stochastic control and dy- sign (2)
namic programming methods in finance in depth. Dynamic
programming provides the basis of the Cox et al. [32] and other
where are the price returns of (defined below) and the
widely used binomial option pricing methods. See also the
system parameters are the weights . A trader of this
strategic asset allocation work of Brennan et al. [33]. Due to the
form is used in the simulations described in Section IV-A.
curse of dimensionality, approximate dynamic programming
The above formulation describes a discrete-action, deter-
is often required to solve practical problems, as in the work ministic trader, but can be easily generalized. One simple
by Longstaff and Schwartz [34] on pricing American options. generalization is to use continuously valued , for ex-
During the past six years, there have been several applica- ample, by replacing with . When discrete values
tions that make use of value function reinforcement learning are imposed, however, the decision function
methods. Van Roy [35] uses a TD( ) approach for valuing is not differentiable. Nonetheless, gradient based optimization
options and performing portfolio optimization. Neuneier [36] methods for may be developed by considering differentiable
uses a Q-Learning approach to make asset allocation decisions,
and Neuneier and Mihatsch [37] incorporate a notion of risk 4For stocks, a short sale involves borrowing shares and then selling the bor-

sensitivity into the construction of the Q-Function. Derivatives rowed shares to a third party. A profit is made when the shorted shares are repur-
chased at a later time at a lower price. Short sales of many securities, including
pricing applications have been studied by Tsitsiklis and Van stocks, bonds, futures, options, and foreign exchange contracts, are common
Roy [38], [39]. Moody and Saffell compare DR to Q-Learning place.
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 877

prethresholded outputs or, for example, by replacing with When the risk-free rate of interest is ignored ( ), a second
during learning and discretizing the outputs when trading. simplified expression is obtained
Moreover, the models can be extended to a stochastic frame-
work by including a noise variable in (8)

with (3)
Relaxing the constant magnitude assumption is more realistic
for asset allocations and portfolios, and enables better risk con-
The random variable induces a joint probability density for
trol. Related expressions for portfolios are presented in [2].
the discrete actions , model parameters and model inputs:
C. Performance Criteria
(4)
In general, the performance criteria that we consider are func-
tions of profit or wealth after a sequence of time steps,
The noise level (measured by or more generally the scale of
or more generally of the whole time sequence of trades
) can be varied to control the “exploration vs. exploitation”
behavior of the trader. Also, differentiability of the probability
distribution of actions enables the straightforward application of (9)
gradient based learning methods.
The simple form includes standard economic utility
B. Profit and Wealth for Trading Systems functions. The second case is the general form for path-de-
Trading systems can be optimized by maximizing perfor- pendent performance functions, which include inter-temporal
mance functions, , such as profit, wealth, utility functions utility functions and performance ratios like the Sharpe ratio
of wealth or performance ratios like the Sharpe ratio. The and Sterling ratio. In either case, the performance criterion
simplest and most natural performance function for a risk-in- at time can be expressed as a function of the sequence of
sensitive trader is profit. trading returns
Additive profits are appropriate to consider if each trade is
for a fixed number of shares or contracts of security . This is (10)
often the case, for example, when trading small stock or futures
accounts or when trading standard US$ FX contracts in dollar-
For brevity, we denote this general form by .
denominated foreign currencies. We define and
For optimizing our traders, we will be interested in the mar-
as the price returns of a risky (traded) asset
ginal increase in performance due to return at each time step
and a risk-free asset (like T-Bills), respectively, and denote the
transactions cost rate as . The additive profit accumulated over
time periods with trading position size is then defined (11)
in term of the trading returns, , as:
Note that depends upon the current trading return , but that
does not. Our strategy will be to derive differential perfor-
where mance criteria that capture the marginal “utility” of
the trading return at each period.5
(5)
D. The Differential Sharpe Ratio
with and typically . When the risk-free Rather than maximizing profits, most modern fund man-
rate of interest is ignored ( ), a simplified expression is agers attempt to maximize risk-adjusted return, as suggested
obtained by modern portfolio theory. The Sharpe ratio is the most
widely-used measure of risk-adjusted return [40]. Denoting
(6)
as before the trading system returns for period (including
The wealth of the trader is defined as . transactions costs) as , the Sharpe ratio is defined to be
Multiplicative profits are appropriate when a fixed fraction
of accumulated wealth is invested in each long or short Average
(12)
trade. Here, and . If Standard Deviation
no short sales are allowed and the leverage factor is set fixed at
, the wealth at time is where the average and standard deviation are estimated for pe-
riods . Note that for ease of exposition and anal-
ysis, we have suppressed inclusion of portfolio returns due
where to the risk free rate on capital . Substituting excess returns
5Strictly speaking, many of the performance criteria commonly used in the
financial industry are not true utility functions, so we use the term “utility” in a
(7) more colloquial sense.
878 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001

for in the equation above produces the stan- The differential Sharpe ratio has several attractive properties:
dard definition. With this caveat in mind, we use (12) for dis- • Facilitates recursive updating: The incremental nature of
cussion purposes without loss of mathematical generality.6 the calculations of and make updating the expo-
Proper on-line learning requires that we compute the influ- nential moving Sharpe ratio straightforward. It is not nec-
ence on the Sharpe ratio (marginal utility ) of the trading re- essary to recompute the average and standard deviation of
turn at time . To accomplish this, we have derived a new ob- returns for the entire trading history in order to update the
jective function called the differential Sharpe ratio for on-line Sharpe ratio for the most recent time period.
optimization of trading system performance [1], [2]. It is ob- • Enables efficient on-line optimization: and
tained by considering exponential moving averages of the re- can be cheaply calculated using the previously computed
turns and standard deviation of returns in (12), and expanding moving averages and and the current return
to first order in the adaptation rate . This enables efficient stochastic optimization.
• Weights recent returns more: Based on the exponential
moving average Sharpe ratio, recent returns receive
stronger weightings in than do older returns.
• Provides interpretability: The differential Sharpe ratio iso-
(13) lates the contribution of the current return to the ex-
ponential moving average Sharpe ratio. The simple form
Note that a zero adaptation rate corresponds to an infinite time of makes clear how risk and reward affect the Sharpe
average. Expanding about amounts to “turning on” the ratio.
adaptation. One difficulty with the Sharpe ratio, however, is that the use
Since only the first order term in this expansion depends upon of variance or as a risk measure does not distinguish between
the return at time , we define the differential Sharpe ratio as upside and downside “risk.” Assuming that , the
largest possible improvement in occurs when

(14) (17)

where the quantities and are exponential moving esti- Thus, the Sharpe ratio actually penalizes gains larger than ,
mates of the first and second moments of which is counter-intuitive relative to most investors’ notions of
risk and reward.

E. Downside Risk
(15)
Symmetric measures of risk such as variance are more and
more being viewed as inadequate measures due to the asym-
Treating and as numerical constants, note that in metric preferences of most investors to price changes. Few in-
the update equations controls the magnitude of the influence vestors consider large positive returns to be “risky,” though both
of the return on the Sharpe ratio . Hence, the differential large positive as well as negative returns are penalized using a
Sharpe ratio represents the influence of the trading return symmetric measure of risk such as the variance. To most in-
realized at time on . It is the marginal utility for the Sharpe vestors, the term “risk” refers intuitively to returns in a portfolio
ratio criterion. that decrease its profitability.
The influences of risk and return on the differential Sharpe Markowitz, the father of modern portfolio theory, understood
ratio are readily apparent. The current return enters expres- this. Even though most of his work focussed on the mean-vari-
sion (14) only in the numerator through and ance framework for portfolio optimization, he proposed the
. The first term in the numerator is positive if semivariance as a means for dealing with downside returns
exceeds the moving average of past returns (increased [41]. After a long hiatus lasting three decades, there is now
reward), while the second term is negative if exceeds the a vigorous industry in the financial community in modeling
moving average of past squared returns (increased risk). and minimizing downside risk. Criteria of interest include
The differential Sharpe ratio is used in the RRL algorithm the double deviation (DD), the second lower partial moment
(see (31) in Section III) as the current contribution to the perfor- (SLPM) and the th lower partial moment [42]–[46].
mance function . Since in (13) does not depend on , One measure of risk-adjusted performance widely used in
we have . When optimizing the professional fund management community (especially for
the trading system using (14), the relevant derivatives have the hedge funds) is the Sterling ratio, commonly defined as
simple form:
Annualized Average Return
Sterling Ratio (18)
(16) Maximum Drawn-Down
Here, the maximum draw-down (from peak to trough) in ac-
6For systems that trade futures and forward, R should be used in place of
R~ , because the risk free rate is already accounted for in the relation between count equity or net asset value is defined relative to some stan-
forward prices and spot prices. dard reference period, for example one to three years. Mini-
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 879

mizing drawdowns is somewhat cumbersome, so we focus on error exploration of the environment and space of strategies. In
the DD as a measure of downside risk in this paper.7 contrast to supervised learning, the system is not presented with
The DD is defined to be the square root of the average of the examples of desired actions. Rather, it receives a reinforcement
square of the negative returns signal from its environment (a reward) that provides information
on whether its actions are good or bad.
In [1], [2], we compared supervised learning to our DR
DD (19) approach. The supervised methods discussed included trading
based upon forecasts of market prices and training a trader
using labeled data. In both supervised frameworks, difficulties
Using the DD as a measure of risk, we can now define a utility are encountered when transaction costs are included. While
function similar to the Sharpe ratio, which we will call the down- supervised learning methods can be effective for solving the
side deviation ratio (DDR) structural credit assignment problem, they do not typically
address the temporal credit assignment problem.
Average
DDR (20) Structural credit assignment refers to the problem of as-
DD
signing credit to the individual parameters of a system. If
The DDR rewards the presence of large average positive returns the reward produced also depends on a series of actions of
and penalizes risky returns, where “risky” now refers to down- the system, then the temporal credit assignment problem is
side returns. encountered, ie. assigning credit to the individual actions
In order to facilitate the use of our recurrent reinforcement taken over time [48]. Reinforcement learning algorithms offer
learning algorithm (Section III), we need to compute the influ- advantages over supervised methods by attempting to solve
ence of the return at time on the DDR. In a similar manner to both problems simultaneously.
the development of the differential Sharpe ratio in [2], we define Reinforcement learning algorithms can be classified as either
exponential moving averages of returns and of the squared DD DR (sometimes called “policy search”), value function, or actor-
critic methods. The choice of the best method depends upon
the nature of the problem domain. We will discuss this issue
in greater detail in Section V. In this section, we present the
DD DD DD (21) recurrent reinforcement learning algorithm for DR and review
value function based methods, specifically Q-learning [10] and
and define the DDR in terms of these moving averages. We ob- a refinement of Q-learning called advantage updating [49]. In
tain our performance function by considering a first-order ex- Section IV-C, we compare the RRL and value function methods
pansion in the adaptation rate of the DDR for systems that learn to allocate assets between the S&P 500
stock index and T-Bills.
DDR
DDR DDR (22)
A. Recurrent Reinforcement Learning
In this section, we describe the recurrent reinforcement
We define the first-order term DDR to be the differential
learning algorithm for DR. This algorithm was originally
downside deviation ratio. It has the form
presented in [1] and [2].
DDR Given a trading system model , the goal is to adjust the
parameters in order to maximize . For traders of form (1)
and trading returns of form (6) or (8), the gradient of with
(23) respect to the parameters of the system after a sequence of
DD periods is
DD
DD
(24) (25)

From (24) it is obvious that when , the utility increases The system can be optimized in batch mode by repeatedly
as increases, with no penalty for large positive returns such computing the value of on forward passes through the data
as exists when using variance as the risk measure. See [24] for and adjusting the trading system parameters by using gradient
detailed experimental results on the use of the DDR to build ascent (with learning rate )
RRL trading systems.

III. LEARNING TO TRADE (26)

Reinforcement learning adjusts the parameters of a system to


or some other optimization method. Note that due to the inherent
maximize the expected payoff or reward that is generated due to
recurrence, the quantities are total derivatives that de-
the actions of the system. This is accomplished through trial and
pend upon the entire sequence of previous time periods. To cor-
7White has found that the DD tracks the Sterling ratio effectively [47]. rectly compute and optimize these total derivatives in an effi-
880 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001

cient manner requires an approach similar to backpropagation We use on-line algorithms of this recurrent reinforcement
through time (BPTT) [50], [51]. The temporal dependencies in learning type in the simulations presented in Section IV. Note
a sequence of decisions are accounted for through a recursive that we find that use of a noise variable provides little
update equation for the parameter gradients advantage for the real financial applications that we consider,
since the data series contain significant intrinsic noise. Hence,
we find that a simple “greedy” update is adequate.8
(27) The above description of the RRL algorithm is for traders
that optimize immediate estimates of performance for spe-
cific actions taken. This presentation can be thought of as a spe-
The above expressions (25) and (27) assume differentiability
cial case of a more general Markov decision process (MDP)
of . For the long/short traders with thresholds described in
and policy gradient formulation. One straightforward extension
Section II-A, the reinforcement signal can be backpropagated
of our formulation can be obtained for traders that maximize
through the prethresholded outputs in a manner similar to the
discounted future rewards. We have experimented with this ap-
Adaline learning rule [52]. Equations (25)–(27) constitute the
proach, but found little advantage for the problems we consider.
batch RRL algorithm.
A second extension to the formulation is to consider a stochastic
There are two ways in which the batch algorithm described
trader (3) and an expected reward framework, for which the
above can be extended into a stochastic framework. First, ex-
probability distribution of actions is differentiable. This latter
ploration of the strategy space can be induced by incorporating
approach makes use of the joint density of (4). While the ex-
a noise variable , as in the stochastic trader formulation of (3).
pected reward framework is appealing from a theoretical per-
The tradeoff between exploration of the strategy space and ex-
spective, (28)–(30) presented above provide the practical basis
ploitation of a learned policy can be controlled by the magnitude
for simulations.
of the noise variance . The noise magnitude can be annealed
Although we have focussed our discussion on traders of a
over time during simulation, in order to arrive at a good strategy.
single risky asset with scalar , the algorithms described in
Second, a simple on-line stochastic optimization can be ob-
this section can be trivially generalized to the vector case for
tained by considering only the term in (25) that depends on the
portfolios. Optimization of portfolios is described in [1], [2].
most recently realized return during a forward pass through
the data B. Value Functions and Q-Learning
Besides explicitly training a trader to take actions, we can also
(28) implicitly learn correct actions through the technique of value
iteration. Value iteration uses a value function to evaluate and
improve policies (see [16] for a tutorial introduction and [18] for
The parameters are then updated on-line using a full overview of these algorithms). The value function, ,
is an estimate of discounted future rewards that will be received
from starting in state , and by following the policy thereafter.
(29)
The value function satisfies Bellman’s equation

Such an algorithm performs a stochastic optimization, since


the system parameters are varied during each forward pass
through the training data. The stochastic, on-line analog of (27) (32)
is
where
(30) probability of taking action in state ;
probability of transitioning from state to state
Equations (28)–(30) constitute the stochastic (or adaptive) when taking action ;
RRL algorithm. It is a reinforcement algorithm closely related immediate reward [differential utility, as in (11)]
to recurrent supervised algorithms such as real time recurrent from taking action and transitioning from state
learning (RTRL) [53] and dynamic backpropagation [54]. See to state ;
also the discussion of backpropagating utility in Werbos [14]. discount factor that weighs the importance of fu-
For differential performance criteria described in (11) of ture rewards versus immediate rewards.
Section II-C (such as the differential Sharpe ratio (14) and dif- A policy is an optimal policy if its value function is greater
ferential downside deviation ratio (24)), the stochastic update than or equal to the value functions of all other policies for a
equations (28) and (29) become given set of states. The optimal value function is defined as

(33)
8Tesauro finds a similar result for TD-Gammon [19], [20]. A “greedy” update

(31) works well, because the dice rolls in the game provided enough uncertainty to
induce extensive strategy exploration.
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 881

and satisfies Bellman’s optimality equation

(34)

The value iteration update

(35)

is guaranteed to converge to the optimal value function under


certain general conditions. The optimal policy can be deter-
mined from the optimal value function through
Fig. 1. A trading system based on DR, the approach taken in this paper. The
system trades the target series, making trading decisions based upon a set of
arg (36) input variables and the current positions held. No intermediate steps such as
making forecasts or labeling desired trades are required and it is not necessary
to learn a value function. The trader learns a strategy via trial and error
1) Q-Learning: The technique named Q-Learning [10] uses exploration, taking actions and receiving positive or negative reinforcement
based on the results. A trading performance function U () such as profit,
a value function which estimates future rewards based on both utility or risk-adjusted return is, hence, used to directly optimize the trading
the current state and the current action taken. We can write the system parameters  . The system is recurrent; the feedback of system state
Q-function version of Bellman’s optimality equation as (current positions or portfolio weights) enables the trading system to learn
to correctly incorporate transactions costs into its trading decisions. For the
traders considered in this paper, the DR (policy search) method of recurrent
reinforcement learning is used to optimize the trader.

(37) action while in state versus choosing the best possible ac-
tion for that state. The value function measures the expected
Similarly to (35), the Q-function can be learned using a value discounted future rewards as described previously. Advantage
iteration approach updating has the following relationship with Q-learning:

(38) (41)

Similarly to Q-learning, the optimal action to take in state


This iteration has been shown [10] to converge to the optimal is found by arg . See Baird [49] for a
Q-function, , given certain constraints. The advantage description of the learning algorithms.
of using the Q-function is that there is no need to know the
system model as in (36) in order to choose the best ac- IV. EMPIRICAL RESULTS
tion. One calculates the best action as This section presents empirical results for three problems.
First, controlled experiments using artificial price series are
arg (39) done to test the RRL algorithm’s ability to learn profitable
trading strategies, to maximize risk adjusted return (as mea-
The update rule for training a function approximator is then sured by the Sharpe ratio), and to respond appropriately to
based on the gradient of the error varying transaction costs. The second problem demonstrates
the ability of RRL to discover structure in a real financial price
(40) series, the half-hourly U.S. Dollar/British Pound exchange rate
(see Fig. 1). For this problem, the RRL trader attempts to avoid
2) Advantage Updating: A refinement of the Q-learning al- downside risk by maximizing the downside deviation ratio.
gorithm is provided by advantage updating [49]. Advantage up- Finally, we compare the performance of traders based on RRL
dating was developed specifically to deal with continuous-time and Q-learning for a second real-world problem, trading the
reinforcement learning problems, though it is applicable to the monthly S&P 500 stock index. Over the 25-year test period, we
discrete-time case as well. It is designed to deal with the situa- find that the RRL-Trader outperforms the Q-Trader, and that
tion where the relative advantages of individual actions within both outperform a buy and hold strategy. Further discussion of
a state are small compared to the relative advantages of being in the Q-Trader versus RRL-Trader performance is presented in
different states. Also, advantage updating has been shown to be Section V-D.
able to learn at a much faster rate than Q-learning in the pres-
ence of noise. A. Trader Simulation
Advantage updating learns two separate functions: the advan- In this section we demonstrate the use of the RRL algorithm
tage function , and the value function . The advan- to optimize trading behavior using the differential Sharpe ratio
tage function measures the relative change in value of choosing (14) in the presence of transaction costs. More extensive results
882 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001

Fig. 2. Artificial prices (top panel), trading signals (second panel), cumulative Fig. 3. An expanded view of the last thousand time periods of Fig. 2. The
sums of profits (third panel) and the moving average Sharpe ratio with  = 0:01 exponential moving Sharpe ratio has a forgetting time scale of 1= = 100
(bottom panel). The system performs poorly while learning from scratch during periods. A smaller  would smooth the fluctuations out.
the first 2000 time periods, but its performance remains good thereafter.

are presented in [2]. There, we find that maximizing the dif-


ferential Sharpe ratio yields more consistent results than max-
imizing profits, and that both methods outperform trading sys-
tems based on forecasts.
The RRL-Traders studied here take long, short positions
and have recurrent state similar to that described in Section
II-A. To enable controlled experiments, the data used in this sec-
tion are artificial price series that are designed to have tradeable
structure. These experiments demonstrate that 1) RRL is an ef-
fective means of learning trading strategies and 2) trading fre-
quency is reduced as expected as transaction costs increase.
1) Data: We generate log price series as random walks with
autoregressive trend processes. The two parameter model is thus

(42)
(43)
Fig. 4. Histograms of the price changes (top), trading profits per time period
(middle) and Sharpe ratios (bottom) for the simulation shown in Fig. 2. The left
where and are constants, and and are normal column is for the first 5000 time periods, and the right column is for the last
random deviates with zero mean and unit variance. We define 5000 time periods. The transient effects during the first 2000 time periods for
the artificial price series as the real-time recurrent learning are evident in the lower left graph.

these experiments, the RRL-Traders are single threshold units


(44) with an autoregressive input representation. The inputs at time
are constructed using the previous eight returns.
where is a scale defined as the range of : The RRL-Traders are initialized randomly at the beginning,
over a simulation with 10 000 samples.9 and adapted using real-time recurrent learning to optimize the
For the results we present here, we set the parameters of the differential Sharpe ratio (14). The transaction costs are fixed at
price process to and . The artificial price series a half percent during the whole real-time learning and trading
are trending on short time scales and have a high level of noise. process. Transient effects of the initial learning while trading
A realization of the artificial price series is shown in the top process can be seen in the first 2000 time steps of Fig. 2 and
panel of Fig. 2. in the distribution of differential Sharpe ratios in the lower left
2) Simulated Trading Results: Figs. 2–4 show results for a panel of Fig. 4.
single simulation for an artificial market as described above. For Fig. 5 shows box plots summarizing test performances
for ensembles of 100 experiments. In these simulations, the
9This is slightly more than the number of hours in a year (8760), so the series
10 000 data samples are partitioned into an initial training
could be thought of as representing hourly prices in a 24-hour artificial market.
Alternatively, a series of this length could represent slightly less than five years set consisting of the first 1000 samples and a subsequent test
of hourly data in a market that trades about 40 hours per week. data set containing the last 9000 samples. The RRL-Traders
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 883

are first optimized on the training data set for 100 epochs and
adapted on-line throughout the whole test data set. Each trial
has different realizations of the artificial price process and
different randomly chosen initial trader parameter values. We
vary the transaction cost from 0.2%, 0.5%, to 1%, and observe
the trading frequency, cumulative profit and Sharpe ratio over
the test data set. As shown, in all 100 experiments, positive
Sharpe ratios are obtained. As expected, trading frequency is
reduced as transaction costs increase.

B. U.S. Dollar/British Pound Foreign Exchange Trading


System
A long, short, neutral trading system is trained on
half-hourly U.S. Dollar/British Pound foreign exchange (FX)
rate data. The experiments described in this section were first
reported in [24]. The dataset used here consists of the first 8
months of quotes from the 24-hour, five-days-a-week foreign
exchange market during 1996.10 Both bid and ask prices are
in the dataset, and the trading system is required to incur the
transaction costs of trading through the bid/ask prices. The
trader is trained via the RRL algorithm to maximize the differ-
ential downside deviation ratio (24), a measure of risk-adjusted
return.
The top panel in Fig. 6 shows the U.S. Dollar/British Pound
price series for the eight-month period. The trading system is
initially trained on the first 2000 data points, and then produces
trading signals for the next two-week period (480 data points).
The training window is then shifted forward to include the just
tested on data, is retrained and its trading signals recorded for
the next two-week out-of-sample time period. This process for
generating out-of-sample trading signals continues for the rest
of the data set.
The second panel in Fig. 6 shows the out-of-sample trading
signal produced by the trading system, and the third panel dis-
plays the equity curve achieved by the trader. The bottom panel
shows a moving average calculation of the Sharpe ratio over the
trading period with a time constant of 0.01. The trading system
achieves an annualized 15% return with an annualized Sharpe
ratio of 2.3 over the approximately six-month long test period.
On average, the system makes a trade once every five hours.
These FX simulations demonstrate the ability of the RRL
algorithm to discover structure in a real-world financial price
series. However, one must be cautious when extrapolating
from simulated performance to what can be achieved in actual
real-time trading. One problem is that the data set consists of
indicative quotes which are not necessarily representative of
the price at which the system would have actually been able to
transact. A related possibility is that the system is discovering
market microstructure effects that are not actually tradeable
in real-time. Also, the simulation assumes that the pound is Fig. 5. Boxplots of trading frequency, cumulative sums of profits and Sharpe
ratios versus transaction costs. The results are obtained over 100 trials with
tradeable 24 hours a day during the five-day trading week. various realizations of artificial data and initial system parameters. Increased
Certainly a real-time trading system will suffer additional transaction costs reduce trading frequency, profits and Sharpe ratio, as expected.
penalties when trying to trade during off-peak, low liquidity The trading frequency is the percentage of the number of time periods during
which trades occur. All figures are computed on the last 9000 points in the data
trading times. An accurate test of the trading system would set.
require live trading with a foreign exchange broker or directly

10The data is part of the Olsen & Associates HFDF96 dataset, obtainable by through the interbank FX market in order to verify real time
contacting www.olsen.ch. transactable prices and profitability.
884 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001

Fig. 6. fLong, short, neutralg trading system of the U.S. Dollar/British Pound that uses the bid/ask spread as transaction costs. The data consists of half-hourly
quotes for the five-day-per-week, 24–hour interbank FX market. The time period shown is the first eight months of 1996. The trader is optimized via recurrent
reinforcement learning to maximize the differential downside deviation ratio. The first 2000 data points (approximately two months) are used for training and
validation. The trading system achieves an annualized 15% return with an annualized Sharpe ratio of 2.3 over the approximately six-month out-of-sample test
period. On average, the system makes a trade once every five hours.

Fig. 7. Time series that influence the return attainable by the S&P 500/TBill asset allocation system. The top panel shows the S&P 500 series with and without
dividends reinvested. The bottom panel shows the annualized monthly Treasury Bill and S&P 500 dividend yields.

C. S&P 500/T-Bill Asset Allocation 500, no T-Bill interest is earned, but when the traders are short
In this section we compare the use of recurrent reinforce- stocks (using standard 2:1 leverage), they earn twice the T-Bill
ment learning to the advantage updating formulation of the rate. We use the advantage updating refinement instead of the
Q-learning algorithm for building a trading system. These standard Q-Learning algorithm, because we found it to yield
comparative results were presented previously at NIPS*98 better trading results. See Section III-B.2 for a description of
[23]. The long/short trading systems trade the S&P 500 stock the representational advantages of the approach.
index, in effect allocating assets between the S&P 500 and The S&P 500 target series is the total return index computed
three-month Treasury Bills. When the traders are long the S&P monthly by reinvesting dividends. The S&P 500 indexes with
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 885

and without dividends reinvested are shown in Fig. 7 along with


the three-month Treasury Bill and S&P 500 dividend yields.
The 84 monthly input series used in the trading systems include
both financial and macroeconomic data. All data are obtained
from Citibase,11 and the macroeconomic series are lagged by
one month to reflect reporting delays.
A total of 45 years of monthly data are used, from January
1950 through December 1994. The first 20 years of data are used
only for the initial training of the system. The test period is the
25 year period from January 1970 through December 1994. The
experimental results for the 25–year test period are true ex ante
simulated trading results.
1) Simulation Details: For each year during 1970 through
1994, the system is trained on a moving window of the previous
20 years of data. For 1970, the system is initialized with random
parameters. For the 24 subsequent years, the previously learned
parameters are used to initialize the training. In this way, the Fig. 8. Test results for ensembles of simulations using the S&P 500 stock
system is able to adapt to changing market and economic condi- index and 3-month Treasury Bill data over the 1970-1994 time period. The
boxplots show the performance for the ensembles of RRL-Trader and Q-Trader
tions. Within the moving training window, the RRL-Trader sys- trading systems. The horizontal lines indicate the performance of the systems
tems use the first ten years for stochastic optimization of system and the buy and hold strategy. The solid curves correspond to the RRL-Trader
parameters, and the subsequent ten years for validating early system performance, dashed curves to the Q-Trader system and the dashed and
dotted curves indicate the buy and hold performance. Both systems significantly
stopping of training. The RRL-Trader networks use a single outperform the buy and hold strategy.
unit, and are regularized using quadratic weight decay
during training with a regularization parameter of 0.01.
respectively. The corresponding annualized monthly Sharpe ra-
The Q-Trader systems use a bootstrap sample of the 20
tios 0.34, 0.63, and 0.83, respectively.13 Remarkably, the supe-
year training window for training, and the final ten years of
rior results for the RRL-Trader are based on networks with a
the training window are used for validating early stopping of
single thresholded unit, while those for the Q-Trader re-
training. For the results reported, the networks are two-layer
quired networks with 30 hidden units.14
feedforward networks with 30 units in the hidden layer.
Fig. 9 shows results for following the strategy of taking posi-
The networks are trained initially with the discounting
tions based on a majority vote of the ensembles of trading sys-
factor set to zero. Then is set to 0.75. We find decreasing
tems compared with the buy and hold strategy. We can see that
performance when the value of is adjusted to higher values.
the trading systems go short the S&P 500 during critical pe-
To investigate the bias/variance tradeoff for the Q-Traders,
riods, such as the oil price shock of 1974, the tight money pe-
we tried networks of size 10, 20, 30, and 40 hidden units. The
riods of the early 1980s, the market correction of 1984 and the
30 unit traders performed significantly better out of sample than
1987 crash. This ability to take advantage of high treasury bill
traders with smaller or larger networks. The 20 unit traders were
rates or to avoid periods of substantial stock market loss is the
significantly better than the ten unit traders, suggesting that the
major factor in the long term success of these trading models.
smaller networks could not represent the Q function adequately
One exception is that the RRL-Trader trading system remains
(high model bias). The degradation in performance observed for
long during the 1991 stock market correction associated with
the 40 unit nets suggests possible overfitting (increased model
the Persian Gulf war, a political event, though the Q-Trader
variance).
system is fortunately short during the correction. On the whole
2) S&P Experimental Results: Fig. 8 shows box plots sum-
though, the Q-Trader system trades much more frequently than
marizing the test performance for the full 25 year test period of
the RRL-Trader system, and in the end does not perform as well
the trading systems with various realizations of the initial system
on this data set.
parameters over 30 trials for the RRL-Trader system, and ten
From these results we find that both trading systems outper-
trials for the Q-Trader system.12 The transaction cost is set at
form the buy and hold strategy, as measured by both accumu-
0.5%. Profits are reinvested during trading, and multiplicative
lated wealth and Sharpe ratio. These differences are statisti-
profits are used when calculating the wealth. The notches in the
cally significant and support the proposition that there is pre-
box plots indicate robust estimates of the 95% confidence in-
dictability in the U.S. stock and treasury bill markets during the
tervals on the hypothesis that the median is equal to the perfor-
25–year period 1970 through 1994. A more detailed presenta-
mance of the buy and hold strategy. The horizontal lines show
tion of the RRL-Trader results appears in [2]. Further discussion
the performance of the RRL-Trader voting, Q-Trader voting and
of the Q-Trader versus RRL-Trader performance is presented in
buy and hold strategies for the same test period. The total profits
Section V-D.
of the buy and hold strategy, the Q-Trader voting strategy and
the RRL-Trader voting strategy are 1348%, 3359%, and 5860%, 13The Sharpe ratios calculated here are for the returns in excess of the three-
month treasury bill rate.
11Citibase historical data is obtainable from www.fame.com. 14As discussed in the Section IV-C.1, care was taken to avoid both underfit-
12Ten trials were done for the Q-Trader system due to the amount of compu- ting and overfitting in the Q-Trader case, and smaller nets performed substan-
tation required in training the systems. tially worse.
886 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001

Fig. 9. Test results for ensembles of simulations using the S&P 500 stock index Fig. 10. Sensitivity traces for three of the inputs to the RRL-Trader trading
and three-month Treasury Bill data over the 1970-1994 time period. Shown are system averaged over the ensemble of traders. The nonstationary relationships
the equity curves associated with the systems and the buy and hold strategy, as typical among economic variables is evident from the time-varying sensitivities.
well as the trading signals produced by the systems. The solid curves correspond
to the RRL-Trader system performance, dashed curves to the Q-Trader system
and the dashed and dotted curves indicate the buy and hold performance. Both
systems significantly outperform the buy and hold strategy. In both cases, the maneuvering around obstacles [58], reinforcement from the
traders avoid the dramatic losses that the buy and hold strategy incurred during environment occurs only at the end of the game or task. The
1974 and 1987. final rewards received are success, failure or win, lose .
For such tasks, the temporal credit assignment problem is ex-
3) Model Insight Through Sensitivity Analysis: A sensitivity treme. There is usually no a priori assessment of performance
analysis of the RRL-Trader systems was performed in an at- available during the course of each game or trial. Hence, one
tempt to determine on which economic factors the traders are is forced to learn a value function of the system state at each
basing their decisions. Fig. 10 shows the absolute normalized time. This is accomplished by doing many runs on a trial and
sensitivities for three of the more salient input series as a func- error basis, and discounting the ultimate reward received back
tion of time, averaged over the 30 members of the RRL-Trader in time. This discounting approach is the basis of dynamic
committee. The sensitivity of input is defined as: programming [8], TD-Learning [9] and Q-Learning [10], [11].
For these value function methods, the action taken at each
time is that which offers the largest increase in expected value.
(45) Thus, the policy is not represented directly. An intermediate
class of reinforcement algorithms are actor-critic methods [12].
While the actor module provides a direct representation of the
where is the unthresholded trading output of the policy func- policy for these methods, it relies on the critic module for feed-
tion and denotes input . back. The role of the critic is to learn the value function.
The time-varying sensitivities in Fig. 10 emphasize the non- In contrast, direct reinforcement methods represent the
stationarity of economic relationships. For example, the yield policy directly, and make use of immediate feedback to adjust
curve slope (which measures inflation expectations) is found to the policy. This approach is appealing when it is possible to
be a very important factor in the 1970s, while trends in long term specify an instantaneous measure of performance, because the
interest rates (measured by the six-month difference in the AAA need to learn a value function is bypassed.
bond yield) becomes more important in the 1980s, and trends in In trading, asset allocation and portfolio management prob-
short term interest rates (measured by the six-month difference lems, for example, overall performance accrues gradually over
in the treasury bill yield) dominate in the early 1990s. time. For these financial decision making problems, an imme-
diate measure of incremental performance is available at each
V. LEARN THE POLICY OR LEARN THE VALUE? time step. Although total performance usually involves inte-
grating or averaging over time, it is nonetheless possible to adap-
As mentioned in Section III, reinforcement learning algo- tively update the strategy based upon the investment return re-
rithms can be classified as eitherDR (sometimes called “policy ceived at each time step.
search”), value function methods, or actor-critic methods. The Other domains that offer the possibility of immediate feed-
choice of the best method depends upon the nature of the back include a wide range of control applications. The standard
problem domain. formulation for optimal control problems involves time inte-
grals of an instantaneous performance measure. Examples of
A. Immediate versus Future Rewards
common loss functions include average squared deviation from
Reinforcement signals received from the environment can a desired trajectory or average squared jerk.15
be immediate or delayed. In some problems, such as checkers
[55], [56], backgammon [19], [20], navigating a maze [57], or 15“Jerk” is the rate of change of acceleration.
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 887

A related approach that represents and improves policies


explicitly is the policy gradient approach. Policy gradient
methods use the gradient of the expected average or discounted
reward with respect to the parameters of the policy function
to improve the policy. The expected rewards are typically
estimated by learning a value function, or by using single
sample paths of the Markov reward process. There have been
several recent, independent proofs for the convergence of Fig. 11. A representation of the value function to be learned by the Q-learning
policy gradient methods. Marbach and Tsitsiklis [59], [60] and algorithm for the example given in the text (Section V). The function represents
Baxter and Bartlett [7]16 show convergence to locally optimal the Q-value, Q(r; a), which is the value from taking action “a” in state “r.”
The figure on the left shows the value function for the case of discrete, binary
policies by using simulation based methodologies to approx- returns. The Q-function has the form of the XOR problem, while the optimal
imate expected rewards. Sutton et al. [61] and Konda and policy is simply a = r . The figure on the right shows the value function when
Tsitsiklis [62] obtain similar results when estimating expected returns are real-valued (note the change in axes). The Q-function now becomes
arbitrarily hard to represent accurately using a single function approximator of
rewards from a value function implemented using a function tanh units while the optimal policy is still very simple, a = sign(r ).
approximator. An application to robot navigation is provided by
Grudic and Ungar [63]. Note that some of the so-called “policy
gradient” methods are not DR methods, because they require C. An Example
the estimation of a value function. Rather, these methods are We present an example of how an increase in complexity oc-
more properly classified as actor-critic methods. curs when a policy is represented implicitly through the use of a
value function. We start with the most simple trading problem:
B. Policies versus Values a trader that makes decisions to buy and sell a single asset where
Much attention in the reinforcement learning community has there are no transaction costs or trading frictions. The asset re-
been given recently to the question of learning policies versus turns are from a binomial process in . To make mat-
learning value functions. Over the past 20 years or so, the value ters even more simple, we will assume that the next period’s re-
function approach has dominated the field. The approach has turn is known in advance. Given these conditions, the op-
worked well in many applications, and a number of convergence timal policy does not require knowledge of future rewards, so
theorems exist that prove that the approach will work under cer- the Q-learning discount parameter will be set to zero. We will
tain conditions. measure the complexity of the solution by counting the number
However, the value function approach suffers from several of units that are required to implement a solution using a
limitations. The original formulation of Q-learning is in the con- single function approximator.
text of discrete state and action spaces. As such, in many prac- It is obvious that the policy function is trivial. The optimal
tical situations it suffers from the “curse of dimensionality.” policy is to take the action . In terms of model struc-
When Q-learning is extended to function approximators, it has ture, a single unit would suffice. On the other hand, if we
been shown in many cases that there are simple Markov deci- decide to learn the value function before taking actions, we find
sion processes for which the algorithms fail to converge [64]. in this case that we have to learn the XOR function. As shown
Also, the policies derived from a Q-learning approach tend to be in Fig. 11, the value function is 1 when the proposed action
brittle, that is, small changes in the value function can produce has the same sign as and 1 otherwise. Because of the
large changes in the policy. For finance in particular, the pres- binomial return process, we can solve this problem using only
ence of large amounts noise and nonstationarity in the datasets two units. Due to the value function representation of the
can cause severe problems for a value function approach.17 problem, the complexity of the solution has doubled.
We find our RRL algorithm to be a simpler and more This doubling of model complexity is by comparison minor
efficient approach. Since the policy is represented directly, a if we make the problem a little more realistic by allowing
much simpler functional form is often adequate to solve the returns to be drawn from a continuous real-valued distribution.
problem. A significant advantage of the RRL approach is the The complexity of the policy function has not increased,
ability to produce real valued actions (e.g., portfolio weights) sign . However, the value function’s increase in
naturally without resorting to the discretization necessary in the complexity is potentially enormous. Since returns are now
Q-learning case. Constraints on actions are also much easier to real valued, if we wish to approximate the value function to
represent given the policy representation. Other advantages are an arbitrarily small precision, we must use an arbitrarily large
that the RRL algorithm is more robust to the large amounts of model.
noise that exists in financial data, and is able to quickly adapt
to nonstationary market conditions. D. Discussion of the S&P 500/T-Bill Results
For the S&P 500/T-Bill asset allocation problem described
16Baxter and Bartlett have independently coined the term DR to describe in Section IV-C, we find that RRL offers advantages over
policy gradient methods in an MDP framework based on simulating sample Q-Learning in performance, interpretability and computational
paths and maximizing average rewards. Our intended usage of the term is in efficiency. Over the 25–year test period, the RRL-Trader
the same spirit, but perhaps more general, referring to all algorithms that do not
need to learn a value function in order to derive a policy. produced significantly higher profits (5860% versus 3359%)
17Brown [65] provides a nice example that demonstrates the brittleness of and Sharpe ratios (0.83 versus 0.63) than did the Q-Trader.
Q-learners in noisy environments. The RRL-Trader learns a stable and robust trading strategy,
888 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 12, NO. 4, JULY 2001

maintaining its positions for extended periods. The frequent ential Sharpe ratio and differential downside deviation ratio. We
switches in position by the Q-Trader suggests that it is more have also provided empirical results that demonstrate the pres-
sensitive to noise in the inputs. Hence, the strategy it has ence of predictability as discovered by RRL in intradaily U.S.
learned is brittle. Dollar/British Pound exchange rates and in the monthly S&P
Regarding interpretability, we find the value function repre- 500 Stock Index for the 25-year test period 1970 through 1994.
sentation to be obscure. While the change in the policy as im- In previous work [1], [2], we showed that trading systems
plemented by the RRL algorithm is directly related to changes trained via RRL significantly outperform systems trained using
in the inputs, for the value function the effect on policy is not supervised methods. In this paper, we have compared the DR
so clear. While the RRL-Trader has an almost linear policy rep- approach using RRL to the Q-learning value function method.
resentation (a net with just a single unit), the Q-Trader’s We find that an RRL-Trader achieves better performance than
policy is the of a two layer network for which the policy a Q-Trader for the S&P 500/T-Bill asset allocation problem.
is an input. The brittle behavior of the Q-Trader is probably due We observe that relative to Q-learning, RRL enables a simpler
to the complexity of the learned Q-function with respect to the problem representation, avoids Bellman’s curse of dimension-
inputs and actions. The problem representation for the Q-Trader ality and offers compelling advantages in efficiency.
thus reduces explanatory value. We have also discussed the relative merits of DR and value
The sensitivity analysis presented for the RRL-Trader function learning, and provided arguments and examples for
strategy in Section IV-C.3 was easy to formulate and imple- why value function-based methods may result in unnatural
ment. It enables us to identify the most important explanatory problem representations. For problem domains where imme-
variables, and to observe how their relative saliency varies diate estimates of incremental performance can be obtained,
slowly over time. For the Q-Trader, however, a similar analysis our results suggest that DR offers a powerful alternative.
is not straightforward. The possible actions are represented as
inputs to the Q-function network, with the chosen action being ACKNOWLEDGMENT
determined by the . While we can imagine proxies
for a sensitivity analysis in a simple two action long, short The authors wish to thank L. Wu and Y. Liao for their contri-
framework, it is not clear how to perform a sensitivity analysis butions to our early work on DR, and A. Atiya, G. Tesauro, and
for actions versus inputs in general for a Q-learning framework. the reviewers for their helpful comments on this manuscript.
This reduces the explanatory value of a Q-Trader.
Since the long, short Q-Trader is implemented using a REFERENCES
neural network function approximator, Bellman’s curse of
[1] J. Moody and L. Wu, “Optimization of trading systems and portfolios,”
dimensionality has a relatively small impact on the results in Decision Technologies for Financial Engineering, Y. Abu-Mostafa, A.
of the experiments presented here. The input dimensionality N. Refenes, and A. S. Weigend, Eds. London: World Scientific, 1997,
of the Q-Trader is increased by only one, and there are only pp. 23–35.
[2] J. Moody, L. Wu, Y. Liao, and M. Saffell, “Performance functions and re-
two actions to consider. However, in the case of a portfolio inforcement learning for trading systems and portfolios,” J. Forecasting,
management or multi-sector asset allocation system, the di- vol. 17, pp. 441–470, 1998.
mensionality problem becomes severe.18 Portfolio management [3] W. A. Clark and B. G. Farley, “Simulation of self-organizing systems by
digital computer,” IRE Trans. Inform. Theory, vol. 4, pp. 76–84, 1954.
requires a continuous weight for each of assets included [4] , “Generalization of pattern recognition in a self-organizing
in the portfolio. This increases the input dimension for the system,” in Proc. 1955 Western Joint Comput. Conf., 1955, pp. 86–91.
Q-Trader by relative to the RRL-Trader. Then, in order [5] R. J. Williams, “Toward a theory of reinforcement-learning connec-
tionist systems,” College Comput. Sci., Northeastern Univ., Boston,
to facilitate the discovery of actions, we can only MA, Tech. Rep. NU-CCS-88-3, 1988.
consider discrete action sets. The number of discrete actions [6] R. J. Williams, “Simple statistical gradient-following algorithms for
that must be considered is exponential in . As another issue, connectionist reinforcement learning,” Machine Learning, vol. 8, pp.
229–256, 1992.
we must also consider the possible loss of utility that results [7] J. Baxter and P. L. Bartlett, “Direct gradient-based reinforcement
due to the finite resolution of action choices. learning: I. Gradient estimation algorithms,” Comput. Sci. Lab.,
In terms of efficiency, the advantage updating representa- Australian Nat. Univ., Tech. Rep., 1999.
[8] R. E. Bellman, Dynamic Programming. Princeton, NJ: Princeton Univ.
tion used for the Q-Trader required two networks each with 30 Press, 1957.
units. In order to reduce run time, the simulation code was [9] R. S. Sutton, “Learning to predict by the method of temporal differ-
written in C. Still, each run required approximately 25 hours to ences,” Machine Learning, vol. 3, no. 1, pp. 9–44, 1988.
[10] C. J. C. H. Watkins, “Learning with Delayed Rewards,” Ph.D. thesis,
complete using a Pentium Pro 200 running the Linux operating Cambridge Univ, Psychol. Dept., 1989.
system. The RRL networks used a single unit, and were [11] C. J. Watkins and P. Dayan, “Technical note: Q-Learning,” Machine
implemented as uncompiled Matlab code. Even given this un- Learning, vol. 8, no. 3, pp. 279–292, 1992.
[12] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive
optimized coding, the RRL simulations were 150 times faster, elements that can solve difficult learning control problems,” IEEE Trans.
taking only 10 min. Syst., Man, Cybern., vol. 13, no. 5, pp. 835–846, Sept. 1983.
[13] A. G. Barto, Handbook of Intelligent Control. New York: Van Nos-
trand Reinhold, 1992, ch. 12.
VI. CONCLUSION [14] Handbook of Intelligent Control, D. A. White and D. A. Sofge, Eds.,
Van Nostrand Reinhold, New York, 1992, pp. 65–90.
In this paper, we have demonstrated how to train trading sys- [15] Handbook of Intelligent Control, D. A. White and D. A. Sofge, Eds.,
tems via DR. We have described the RRL algorithm, and used Van Nostrand-Reinhold, New York, 1992, pp. 493–526.
[16] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement
it to optimize financial performance criteria such as the differ- learning: A survey,” J. Artificial Intell. Res., vol. 4, 1996.
[17] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Program-
18We have encountered this obstacle in preliminary, unpublished experiments. ming. Belmont, MA: Athena Scientific, 1996.
MOODY AND SAFFELL: LEARNING TO TRADE VIA DIRECT REINFORCEMENT 889

[18] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc- [47] H. White, private communication, 1996.
tion. Cambridge, MA: MIT Press, 1997. [48] R. S. Sutton, “Temporal credit assignment in reinforcement learning,”
[19] G. Tesauro, “TD-Gammon, a self-teaching backgammon program, Ph.D. dissertation, Univ. Massachusetts, Amherst, 1984.
achieves master-level play,” Neural Comput., vol. 6, no. 2, pp. 215–219, [49] L. C. Baird, “Advantage updating,” Wright Laboratory, Wright-Pat-
1994. terson Air Force Base, OH, Tech. Rep. WL-TR-93-1146, 1993.
[20] , “Temporal difference learning and TD-Gammon,” Commun. [50] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning in-
ACM, vol. 38, no. 3, pp. 58–68, 1995. ternal representations by error propagation,” in Parallel Distributed
[21] R. H. Crites and A. G. Barto, “Improving elevator performance using Processing: Exploration in the Microstructure of Cognition, D. E.
reinforcement learning,” in Advances in Neural Information Processing Rumelhart and J. L. McClelland, Eds. Cambridge, MA: MIT Press,
Systems, D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds., 1996, 1986, ch. 8, pp. 310–362.
vol. 8, pp. 1017–1023. [51] P. J. Werbos, “Back-propagation through time: What it does and how to
[22] W. Zhang and T. G. Dietterich, “High-performance job-shop scheduling do it,” Proc. IEEE, vol. 78, no. 10, pp. 1550–1560, Oct. 1990.
with a time-delay TD() network,” in Advances Neural Inform. Pro- [52] B. Widrow and M. E. Hoff, “Adaptive switching circuits,” in IRE
cessing Syst., D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds., WESCON Convention Record, 1960, pp. 96–104.
1996, vol. 8, pp. 1024–1030. [53] R. J. Williams and D. Zipser, “A learning algorithm for continually
[23] J. Moody and M. Saffell, “Reinforcement learning for trading,” in Ad- running fully recurrent neural networks,” Neural Comput., vol. 1, pp.
vances Neural Inform. Processing Syst., S. A. Solla, M. S. Kearns, and 270–280, 1989.
D. A. Cohn, Eds: MIT Press, 1999, vol. 11, pp. 917–923. [54] K. S. Narendra and K. Parthasarathy, “Identification and control of dy-
[24] , “Minimizing downside risk via stochastic dynamic program- namical systems using neural networks,” IEEE Trans. Neural Networks,
ming,” in Computational Finance 1999, A. W. Lo, Y. S. Abu-Mostafa, vol. 1, pp. 4–27, 1990.
B. LeBaron, and A. S. Weigend, Eds. Cambridge, MA: MIT Press, [55] A. L. Samuel, “Some studies in machine learning using the game of
2000, pp. 403–415. checkers,” IBM J. Res. Development, vol. 3, pp. 211–229, 1959.
[25] R. C. Merton, “Lifetime portfolio selection under uncertainty: The con- [56] , “Some studies in machine learning using the game of checkers.
tinuous-time case,” Rev. Economics Statist., vol. 51, pp. 247–257, Aug. II—Recent progress,” IBM J. Res. Development, vol. 11, pp. 601–617,
1969. 1967.
[26] , “Optimum consumption and portfolio rules in a continuous-time [57] J. Peng and R. J. Williams, “Efficient learning and planning within the
model,” J. Economic Theory, vol. 3, pp. 373–413, Dec. 1971. Dyna framework,” Adaptive Behavior, vol. 1, no. 4, pp. 437–454, 1993.
[27] R. C. Merton, Continuous-Time Finance. Oxford, U.K.: Blackwell, [58] A. W. Moore and C. G. Atkeson, “Prioritized sweeping: Reinforcement
1990. learning with less data and less real time,” Machine Learning, vol. 13,
[28] E. J. Elton and M. J. Gruber, “Dynamic programming applications in pp. 103–130, 1993.
finance,” J. Finance, vol. 26, no. 2, 1971. [59] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization of
[29] D. T. Breeden, “Intertemporal portfolio theory and asset pricing,” in Markov reward processes,” in Proc. IEEE Conf. Decision Contr., 1998.
Finance, J. Eatwell, M. Milgate, and P. Newman, Eds. New York: [60] , “Simulation-based optimization of Markov reward processes,”
Macmillan, 1987, pp. 180–193. IEEE Trans. Automat. Contr., vol. 46, pp. 191–209, Feb. 2001.
[30] D. Duffie, Security Markets: Stochastic Models. New York: Aca- [61] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient
demic, 1988. methods for reinforcement learning with function approximation,” in
[31] , Dynamic Asset Pricing Theory, 2nd ed. Princeton, NJ: Princeton Advances in Neural Information Processing Systems, T. K. Leen, S. A.
Univ. Press, 1996. Solla, and K.-R. Muller, Eds. Cambridge, MA: MIT Press, 2000, vol.
[32] J. C. Cox, S. A. Ross, and M. Rubinstein, “Option pricing: A simplified 12, pp. 1057–1063.
approach,” J. Financial Economics, vol. 7, pp. 229–263, Oct. 1979. [62] V. R. Konda and J. N. Tsitsiklis, “Actor-critic algorithms,” in Advances
[33] M. J. Brennan, E. S. Schwartz, and R. Lagnado, “Strategic asset alloca- in Neural Information Processing Systems, S. A. Solla, T. K. Leen, and
tion,” J. Economic Dynamics Contr., vol. 21, pp. 1377–1403, 1997. K.-R. Muller, Eds. Cambridge, MA: MIT Press, 2000, vol. 12, pp.
[34] F. Longstaff and E. Schwartz, “Valuing American options by simulation: 1008–1014.
A simple least squares approach,” Rev. Financial Studies, 2001, to be [63] G. Z. Grudic and L. H. Ungar, “Localizing policy gradient estimates to
published. action transitions,” in Proc. 17th Int. Conf. Machine Learning, 2000.
[35] B. Van Roy, “Temporal-difference learning and applications in finance,” [64] L. Baird and A. Moore, “Gradient descent for general reinforcement
in Computational Finance 1999, Y. S. Abu-Mostafa, B. LeBaron, A. W. learning,” in Advances in Neural Information Processing Systems, S.
Lo, and A. S. Weigend, Eds. Cambridge, MA: MIT Press, 2001, pp. A. Solla, M. S. Kearns, and D. A. Cohn, Eds. Cambridge, MA: MIT
447–461. Press, 1999, vol. 11, pp. 968–974.
[36] R. Neuneier, “Optimal asset allocation using adaptive dynamic program- [65] T. X. Brown, “Policy vs. value function learning with variable discount
ming,” in Advances in Neural Information Processing Systems, D. S. factors,” in Proc. NIPS 2000 Workshop Reinforcement Learning: Learn
Touretzky, M. C. Mozer, and M. E. Hasselmo, Eds. Cambridge, MA: the Policy or Learn the Value Function?, Dec. 2000.
MIT Press, 1996, vol. 8, pp. 952–958.
[37] R. Neuneier and O. Mihatsch, “Risk sensitive reinforcement learning,”
in Advances in Neural Information Processing Systems, M. S. Kearns, John Moody received the B.A degree in physics from the University of Chicago,
S. A. Solla, and D. A. Cohn, Eds. Cambridge, MA: MIT Press, 1999, Chicago, IL, in 1979 and the M.A. and Ph.D. degrees in theoretical physics from
vol. 11, pp. 1031–1037. Princeton University, Princeton, NJ, in 1981 and 1984, respectively.
[38] J. N. Tsitsiklis and B. Van Roy, “Optimal stopping of Markov processes: He is the Director of the Computational Finance Program and a Professor of
Hilbert space theory, approximation algorithms, and an application to Computer Science and Electrical Engineering at Oregon Graduate Institute of
pricing high-dimensional financial derivatives,” IEEE Trans. Automat. Science and Technology, Beaverton. He is also the Founder and President of
Contr., vol. 44, pp. 1840–1851, Oct. 1999. Nonlinear Prediction Systems, a company specializing in the development of
[39] , “Regression methods for pricing complex American-Style Op- forecasting and trading systems. His research interests include computational
tions,” IEEE Trans. Neural Networks, vol. 12, no. 4, pp. 694–703, July finance, time series analysis, and machine learning.
2001.
Dr. Moody recently served as Program Co-Chair for Computational Finance
[40] W. F. Sharpe, “Mutual fund performance,” J. Business, pp. 119–138, Jan.
2000 in London, and is a past General Chair and Program Chair of the Neural
1966.
Information Processing Systems (NIPS) Conference, and is a member of the
[41] H. M. Markowitz, Portfolio Selection: Efficient Diversification of Invest-
ments. New York: Wiley, 1959. editorial board of Quantitative Finance.
[42] F. A. Sortino and R. van der Meer, “Downside risk—capturing what’s
at stake in investment situations,” J. Portfolio Management, vol. 17, pp.
27–31, 1991.
[43] D. Nawrocki, “Optimal algorithms and lower partial moment: Ex post
results,” Appl. Economics, vol. 23, pp. 465–470, 1991. Matthew Saffell received the B.Sc. degree in computer science and engineering
[44] , “The characteristics of portfolios selected by n-degree lower par- with a minor in mathematics from LeTourneau University, Longview, TX, in
tial moment,” Int. Rev. Financial Anal., vol. 1, pp. 195–209, 1992. 1992, and the M.Sc. degree in computer science and engineering from the Uni-
[45] F. A. Sortino and H. J. Forsey, “On the use and misuse of downside risk,” versity of Tennessee, in 1994. He is pursuing the Ph.D. degree in the Com-
J. Portfolio Management, vol. 22, pp. 35–42, 1996. puter Science and Engineering Department at the Oregon Graduate Institute,
[46] D. Nawrocki, “A brief history of downside risk measures,” J. Investing, Beaverton.
pp. 9–26, Fall 1999. He is a Consulting Scientist at Nonlinear Prediction Systems.

You might also like