1999Forecasting Series-based Stock Price Data Using

The paper discusses a novel approach to forecasting stock prices using direct reinforcement learning, addressing the limitations of traditional supervised learning techniques. It introduces a hybrid model that combines market inertia with investor behavior to improve prediction accuracy, utilizing a new utility function called policies-matching ratio. Empirical results demonstrate the effectiveness of this framework in real financial data applications.

Uploaded by

official wwfem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views6 pages

1999Forecasting Series-based Stock Price Data Using

Uploaded by

official wwfem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Forecasting Series-based Stock Price Data using

Direct Reinforcement Learning

Hailin Li, Cihan H. Dagli, and David Enke
Department of Engineering Management
University of Missouri-Rolla
Rolla, MO USA 65409-0370
E-mail {h18p5, dagli, enke]@umr.edu

Abstract-A significant amount of work has been done in the of the financial series should be expected to repeat. The
area of price series forecasting using soft computing techniques, strength of models in interpolation cannot promise the
most of which are base upon supervised learning. Unfortunately, exploration ability that obviously is crucial for this
there has been evidence that such models suffer from application.
fundamental drawbacks. Given that the short-term performance Real time financial performance depends upon the
of the financial forecasting architecture can he immediately
measured, it is possible to integrate reinforcement learning into sequences of interdependent decisions and is thus path-
such applications. In this paper, we present the novel hybrid dependent. In other words, a series of actions must be taken
view for a financial series and critic adaptation stock prim in sequence and the quality of these actions usually cannot be
forecasting architecture using direct reinforcement. A new determined until the end of the sequence. This is a much
utility function called policies-matching ratio is also proposed. harder problem than supervised learning algorithms often
The need for the common tweaking work of supervised learning face. In this sense, many financial applications will fall into
is reduced and the empirical results using real financial data the reinforcement learning domain. Classical dynamic
illustrate the effectiveness of such a learning framework. programming methods have already been applied to asset
allocation [5], portfolio optimization [6], and derivatives
1. INTRODUCTION pricing applications [7]. Recently, Moody eta1 [8] proposed a
recurrent reinforcement-learning method to learn trading
Forecasting series-based stock price data via soil policies. In terms of forecasting financial series data, it is
computing techniques has in fact already taken shape in the dificult to integrate reinforcement learning techniques
past decade. Although many academics and practitioners directly. Few published research studies cover this issue. In
have tended to regard such application with a high degree of [9] we proposed a new hybrid view for fmancial series. The
skepticism, there has been accumulating evidence that the architecture offers the basis for further analysis using
markets are not fully efficient and the Artificial Intelligent- reinforcement learning. The Q-Learning approach is also
based models can outperform the benchmark models (e.g. adopted to forecast future prices trends purely from historical
Random Walk model). We anticipate that more Inlemet stock price data. The empirical results reported in that paper
financial service providers will incorporate AI techniques in show the effectiveness of this new thought. Opposite to the
their service to address the current industry trends, such as claims of the Random Walk Theory, historical fmancial
cheap real-time information, financial market and series may provide indications to predict future trends.
institutional deregulation, and global capitalization. However, the Neuro-Q forecasting architecture in [9] is
Till now, most of related research is based on traditional likely to suffer from several limitations due to the nature of
supervised learning techniques. Usually, artificial neural the value function reinforcement learning. For a discrete-time
network (ANN) models are adopted as the core engine in the dynamic system that is sui&hle for most financial series
forecasting architecture. It is believed that non-linear prediction environment, the "Bellman's curse of
relationships exist between financial variables and stock dimensionality" is unavoidable. Moreover, as pointed out by
returns. Often hybrid architectures are proposed in order to Brown [IO], the policies produced by Q-learning tend to he
obtain better predictions than simple A N N s models currently brittle because of the noisy financial data. The recurrent
provide. Such developments include the synthesis of genetic reinforcement learning (RRL) algorithm presented by [8] is a
algorithms and ANNs [I], Neuro-fuzzy architectures [Z], type of direct policy learning that eliminates the intermediate
combination of qualitative and quantitative data [3], and value function estimation procedure. Inspired by its strength
ARIMA-based ANNs [4]. Unfortunately, much of the in problem representation and computation efficiency, this
published literature is still somewhat academic. The results paper proposes the novel Critic adaptation forecasting
are case sensitive and hard to apply. In essence, all efforts architecture to predict series-based stock prices. RFU is
under the supervised learning framework will he subject to utilized as the learning algorithm for the critic network. We
the fundamental limitation in terms of why historical patterns demonstrate how learning can be implemented under the

0-7803-8359-1/04/$20.00 02004 IEEE 1103

proposed schema. Experiment results using real financial data of price behaviour. The changing of the financial series
are used to illustrate the effectiveness of such a learning results from the dynamics of the complex system. Supervised
framework. learning-based models can generalize the non-linear
The remaindsr of this paper is structured as follows. relationship between stock returns and various Micro/Macro
Section 2 describes the related adaptive critic design concepts financial factors when the market is "smooth and calm".
and the novel hybrid view (model) for financial series Unfortunately, they lack the exploration ability to catch
prediction. Section 3 presents the implementation of the unexpected price movements in time. Essentially, all
proposed critic adaptation stock price forecasting volatility is the direct consequence of the investment
architecture. In section 4, empirical results are provided and behaviours of people, no matter if the decisions of investors
discussions are included, followed by conclusions in Section are made base upon technical or fundamental analysis.
5. Obviously, the statistically significant structure of the
financial time series found by supervised learning is highly
11. Financial Series Lcaming using B Critic and Hybrid View difficult (if not impossible) to represent such evolutionary
behaviour.
The concern of reinforcement learning regards decision- To account for such observation, we propose a financial
making in uncertain environments. The essential idea behind series hybrid view in which the observed price p, is the
reinforcement learning is a simple penalty-reward strategy. result influenced by the combination of inherent market
Compared with supervised learning, reinforcement learning inertia (i.e. "normal" market patterns that can be generalized
techniques are a form of match-based learning (other than from historical data) and the actions taken by investors at the
error-based learning) duedo the fact that the correct target immediate previous market "state" (resulting in the
output value for each input set or pattern is lacking and "unexpected" market response). The price series generalized
typically only delayed reward signals are available. Classical from "normal" market patterns, 'p, , can be viewed as the
dynamic programming (DP) is the well-known approach used
to handle the reinforcement learning problem under both observed underlying "rational price" process. The most recent
deterministic and stochastic cases, but its exhaustive search investment actions @uy/selVhold) of investors change the
strategy is computationally expensive for most real problems. output from the market inertia. The final observed noisy
Furthermore, the backward search direction of DP precludes financial series is the summation of the two separated
its utilization in real-time (or on-line) policy generation. To processes:
address such issues, adaptive critic design frameworks and P , = rp, +a,+ E , , (2)
various heuristic-programming methods are gaining where a< implicitly shows the extra investment policies other
popularity in the recent literature [ 11, 121. than those that follow the historical pattern, while E, is zero
In general, critic methods integrate backpropagation (a way
mean random noise.
to obtain necessary derivatives) with reinforcement learning
through a critic network. The critic network, rather than
111. Crilic Adaptation Stock Pricc ForccastingArchitcchlre
*'actor'' or control network, learns to approximate a strategic
utility function (i.e. J function in DP's Bellman's equation)
The proposed hybrid view for financial time series
and uses the actor's output as part of its inputs, directly or prediction provides the hasis for further technical analysis
indirectly. Thus, the needed learning signal is the critic's using the reinforcement learning philosophy. It is natural to
output J ,or one-step utility U ,which vary according to the
conclude that price prediction can be done by solving two
adopted techniques, instead of desired system output. types of non-linear mapping problems, i.e. the next-step price
Specifically, learning for this research will be implemented derived from the market inertia and next-step price resulting
based on the RRL algorithm under which there is no from "irrational" investment behaviours at the given time
backpropagation path to the action network, but which uses step. The latter price implicitly shows the very recent extra
the signal of the actor to estimate a utility function. For such investment policies other than those following historical
a critic adaptation model, the cost-to-go function is expressed patterns.
as follows: Supervised learning methods are an extremely useful tool
01

J ( t )= x y ' U ( f + k ) , to catch the system patterns that keep repeating. Therefore,

k 4 they will be used to obtain the characteristic of market inertia.
where O < y < l is a discount factor for finite horizon Given that the short-term performance of the forecasting
problems. architecture can be immediately measured, a type of direct
reinforcement learning approach is utilized in order to
The target value for J(1) can be y J ( t + l ) + U ( f ), allowing
directly explore "irrational" policies. It is noted that
the critic to be trained forward in time, which is the key for forecasting for future price is totally based upon historical
real-time application. price values, so that the selection and pre-processing for input
In the financial area, high market volatility often creates variables is unnecessary. Let the output price series from the
difficulties for existing theories to provide clear explanations
- network
forecasting architecture be denoted as p , . The architecture
will update its parameters at the end of each time interval f
in order to forecast p,,, .
The ANNs model is constructed to generalize the
unobserved underlying "rational" price series rpf from
-
market inertia. The results will be rp, . The recurrent 4= P,., -rp,., (7)
reinforcement learning algorithm is used for the adaptation of The ratio is constructed by considering the ability of series
-
the critic network so that the additional price series a, that a, to both minimize prediction error (represented by the first
approximates the unexpected investment behaviours alone in Gaussian function term) and follow the market trend
time can be obtained. Thus, the final forecasting result is as (demonstrated by latter hyperbolic tangent sigmoid term).
following: Note that the weights 0 5 p, 5 I and 0 5 p2 2 1 illustrate the
- - - degree of preference toward the two optimization objectives
P,*I =rp,+,+a,+, (3) mentioned above. Their values depend upon the application
A standard multi-layer feedfonvard perceptron model emphasis.
trained by various backpropagation algorithms can be used to The performance criterion at time t can now be expressed
get a fairly accurate representation for the underlying as a function of the sequence of utility.
function F through which the rp, can be mapped from the J ( U,I u,.l,...,u2
3 U, ) (8)
laggedobserved price series ( P ! . ~ , P , . ~ , ..., P , . ~ ) Normally, the following format for the cost-to-go function
will be used
The decision function for @ is defmed using the following T

format: Jr = z r k u s
k=l

- (-
a, = A a,.,,@,,R, ,
1
where 0, denotes the adjustable architecture parameters at
(4) For a decision function A(@,), the learning processes need
to be implemented so that 0 can be updated to maximize J r .
The RRL algorithm [SI is adopted for direct reinforcement.
time f and R, is the observable of the system

IP(., P,.2. pr.,>...;pz.,

2
. _
,rp,.l.rp(.,,...;&.
.
z,., ,zr. 2 . . . .

2; denotes any other external variables needed. The

I ,

The gradient ascent is:

(10)

specific form of A at the starting point for a real problem can

be set up by regression analysis based on a small group of
training data, e.g. a simple autoregressive form may be valid where 0 2 a 5 1 is the leaming rate, with a small value being
as: preferred.
- - - By considering only the most recently utility U,, the on-
an= ua,-I+vop,-I+ V , P , - ~+... worp,-,+ w,rp,.l+ ...+ x, (5) line stochastic optimization can be obtained
Here, 0, are the adjustable weights
2 , . . . ;w,,w2 , . . . ;x} . Such a format is used in the
(u;vI,v
simulations described in the next section. More specifically, by adopting the proposed policies-
Financial series data are highly time-variant. Ultimately, matching ratio, a more simple form follows:
minimizing the Mean Square Error is not necessaly the most
important objective because the goal of investors is to a,(@) dl dU dq-
(13)
maximize the profit gain, or catch the market trend. The cost- do, dU, dq., do,.,
to-go function needs to be designed properly in order to
reflect such a trade-off. To accomplish this, we propose a
new one-step utility function U ( * ) , named the policies-
matching ratio, as the reinforcement signal for the critic
Equations (13)-( 15) constitute the critic network adaptive
learning algorithm.

1105
Figure 1 presents the architecture for the whole forecast the individual market inertia for each index. As mentioned in
learning process. Basically, the complete adaptive learning section 3, ANN models are prone to over-fitting. In order to
process for the proposed architecture will consist of two construct relatively accurate ANN models without adding too
groups of parameter updating, i.e.. 8 for the critic network much extra work, each case experimented with different
and the weights 'W for the ANN model. Obviously the most values of k (the number of lagged values) ranging from 3 to
recent "unexpected" investment polices keep changing over 10 with an increment of 1, along with n (hidden neuron
time. Moreover, the unobserved market inertia is also likely number) ranging from 2 to 8 with an increment of 2. Such a
to change keequently. To account for this assumption, a combination resulted in 32 candidate models. The method for
rolling-training process will be applied to both supervised and choosing k and n employed standard cross-validation
reinforcement leaning at the end of each time interval 1 . verification. A small group of training data (20 daily closing
For the ANN model trained by the backpropagation prices for S&P 500 started from 11-Feh-98 and 20 daily
algorithm, the perfect data fitting for the training set can closing prices for Nasdaq started from 1-Jun-98) will he
easily be obtained if sufficient lagged values are included and available at the beginning of each experiment. In order to let
if the number of neurons in the hidden layer is also large. the ANN model reflect the market inertia's changes as timely
However, the optimal training balance is hard to get and the as possible, rolling-training and data resampling were
ANNs is well known for its tendency to over-fit. To implemented, i.e. 10 input-target pairs
overcome the over-training phenomenon common in
supervised learning, a cross-validation procedure will he ( { ( p,~,,
P,.~,...P ~ . ~tf) (P,)}) are sequentially pickedup each
included. time as the training data for the construction of the network
models. As such, 10 data pairs were divided into a training
set (7 pairs) and validation set (3 pairs). After the ANNs
model was set up and the prediction for the next day was
completed, the new actual value of the stock closing price is
CRITIC added to the training data once it becomes available. The
oldest input-target pair is eliminated. Then the training
process is restarted. Throughout the experiment, most of the
time the selected structure for the S&P 500 has 7 hidden layer
units and 3 lagged values, while the structure for Nasdaq has
2 hidden layer units and 3 lagged values.
Meanwhile, the adaptation for the parameter 8 of the
Reinforcement Signal decision function A keeps repeating at the end of each day
based on the new reinforcement signal. The learning episodes
Fig. I.Critic Adaptation Forecasting ArchitccNre for Stock Price will he increased gradually until the approximate policy
convergence is reached.
Figure 2 show the results for the S&P 500 within a certain
Note that here the exploration ability of architecture to time window. For thepolicies-matching ratio, ,U, = 0.9 and
search unknown investment strategies (which is crucial) is
promised not only by the adopted on-line stochastic FXL
optimization algorithm, hut also by the characteristic of
financial series: intrinsic noisy and uncertain. In the light of
above reasons, the noise variable will not he incorporated in
the reinforcement leaming.

N.Empirical result^

This section presents an overview of experimental work for

applying the proposed critic adaptation forecasting -t\I \r
architecture to predict future stock prices. Two daily stock
index series (S&P 500 and NASNMS Composite) are
studied The goal is to forecast the corresponding index's next
day closing price based simply upon its historical daily
closing price. A 5-year test period starting from year 1998 to Fig.2. Time Serics far S&P 500 Index Simulation within Certain Timeframe
2003 was selected and used for both cases.
The standard three-layer feedfonvard perceptron trained by
the Levenberg-Marquardt algorithm is adopted to generalize

1 IO6
p2 = 0.1 are used for this experiment so the preference this
case is to minimize Mean Square Error. Other settings
include a = 0.01, U = 0.95, and number of episodes is equal
to 500.
Assume investors are more interested in the market trend
in terms of NAS/NMS Composite ( ,u, = 0.1 and ,ul= 0.9 is
used to address this assumption). Figure 3 presents the related
simulation results. Here, a = 0.005 , U = 0.75 , and number
of episodes for obtaining stable policy is 450.
The extra investment decisions, other than just following
the market inertia, are evolved alone over time. The same
evolution occurs for its implicit representation 8 .Figure 4 I I
2BMWm 2s-J"- 24.J"l-m 2l.A"+
Ti"
uses vo as an example to demonstrate such evolution. Figure
Fig. 4. Examplc of 8'sEvolution within Certain Timeframe
4 also shows the difference between the unstable learning
policy (400 learning episodes) and the convergent policy
learning (500 learning episodes). TABLE I:Camparison o f P e r f o m c e for Different Forecasting
More precise comparison results are reported in Table I in Archiecures
which five performance measures are used for a total of 1,268
predicted daily prices. These measures include Root Mean
Squares Error RMSE , Mean Absolute Error M E , Mean 1815.1 / 2109.9
Absolute Percentage Error M P E , and the Direction
Accuracy Indicators DAl and DA2 (the correction , ..,_
.._.I
I l"lYrI
r
percentage forecast for an up and down market). Note that the n A1 I 50.75%/51.76% I 44.78%/61.22%
test goal for the S&P 500 is RMSE reduction and for the I.
L.

Nasdaq is market profitability. The definitions for DAl and

DA2 I 41.79%/53.3% I 41.79% / 79.76%

DA2 are as followine:

where I ( x ) = l if x > O and I ( x ) = O if x 5 0 .

The most important observation from Table 1 is that in
either case the proposed hybrid learning model can correctly
forecast market direction over 50 percent of the time. Such
timing ability should result in market profitability.

---I I

REFERENCES

[I: P.G McCluskey, "Feedfoward and recurrent neural networks and

genetic programs for stock market and time scries forccasting,"
Technical Report CS-93-96, Brown Univcnity. 1993.
..
I integration o f fuzzy neural networks and fuzzy Delphi," ApplGd
Fig.3. Timc Scnes for Nasdaq Index Simulation within Certain Timeframe I I 6i , ~pp.~ 501-520,
A ~ t ~ ~ i ~ I I " I ~vol. " ~ ~ , 1998.
[3] Marca Costantino, Russell J. Collingham and Richard G. Morgan,
Qualitative Information in Finance: Natural languagc Proccssing and
Information Extraction, University of Durham, UK, 1996.

1107
[4] Jug-Hua Wang and Jia-Yann Leu. “Stock market trend prediction
using ARIMA-based neural networks,” IEEE International Conjerence
onNeurolNerworkr, 1996,Vol.4,pp.2160-2I65.
[ S ] M.J. Brennan, E.S. Schwa- and R. Lagnado, “Strategic asset
allocation,” J. Economic Dynamics Contr., vol. 21. pp. 1377-1403,
1997.
[6] R.C.Merton, Continuom-Time Finmce. Oxford, U.K.: Blackwell.
1990.
[7] J.C.Cox, S.A.Ross and M.Rubinstcin., ”Option pricing: A simplified
approach,”J. Financial Economics, vol. 7, pp. 229-263, Oct. 1979.
[E] J.Mwdy, L.Wu, Y.Liao and M.Saffell., “Performance functions and
reinforcement leaming for hading systcms and partfolios,” J.
Forecosring, vol. 17, pp. 4 4 4 7 0 , 1998.
[9] Hailin Li and Cihan H. Dagli., “Synthesis of reinforcement leaming
and artificial neural networks applied to forecast real time fmancial
series,” 24* ASEM annual conference, St.louis, USA, 2003, pp. 493-
499.
[IO] T.X.Brown, “Policy YS. value function leaming with variable discount
factors,” Proc. ,NIPS 2000 Workhop Reinforcement Learning: Leom
the Policy or Leom the Volue function?, Dee. 2000.
[I I ] W.T.Miller, R.S.Sutton and P.J.Werhas, Neural Neworkr for Control,
Cambridge, MA: MIT Press, 1990.
[I21 D.Prokhorov, R.Santiaga and D.Wunsch., “Adaptive critic designs: A
case shldy for neuraconfrol,” Neural Nerworkr. vol. 8, pp. 1367-1372,
1995.