A-multiple-kernel-support-vector-regression-approach-for-stock-market
A-multiple-kernel-support-vector-regression-approach-for-stock-market
a r t i c l e i n f o a b s t r a c t
Keywords: Support vector regression has been applied to stock market forecasting problems. However, it is usually
Stock market forecasting needed to tune manually the hyperparameters of the kernel functions. Multiple-kernel learning was
Support vector regression developed to deal with this problem, by which the kernel matrix weights and Lagrange multipliers can
Multiple-kernel learning be simultaneously derived through semidefinite programming. However, the amount of time and space
SMO
required is very demanding. We develop a two-stage multiple-kernel learning algorithm by incorporating
Gradient projection
sequential minimal optimization and the gradient projection method. By this algorithm, advantages from
different hyperparameter settings can be combined and overall system performance can be improved.
Besides, the user need not specify the hyperparameter settings in advance, and trial-and-error for deter-
mining appropriate hyperparameter settings can then be avoided. Experimental results, obtained by run-
ning on datasets taken from Taiwan Capitalization Weighted Stock Index, show that our method
performs better than other methods.
Ó 2010 Elsevier Ltd. All rights reserved.
1. Introduction Lin, 2005; Tay & Cao, 2001; Valeriy & Supriya, 2006; Yang, Chan,
& King, 2002).
Accurate forecasting of stock prices is an appealing yet difficult ANN has been widely used for modeling stock market time ser-
activity in the modern business world. Many factors influence the ies due to its universal approximation property (Kecman, 2001).
behavior of the stock market, including both economic and non- Previous researchers indicated that ANN, which implements the
economic. Therefore, stock market forecasting is regarded as one empirical risk minimization principle, outperforms traditional sta-
of the most challenging topics in business. In the past, methods tistical models (Hansen & Nelson, 1997). However, ANN suffers
based on statistics were proposed for tackling this problem, such from local minimum traps and difficulty in determining the hidden
as the autoregressive (AR) model (Champernowne, 1948), the layer size and learning rate. On the contrary, SVR, proposed by
autoregressive moving average (ARMA) model (Box & Jenkins, Vapnik and his co-workers, has a global optimum and exhibits bet-
1994), and the autoregressive integrated moving average (ARIMA) ter prediction accuracy due to its implementation of the structural
model (Box & Jenkins, 1994). These are linear models which are, risk minimization principle which considers both the training error
more than often, inadequate for stock market forecasting, since and the capacity of the regression model (Cristianini & Shawe-Tay-
stock time series are inherently noisy and non-stationary. Recently, lor, 2000; Vapnik, 1995). However, the practitioner has to deter-
nonlinear approaches have been proposed, such as autoregressive mine in advance the type of kernel function and the associated
conditional heteroskedasticity (ARCH) (Engle, 1982), generalized kernel hyperparameters for SVR. Unsuitably chosen kernel func-
autoregressive conditional heteroskedasticity (GARCH) (Bollerslev, tions or hyperparameter settings may lead to significantly poor
1986), artificial neural networks (ANN) (Hansen & Nelson, 1997; performance (Chapelle, Vapnik, Bousquet, & Mukherjee, 2002;
Kim & Han, 2008; Kwon & Moon, 2007; Qi & Zhang, 2008; Zhang Duan, Keerthi, & Poo, 2003; Kwok, 2000). Most researchers use
& Zhou, 2004), fuzzy neural networks (FNN) (Chang & Liu, 2008; trial-and-error to choose proper values for the hyperparameters,
Oh, Pedrycz, & Park, 2006; Zarandi, Rezaee, Turksen, & Neshat, which obviously takes a lot of efforts. In addition, using a single
2009), and support vector regression (SVR) (Cao & Tay, 2001, kernel may not be sufficient to solve a complex problem satisfacto-
2003; Fernando, Julio, & Javier, 2003; Gestel et al., 2001; Pai & rily, especially for stock market forecasting problems. Several
researchers have adopted multiple-kernels to deal with these
q
problems (Bach, Lanckriet, & Jordan, 2004; Bennett, Momma, &
This work was supported by the National Science Council under the grant NSC
Embrechts, 2002; Crammer, Keshet, & Singer, 2003; Gönen et al.,
95-2221-E-110-055-MY2.
⇑ Corresponding author. 2008; Lanckriet, Cristianini, Bartlett, Ghaoui, & Jordan, 2004;
E-mail address: [email protected] (S.-J. Lee). Ong, Smola, & Williamson, 2005; Rakotomamonjy, Bach, Canu, &
0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2010.08.004
2178 C.-Y. Yeh et al. / Expert Systems with Applications 38 (2011) 2177–2186
Grandvalet, 2007, 2008; Sonnenburg, Ratsch, Schäfer, & Schölkopf, Instead, the images which lie outside the e-insensitive band are
2006; Szafranski, Grandvalet, & Rakotomamonjy, 2008; Tsang & penalized and slack variables are introduced to account for these
Kwok, 2006; Wang, Chen, & Sun, 2008). images. For convenience, in the sequel, the term SVR is used to
The simplest way to combine multiple-kernels is by averaging stand for e-SVR. The objective function and constraints for SVR are
them. But each kernel having the same weight may not be appro-
priate for the decision process, and therefore the main issue con- 1 X l
min hw; wi þ C ðni þ ^ni Þ;
cerning multiple-kernel combination is to determine optimal w;b 2 i¼1
weights for participating kernels. Lanckriet et al. (2004) used a lin-
s:t: ðhw; /ðxi Þi þ bÞ yi 6 e þ ni ;
ear combination of matrices to combine multiple-kernels. They
transformed the optimization problem into a semidefinite pro- yi ðhw; /ðxi Þi þ bÞ 6 e þ ^ni ;
gramming (SDP) problem, which, being convex, has a global opti- ni ; ^ni P 0; i ¼ 1; . . . ; l; ð1Þ
mum. However, the amount of time and space required by this
method is demanding. Other multiple-kernel learning algorithms where l is the number of training patterns, C is a parameter which
include Bach et al. (2004), Sonnenburg et al. (2006), Rak- gives a tradeoff between model complexity and training error, ni
otomamonjy et al. (2007), Rakotomamonjy, Bach, Canu, and Grand- and ^ni are slack variables for exceeding the target value by more
valet (2008), Szafranski et al. (2008) and Gönen et al. (2008). These than e and for being below the target value by more than e, respec-
approaches deal with large-scale problems by iteratively using the tively. Note that / : X ? F is a possibly nonlinear mapping function
sequential minimal optimization (SMO) algorithm (Platt, 1999) to from the input space to a feature space F. Also, h , i indicates the
update Lagrange multipliers and kernel weights in turn, i.e., La- inner product of the involved arguments. The regression hyperplane
grange multipliers are updated with fixed kernel weights and ker- to be derived is
nel weights are updated with fixed Lagrange multipliers
alternatively. Although these methods are faster than SDP, they f ðxÞ ¼ hw; /ðxÞi þ b; ð2Þ
are likely to suffer from local minimum traps. Multiple-kernel
learning based on hyperkernels has also been studied (Ong et al., where w and b are weight vector and offset, respectively.
2005; Tsang & Kwok, 2006). Tsang and Kwok (2006) reformulated To solve Eq. (1), one can introduce the Lagrangian, take partial
the problem as a second-order cone programming (SOCP) form. derivatives with respect to the primal variables and set the result-
Crammer et al. (2003) and Bennett et al. (2002) used boosting ing derivatives to zero, and turn the Lagrangian into the following
methods to combine heterogeneous kernel matrices. Wolfe dual form
We propose a regression model, which integrates multiple-ker-
nel learning and SVR, to deal with the stock price forecasting prob- X
l X
l
maxa;a^ ^ i ai Þ e
yi ða ða
^ i þ ai Þ
lem. A two-stage multiple-kernel learning algorithm is developed i¼1 i¼1
to optimally combine multiple-kernel matrices for SVR. This learn-
ing algorithm applies SMO (Platt, 1999) and the gradient projection 1X l X l
ða
^ i ai Þða
^ j aj ÞKðxi ; xj Þ;
method (Bertsekas, 1999) iteratively to obtain Lagrange multipliers 2 i¼1 j¼1
and optimal kernel weights. By this algorithm, advantages from X
l
different hyperparameter settings can be combined and overall s:t: ða
^ i ai Þ ¼ 0;
system performance can be improved. Besides, the user need not i¼1
s:t: ðhw; Uðxi Þi þ bÞ yi 6 e þ ni ; This equation is, obviously, identical in form to Eq. (3) and can be
y ðhw; Uðxi Þi þ bÞ 6 e þ ^ni ; solved by SMO (Platt, 1999). In the second stage, the Lagrange mul-
i
tipliers are kept fixed, and the weight vector l is updated by the
ni ; ^ni P 0; i ¼ 1; . . . ; l; gradient projection method (Bertsekas, 1999). Since SMO is a stan-
ls P 0; s ¼ 1; . . . ; M; dard algorithm for solving the Wolfe dual form, we won’t describe it
X
M here. Detailed description about SMO can be found in Platt (1999).
ls ¼ 1; ð8Þ In the following, we describe how gradient projection is applied to
s¼1
obtain optimal l in the second stage.
where U is the vector of function mappings of Eq. (7). Since the Lagrange multipliers are considered as known in the
By introducing the Lagrangian, as usual, Eq. (8) can be converted second stage, Eq. (9) can be rewritten as follows:
to the following Wolfe dual form:
min JðlÞ;
l
X
l X
l
min maxa;a^ ^ i ai Þ e
y i ða ða
^ i þ ai Þ s:t: ls P 0; s ¼ 1; . . . ; M;
l
i¼1 i¼1
X
M
1X l X l ls ¼ 1; ð13Þ
ða
^ i ai Þða e ðxi ; xj Þ
^ j aj Þ K s¼1
2 i¼1 j¼1
where
X
l
s:t: ða
^ i ai Þ ¼ 0;
X
l X l
1X l X l
i¼1
JðlÞ ¼ ^ i ai Þ e
y i ða ða
^ i þ ai Þ ða
^ i ai Þða
^j
C P ai ; a
^ i P 0; i ¼ 1; . . . ; l; i¼1 i¼1
2 i¼1 j¼1
ls P 0; s ¼ 1; . . . ; M; X
M
X
M aj Þ ls K s ðxi ; xj Þ: ð14Þ
ls ¼ 1 ð9Þ s¼1
s¼1
Note that J(l) only depends on l. By gradient projection (Bertsekas,
where 1999), we have
e ðxi ; xj Þ ¼ hUðxi Þ; Uðxj Þi
K lkþ1 ¼ lk þ gk l^ k lk ; ð15Þ
¼ l1 h/1 ðxi Þ; /1 ðxj Þi þ l2 h/2 ðxi Þ; /2 ðxj Þi þ where lk is the weight vector of the kth iteration, 0 < gk 6 1 is the
þ lM h/M ðxi Þ; /M ðxj Þi ^ k is defined as
step-size, and l
¼ l1 K 1 ðxi ; xj Þ þ l2 K 2 ðxi ; xj Þ þ þ lM K M ðxi ; xj Þ
z; if z belongs to the feasible region;
l^ k ¼ ð16Þ
X
M z? ; otherwise;
¼ ls K s ðxi ; xj Þ ð10Þ
s¼1
z ¼ lk sk rJðlk Þ; ð17Þ
k
is a weighted sum of M kernel functions K1, K2, . . . , KM, correspond- where s is a positive scalar, and z\ denotes the projection of z on
ing to mapping functions /1, /2, . . . , /M, respectively. Now, if we can the feasible region. The feasible region contains all the vectors
2180 C.-Y. Yeh et al. / Expert Systems with Applications 38 (2011) 2177–2186
t 0
second stage
yes
stop and
output , ˆ , and
PM
v = [v1, v2, . . . , vM] such that vs P 0, 1 6 s 6 M, and s¼1 v s ¼ 1. 4. Experimental results
rJ(lk) is the following gradient:
To test the forecasting performance of our proposed method, we
k
@J 1X l X l
have conducted three experiments on the datasets taken from Tai-
rJ l s ¼ lks ¼ ða
^ i ai Þða
^ j aj ÞK s ðxi ; xj Þ ð18Þ
@ 2 i¼1 j¼1 wan Capitalization Weighted Stock Index (TAIEX). We also com-
pare the performance of our proposed method with that of other
for s = 1, . . . , M. methods, i.e., single kernel support vector regression (SKSVR)
The projection z\ of z onto the feasible region can be obtained (Tay & Cao, 2001), autoregressive integrated moving average (AR-
by solving the following constraint problem: IMA) model (Box & Jenkins, 1994), and TSK type fuzzy neural net-
work (FNN) (Chang & Liu, 2008). For convenience, we abbreviate
min kz z? k2 ; our multiple-kernel support vector regression method as MKSVR.
z?
∇ J ( μ k ) ← Eq.(18)
k = k +1
z belong to
no feasible region
yes
[ z ]⊥ ← Eq.(20) μˆ k ← Eq.(16)
Armijo rule
Eq.(21)
no
μ k +1 ← Eq.(15) stopping criterion
is met
yes
Table 1 0.4
The datasets for Experiment I. DS−I
DS−II
Datasets Training Validating Testing DS−III
0.35
DS−IV
DS-I 2002/10 – 2004/09 2004/10 – 2004/12 2005/01 – 2005/03
DS-II 2003/01 – 2004/12 2005/01 – 2005/03 2005/04 – 2005/06
DS-III 2003/04 – 2005/03 2005/04 – 2005/06 2005/07 – 2005/09 0.3
DS-IV 2003/07 – 2005/06 2005/07 – 2005/09 2005/10 – 2005/12
RMSE
0.25
d n ðtÞ ¼ p EMAn ðtÞ
EMA ð24Þ
t
0.2
and a lagged relative difference in percentage of price (RDP) is de-
fined as:
pt ptn
RDPn ðtÞ ¼ 100: ð25Þ 0.01 0.05 0.1 0.5 1 5 10 50 100
ptn γ
Then the input variables are defined as xt;1 ¼ EMA d 15 ðt 5Þ, Fig. 3. Forecasting performance of SKSVR with different hyperparameters in
xt,2 = RDP5(t 5), xt,3 = RDP10(t 5), xt,4 = RDP15(t 5), and Experiment I.
xt,5 = RDP20(t 5). The root mean squared error (RMSE) is adopted
for performance comparison, and is defined as follows: Table 2
Performance comparison between best SKSVR and MKSVR in Experiment I.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u T
u1 X Methods Datasets
RMSE ¼ t ðy y ^ t Þ2 ; ð26Þ
T t¼1 t DS-I DS-II DS-III DS-IV
SKSVR 0.170 0.179 0.188 0.234
where yt and y ^t are desired output and predicted output, MKSVR 0.161 0.174 0.179 0.219
respectively.
For SKSVR, there are three parameters that have to be deter-
mined in advance while using RBF kernel, i.e., C, e, and c. We exam- Besides, we try with 37 different settings of hyperparameter c,
ine the forecasting performance of SKSVR with C = 1 and e = 0.001. from 0.01 to 0.09 with a stepping factor of 0.01, from 0.1 to 0.9
2182 C.-Y. Yeh et al. / Expert Systems with Applications 38 (2011) 2177–2186
RMSE
the best performance occurs when 0.1 6 c 6 0.5. For DS-III, the best
54
performance occurs when 50 6 c 6 100. For DS-IV, the best perfor-
mance occurs when 0.01 6 c 6 0.05. The best RMSE values ob- 52
tained by SKSVR are listed in Table 2. 50
For multiple-kernel learning, a kernel combining all the 37 dif-
48
ferent RBF kernels is considered, i.e., c 2 {0.01, 0.02, . . . , 0.09,
0.1, 0.2, . . . , 0.9, 1, 2, . . . , 9, 10, 20, . . . , 100}. Therefore, the combined 46
with that of ARIMA (Box & Jenkins, 1994). The daily stock closing 120
prices of TAIEX for the period of January 2004 to December 2005 100
80
Table 3
The datasets for Experiment II and Experiment III. 60
Datasets Training Validating Testing
40
DS-V 2004/01 – 2004/09 2004/10 – 2004/12 2005/01 – 2005/03 0.01 0.05 0.1 0.5 1 5 10 50 100
DS-VI 2004/04 – 2004/12 2005/01 – 2005/03 2005/04 – 2005/06 γ
DS-VII 2004/07 – 2005/03 2005/04 – 2005/06 2005/07 – 2005/09
DS-VIII 2004/10 – 2005/06 2005/07 – 2005/09 2005/10 – 2005/12 Fig. 6. Forecasting performance of SKSVR with different hyperparameters in
Experiment II.
6800
8.85
0.04
6600
8.8
6400 0.02
8.75
6200
yt’ yt
0
6000
8.7
5800 −0.02
8.65
5600
−0.04
8.6
5400
are used. The one-season moving-window testing approach used in tion ARIMA (m, o, n) is used. Each input vector consists of (m + n)
Pai and Lin (2005) for generating the training/validating/testing components, i.e., xt = [xt,1 xt,2. . . xt,m+n]. The values of the compo-
data is adopted and four datasets, DS-V to DS-VIII, are obtained. nents depend on the model used. For instance, for ARIMA (2, 1, 3)
For instance, DS-V contains the daily stock closing prices from Jan- we have xt,1 = yt1, xt,2 = yt2, xt,3 = t1, xt,4 = t2, and xt,5 = t3
uary 2004 to September 2004 selected as training dataset, the daily where t1, t2, and t3 are forecast errors. The RMSE is defined
stock closing prices from October 2004 to December 2004 selected as follows:
as validating dataset, and the daily stock closing prices from Janu-
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ary 2005 to March 2005 selected as testing dataset. The corre- u T u T
u1 X u1 X
sponding time periods for DS-V to DS-VIII are listed in Table 3. RMSE ¼ t ðpt p ^t Þ2 ¼ t ðexpðy0t Þ exp ðy ^0t ÞÞ2
Given the original daily stock closing prices p = {p1, p2, . . .pt, . . .}, T t¼1 T t¼1
we follow (Box & Jenkins, 1994) to derive training patterns (xt, yt) vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u T
for this experiment. Firstly, the natural logarithmic transformation u1 X
¼t exp ðy0t Þ exp y ^t þ y0t1 2 ; ð28Þ
is applied to the original daily stock closing prices p = {p1, p2, T t¼1
. . .pt, . . .}, resulting in another time series y0 ¼ fy01 ; y02 ; . . . y0t ; . . .g
where y0t ¼ lnðpt Þ. The output sequence is y = {y1, y2, . . .yt, . . .} where
^t is the predicted output obtained from the predictor.
where y
yt is defined by:
120
Table 4 100
Performance comparison among best ARIMA, best SKSVR, and MKSVR in Experiment
80
II.
60
Methods Datasets
DS-V DS-VI DS-VII DS-VIII 40
2 3 5 7 9 11 13 15
ARIMA 45.421 48.400 45.674 56.957 number of hidden nodes
SKSVR 45.686 48.667 46.401 55.294
MKSVR 45.634 47.297 44.142 54.882 Fig. 8. Forecasting performance of FNN with different numbers of hidden nodes in
Experiment III.
DS−V DS−VI
6600 desired output 6600 desired output
MKSVR MKSVR
6400 6400
6200 6200
RMSE
6000 6000
5800 5800
5600 5600
2005/01/03 2005/02/01 2005/03/01 2005/03/31 2005/04/01 2005/05/03 2005/06/01 2005/06/30
date date
DS−VII DS−VIII
6200 6200
6100
6000
6000
5800
5900
5800 5600
2005/07/01 2005/08/01 2005/09/02 2005/09/30 2005/10/01 2005/11/01 2005/12/01 2005/12/30
date date
150 ferent parameter settings with ARIMA. We run SKSVR and MKSVR
DS−V on these four datasets, with the same settings as in Experiment I.
140
DS−VI
130 DS−VII The forecasting performance obtained by SKSVR is shown in
DS−VIII Fig. 6. From this figure, we can see that SKSVR requires different
120
c settings for different datasets to obtain good performance. The
110 best RMSE values obtained by ARIMA and SKSVR are listed in Ta-
ble 4. The RMSE values obtained by MKSVR for these datasets are
RMSE
100
90 also listed in Table 4. Obviously, MKSVR can do equally well as,
80
or even better than, the best ARIMA and SKSVR for each dataset.
However, we don’t need to worry about the hyperparameter set-
70
tings in MKSVR. Fig. 7 shows the forecasting results for datasets
60 DS-V to DS-VIII by MKSVR.
50
40 4.3. Experiment III
0.01 0.05 0.1 0.5 1 5 10 50 100
γ
In this experiment, we compare the performance of MKSVR
Fig. 9. Forecasting performance of SKSVR with different hyperparameters in with that of FNN (Chang & Liu, 2008). We use the same datasets
Experiment III. used in Experiment II, as listed in Table 1. Given the original daily
stock closing prices p = {p1, p2, . . .pt, . . .}, we follow (Chang & Liu,
2008) to derive training patterns (xt, yt) for this experiment. Let
Table 5
Performance comparison among best FNN, best SKSVR, and MKSVR in Experiment III.
y0t be pt, i.e., y0t ¼ pt . Two technical indices, SMA and BIAS, are used
in generating the input vector xt. SMA, abbreviated for simple mov-
Methods Datasets
ing average, is used to emphasize the direction of a trend and to
DS-V DS-VI DS-VII DS-VIII smooth out price and volume fluctuations. The n-day SMA of the
FNN 59.260 64.232 50.395 61.774 tth day is defined as follows:
SKSVR 45.543 47.434 46.669 57.625 Pt5
MKSVR 45.531 47.398 45.907 57.301 i¼t pi
SMAn ðtÞ ¼ : ð29Þ
n
BIAS is used to observe the difference between the closing price and
the moving average line. The n-day BIAS of the tth day is defined as
To compare ARIMA and MKSVR, we consider 25 models which
follows:
are ARIMA (m, 1, n) with m 2 {1, 2, 3, 4, 5} and n 2 {1, 2, 3, 4, 5}. The
forecasting performance obtained by ARIMA on the four datasets pt SMAn ðtÞ
BIASn ðtÞ ¼ 100: ð30Þ
is shown in Fig. 5. Interestingly, little variation occurs among dif- SMAn ðtÞ
DS−V DS−VI
6200
6200
RMSE
6000
6000
5800 5800
5600 5600
2005/01/03 2005/02/01 2005/03/01 2005/03/31 2005/04/01 2005/05/03 2005/06/01 2005/06/30
date date
DS−VII DS−VIII
6500
6800
desired output desired output
6400 MKSVR MKSVR
6600
6300 6400
RMSE
6200 6200
6100 6000
6000 5800
5900 5600
2005/07/01 2005/08/01 2005/09/02 2005/09/30 2005/10/03 2005/11/01 2005/12/01
date date
Let x0t;1 ¼ SMA6 ðt 1Þ and x0t;2 ¼ BIAS6 ðt 1Þ. Now the underlying References
dataset is partitioned into K clusters by k-means (Hartigan & Wong,
1979), a popular clustering algorithm. Then the output variable yt is Bach, F. R., Lanckriet, G. R. G., & Jordan, M. I. (2004). Multiple kernel learning, conic
duality, and the SMO algorithm. In: Proceedings of the 21th international
conference on machine learning (pp. 6–13).
y0t y
0j
yt ¼ ; ð31Þ Bennett, K. P., Momma, M., & Embrechts, M. J. (2002). MARK: A boosting algorithm
ry0j for heterogeneous kernel models. In Proceedings of the eigth ACM
SIGKDD international conference on knowledge discovery and data mining (pp.
24–31).
0j and ry0 are the mean and
where y0t belongs to the jth cluster, and y Bertsekas, D. P. (1999). Nonlinear programming (2nd ed.). Massachusetts, USA:
j
0
standard deviation in the y direction of the jth cluster. The input Athena Scientific.
vector xt = [xt,1 xt,2] is obtained by: Bollerslev, T. (1986). Generalized autoregressive conditional heteroscedasticity.
Journal of Econometrics, 31(3), 307–327.
Box, G. E. P., & Jenkins, G. M. (1994). Time series analysis: Forecasting and control (3rd
x0t;i x0j;i ed.). Englewood Cliffs: Prentice Hall.
xt;i ¼ ð32Þ
rx0j;i Cao, L., & Tay, F. E. H. (2001). Financial forecasting using support vector machines.
Neural Computing & Applications, 10(2), 184–192.
Cao, L., & Tay, F. E. H. (2003). Support vector machine with adaptive parameters in
for i = 1, 2, where ½x0t;1 x0t;2 belongs to the jth cluster, and x0j;i and rx0j;i financial time series forecasting. IEEE Transactions on Neural Networks, 14(6),
are the mean and standard deviation, respectively, in the ith direc- 1506–1518.
Champernowne, D. G. (1948). Sampling theory applied to autoregressive schemes.
tion of the jth cluster. The RMSE is defined as follows:
Journal of the Royal Statistical Society: Series B, 10, 204–231.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Chang, P.-C., & Liu, C.-H. (2008). A TSK type fuzzy rule based system for stock price
u T u T
u1 X u1 X prediction. Expert Systems with Application, 34(1), 135–144.
RMSE ¼ t ðpt p ^t Þ2 ¼ t ðy 0 y^0t Þ2 Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple
T t¼1 T t¼1 t parameters for support vector machines. Machine Learning, 46(1–3), 131–
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 159.
u T Crammer, K., Keshet, J., & Singer, Y. (2003). Kernel design using boosting. In S.
u1 X 2
¼t y0t y ^t ry0 þ y 0j ; ð33Þ Becker, S. Thrun, & K. Obermayer (Eds.). Advances in neural information
processing systems (Vol. 15, pp. 537–544). Cambridge, MA, USA: MIT Press.
T t¼1 j
Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines
and other kernel-based learning methods. Cambridge, UK: Cambridge University
where y ^t is the predicted output and j is the index of its correspond- Press.
Duan, K., Keerthi, S., & Poo, A. N. (2003). Evaluation of simple performance measures
ing cluster.
for tuning SVM hyperparameters. Neurocomputing, 51, 41–59.
For FNN, standard three-layer networks are adopted. There are Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of
2 nodes in the input layer and 1 node in the output layer. To exam- the variance of united kingdom inflation. Econometrica, 50(4), 987–1008.
ine the effect of different architectures on the performance, we set Fernando, P.-C., Julio, A. A.-R., & Javier, G. (2003). Estimating GARCH models using
support vector machines. Quantitative Finance, 3(3), 163–172.
the number of hidden nodes from 2 to 15 with a stepping factor of Gestel, T. V., Suykens, J. A. K., Baestaens, D. E., Lambrechts, A., Lanckriet, G.,
1 in the hidden layer. A hybrid learning algorithm incorporating Vandaele, B., et al. (2001). Financial time series prediction using least squares
particle swarm optimization (PSO) and recursive least square support vector machines within the evidence framework. IEEE Transactions on
Neural Networks, 12(4), 809–821.
(RLS) is used for refining the antecedent parameters and the conse- Gönen, M., & Alpaydin, E. (2008). Localized multiple kernel learning. In Proceedings
quent parameters, respectively. The forecasting performance ob- of the 25th international conference on machine learning (pp. 352–359).
tained by FNN with different numbers of hidden nodes is Hansen, J. V., & Nelson, R. D. (1997). Neural networks and traditional time series
methods: A synergistic combination in state economic forecasts. IEEE
depicted in Fig. 8. From this figure, we can see that FNN requires Transactions on Neural Networks, 8(4), 863–873.
different numbers of hidden nodes for different datasets to obtain Hartigan, J. A., & Wong, M. A. (1979). A K-means clustering algorithm. Applied
good performance. We run SKSVR and MKSVR on these four data- Statistics, 28, 100–108.
Kecman, V. (2001). Learning and soft computing: Support vector machines. Neural
sets, with the same settings as in Experiment I. The forecasting per-
networks, and fuzzy logic models. Cambridge, MA, USA: MIT Press.
formance obtained by SKSVR is shown in Fig. 9. Again, we can see Kim, K. J., & Han, I. (2008). Genetic algorithms approach to feature discretization in
that SKSVR requires different c settings for different datasets to ob- artificial neural networks for the prediction of stock price index. Expert Systems
with Application, 19(2), 125–132.
tain good performance. The best RMSE values obtained by FNN and
Kwok, J. T.-Y. (2000). The evidence framework applied to support vector machines.
SKSVR are listed in Table 5. The RMSE values obtained by MKSVR IEEE Transactions on Neural Networks, 11(5), 1162–1173.
for the four datasets are also listed in Table 5. Obviously, MKSVR Kwon, Y.-K., & Moon, B.-R. (2007). A hybrid neurogenetic approach for stock
works better than the best FNN and SKSVR for each dataset, and forecasting. IEEE Transactions on Neural Networks, 18(3), 851–864.
Lanckriet, G. R. G., Cristianini, N., Bartlett, P., Ghaoui, L. E., & Jordan, M. I. (2004).
we don’t need to do trail-and-error with MKSVR. Fig. 10 shows Learning the kernel matrix with semidefinite programming. Journal of Machine
the forecasting results for datasets DS-V to DS-VIII by MKSVR. Learning Research, 5, 27–72.
Oh, S.-K., Pedrycz, W., & Park, H.-S. (2006). Genetically optimized fuzzy polynomial
neural networks. IEEE Transactions on Fuzzy Systems, 14(1), 125–144.
5. Conclusion Ong, C. S., Smola, A. J., & Williamson, R. C. (2005). Learning the kernel with
hyperkernels. Journal of Machine Learning Research, 6, 1043–1071.
Pai, P.-F., & Lin, C.-S. (2005). A hybrid ARIMA and support vector machines model in
We have proposed a multiple-kernel support vector regression stock price forecasting. Omega: The International Journal of Management Science,
approach for stock market price forecasting. A two-stage multi- 33(6), 497–505.
Platt, J. C. (1999). Fast training of support vector machines using sequential minimal
ple-kernel learning algorithm is developed to optimally combine optimization. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.). Advances in
multiple-kernel matrices for support vector regression. The learn- kernel methods: Support vector learning (Vol. 11, pp. 185–208). Cambridge, MA,
ing algorithm applies sequential minimal optimization and gradi- USA: MIT Press.
Qi, M., & Zhang, G. P. (2008). Trend timevseries modeling and forecasting with
ent projection iteratively to obtain Lagrange multipliers and neural networks. IEEE Transactions on Neural Networks, 19(5), 808–816.
optimal kernel weights. By this algorithm, advantages from differ- Rakotomamonjy, A., Bach, F. R., Canu, S., & Grandvalet, Y. (2008). Simplemkl. Journal
ent hyperparameter settings can be combined and overall system of Machine Learning Research, 9, 2491–2521.
Rakotomamonjy, A., Bach, F. R., Canu, S., & Grandvalet, Y. (2007) More efficiency in
performance can be improved. Besides, the user need not specify
multiple kernel learning. In Proceedings of the 24th international conference on
the hyperparameter settings in advance, and trial-and-error for machine learning (pp. 775–782).
determining appropriate hyperparameter settings can then be Sonnenburg, S., Ratsch, G., Schäfer, C., & Schölkopf, B. (2006). Large scale multiple
kernel learning. Journal of Machine Learning Research, 7, 1531–1565.
avoided. Experimental results, obtained by running on datasets ta-
Szafranski, M., Grandvalet, Y., & Rakotomamonjy, A. (2008). Composite kernel
ken from Taiwan Capitalization Weighted Stock Index, have shown learning. In Proceedings of the 25th international conference on machine learning
that our method performs better than other methods. (pp. 1040–1047).
2186 C.-Y. Yeh et al. / Expert Systems with Applications 38 (2011) 2177–2186
Taiwan Stock Exchange Corporation. <http://www.twse.com.tw/>. Wang, Z., Chen, S., & Sun, T. (2008). MultiK-MHKS: A novel multiple kernel learning
Tay, F. E. H., & Cao, L. (2001). Application of support vector machines in financial algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2),
time series forecasting. Omega: The International Journal of Management Science, 348–353.
29(4), 309–317. Yang, H., Chan, L., & King, I. (2002). Support vector machine regression for volatile
Tsang, I. W.-H., & Kwok, J. T.-Y. (2006). Efficient hyperkernel learning using stock market prediction. In Proceedings of the third international conference on
second-order cone programming. IEEE Transactions on Neural Networks, 17(1), intelligent data engineering and automated learning (pp. 391–396).
48–58. Zarandi, M. H. F., Rezaee, B., Turksen, I. B., & Neshat, E. (2009). A type-2 fuzzy rule-
Valeriy, G., & Supriya, B. (2006). Support vector machine as an efficient framework based expert system model for stock price analysis. Expert Systems with
for stock market volatility forecasting. Computational Management Science, 3(2), Application, 36(1), 139–154.
147–160. Zhang, D., & Zhou, L. (2004). Discovering golden nuggets: Data mining in financial
Vapnik, V. (1995). The nature of statistical learning theory. New York, USA: Springer- application. IEEE Transactions on Systems, Man, and Cybernetics, Part C:
Verlag. Applications and Reviews, 34(4), 513–522.