Statistical Arbitrage With ML 1721555596
Statistical Arbitrage With ML 1721555596
com
ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2022) 000–000
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 202 (2022) 194–202
Abstract
In this paper machine learning is used to investigate statistical arbitrage in China stock market. We use HS300 index constituent
stocks to construct pairs trading. The daily and monthly momentums in these stocks are used as new input factors to forecast the
stock price. We develop a trading approach to find that random forest (RF) outperform deep neural net (DNN), XGBoost, support
vector machine(SVM) and LSTM from January 2013 to August 2017.
© 2022 The Authors. Published by Elsevier B.V.
© 2022 The Authors. Published by ELSEVIER B.V.
This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0)
This is an open
Peer-review access
under article under
responsibility of thethe CC BY-NC-ND
scientific license
committee of the(https://creativecommons.org/licenses/by-nc-nd/4.0)
International Conference on Identification, Information and
Peer-review under responsibility of the scientific
Knowledge in the Internet of Things, 2021 committee of the International Conference on Identification, Information and
Knowledge in the Internet of Things, 2021
Keywords: Momentum; Random forest; Statistical arbitrage; Machine learning;
1. Introduction
It is a very difficult task to predict price change of stocks. Statistical arbitrage is based on the mean reversion
principle. If the price fluctuation process of an asset is a stable time series, when the price of the asset fluctuates in a
short time, its price will return to the mean state in the next period of time due to the effect of the equilibrium
mechanism.Statistical arbitrage is to find a pair of stocks with high convergence in the stock market. The price
fluctuations of the two stocks have strong similarities. When the price of one stock rises, the price of the other stock
rises at the same time. In one certain period, the prices of the two stocks may deviate to a certain extent due to the
random factors of their respective companies.
The literature on statistical arbitrage mostly uses the traditional mathematical modeling methods such as time
series and stochastic control to find the pairing combination and solve the optimal transaction signal. It is expected
that machine learning algorithms can extract information from data more effectively. Huck (2009) developed a
statistical arbitrage strategy using the integration of neural network and a multi criteria decision-making. His method
consists of three steps: prediction, ranking and trading. Huck (2010) further improved this method through multi-
step predictions. Takeuchi and Lee (2013) developed an enhanced momentum strategy for the CRSP stock market
from 1965 to 2009. Moritz and Zimmermann (2014) tested the statistical arbitrage strategy based on random forest
for the CRSP stock market data from 1968 to 2012, and found that the average monthly risk adjusted excess return
was 2%. When the feature data set includes 86 features of companies, the return increases to 2.28% monthly. Krauss
et al. (2017) applied deep neural network, gradient enhancement tree and random forest to S&P 500 index from
1992 to 2015. Using revenue based characteristics, they found that the combination of the above methods can
produce 0.45% revenue per day (before transaction costs). Fischer and Krauss (2018)use the LSTM network for the
same prediction task. Huck (2019) show that these technical indicators have the ability to generate trading signals
for portfolios with significant reversal effect and short holding period (one to five days).
Moreover, The momentum effect means that assets that have performed well in the past will often perform better
in the future. The momentum effect was discovered by Jegadeesh and Titman (1993,2001). Rouwenhorst (1998)
conducted an empirical study on the stocks of 2000 listed companies in twelve European countries and found that
momentum effect exists significantly in the European market. Schiereck et al. (1999) concluded that the momentum
strategy in the German market is better in the medium and long term. Chui's (2000)show that momentum strategies
can outperform the market in Asian stock markets. Pedro et al. (2015) show that momentum strategies can achieve
an excess return of more than 10% in several major global stock markets.
Therefore, this paper selects HS300constituent stocks to investigate statistical arbitrage using machine learning
algorithms. We take the momentum factor as the input feature to explore whether machine learning can make full
use of the momentum factor of stock price to successfully predict the rise and fall of stocks.The key task of the
employed machine learning methods is to accurately predict whether a stock outperforms HS300 index as a
benchmark.
The remainder of this paper is organized as follows.Section 2 briefly covers the data sample, software packages,
and our methodology, i.e., the generation of training and trading sets, the construction of input sequences, the model
architecture and training as well as the forecasting and trading steps. Section 3 presents the results and discusses our
most relevant findings. Finally, Section 4concludes.
We take 84 stocks from HS 300 Index in Shanghai and Shenzhen Exchange markets. The data is from Wind
data over a 9-year period, from January 2, 2011 to December 31, 2019.This paper uses the sliding window method
to generate the training set and trading set.
The construction of input features mainly refers to the processing method (Takeuchi and Lee, 2013). First,
extract the daily returns of the stocks in the past 21 days, and then extract the monthly returns of the stocks in the
following 12 months. Next, use the daily returns of the past 21 days to calculate the cumulative returns, and use the
monthly returns of the following 12 months to calculate the 12 monthly cumulative returns. After calculating all the
daily momentum factors and monthly momentum factors, it is necessary to calculate the quantile of each momentum
factor in all 84 stocks. The quantile as each momentum factor is used as the input feature. In this way, a total of 33
characteristic variables are constructed.
Let P S ( Pts )tT represent the price of stock s ,where s 1, ...n .Then the return is defined as
Ps
R s t 1. (1)
t, m s
P
t m
For the daily returns, define the range of period m 1, 2, 3...19, 20, 21 ,the range of period
m 42,...252, 273 for the monthly returns. A binary variable
196 Maojun Zhang et al. / Procedia Computer Science 202 (2022) 194–202
Author name / Procedia Computer Science 00 (2019) 000–000 3
Y 0,1 (2)
s |t l
is constructed for a stock s to represent the rise and fall trend. If the return of stock s exceeds the median return of
all stocks, the binary variable is Ys |t l 1 (Category 1) , otherwise Ys |t l 0 (category 0).
In order to scientifically and reasonably evaluate the real return level of statistical arbitrage strategy, we use
some classical performance evaluation indicators, including annualized return, return volatility, Sharp ratio, Sortino
ratio, etc. On each trading day t 1 , the probability that the return of stock s exceeds the median return of all
stocks at t is ps . We can find the undervalued stocks at the top of the ranking of the probability, the
t 1| t
overvalued stocks at the bottom of the ranking. Thus we buy the stocks with the highest rising probability and sell
the stocks with the highest falling probability.
2.2. Methodology
According to the empirical research of Moritz and Zimmermann (2014), Krauss et al. (2017), Fischer and
Krauss (2018), Huck (2019), etc., the random forest algorithm has a good performance in financial sequence
prediction. The random forest algorithm is not a single machine learning algorithm, but an integrated algorithm
based on decision tree model. Its estimator is a decision tree. The performance of each decision tree in classification
function determines the effect of random forest classification and prediction. In the process of decision tree growth,
the selection of features follows the principle of minimum information purity.
SVM is based on statistical learning theory, with extremely strict theoretical basis, based on the minimum
principle of VC dimensional theory and structural risk, and introduces the nuclear function, allowing its algorithm to
map high-dimensional space, but avoid complex the calculation and effectively overcomes the problem of disaster.
Since these more significant advantages, it is also applied in many fields and has achieved good results. Although
SVM theory and algorithm have had a large extent development and progress through such problems, on some
issues, such as training speed, nuclear function, calculation storage capacity, etc. Because of these advantages, SVM
can be well applied to pattern recognition, probability density function estimation, time series prediction, regression
estimation, etc.
Gradient boosting is one of the most powerful technologies for building prediction models. It is a representative
algorithm of boosting in integrated algorithms. The integration algorithm constructs multiple weak evaluators on the
data and summarizes the modeling results of all weak evaluators to obtain better regression or classification
performance than a single model. The weak evaluator is defined as performing at least better than random guess is a
better model, that is, any model with a prediction accuracy of no less than 50%. There are many ways to integrate
different weak evaluators. For example, the bagging method of establishing multiple parallel independent weak
evaluators at one time. There are also methods like the lifting method, which build weak evaluators one by one and
gradually accumulate multiple weak evaluators after many iterations. The most famous lifting algorithms include
AdaBoost and gradient lifting tree. XGBoost is developed from gradient lifting tree. Unlike traditional GBDT, the
traditional GBDT only uses the first order countdown information when optimizing, while XGBoost performs the
two order Taylor expansion for the loss function.
Deep neural network is composed of input layer, one or more hidden layers and output layers. The dimension
of input layer and input feature is equal. The output layer is a classification or regression layer to match the output
space. All layers are composed of neurons, the basic unit of this model. In the classical feed architecture, each
neuron is fully connected with all neurons in the previous layer, and each neuron represents a certain weight.
Moreover, the input layer and hidden layer of the neural network have bias units, which are used as the activation
threshold of neurons in the subsequence layer.
RNN is a cyclic network structure with the ability to maintain information. The cyclic network module in RNN
transmits information from the upper layer of the network to the lower layer. The output of the hidden layer of the
network module at each time depends on the information of the previous time. The chain attribute of RNN shows
that it is closely related to sequence annotation. In the training of RNN, there are problems of gradient explosion and
Maojun Zhang et al. / Procedia Computer Science 202 (2022) 194–202 197
4 Author name / Procedia Computer Science 00 (2019) 000–000
disappearance, and RNN is difficult to keep memory for a long time. LSTM network is an extension of RNN and is
specially designed to avoid long-term dependency problems. The repetitive neural network module of LSTM has
different structures, which is different from the naive RNN. There are four neural network layers that interact in a
special way.
3. Performance analysis
The research method of this paper is mainly divided into four steps. Firstly, the data set in a research cycle is
divided into two parts, the training set and the trading set. The training set is used to train the machine learning
model, and the trading set is used to verify the prediction effect of the model. The second step is to generate the
input characteristics and output characteristics. The third step is to train the random forest model on the training set
and determine the optimal parameter setting of the random forest model. The fourth step is to use the random forest
to predict on the trading set, rank the stocks according to the prediction results, and long the stocks with the highest
rising probability and short the stocks with the highest falling probability.
During the trading period, the cumulative returns of the five algorithms are shown in Figure1. We see the
trends, and find that the cumulative profit of the random forest algorithm is much higher than that of the other four
algorithms. The performance of LSTM algorithm is close to that of SVM algorithm. During the whole trading
period, the trend of its cumulative income is better than that of the HS300 index. XGBoost algorithm performs
worse than the HS300 index in the early stage, performs better in the late trading period, and finally outperforms the
HS300 index. During the whole empirical period, the cumulative return of DNN algorithm was poor, and finally
failed to exceed the HS300 index. In addition, in order to explore the profitability of the machine learning algorithm,
the proportion of the trading days with the returns greater than 0 before transaction cost and the proportion of the
after the transaction cost are respectively counted, as shown in Table 1.
The proportion of RAF, SVM and XGBoost is higher than that of the HS300 index, while the ratio of DNN
algorithm to LSTM algorithm is slightly lower than that of the HS300 index. After transaction costs, only RAF and
XGBoost algorithms have a higher proportion than the HS300 index, while the other three algorithms have a lower
proportion than HS300 Index. Through the above analysis, it can be found that most machine learning has the ability
to predict China's stock market. In addition, RAF algorithm performs better than the other four algorithms.
According to the accuracy performance of the random forest model in the training set, the number of estimators
is 39, the Gini coefficient used in the branching standard, the maximum depth of the random forest is 20, the
maximum feature is 20, the minimum segmented leaf node is 30, and the minimum number of leaf nodes is 25.
Since the random forest model adopted this time uses 39 estimators, and each estimator has more branches and a
large width.
Author name / Procedia Computer Science 00 (2019) 000–000 5
198 Maojun Zhang et al. / Procedia Computer Science 202 (2022) 194–202
we display strategy performance over time from January 2013 to December 2019. The transaction cost is 1.5 ‰.
The evaluation indicators of the profitability in the three stages are shown in Table 1. From January 2013 to May
2015, the profitability of the statistical strategy is better than that of the HS300index in the same period with a daily
average return of 0.0036, while a daily average return of the HS300index is 0.0012. In addition, the annual return of
the statistical arbitrage strategy is 1.2438, while the annual return of the HS300index is 0.3143. The alpha value of
the strategy is 1.4862. In terms of risk, the annualized volatility of the statistical arbitrage strategy is 1.2483, while
the annualized volatility of the HS300index is 0.2216. The Sharp ratio of the statistical arbitrage strategy 2.1295,
while the Sharp ratio of the HS300index in the same period is 1.3451. According to the above analysis, although the
volatility of the strategy is higher than that of the HS300index, the statistical arbitrage strategy without transaction
cost is much better than that of the HS300index, After transaction costs, the fluctuation relationship between return
and risk of the statistical arbitrage strategy based on momentum factor and random forest is roughly equivalent to
that of the HS300index.
From May 2015 to August 2017, the profitability of the statistical arbitrage is better than that of the HS300index,
regardless of whether the transaction cost is deducted, with a daily average return of 0.0031, a daily average return
of 0.0016 after transaction costs, and a daily average return of the HS300index is -0.0002 in the same period. In
addition, during this period, the annualized return of statistical arbitrage strategy is 0.9331, while the annualized
return after transaction cost is 0.3242, while the annualized return of the HS300index is -0.0851. Compared with the
HS300index, the alpha value of statistical arbitrage strategy is 1.1764, and the alpha value after transaction costs is
0.4909. In terms of risk, the annualized volatility of the statistical arbitrage strategy is 0.4902, the annualized
volatility after transaction costs is 0.4895, and the annualized volatility of the HS300index is 0.2818. The Sharp
ratio of statistical arbitrage strategy is 1.5893, the Sharp ratio after transaction costS is 1.5893, while the Sharp ratio
of the HS300index is -0.1729. Considering the risk and return at the same time, not only the statistical arbitrage
strategy without transaction costs is much better than the HS300 index, but also after transaction costs is still much
better than the HS300index.
Maojun Zhang et al. / Procedia Computer Science 202 (2022) 194–202 199
6 Author name / Procedia Computer Science 00 (2019) 000–000
From August 2017 to December 2019, although the cumulative rate of return of the statistical arbitrage strategy
eventually exceeds the HS300 index before transaction costs, the statistical arbitrage strategy fluctuates greatly and
is inferior to the HS300 index in. The average daily return of the statistical arbitrage strategy is 0.0009, the average
daily return after transaction costs is -0.0006, while the average daily return of the HS300 index in the same period
is 0.0001. In addition, during this period, the annualized return of the statistical arbitrage strategy is 0.1448, while
the annualized return after transaction costs is -0.2158, while the annualized return of the HS300index is 0.0184.
Compared with the HS300index, the alpha value of the statistical arbitrage strategy is 0.2401, and the alpha value
after transaction cost is -0.1505. In terms of risk, the annualized volatility of the statistical arbitrage strategy is
0.4133, the annualized volatility after transaction costs is 0.4895, and the annualized volatility of the HS300index in
the same period is 0.4127. The Sharp ratio of the statistical arbitrage strategy based on momentum factors and
random forest is 0.5337, the Sharp ratio after transaction cost is -0.3823, while the Sharp ratio of the HS300 index in
the same period is -0.1908. The fluctuation relationship between return and risk of the statistical arbitrage strategy
after transaction costs is far worse than the HS300index. Therefore, it can be found that the statistical arbitrage
strategy can more accurately predict the rise and fall of the market from August 2017 to December 2019, but the
prediction ability is not enough to gain from the market.
Through the above empirical analysis of the statistical arbitrage strategy based on momentum factor and random
forest, the following conclusions can be summarized,: first, the statistical arbitrage strategy can obtain a return far
exceeding the HS300index from January 2013 to August 2017, and from August 2017 to December 2019, The
statistical arbitrage strategy can more accurately predict the rise and fall of the market, but the prediction ability is
not enough to make profits from the market. The return after transaction costs is worse than that of HS300index.
Second, the profit of statistical arbitrage strategy continues to decline with the extension of time, which may be due
to the optimal parameters selected by the parameters of random forest in the first research cycle.
A common method to test the robustness of machine learning algorithm is to change the accuracy of the algorithm
and observe whether the experimental results change when the dimension of parameter setting changes, so as to test
whether the model is stable. This paper changes the classification accuracy of the model by changing the number of
estimators in the random forest, and observes the change trend of cumulative return and Sharp ratio with the number
of estimators. As shown in Table 1, when the estimators of the statistical arbitrage policy based on momentum factor
and random forest is 37-41, the daily average return, annualized return, annualized volatility, alpha value, beta value,
Sharp ratio, sortino ratio and calmar ratio of the strategy are at a very stable level. When the number of estimators is
more than 41 or less than 37, the daily average return, annualized return, annualized volatility, alpha value, beta
value, Sharp ratio, Sortino ratio and calmar ratio of the strategy begin to fluctuate greatly. Therefore, it can be found
that when the number of estimators is 37 to 41, the random forest model constructed in this paper is in a stable state.
200 Maojun Zhang et al. / Procedia Computer Science 202 (2022) 194–202
Author name / Procedia Computer Science 00 (2019) 000–000 7
Mean return 0.0036 0.0031 0.0009 0.0021 0.0016 -0.0006 0.0012 -0.0002 0.0001
Max 0.1238 0.1846 0.1075 0.1222 0.1829 0.1059 0.0461 0.0671 0.0595
Quartile 1 0.0187 0.0157 0.0145 0.0171 0.0141 0.0130 0.0080 0.0063 0.0068
Median 0.0049 0.0033 0.0009 0.0033 0.0018 -0.0006 0.0005 0.0008 0.0001
Quartile 3 -0.0104 -0.0098 -0.0135 -0.0119 -0.0113 -0.0150 -0.0061 -0.0049 -0.0064
Min -0.1206 -0.1356 -0.0928 -0.1219 -0.1369 -0.0941 -0.0770 -0.0875 -0.0584
Standard
deviation 0.0266 0.0309 0.0260 0.0266 0.0308 0.0260 0.0140 0.0178 0.0124
Skewness -0.5212 0.3569 -0.0143 -0.5212 0.3569 -0.0143 -0.2538 -1.0446 -0.0563
Kurtosis 2.8906 5.7621 1.5208 2.8906 5.7621 1.5208 3.0785 5.8067 2.7703
Annualized
return 1.2438 0.9331 0.1448 0.5371 0.3242 -0.2158 0.3143 -0.0851 0.0184
Annualized
volatility 0.4222 0.4902 0.4133 0.4216 0.4895 0.4127 0.2216 0.2818 0.1971
Cumulative
returns 5.0248 3.3262 0.3505 1.5993 0.8665 -0.4173 0.8357 -0.1793 0.0413
Alpha 1.4862 1.1764 0.2401 0.7032 0.4909 -0.1505 0.0000 0.0000 0.0000
Beta -0.0447 -0.0057 0.1399 -0.0447 -0.0057 0.1397 1.0000 1.0000 1.0000
Sharpe ratio 2.1295 1.5893 0.5337 1.2328 0.8171 -0.3823 1.3451 -0.1729 0.1908
Downside risk 0.2885 0.3185 0.2852 0.2988 0.3285 0.2971 0.1472 0.2195 0.1387
Sortino ratio 3.1167 2.4461 0.7734 1.7395 1.2177 -0.5309 2.0240 -0.2220 0.2713
Maximum
drawdown -0.2594 -0.4588 -0.4797 -0.2856 -0.4730 -0.5796 -0.2482 -0.4670 -0.3246
Calmar ratio 4.7951 2.0338 0.3018 1.8805 0.6855 -0.3723 1.2667 -0.1822 0.0566
Omega ratio 1.4437 1.3600 1.0955 1.2382 1.1718 0.9367 1.2673 0.9648 1.0339
Estimators 36 37 38 39 40 41 42
4. Conclusion
This paper constructs the input features from the perspective of momentum, and explores the profitability of the
statistical arbitrage strategy based on momentum factors and random forest in China Stock Markets. Specifically,
when we predict the rise and fall of a stock after one day, we first need to calculate 33 indicators of a stock's daily
cumulative return and monthly cumulative return. After calculating all daily momentum factors and monthly
momentum factors, we need to further calculate the quantile of each momentum factor in all 84 stocks. The quantile
corresponding to each momentum factor is used as the input feature of the random forest to predict the rise and fall
of the stock.
This paper finds that the statistical arbitrage strategy based on momentum factors and random forest can obtain a
return far beyond the HS300index from January 2013 to August 2017. From August 2017 to December 2019, the
statistical arbitrage strategy can more accurately predict the rise and fall of the market, but the prediction ability is
not enough to make profits. In general, the statistical arbitrage strategy can obtain benefits beyond the market during
the trading period. However, during the whole empirical period, this profitability continues to decline with the
extension of trading time, but this decline in profitability is likely due to the change of the data set of the
corresponding training set with the change of training cycle, The rule of information in the data set may change
accordingly, and the random forest model is the best parameter selected according to the first training cycle. In
addition, the empirical study of this paper shows that the statistical arbitrage strategy is better than the traditional
momentum and reversal strategy. It covers the momentum information of the market in the past period, and contains
more information about the market, so it has stronger profitability.
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant Nos. 71961004,
72061007, 71461005, and Scientific Research of Suzhou University of Science and Technology Grant No.
332111807, 332111801.
References
[1] Chui A, Titman S, Wei K.C.J. Momentum, ownership structure, and financial crises: An analysis of Asian stock markets,Working paper,
University of Texas at Austin, 2000.
202 Maojun Zhang et al. / Procedia Computer Science 202 (2022) 194–202
Author name / Procedia Computer Science 00 (2019) 000–000 9
[2] Fischer T, Krauss C. Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational
Research, 2017, 270(2):654-669.
[3] Huck, N. Pairs selection and outranking: An application to the S&P 100 index[J]. European Journal of Operational Research, 2009, 196(2):
819–825.
[4] Huck, N. Pairs trading and outranking: The multi-step-ahead forecasting case[J]. European Journal of Operational Research, 2010, 207(3):
1702–1716.
[5] Huck N. Large data sets and machine learning: Applications to statistical arbitrage[J]. European Journal of Operational Research, 2019,
278(1):330-342.
[6] Jegadeesh N, Titman S. Returns to buying winners and selling losers: Implications for stock market efficiency. Journal of Finance,
1993,48:65-91.
[7] Jegadeesh N, Titman S. Profitability of Momentum Strategies: An Evaluation of Alternative Explanations. Journal of Finance, 2001, 02:699-
720.‘
[8] Krauss, C., Do, X. A., Huck, N. Deep neural networks, gradient-boosted trees, random forests: Statistical arbitrage on the S&P 500. European
Journal of Operational Research, 2017,259(2), 689–702.
[9] Moritz B , Zimmermann T. Tree-Based Conditional Portfolio Sorts: The Relation between Past and Future Stock Returns. Available at
http://dx.doi.org/10.2139/ssrn.2740751, 2016.
[10] Pedro Barroso, Pedro Santa-Clara. Momentum has its moments. Journal of Financial Economics, 2015(116): 111-120.
[11] Rouwenhorst K.G. International momentum strategies. Journal of Finance, 1998,53:267-284
[12] Schiereck D, Weber B M. Behavioral Finance || Contrarian and Momentum Strategies in Germany. Financial Analysts Journal, 1999,
55(6):104-116.
[13] Takeuchi, L., Lee, Y.-Y. A. Applying deep learning to enhance momentum trading strategies in stocks. Working paper, Stanford, 2013.