Stock Trading Strategies Based On Deep Reinforcement Learning
Stock Trading Strategies Based On Deep Reinforcement Learning
Scientific Programming
Volume 2022, Article ID 4698656, 15 pages
https://doi.org/10.1155/2022/4698656
Research Article
Stock Trading Strategies Based on Deep Reinforcement Learning
Received 1 August 2021; Revised 30 December 2021; Accepted 4 February 2022; Published 1 March 2022
Copyright © 2022 Yawei Li et al. This is an open access article distributed under the Creative Commons Attribution License, which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The purpose of stock market investment is to obtain more profits. In recent years, an increasing number of researchers have tried
to implement stock trading based on machine learning. Facing the complex stock market, how to obtain effective information
from multisource data and implement dynamic trading strategies is difficult. To solve these problems, this study proposes a new
deep reinforcement learning model to implement stock trading, analyzes the stock market through stock data, technical indicators
and candlestick charts, and learns dynamic trading strategies. Fusing the features of different data sources extracted by the deep
neural network as the state of the stock market, the agent in reinforcement learning makes trading decisions on this basis.
Experiments on the Chinese stock market dataset and the S&P 500 stock market dataset show that our trading strategy can obtain
higher profits compared with other trading strategies.
different data sources, which is more conducive to the price of stock [24–27]. In the financial field, deep learning
analysis of the stock market. However, the fusion of mul- methods are used for stock price prediction because they can
tisource data is difficult. obtain temporal characteristics from financial data [28, 29];
To a deeper analysis of the stock market and learn the Chen et al. [30] analyzed 2D images transformed from fi-
optimal dynamic trading strategy, this study proposes a deep nancial data through a convolutional neural network (CNN)
reinforcement learning model and integrates multisource to classify the future price trend of stocks. When imple-
data to implement stock trading. Through the analysis of menting stock trading based on the deep learning method,
stock data, technical indicators, and candlestick charts, we the higher the accuracy of the prediction, the more helpful
obtain a deeper feature representation of the stock market, the trading decision. On the contrary, when the prediction
which is conducive to learning the optimal trading strategy. result deviates greatly from the actual situation, it will cause
Besides, the setting of the reward function in reinforcement the fault trading decision. In addition, the trading strategy
learning cannot be ignored. In stock trading, investment risk implemented by such methods is static and cannot be ad-
should be paid attention to while considering returns and justed in time according to the changes in the stock market.
reasonably balance risk and returns. Sharpe ratio (SR) Reinforcement learning can be used to implement
represents the profit that can be obtained under certain risks stock trading by self-learning and autonomous decision-
[23]. In this study, the reward function takes investment risk making. Chakole et al. [31] used Q-learning algorithm [32]
into consideration and combines SR and profit rate (PR) as to find the optimal trading strategy, in which the unsu-
the reward function to promote the learning of optimal pervised learning method K-means and candlestick chart
trading strategies. were, respectively, used to represent the state of the stock
To verify the effectiveness of the trading strategy learned market. Deng et al. [33] proposed a model Deep Direct
by our proposed model, we compare it with other trading Reinforcement Learning and added fuzzy learning, which
strategies based on practical trading data. For stocks with is the first attempt to combine deep learning and rein-
different trends, our trading strategy obtained higher PR and forcement learning in the field of financial transactions.
SR, which has better robustness. In addition, we conduct Wu et al. [34] proposed a long short-term memory based
ablation experiments, and the experimental results show that (LSTM-based) agent that could perceive stock market
the trading strategy learned from analyzing the stock market conditions and automatically trade by analyzing stock
based on multisource data is better than those learned from data and technical indicators. Lei et al. [35] proposed a
analyzing the stock market based on a single data source. The time-driven feature aware jointly deep reinforcement
main contributions of this paper are as follows: learning model (TFJ-DRL), which combines gated re-
current unit (GRU) and policy gradient algorithm to
(i) A new deep reinforcement learning model is pro-
implement stock trading. Lee et al. [36] proposed
posed to implement stock trading and integrate the
HW_LSTM_RL structure, which first used wavelet
stock data and candlestick charts to analyze the
transforms to remove noise in stock data, then based on
stock market, which is more helpful to learn the
deep reinforcement learning to analyze the stock data to
optimal dynamic trading strategy.
make trading decisions.
(ii) A new reward function is proposed. In this study, Existing studies on stock trading based on deep rein-
investment risk is taken into account, and the sum forcement learning mostly analyze the stock market through
of SR and PR is taken as the reward function. a single data source. In this study, we propose a new deep
(iii) The experimental results show that the trading reinforcement learning model to implement stock trading,
strategy learned from the deep reinforcement and analyze the state of the stock market through stock data,
learning model proposed in this paper can obtain technical indicators, and candlestick charts. In our proposed
better profits for stocks with different trends. model, firstly, different deep neural networks are used to
extract the features of different data sources. Secondly, the
2. Related Work features of different data sources are fused. Finally, rein-
forcement learning makes trading decisions according to the
In recent years, a mass of machine learning methods has fused features and continuously optimizes trading strategies
been implemented in stock trading. Investors make trading according to the profits. The setting of reward function in
decisions based on their judgment of the stock market. reinforcement learning cannot be ignored. In this study, the
However, due to the influence of many factors, they cannot SR is added to the reward function setting, and the in-
make correct trading decisions based on the changes in the vestment risk is taken into account while considering the
stock market in time. Compared with traditional trading profits.
strategies, machine learning methods can learn trading
strategies by analyzing information related to the stock 3. Methods
market and discovering profit patterns that people do not
know about without professional financial knowledge, We propose a new deep reinforcement learning model and
which have more advantages. implement stock trading by analyzing the stock market with
There are some studies based on deep learning methods multisource data. In this section, first, we introduce the
to implement stock trading. Deep learning methods usually overall deep reinforcement learning model, then the feature
implement stock trading by predicting the future trend or extraction process of different data sources is described in
Scientific Programming 3
detail. Finally, the specific application of reinforcement value, so we replaced the NaNs in the stock data and
learning in stock trading is introduced. technical indicators with 0. Data with different value ranges
may show gradient explosion during neural network
training [42]. To prevent this problem, we normalize the
3.1. The Overall Structure of the Model. Implementing stock stock data and technical indicators; normalization is per-
trading based on deep reinforcement learning and correctly formed to transform the data to a fixed interval. In this work,
analyzing the state of the stock market is more conducive to the stock data and technical indicators of each dimension are
learning the optimal dynamic trading strategy. To obtain the normalized, and the data are converted into ranges [0, 1].
deeper feature representation of the stock market state and The normalization formula is as follows:
learn the optimal dynamic trading strategy, we fuse the
features of stock data, technical indicators, and candlestick X − Xmin
Xnorm � , (1)
charts. Figure 1 shows the overall structure of the model. Xmax − Xmin
The deep reinforcement learning model we propose can
be divided into two modules, the deep learning module for where X represents the original data, Xmin and Xmax rep-
extracting features of different data sources and the rein- resent the minimum and maximum values of the original
forcement learning module for making trading decisions. data, respectively, Xnorm represents the normalized data. The
Candlestick charts features are extracted by the CNN and neural network structure for extracting stock data and
bidirectional long short-term memory (BiLSTM); stock data technical indicators is shown in Figure 2. LSTM network is a
and technical indicators are as the input of the LSTM variant of recurrent neural network (RNN), and its unit
network for feature extraction. After extracting the features structure is shown in Figure 3. LSTM solves the problem of
of different data sources, contacting the features of the gradient disappearance and gradient explosion in the long
sequence training process. In the LSTM network, f, i, and o
different data sources to implement feature fusion, the fused
represent a forget gate, an input gate, and an output gate,
features can be regarded as the state of the stock market, and
respectively. A forget gate is responsible for removing in-
the reinforcement learning module makes trading decisions
formation from the cell state. The input gate is responsible
on this basis. In addition, in the reinforcement learning
module, the algorithms used are Dueling DQN [37] and for the addition of information to the cell state. The output
Double DQN [38]. gate decides which next hidden state should be selected. Ct is
the state of the memory cell at time t; Ct is the value of the
candidate state of the memory cell at time t; σ and tanh are
3.2. Deep Learning Module. The purpose of this study is to the sigmoid and tanh activation functions, respectively; W
obtain a deeper feature representation of the stock market and b represent the weight and deviation matrix, respec-
environmental state through the fusion of multisource data tively; xt is the input vector; ht is the output vector; in this
to learn the optimal dynamic trading strategy. Although raw paper, xt is the data after the contacting of stock data and
stock data can reflect changes in the stock market, they technical indicators, xt and other specific calculation for-
contain considerable noise. To reduce the impact of noise mulas are as follows:
and perceive the changes of the stock market more objec- xt � (open, low, close, ..., MACD, RSI, Willr)
tively and accurately, relevant technical indicators are used
as one of the data sources for analyzing the stock market in it � σ Wi · ht−1 , xt + bi
this study. Candlestick charts can reflect the changes in the Ct � tanh Wc · ht−1 , xt + bc
stock market from another perspective. This paper fuses the (2)
features of the candlestick charts. Ct � ft ∗ Ct−1 + it ∗ Ct
ot � σ Wo ht−1 , xt + bo
3.3. Stock Data and Technical Indicator Feature Extraction. ht � ot ∗ tanh Ct .
Due the noise in stock data, we use relevant technical in-
dicators to reduce the impact of noise. The technical indi- In the entire process of feature extraction of stock data
cators reflect the changes in the stock market from different and technical indicators, the stock data and technical in-
perspectives. In this paper, stock data and technical indi- dicators are first cleaned and normalized. Then, the nor-
cators are used as inputs to the LSTM network to better malized data are used as the input of the LSTM network for
capture the main trends of stocks. The raw stock data we use feature extraction. Finally, the final feature is obtained by
include opening price, closing price, high price, low price, feature extraction through the two-layer LSTM network.
and trading volume. The technical indicators used in this
paper are the MACD, EMA, DIFF, DEA, KDJ, BIAS, RSI,
and WILLR. The indicators are calculated by mathematical 3.4. Candlestick Chart Feature Extraction. To extract more
formulas based on stock prices and trading volumes [39], as informative features, in this study, historical stock data are
reported in Table 1. transformed into candlestick charts; candlestick charts
To facilitate subsequent calculations, we perform missing contain not only the candlestick but also other information,
value processing on the stock data. First, the stock data is which can be divided into two parts, the upper part is the
cleaned, and the missing data are supplemented with 0. In candlestick and moving average of the closing price, the
addition, the input of the neural network must be a real lower part is the trading volume histogram and its moving
4 Scientific Programming
240
230
220
V (s)
210
CNN BiLSTM
1e7
FC FC
1
0
0 10 20
features vi
candlestick
Open Q–Value
... Open
A (s, a)
Volume Volume
... LSTM features v FC FC
MACD MACD
... Willr
Willr
features vd
Figure 1: The overall structure of the model. vi represents candlestick chart feature, vd represents the feature of stock data and technical
indicators, and the feature vector obtained by contacting these two feature vectors is used as the input of the two fully connected (FC) layers.
In this paper, FC layers are used to construct the dueling DQN network; the two FC layers represent the advantage function A(s, a) and state
value function V(s) in dueling DQN. The final Q value is obtained by adding the outputs of the two functions.
Feature Extraction
Figure 2: The network structure for extracting features of stock data and technical indicators.
Scientific Programming 5
high high
Ct-1 Ct-1
open close
tanh
Body Body
~
ft it Ct Ot open
close
σ σ tanh σ
low low
ht-1 ht
Figure 4: Candlestick representation.
10.6
10.4
10.2
10.0
9.8
9.6
Conv Pool Conv Pool Conv Pool BiLSTM vi
1e8
1
0
0 10 20
Figure 5: The network structure for extracting features of a candlestick chart. vi is the final features.
Q(s, a; θ, α, β) � A(s, a; θ, α) + V(s; θ, β), (4) The sum of the features extracted by the deep
learning module represents the environment state st ;
where α and β, respectively, represent the parameters in the With the probability E choose a random action at ;
value function V(s) and advantage function A(s, a), θ
Otherwise select at � argmax Q(st , a; θ) ;
represents other parameters in the deep reinforcement a
learning modal. Get the reward rt and next state st+1 ;
Double DQN changes the calculation of the Q value of Store the transaction (st , at , rt , st+1 ) to D;
the target network and solves the Q value overestimation if t%n � 0 then
problem in the DQN algorithm, which can be combined
with the Dueling DQN algorithm to improve the overall Sample minibatch (st , at , rt , st+1 ) randomly
performance of the model. The formula for calculating the Q from D;
value of the target network in the Double DQN algorithm is Set:
r if the state st+1 is terminal,
as follows: Yt � t
rt
+cQ(st+1 , arg max Q(st+1 ; θt ); θ′t)otherwise.
a network with loss function
Train the
Yt � rt+1 + cQst+1 , argmax Q st+1 ; θt ; θ′t, (5)
a
L(θ) � E[(Yt − Q(s, a; θ))2
where θ and θ′ represent the parameters in the main network Update the target network parameters θ’ � θ
and target network, respectively. every N steps;
The loss function is the mean square error of the Q value end if
of the main network and the target network. The formula is
shown as follows: end for
end for
2
L(θ) � E Yt − Q(s, a; θ) . (6)
4. Experiment and Results
In this study, we analyze the stock market from stock
data, technical indicators, and candlestick charts and fuse This section mainly introduces the dataset, evaluation
the features of the different data sources to obtain stock metrics, comparison methods, implementation details, and
market state features representation and help the agent learn experimental result analysis.
the optimal dynamic trading strategy. Trading action at � {
long, neural, short } � {1, 0, −1}, long, neural and short
represent buy, hold, and sell, respectively. When the trading 4.1. Datasets. In this study, we verify the dynamic trading
action is long, cash is converted into stock as much as strategy learned from the proposed model on datasets of
possible, and when the trading action is short, all shares are Chinese stocks and S&P 500 stock market stocks and
sold into cash. In addition, transaction costs in stock trading compare them with other trading strategies. The period
cannot be ignored. High-frequency transactions result in range of the dataset is from January 2012 to January 2021.
higher costs; the transaction cost in this paper is 0.1% of the The training period ranges from January 2012 to December
stock value [40]. The trading process is shown in Algorithm 2018; the testing period ranges from January 2019 to January
1. 2021. For stock data, it includes the daily open price, high
price, low price, close price, and trading volume of the stock,
Input: stock data, technical indicators, candlestick as shown in Table 2.
chart;
Initialize the experience replay memory D to capacity
C; 4.2. Metrics. The evaluation indicators used in this paper are
PR, the annualized rate of return (AR), SR, and max
Initialize the main Q network with random weights θ;
drawdown duration (MDD). The details are as follows:
Initialize the target Q network with θ’ � θ;
(i) PR refers to the difference between the assets owned
for episode 1 to N do
at the end of the stock transaction and the original
for t � 1 to T do assets divided by the original assets.
Scientific Programming 7
Table 2: Stock data structure example. learning for stock trading and is a relatively new
Date Open High Low Close Volume
method.
... ... ... ... ... ...
2018/4/5 177.01 185.49 175.75 91.98 2749769
2018/4/6 185.30 190.95 184.08 94.13 2079350
4.4. Implementation Details. This study is based on deep
2018/4/7 189.06 194.00 186.98 94.28 2565820 reinforcement learning to implement stock trading and fuses
2018/4/8 192.03 196.00 188.20 93.98 1271339 the features of stock data, technical indicators, and can-
... ... ... ... ... ... dlestick charts as the state of the stock market. LSTM
network extracts features of stock data and technical indi-
cators, the size of the hidden layer is 128, and the size of the
(ii) AR is the ratio of the profits to the principal of the candlestick chart is 390 × 290. In the process of learning the
investment period of one year. The formula is de- optimal trading strategy, an episode is the trading period
fined as follows: ranges from January 2012 to December 2018. The episode in
total profits 365 training is 200. In the ε-greedy strategy of reinforcement
AR � ∗ ∗ 100. (7) learning, the ε � 0.8. The length of the sliding time window is
principal trading days
set to 30 days, and the learning rate is 0.0001.
(iii) SR is a standardized comprehensive evaluation
index, which can consider both profits and risks at
the same time to eliminate the adverse impact of risk 4.5. Experimental Results
factors on performance evaluation.
4.5.1. Comparative Experiment on the Chinese Stock Dataset.
(iv) MDD refers to the maximum losses that can be
We select 10 Chinese stocks with different trends for
borne during trading, and the lower the value, the
comparative experiments, and the initial amount is 100,000
better the performance of the trading strategy.
Chinese Yuan (CNY). The results of the experiment are
shown in Tables 3–6. We select three stocks with different
trend changes to further demonstrate the PR changes, as
4.3. Baselines
shown in Figure 6.
(i) Buy and Hold (B&H) [41] refers to the construction The traditional trading strategy B&H is a passive trading
of a certain portfolio according to the determined strategy, which has an advantage in stocks with rising prices.
appropriate asset allocation ratio, and the mainte- However, it does not perform well for stocks with large price
nance of this portfolio during the appropriate fluctuations or downward trends. It can be seen from Fig-
holding period without changing the ure 6 that for the stock 002460 with an upward trend, the
asset allocation status. And B&H strategy is a B&H trading strategy can obtain a higher PR, while for the
passive investment strategy. other two stocks, 601101 and 600746, with different trends,
(ii) Based on the Q-learning algorithm, two models are the PR obtained are not as good as the other trading
proposed to implement stock trading [31]. The two strategies. Trading strategies learned based on the Q-learning
models perceive the stock market environment in algorithm are dynamic, compared with the traditional
different ways, model 1 analyzes the stock market trading strategy B&H. In most cases, the trading strategies
through the k-means method, model 2 analyzes the learned based on model 1 can obtain higher PR, AR, SR, and
stock market through a candlestick chart, and the lower MDD for stocks with different trends in different
experimental results show that model 1 performs fields. Nonetheless, reinforcement learning lacks the ability
better than model 2, so we only compare with model to perceive the environment. Compared with the trading
1. In model 1, the size n of clusters is set to 3, 6, and strategies learned based on the deep reinforcement learning
9, and we compare them, respectively. model, the trading strategies learned based on model 1 do
not have obvious advantages. LSTM_based, TFJ-DRL, and
(iii) A LSTM-based agent is proposed to learn the
HW_LSTM_RL are all methods based on deep reinforce-
temporal relationship between data and implement
ment learning. The data sources analyzed by these methods
stock trading [34], and the effects of different
are relatively single, compared with the traditional trading
combinations of technical indicators on trading
strategy B&H and trading strategies learned based on Model
strategies are verified. In this paper, only a group of
1. The trading strategies learned by these methods can obtain
technical indicators with the best results are
more profits for stocks with different trends in different
compared.
fields. From the experimental results, we can see that the
(iv) The time-driven feature-aware jointly deep rein- dynamic trading strategies learned by our proposed model
forcement learning (TFJ-DRL) [35] model, which have better performance. It can be seen from Tables 3–6 that
used GRU to extract temporal features of stock data stocks with different trends, the trading strategy learned by
and technical indicators and implement stock the model proposed in this paper, have better performance.
trading through the policy gradient algorithm. Compared with other trading strategies, the evaluation in-
(v) HW_LSTM_RL [36] is a structure that combines dicators of our trading strategy are the highest in most cases.
wavelet transformation and deep reinforcement On the whole, the average PR of our trading strategy is
8 Scientific Programming
162.87, AR is 34.37, SR is 0.97, and MDD is 29.98, which are the S&P 500 dataset, we selected 10 stocks with different
higher than other trading strategies. trends and compared them with other trading strategies;
the initial capital is 100,000 USD. The results of the
experiment are shown in Tables 7–10. In this section, we
4.5.2. Comparative Experiment on the S&P 500 Stock select ten stocks with different trends and show their
Dataset. To further verify the performance of the trading price changes and PR changes in detail, as shown in
strategy learned by the model proposed in this paper, in Figure 7.
Scientific Programming 9
Table 6: MDD comparison of different methods in the Chinese stock market dataset.
MDD (%)
B&H [41] Model 1 [31] LSTM-based [34] TFJ-DRL [35] HW_LSTM_RL [36] Ours
Stock n�3 n�6 n�9
002460 41.32 60.70 50.50 69.37 56.68 48.05 41.32 41.32
601101 56.98 42.17 39.50 43.90 56.29 59.38 57.80 46.44
600746 42.58 44.56 40.65 36.01 52.47 53.53 39.68 49.84
600316 25.36 26.71 26.59 26.83 28.91 30.12 34.40 25.12
600028 50.03 53.17 51.77 56.82 17.20 43.95 23.39 15.58
600900 16.19 15.29 19.40 16.38 19.95 17.23 18.34 17.42
002129 35.48 35.21 38.76 34.92 35.48 42.83 32.57 30.41
600704 25.77 26.91 25.08 26.96 30.79 36.24 24.16 23.37
600377 19.76 19.84 20.62 18.26 20.31 23.48 18.22 15.45
300122 38.07 38.28 40.96 36.05 42.60 39.12 38.55 34.89
Average 35.15 36.28 35.58 36.55 36.07 39.39 32.84 29.98
100 4
3
80
Close Price
Profit rate
2
60
1
40
0
20
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
Date Date
B&H[41] LSTM_based[34]
Model 1 n=3[31] TFJ–DRL[35]
Model 1 n=6[31] HW_LSTM_RL[36]
Model 1 n=9[31] Our
(a) (b)
9.5
0.6
9.0
8.5 0.4
Close Price
8.0
Profit rate
0.2
7.5
0.0
7.0
6.5 -0.2
6.0
-0.4
5.5
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
Date Date
B&H[41] LSTM_based[34]
Model 1 n=3[31] TFJ–DRL[35]
Model 1 n=6[31] HW_LSTM_RL[36]
Model 1 n=9[31] Our
(c) (d)
Figure 6: Continued.
10 Scientific Programming
0.3
7 0.2
0.1
6
0.0
Close Price
Profit rate
-0.1
5
-0.2
-0.3
4
-0.4
3 -0.5
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
Date Date
B&H[41] LSTM_based[34]
Model 1 n=3[31] TFJ–DRL[35]
Model 1 n=6[31] HW_LSTM_RL[36]
Model 1 n=9[31] Our
(e) (f )
Figure 6: Changes in the price and PR of stocks with different trends. (a) 002460. (b) 002460. (c) 600746. (d) 600746. (e) 601101. (f ) 601101.
Table 10: MDD comparison of different methods in the S&P 500 market dataset.
MDD (%)
B&H [41] Model 1 [31] LSTM-based [34] TFJ-DRL [35] HW_LSTM_RL [36] Ours
Stock n�3 n�6 n�9
AMD 34.28 34.06 34.82 33.95 35.47 39.05 30.41 32.11
AAL 74.72 30.20 82.01 41.15 77.64 76.10 48.31 34.06
BIO 19.72 19.23 21.65 19.28 25.83 22.64 20.16 18.09
BLK 42.27 45.21 49.32 38.75 50.28 48.45 40.91 36.27
TSLA 33.73 31.37 36.04 38.61 41.29 30.73 35.20 27.06
AAPL 20.37 21.38 23.09 19.71 22.74 21.29 18.43 16.51
GOOGL 32.41 43.69 38.62 44.41 20.22 30.84 22.52 24.01
IBM 38.96 33.68 31.71 32.65 31.60 30.37 24.62 30.55
HST 52.05 47.26 49.25 47.81 59.18 48.20 44.36 43.08
PG 23.14 23.27 27.18 29.02 33.38 30.43 25.58 20.06
Average 37.17 32.93 39.37 34.53 39.76 37.81 31.05 28.18
1800 0.6
1700
1600 0.4
Close Price
1500
Profit rate
1400 0.2
1300
0.0
1200
1100
-0.2
1000
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
Date Date
B&H[41] LSTM_based[34]
Model 1 n=3[31] TFJ–DRL[35]
Model 1 n=6[31] HW_LSTM_RL[36]
Model 1 n=9[31] Our
(a) (b)
Figure 7: Continued.
12 Scientific Programming
140 0.3
130 0.2
Close Price
0.1
Profit rate
120
0.0
110
-0.1
100
-0.2
90
-0.3
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
Date Date
B&H[41] LSTM_based[34]
Model 1 n=3[31] TFJ–DRL[35]
Model 1 n=6[31] HW_LSTM_RL[36]
Model 1 n=9[31] Our
(c) (d)
35 0.4
30 0.2
0.0
Close Price
Profit rate
25
-0.2
20
-0.4
15
-0.6
10
-0.8
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
Date Date
B&H[41] LSTM_based[34]
Model 1 n=3[31] TFJ–DRL[35]
Model 1 n=6[31] HW_LSTM_RL[36]
Model 1 n=9[31] Our
(e) (f )
Figure 7: Changes in the price and PR of stocks with different trends. (a) GOOGL. (b) GOOGL. (c) IBM. (d) IBM. (e) AAL. (f ) AAL.
For stocks with different trends in S&P 500, our trading stocks selected from the Chinese stock market are 600746
strategy also has a better performance. It can be seen from and 601101, and the stocks selected from the S&P 500 are
Tables 7–10 that compared with other trading strategies, our GOOGL and IBM, the training period ranges from January
trading strategy can obtain higher yields, SR, AR, and MDD, 2012 to December 2018, and the testing period ranges from
and on the whole, our PR reached 212.96, SR reached 1.16, January 2019 to January 2021. The experimental results are
obviously higher than other trading strategies. shown in Table 11.
To further verify the performance of our proposed model It can be seen from the experimental results in Table 11
in different stock markets, we conducted Mann–Whitney U that compared to the reward function without the SR, when
test on the profits rate of 20 stocks selected from the Chinese the reward function contains the SR, the learned trading
stock market and the S&P 500 stock market. The results strategies have a better performance overall. Different from
showed that P � 0.677 > 0.05, indicating that there is no the existing algorithmic trading based on deep reinforce-
significant difference between the returns obtained by our ment learning, most of which take profit rate as reward
model in the Chinese stock market and the S&P 500 stock function, this study takes investment risk into account and
market, and it has good generalization ability. adds SR and profit rate as reward function, and the learned
trading strategy obtains higher PR, AR, SR, and MDD.
4.5.3. Reward Function Comparison Experiment. In this
section, we set the reward function with SR and without SR 4.5.4. Ablation Experiments. In this section, to verify the
and select two stocks from the Chinese stock market and the effectiveness of multisource data fusion, we conduct an
S&P 500 stock market for comparison experiments. The ablation experiment. Three groups of comparative
Scientific Programming 13
0.6
0.5
0.4
Profit rate
0.3
0.2
0.1
0.0
–0.1
2019-01
2019-04
2019-07
2019-10
2020-01
2020-04
2020-07
2020-10
2021-01
Date
Group one
Group two
Group three
experiments are carried out, all of which are based on deep implement stock trading. Stock data, technical indicators,
reinforcement learning to implement stock trading. The first and candlestick charts can reflect the changes in the stock
group analyzes the stock market through stock data and market from different perspectives, we use different deep
technical indicators; the second group analyzes the stock neural networks to extract the features of the data source and
market through a candlestick chart; and the third group fuse features, and the fused features are more helpful to learn
analyzes the stock market through stock data, technical the optimal dynamic trading strategy.
indicators, and candlestick chart. We select the trading data It can be concluded from the experimental results that
of GOOGL stock from January 2012 to January 2021 as the the trading strategies learned based on the deep rein-
dataset for this section, in which January 2012 to December forcement learning method can be dynamically adjusted
2018 is the training data, and January 2019 to January 2021 is according to the stock market changes and have more
the test data. The comparison results are shown in Figure 8 advantages. Compared with other trading strategies, our
and Table 12. trading strategy has better performance for stocks with
The experimental results show that compared with the different trends, and the average SR value is the highest,
trading strategies learned in the first two groups, the trading which means that under the same risk, our trading strategy
strategies learned in the third group can obtain higher PR, can get more profits. However, textual information such as
SR, AR, and lower MDD. This also proves that the analysis of investor comments and news events has an impact on the
multisource data can obtain a deeper feature representation fluctuations of stock prices and cannot be ignored. It is
of the stock market, which is more conducive to learning the important to obtain informative data from relevant texts
optimal trading strategy. for stock trading. In future research, we will consider
different text information and train more stable trading
5. Conclusion strategies.
Correct analysis of the stock market state is one of the Data Availability
challenges when implementing stock trading based on deep
reinforcement learning. In this research, we analyze mul- The experimental data in this article can be downloaded
tisource data based on deep reinforcement learning to from Yahoo Finance (https://finance.yahoo.com/).
14 Scientific Programming