10 DeepScalper A Risk-Aware Reinforcement Learning Framework
10 DeepScalper A Risk-Aware Reinforcement Learning Framework
Bo An
arXiv:2201.09058v3 [q-fin.TR] 21 Aug 2022
Reinforcement learning (RL) techniques have shown great success Qty Price Price Qty
100 9.9 10.0 450
Data Analysis long-term
Similar to other application scenarios of reinforcement learning [14] and mean reversion [28] are two well-known types of tradi-
(RL), quantitative trading also interacts with the environment (fi- tional finance methods based on technical analysis. Momentum
nancial market) and maximizes the accumulative profit. Recently, trading assumes the trend of financial assets in the past has the
RL has been considered as an attractive approach to quantitative tendency to continue in the future. Time Series Momentum [24]
trading as it allows training agents to directly output profitable and Cross Sectional Momentum [16] are two classic momentum
trading actions with better generalization ability across various strategies. In contrast, mean reversion strategies such as Bollinger
market conditions [21]. Although there have been many successful bands [4] assume that the price of financial assets will finally revert
RL-based quantitative trading methods [11, 35, 39], a vast majority to the long-term mean.
of existing methods focus on the relatively low-frequency trading However, traditional finance methods are not perceptive enough
scenarios (e.g., day-level) and fail to capture the fleeting intraday to capture fleeting intraday patterns and only perform well in cer-
investment opportunities. To design profitable RL methods for in- tain market conditions [7]. In recent years, many advanced machine
traday trading, there are two major challenges. First, different from learning methods have significantly outperformed traditional fi-
the low-frequency scenarios, intraday trading involves a much nance methods.
larger high-dimensional fine-grained action space to represent the
price and quantity of orders for more accurate control of the finan-
cial market. Second, we need to learn meaningful multi-modality 2.2 Prediction-Based Methods
intraday market representation, which takes macro-level market, As for prediction-based methods, they first formulate quantitative
micro-level market and risk into consideration. trading as a supervised learning task to predict the future return
Considering the workflow of a professional human intraday (regression) or price movement (classification). Later on, trading
trader (Figure 1), the trader first collects both micro-level and macro- decisions are generated by the prediction results with a heuristic
level market information to analyze the market status. Then, he strategy generator (e.g., top-k in [40]). Specifically, Wang et al. [34]
predicts the short-term and long-term price trend based on the mar- combine attention mechanism with LSTM to model correlated time
ket status. Later on, he conducts risk management and makes final steps. To improve the robustness of LSTM, Feng et al. [12] apply
trading decisions (when and how much to long/short at what price). adversarial training techniques for stock prediction. Zhang et al.
Among many successful trading firms, this workflow plays a key [43] propose a novel State Frequency Memory (SFM) recurrent
role for designing robust and profitable intraday trading strategies. network with Discrete Fourier Transform (DFT) to discover multi-
Inspired by it, we propose DeepScalper, a novel RL framework for frequency patterns in stock markets. Liu et al. [20] introduce a
intraday trading to tackle the above challenges by mimicking the multi-scale two-way neural network to predict the stock trend. Sun
information collection, short-term and long-term market analysis et al. [32] propose an ensemble learning framework to train mixture
and risk management procedures of human intraday traders. Our of trading experts.
main contributions are three-fold: However, the high volatility and noisy nature of the financial mar-
• We apply the dueling Q-Network with action branching ket make it extremely difficult to accurately predict future prices
to effectively optimize intraday trading agents with high- [10]. Furthermore, there is a noticeable gap between prediction
dimensional fine-grained action space. A novel reward func- signals and profitable trading actions [13]. Thus, the overall perfor-
tion with hindsight bonus is designed to encourage RL agents mance of prediction-based methods is not satisfying as well.
making trading decisions with a long-term horizon of the
entire trading day.
2.3 Reinforcement Learning Methods
• We propose an multi-modality encoder-decoder architecture
to learn meaningful temporal intraday market embedding, Recent years have witnessed the successful marriage of reinforce-
which incorporates both micro-level and macro-level market ment learning and quantitative trading as RL allows training agents
information. Furthermore, we design a risk-aware auxiliary to directly output profitable trading actions with better generaliza-
task to keep balance between profit and risk. tion ability across various market conditions [31]. Neuneier [26]
• Through extensive experiments on real-world market data make the first attempt to learn trading strategies using Q-learning.
spanning over three years on six financial futures, we show Moody and Saffell [23] propose a policy-based method, namely
that DeepScalper significantly outperforms many state-of- recurrent reinforcement learning (RRL), for quantitative trading.
the-art baseline methods in terms of four financial crite- However, traditional RL approaches have difficulties in selecting
ria and demonstrate DeepScalper’s practical applicability to market features and learning good policy in large scale scenarios.
intraday trading with a series of exploratory and ablative To tackle these issues, many deep RL approaches have been pro-
studies. posed to learn market embedding through high dimensional data.
Jiang et al. [17] use DDPG to dynamically optimize cryptocurrency
2 RELATED WORK portfolios. Deng et al. [7] apply fuzzy learning and deep learning
to improve financial signal representation. Yu et al. [41] propose a
2.1 Traditional Finance Methods model-based RL framework for daily frequency portfolio trading.
Technical analysis [25], which believes that past price and volume Liu et al. [21] propose an adaptive DDPG-based framework with
data have the ability to reflect future market conditions [10], is the imitation learning. Ye et al. [39] proposed a State-Augmented RL
foundation of traditional finance methods. Millions of technical (SARL) framework based on DPG with financial news as additional
indicators are designed to generate trading signals [18]. Momentum states.
DeepScalper: A Risk-Aware Reinforcement Learning Framework to Capture Fleeting Intraday Trading Opportunities Conference’17, July 2017, Washington, DC, USA
#
Although there are many efforts on utilizing RL for quantita- sell order # 𝑞! "
𝑞! #
tive trading, a vast majority of existing RL methods focus on the #
#
𝑞! ! 𝑞! $
relatively low-frequency scenarios (e.g., day-level) and fail to cap- " " " " Ask
𝑝! " 𝑝! # 𝑝! $ 𝑝! !
ture the fleeting intraday investment opportunities. We propose
#
DeepScalper to fill this gap by mimicking the workflow of human #
𝑝! !
#
𝑝! $
#
𝑝! # 𝑝! " Price
intraday traders. Bid "
" 𝑞! #
𝑞! " " buy order
𝑞! $ "
3 PROBLEM FORMULATION 𝑞! !
In this section, we first introduce necessary preliminaries and the
objective of intraday trading. Next, we provide a Markov Decision Figure 2: A snapshot of 4-level limit order book (LOB)
Process (MDP) formulation of intraday trading.
3.1 Intraday Trading In real-world intraday trading, traders are allocated some cash into
Intraday trading is a fundamental quantitative trading task, where the account at the beginning of each trading day. During the trading
traders actively long/short one pre-selected financial asset within time, traders analyze the market and make their trading decisions.
the same trading day to maximize future profit. Below are some Then, they submit their orders (desired price and quantity) to the
necessary definitions for understanding the problem: matching system. The matching system will execute orders at best
available price (possibly at multiple price when market liquidation
Definition 1. (OHLCV) OHLCV is a type of bar chart directly
is not enough for large orders) and then return execution results to
obtained from the financial market. OHLCV vector at time 𝑡 is
traders. At the end of the trading day, traders close all remaining
denoted as x𝑡 = (𝑝𝑡𝑜 , 𝑝𝑡ℎ , 𝑝𝑡𝑙 , 𝑝𝑡𝑐 , 𝑣𝑡 ), where 𝑝𝑡𝑜 is the open price, 𝑝𝑡ℎ
positions at market price to avoid overnight risk and hold 100% cash
is the high price, 𝑝𝑡𝑙 is the low price, 𝑝𝑡𝑐 is the close price and 𝑣𝑡 is again. The objective of intraday trading is to maximize accumulative
the volume. profit for a period of multiple continuous trading days (e.g., half a
Definition 2. (Technical Indicator) A technical indicator indicates year).
a feature calculated by a formulaic combination of the original Comparing to conventional low-frequency trading scenarios,
OHLCV to uncover the underlying pattern of the financial market. intraday trading is more challenging since intraday traders need
Ð
We denote the technical indicator vector at time 𝑡: y𝑡 = 𝑘 𝑦𝑡𝑘 , to capture the tiny price fluctuation with much less responsive
𝑘 𝑘 𝑘
where 𝑦𝑡 = 𝑓𝑘 (x𝑡 −ℎ , ..., x𝑡 , 𝜃 ), 𝜃 is the parameter of technical time (e.g., 1 min). In addition, intraday trading involves a large
indicator 𝑘. fine-grained trading action space that represents a limit order to
pursue more accurate control of the market.
Definition 3. (Limit Order) A limit order is an order placed to
long/short a certain number of shares at a specific price. It is de- 3.2 MDP Formulation
fined as a tuple 𝑙 = (𝑝𝑡𝑎𝑟𝑔𝑒𝑡 , ±𝑞𝑡𝑎𝑟𝑔𝑒𝑡 ), where 𝑝𝑡𝑎𝑟𝑔𝑒𝑡 represents
We formulate intraday trading as a MDP, which is constructed by a
the submitted target price, 𝑞𝑡𝑎𝑟𝑔𝑒𝑡 represents the submitted target
5-tuple (S, A,𝑇 , 𝑅, 𝛾). Specifically, S is a finite set of states. A is a
quantity, and ± represents the trading direction (long/short).
finite set of actions. 𝑇 : S × A × S −→ [0, 1] is a state transaction
Definition 4. (Limit Order Book) As shown in Figure 2, a limit function, which consists of a set of conditional transition proba-
order book (LOB) contains public available aggregate information bilities between states. 𝑅 : S × A −→ R is the reward function,
of limit orders by all market participants. It is widely used as market where R is a continuous set of possible rewards and 𝑅 indicates the
microstructure [22] in finance to represent the relative strength immediate reward of taking an action in a state. 𝛾 ∈ [0, 1) is the
between buy and sell side. We denote an 𝑚 level LOB at time 𝑡 discount factor. A (stationary) policy 𝜋 : S × A −→ [0, 1] assigns
as b𝑡 = (𝑝𝑡𝑏 1 , 𝑝𝑡𝑎1 , 𝑞𝑏𝑡 1 , 𝑞𝑡𝑎1 , ..., 𝑝𝑡𝑏𝑚 , 𝑝𝑡𝑎𝑚 , 𝑞𝑏𝑡 𝑚 , 𝑞𝑡𝑎𝑚 ), where 𝑝𝑡𝑏𝑖 is the each state 𝑠 ∈ S a distribution over actions, where 𝑎 ∈ A has
probability 𝜋 (𝑎|𝑠). In intraday trading, O, A, 𝑅 are set as follows.
level 𝑖 bid price, 𝑝𝑡𝑎𝑖 is the level 𝑖 ask price, 𝑞𝑏𝑡 𝑖 and 𝑞𝑡𝑎𝑖 are the
State. Due to the particularity of the financial market, the state
corresponding quantities.
𝑠𝑡 ∈ S at time 𝑡 can be divided into three parts: macro-level market
Definition 5. (Matching System) The matching system is an elec- state 𝑠𝑡𝑎 ∈ S𝑎 , micro-level market state 𝑠𝑡𝑖 ∈ S𝑖 and trader’s private
𝑝
tronic system that matches the buy and sell orders for the financial state set 𝑠𝑡 ∈ S 𝑝 . Specifically, we use a vector yt of 11 technical
market. It is usually used to execute orders for different traders in indicators and the original OHLCV vector xt as macro-level state
the exchange. following [40], the historical LOB sequence (bt-h, ..., bt ) with length
ℎ + 1 as micro-level state and trader’s private state zt = (𝑝𝑜𝑠 𝑡 , 𝑐𝑡 , 𝑡𝑡 ),
Definition 6. (Position) Position 𝑝𝑜𝑠 𝑡 at time 𝑡 is the amount and
where 𝑝𝑜𝑠 𝑡 , 𝑐𝑡 and 𝑡𝑡 are the current position, cash and remaining
direction of a financial asset owned by traders. It represents a long
time. The entire set of states can be denoted as S = (S𝑎 , S𝑖 , S𝑝 ).
(short) position when 𝑝𝑜𝑠 𝑡 is positive (negative).
Compared to previous formulations, we introduce the LOB and
Definition 7. (Net Value) Net value is the normalised sum of cash trader’s private state as additional information to effectively capture
and value of financial assets held by a trader. The net value at time intraday trading opportunities.
𝑡 is denoted as 𝑛𝑡 = (𝑐𝑡 + 𝑝𝑡𝑐 × 𝑝𝑜𝑠 𝑡 )/𝑐 1 , where 𝑐𝑡 is the cash at Action. Previous works [7, 21] lie in low-frequency trading
time 𝑡 and 𝑐 1 is the initial amount of cash. scenarios, which generally stipulate that the agent trades a fixed
Conference’17, July 2017, Washington, DC, USA Sun et al.
Mirco Embedding
150 30 9.7 10.2 100
15030 9.79.8 10.210.1 100200 𝒃𝒕 LSTM
150 9.7 10.2 100 𝒆𝒊𝒕 MLP
Concat Optimize Volatility Prediction
Historical LOB
cash 𝒛𝒕"𝒌 Advantages Q-values
Price Price argmax
Market Embedding
… 𝒉𝒕𝒛 Dimension Dimension
𝒛𝒕 LSTM MLP
Private Observation 𝒆𝒕 2-tuple Action
Indicator Discovery Price Qty
State Value
MLP 9.9 +3
Marco Embedding
Indicator Indicator 𝒚𝒕 MLP
… 𝒆𝒂𝒕
𝒗𝒕
Concat
1 k
Indicator set Advantages Q-values
Quantity Quantity argmax
𝒙𝒕
Dimension Dimension
OHLCV
MLP
(b) Macro-level Encoder (d) Action Branching
Figure 3: An overview of the proposed DeepScalper framework. We show four individual building blocks: (a) micro-level
encoder, (b) macro-level encoder, (c) risk-aware auxiliary task, and (d) RL optimization with action branching.
quantity at market price and applies a coarse action space with (price and quantity) fine-grained action space. However, learning
three options (long, hold, and short). However, when focusing on from scratch for tasks with large action spaces remains a critical
relatively high-frequency trading scenarios (e.g., intraday trading), challenge for RL algorithms [3, 42]. For intraday trading, while
tiny price fluctuation (e.g., 1 cent) is of vital importance to final human traders can usually detect the subset of feasible trading
profit that makes the market price execution and fixed quantity as- actions in a given market condition, RL agents may attempt inferior
sumptions unacceptable. In the real-world financial market, traders actions, thus wasting computation time and leading to capital loss.
have the freedom to decide both the target price and the quantity As possible intraday trading actions can be naturally divided
by submitting limit orders. We use a more realistic two-dimensional into two components (e.g., desired price and quantity), we propose
fine-grained action space for intraday trading, which represents to adopt the Branching Dueling Q-Network (BDQ) [33] for decision-
a limit order as a tuple (𝑝𝑡𝑎𝑟𝑔𝑒𝑡 , ±𝑞𝑡𝑎𝑟𝑔𝑒𝑡 ). 𝑝𝑡𝑎𝑟𝑔𝑒𝑡 is the target making. Particularly, as shown in Figure 3(d), BDQ distributes the
price, 𝑞𝑡𝑎𝑟𝑔𝑒𝑡 is the target quantity and ± is the trading direction representation of the state-dependent action advantages in both the
(long/short). It is also worth pointing out that when the quantity is price and quantity branches. Later, it simultaneously adds a single
zero, it indicates that we skip the current time step with no order additional branch to estimate the state-value function. Finally, the
placement. advantages and the state value are combined via an aggregating
Reward. We define the reward function as the change of ac- layer to output the Q-values for each action dimension. During the
count P&L, which shows the value fluctuation (profit & loss) of the inference period, these Q-values are then queried with argmax to
account: generate a joint action tuple to determine the final trading actions.
Formally, intraday trading is formulated as a sequential decision
𝑟𝑡 = (𝑝𝑡𝑐+1 − 𝑝𝑡𝑐 ) × 𝑝𝑜𝑠 𝑡 − 𝛿 × 𝑝𝑡𝑐 × 𝑝𝑜𝑠 𝑡 − 𝑝𝑜𝑠 𝑡 −1
| {z } | {z } making problem with two action dimensions of |𝑝 | = 𝑛𝑝 discrete
𝑖𝑛𝑠𝑡𝑎𝑛𝑡 𝑝𝑟𝑜 𝑓 𝑖𝑡 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 𝑓 𝑒𝑒
relative price levels and |𝑞| = 𝑛𝑞 discrete quantity proportions.
The individual branch’s Q-value 𝑄𝑑 at state 𝑠 ∈ 𝑆 and the action
where 𝑝𝑡𝑐 is the close price at time 𝑡, 𝛿 is the transaction fee rate 𝑎𝑑 ∈ A𝑑 are expressed in terms of the common state value 𝑉 (𝑠)
and 𝑝𝑜𝑠 𝑡 is the position at time 𝑡. and the corresponding (state-dependent) action advantage [37]
𝐴𝑑𝑣𝑑 (𝑠, 𝑎𝑑 ) for 𝑑 ∈ {𝑝, 𝑞}:
4 DEEPSCALPER
In this section, we introduce DeepScalper (an overview in Fig- 1 ∑︁ ′
𝑄𝑑 (𝑠, 𝑎𝑑 ) = 𝑉 (𝑠) + (𝐴𝑑𝑣𝑑 (𝑠, 𝑎𝑑 ) − 𝐴𝑑𝑣𝑑 (𝑠, 𝑎𝑑 ))
ure 3) for intraday trading. We describe the four components of 𝑛 ′
DeepScalper: 1) RL optimization with action branching; 2) reward 𝑎𝑑 ∈𝐴𝑑
function with a hindsight bonus; 3) intraday market embedding; 4) We train our Q-value function approximator as Q-Network with
risk-aware auxiliary task orderly. parameter 𝜃𝑞 based on the one-step temporal-difference learning
with target 𝑦𝑑 in a recursive fashion:
4.1 RL Optimization with Action Branching
Comparing to conventional low-frequency trading scenarios, in- 𝑦𝑑 = 𝑟 + 𝛾 max 𝑄 − (𝑠 ′, 𝑎𝑑′ , 𝜃𝑞 )), 𝑑 ∈ {𝑝, 𝑞}
𝑎′ ∈𝐴𝑑 𝑑
traday trading tries to seize the fleeting tiny price fluctuation with 𝑑
much less responsive time. To provide more accurate trading de- where 𝑄𝑑− denoting the branch 𝑑 of the target network 𝑄 − , 𝑟 is the
cisions, intraday trading involves a much larger two-dimensional reward function result and 𝛾 is the discount factor.
DeepScalper: A Risk-Aware Reinforcement Learning Framework to Capture Fleeting Intraday Trading Opportunities Conference’17, July 2017, Washington, DC, USA
Finally, we calculate the following loss function: Noticeably, we only use the reward function with a hindsight
1 ∑︁ bonus for training to better understand the market. During the
𝐿𝑞 (𝜃𝑞 ) = E (𝑠,𝑎,𝑟,𝑠 ′ )∼𝐷 [ (𝑦𝑑 − 𝑄𝑑 (𝑠, 𝑎𝑑 , 𝜃𝑞 )) 2 ] test period, we continue to use the original reward 𝑟𝑡 to calculate
𝑁
𝑑 ∈ {𝑝,𝑞 } the profits. Furthermore, the hindsight reward function somehow
where 𝐷 denotes a prioritized experience replay buffer. 𝑎 denotes ignores details of price fluctuation between 𝑡 + 2 to 𝑡 + ℎ − 1 and
the joint-action tuple (𝑝, 𝑞). By differentiating the Q-Network loss focuses on the trend of this period, which is computational efficient
function with respect to 𝜃𝑞 , we get the following gradient: and shows robust performance in practice.
′ ′
▽𝜃𝑞 𝐿𝑞 (𝜃𝑞 ) = E (𝑠,𝑎,𝑟,𝑠 ′ )∼𝐷 [(𝑟 + 𝛾 max 𝑄𝑑 (𝑠 , 𝑎𝑑 , 𝜃𝑞 )
′
𝑎𝑑 ∈𝐴𝑑 4.3 Intraday Market Embedding
−𝑄𝑑 (𝑠, 𝑎𝑑 , 𝜃𝑞 )) ▽𝜃𝑞 𝑄𝑑 (𝑠, 𝑎𝑑 , 𝜃𝑞 )] To learn a meaningful multi-modality intraday market embedding,
we propose an encoder-decoder architecture to represent the mar-
In practice, we optimize the loss function by stochastic gradient ket from the micro-level and macro-level, respectively.
descent, rather than computing the full expectations in the above For micro-level encoder, we choose LOB data and trader’s private
gradient, to maintain computational efficiency. state to learn the micro-level market embedding. LOB is widely
used to analyze the relative strength of the buy and sell side based
on micro-level trading behaviors, and private state of traders is
buy miss increasing trend considered insightful to capture micro-level trading opportuni-
sell ties [27]. At time 𝑡, we have a sequence of historical LOB embed-
dings (bt-k, ..., bt ) and trader’s private state embedding (zt-k, ..., zt ),
get positive return where 𝑘 + 1 is the sequence length. As shown in Figure 3(a), we
feed them into two different LSTM layers and concatenate the last
hidden states hbt and hzt of the two LSTM layers as the micro-level
embedding eit at time 𝑡.
For macro-level encoder, we pick raw OHLCV data and technical
indicators to learn the macro-level embedding. The intuition here
is that OHLCV reflects the original market status, and technical
indicators offer additional information. At time 𝑡, we firstly concate-
Figure 4: Illustration of the motivation of the hindsight nate OHLCV vector xt and technical indicator vector yt as input
bonus vt . As shown in Figure 3(b), the concatenated embedding is then
fed into a multilayer perceptron (MLP). The MLP output is applied
as the macro-level embedding eat at time 𝑡. Finally, we concatenate
4.2 Reward Function with Hindsight Bonus micro-level embedding and macro-level embedding together as
One major issue for training directly with the profit & loss reward the market embedding et . Our market embedding is better than
is that RL agents tend to pay too much attention to the short- that of previous work, since it incorporates the micro-level market
term price fluctuation [36]. Although the agent performs well in information.
capturing local trading opportunities, ignoring the overall long-
term price trend could lead to significant loss. Here, we design a 4.4 Risk-Aware Auxiliary Task
novel reward function with a hindsight bonus to tackle this issue. As risk management is of vital importance for intraday trading, we
To demonstrate the motivation of the hindsight bonus, considering propose a risk-aware auxiliary task by predicting volatility to take
a pair of buy/sell actions in Figure 4, the trader feels happy at the into account market risk as shown in Figure 3(c). Volatility is widely
point of selling the stock, since the price of the stock increases. used as a coherent measure of risk that describing the statistical
However, this sell decision is actually a bad decision in the long run. dispersion of returns in finance [2]. We analyze the reasons why
The trader feels disappointed before 12:00 since he/she misses the volatility prediction is an effective auxiliary task to improve the
main increasing wave due to the short horizon. It is more reasonable trading policy learning as follows.
for RL agents to evaluate one trading action from both short-term First, it is consistent with the general trading goal, which is to
and long-term perspectives. Inspired by this, we add a hindsight maximize long-term profit under certain risk tolerance. Second,
bonus, which is the expected profit for holding the assets for a future volatility is easier to predict compared to future price. For
longer period of ℎ with a weight term 𝑤, into the reward function instance, considering the day that the president election result will
to add a long-term horizon while training intraday RL agents: be announced, nobody can know the result in advance, which will
lead the stock market to increase or decrease. However, everyone
𝑟𝑡ℎ𝑖𝑛𝑑 = 𝑟𝑡 + 𝑤 × (𝑝𝑡𝑐+ℎ − 𝑝𝑡𝑐 ) × 𝑝𝑜𝑠 𝑡 knows that there would be a huge fluctuation in the stock market,
| {z } which increases future volatility. Third, predicting future price and
ℎ𝑖𝑛𝑑𝑠𝑖𝑔ℎ𝑡 𝑏𝑜𝑛𝑢𝑠 volatility are two closely related tasks. Learning value function
where 𝑝𝑡𝑐 is the close price at time 𝑡, 𝑤 is the weight of the hindsight approximation and volatility prediction simultaneously can help
bonus, ℎ is the horizon of the hindsight bonus and 𝑝𝑜𝑠 𝑡 is the the agent learn a more robust market embedding. The definition
position at time 𝑡. of volatility is the variance of return sequence 𝑦 𝑣𝑜𝑙 = 𝜎 (r), where
Conference’17, July 2017, Washington, DC, USA Sun et al.
Net Value
Net Value
Ours Ours
Reinforcement Learning Methods 0.9 0.95
0.8 0.90
• DQN [44] applies the deep Q-network with a novel state
0.7 0.85
representation and reward function for quantitative trading, 0.80
0.6
which shows stellar performance in more than 50 financial 0 10 20 30 40 50 0 10 20 30 40 50
Trading Day Trading Day
assets.
(a) IC (b) IF
• DS-NH is a variant of DeepScalper (DS), which removes the
hindsight bonus from the reward function. 1.08 MV 1.02 MV
TSM TSM
• DS-NA is a variant of DeepScalper (DS), which removes the 1.06 MLP
GRU 1.01 MLP
GRU
LGBM LGBM
risk-aware auxiliary task. 1.04
DQN 1.00 DQN
Net Value
Net Value
Ours Ours
1.02
0.99
5.4 Preprocessing and Experiment Setup 1.00
0.98
0.98
For macro-level features, we directly calculate the 11 technical in- 0.97
Net Value
1.02 Ours 1.01 Ours
practical real-world constraints. The transaction fee rate 𝛿 is set 1.00
1.00
as 2.3 × 10−5 and 3 × 10−6 for stock index futures and treasury 0.99
0.98
bond futures respectively, which is consistent with the real-world 0.98
intraday trading, we apply a fixed five-times leverage to amplify (e) T01 (f) T02
profit and volatility. Time is discretized into 1 min interval and we
assume that the agent can only long/short a financial future at the Figure 5: Trading day vs. net value curve of different base-
end of each minute. The account of RL agents is initialized with lines and DeepScalper on stock index (IC and IF) and trea-
enough cash to buy 50 shares of the asset at the beginning. The sury bond (TF01, TF02, T01, T02) datasets. DeepScalper
maximum holding position is 50. achieves the highest profit in all six financial assets.
We perform all experiments on a Tesla V100 GPU. Grid search
is applied to find the optimal hyperprarameters. We explore the
look-ahead horizon ℎ in [30, 60, 90, 120, 150, 180], importance of
hindsight bonus 𝑤 in [1𝑒 −3, 5𝑒 −3, 1𝑒 −2, 5𝑒 −2, 1𝑒 −1 ] and importance consistently generates significantly (𝑝 < 0.01) higher performance
of auxiliary task 𝜂 in [0.5, 1.0]. As for neural network architectures, than all baselines on 7 out of 8 metrics across the two datasets. In the
we search the hidden units of MLP layers and GRU layer in [32, stock index dataset, DeepScalper performs best on all four metrics.
64, 128] with ReLU as the activation function. we use Adam as the Specifically, it outperforms the second best by 30.80%, 33.33%, 21.42%
optimizer with learning rate 𝛼 ∈ (1𝑒 −5, 1𝑒 −3 ) and train DeepScalper and 7.50% in terms of TR, SR, CR and SoR, respectively. As for the
for 5 epochs in all financial assets. Following the iterative training treasury bond dataset, DeepScalper outperforms the second best by
scheme in [27], we augment traders’ private state repeatedly during 14.87%, 7.47%, 30.94% in terms of TR, SR and CR. For SoR, DS-NA
the training to improve data efficiency. We run experiments with 5 performs slightly better than DS (2% without statistical significance).
different random seeds and report the average performance. It takes One possible reason is that volatility prediction auxiliary task is
1.5 and 3.5 hours to train and test Deepscalper on each financial not directly relevant to control downside return variance.
asset in the stock index and treasury bond datasets, respectively. Furthermore, we show the trading day vs. net value trading days
As for other baselines, we use the default settings in their public of the test period for each financial future from the two datasets
implementations5 6 . in Figure 5. We intentionally exclude BAH, DS-NH and DS-NA to
make the figure easy to follow. For traditional methods, we find
6 RESULTS AND ANALYSIS that MV achieves decent performance for most financial futures.
6.1 Profitability Comparison with Baselines In comparison, TSM’s performance is much worse. One possible
reason for TSM’s failure is that there is no evident momentum effect
We compare DeepScalper with 9 state-of-the-art baselines in terms
within the market for intraday trading. For deep learning models,
of four financial metrics in Table 3. We observe that DeepScalper
the overall performance of GRU is better than that of MLP due to
4 China its ability to learn the temporal dependency of indicators. As for
Financial Futures Exchange: http://www.cffex.com.cn/en_new/index.html
5 Qlib:https://github.com/microsoft/qlib LGBM, it achieves slightly better performance than deep learning
6 FinRL: https://github.com/AI4Finance-Foundation/FinRL models. The average performance of RL methods is the best.
Conference’17, July 2017, Washington, DC, USA Sun et al.
04/30 05/21 06/09 06/30 07/17 04/30 05/21 06/09 06/30 07/17
(a) TF02 (b) T02
bonus (Figure 7a) performs well in capturing local trading opportu- Figure 10: Price curve of TF02 and T02
nities and overlooks the long-term trend of the entire trading day.
In comparison, an agent trained with the hindsight bonus (Figure
7b) trades a large volume of short actions at the beginning of the
1.06 1.06
trading day, indicating that it is aware of the decreasing trend in 1.05
TF02
T02 1.05
TF02
T02
advance. This kind of trading action is smart, since it captures the 1.04 1.04
1.03 1.03
Trading Day
Trading Day
big price gap of the overall trend and somehow ignores the local
1.02 1.02
gain or loss. 1.01 1.01
1.00 1.00
0.99 0.99
6.4 Effectiveness of Risk-aware Auxiliary Task 0.98
04/30 05/21 06/09 06/30 07/17
0.98
04/30 05/21 06/09 06/30 07/17
Net Value Net Value
Since the financial market is noisy and the RL training process is
(a) MV (b) GRU
unstable, the performance variance among different random seeds is
a major concern of RL-based trading algorithms. Intuitively, taking 1.06 1.06
TF02
TF02 1.05 T02
market risk into account can help the RL agent behave more stable 1.05 T02
1.04 1.04
with lower performance variance. We run experiments 5 times with 1.03
Trading Day
1.03
Trading Day
different random seeds and report the relative variance relationship 1.02 1.02
1.01 1.01
between RL agents trained with/without the risk-aware auxiliary 1.00
1.00
task in Figure 8. We find that RL agents trained with the risk-aware 0.99 0.99
auxiliary task achieve a lower TR variance in all six financial assets 0.98
04/30 05/21 06/09 06/30 07/17
0.98
04/30 05/21 06/09 06/30 07/17
Net Value Net Value
and a lower SR variance in 67% of financial assets. Furthermore,
(c) LGBM (d) DeepScalper
we test the impact of auxiliary task importance 𝜂 on DeepScalper’s
performance. Naturally, the volatility value scale is smaller than
return, which makes 𝜂 = 1 a decent option to start. In practice, we Figure 11: Net value curve of MV, GRU, LGBM and Deep-
test 𝜂 ∈ [0, 0.5, 1] and find the improvement of the auxiliary task is Scalper on TF02 and T02 to show the generalization ability.
robust to different importance weights as shown in Figure 9.
Conference’17, July 2017, Washington, DC, USA Sun et al.
7 CONCLUSION [19] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma,
Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting
In this article, we focus on intraday trading and propose Deep- decision tree. (2017), 3146–3154.
Scalper to mimic the workflow of professional intraday traders. [20] Guang Liu, Yuzhao Mao, Qi Sun, Hailong Huang, Weiguo Gao, Xuan Li, Jianping
Shen, Ruifan Li, and Xiaojie Wang. 2020. Multi-scale two-way deep neural
First, we apply the dueling Q-network with action branching to network for stock trend prediction.. In Proceedings of the 29th International Joint
efficiently train intraday RL agents. Then, we design a novel re- Conference on Artificial Intelligence (IJCAI). 4555–4561.
ward function with a hindsight bonus to encourage a long-term [21] Yang Liu, Qi Liu, Hongke Zhao, Zhen Pan, and Chuanren Liu. 2020. Adaptive
quantitative trading: An imitative deep reinforcement learning approach. In
horizon to capture the overall price trend. In addition, we design an Proceedings of 35th the AAAI Conference on Artificial Intelligence. 2128–2135.
encoder-decoder architecture to learn robust market embedding by [22] Ananth Madhavan. 2000. Market microstructure: A survey. Journal of Financial
Markets 3, 3 (2000), 205–258.
incorporating both micro-level and macro-level market information. [23] John E Moody and Matthew Saffell. 1999. Reinforcement learning for trading.
Finally, we propose volatility prediction as an auxiliary task to help (1999), 917–923.
agents be aware of market risk while maximizing profit. Extensive [24] Tobias J Moskowitz, Yao Hua Ooi, and Lasse Heje Pedersen. 2012. Time series
momentum. Journal of Financial Economics 104, 2 (2012), 228–250.
experiments on two stock index futures and four treasury bond [25] John J Murphy. 1999. Technical Analysis of The Financial Markets: A Comprehensive
futures demonstrate that DeepScalper significantly outperforms Guide to Trading Methods and Applications. Penguin.
many advanced methods. [26] Ralph Neuneier. 1996. Optimal asset allocation using adaptive dynamic program-
ming. In Proceedings of the 10th Neural Information Processing Systems. 952–958.
[27] Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. 2006. Reinforcement learning
REFERENCES for optimized trade execution. In Proceedings of the 23rd International Conference
[1] Bo An, Shuo Sun, and Rundong Wang. 2022. Deep Reinforcement Learning for on Machine learning. 673–680.
Quantitative Trading: Challenges and Opportunities. IEEE Intelligent Systems 37, [28] James M Poterba and Lawrence H Summers. 1988. Mean reversion in stock prices:
2 (2022), 23–26. Evidence and implications. Journal of Financial Economics 22, 1 (1988), 27–59.
[2] Gurdip Bakshi and Nikunj Kapadia. 2003. Delta-hedged gains and the negative [29] Karsten Schierholt and Cihan H Dagli. 1996. Stock market prediction using
market volatility risk premium. The Review of Financial Studies 16, 2 (2003), different neural network classification architectures. In IEEE/IAFE 1996 Conference
527–566. on Computational Intelligence for Financial Engineering. 72–78.
[3] Marc G Bellemare, Will Dabney, and Rémi Munos. 2017. A distributional perspec- [30] William F Sharpe. 1994. The sharpe ratio. Journal of Portfolio Management 21, 1
tive on reinforcement learning. In Proceedings of the 34th International Conference (1994), 49–58.
on Machine Learning. 449–458. [31] Shuo Sun, Rundong Wang, and Bo An. 2021. Reinforcement learning for quanti-
[4] John Bollinger. 2002. Bollinger on Bollinger Bands. McGraw-Hill New York. tative trading. arXiv preprint arXiv:2109.13851 (2021).
[5] Chi Chen, Li Zhao, Jiang Bian, Chunxiao Xing, and Tie-Yan Liu. 2019. Investment [32] Shuo Sun, Rundong Wang, and Bo An. 2022. Quantitative Stock Investment by
behaviors can tell what inside: Exploring stock intrinsic properties for stock Routing Uncertainty-Aware Trading Experts: A Multi-Task Learning Approach.
trend prediction. In Proceedings of the 25th ACM SIGKDD International Conference arXiv preprint arXiv:2207.07578 (2022).
on Knowledge Discovery & Data Mining. 2376–2384. [33] Arash Tavakoli, Fabio Pardo, and Petar Kormushev. 2018. Action branching
[6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. architectures for deep reinforcement learning. In Proceedings of the 32nd AAAI
Empirical evaluation of gated recurrent neural networks on sequence modeling. Conference on Artificial Intelligence. 4131–4138.
arXiv preprint arXiv:1412.3555 (2014). [34] Jia Wang, Tong Sun, Benyuan Liu, Yu Cao, and Hongwei Zhu. 2019. CLVSA:
[7] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. 2016. Deep A convolutional LSTM based variational sequence-to-sequence model with at-
direct reinforcement learning for financial signal representation and trading. tention for predicting trends of financial markets. In Proceedings of the 28th
IEEE Transactions on Neural Networks and Learning Systems 28, 3 (2016), 653–664. International Joint Conference on Artificial Intelligence (IJCAI). 3705–3711.
[8] Qianggang Ding, Sifan Wu, Hao Sun, Jiadong Guo, and Jian Guo. 2020. Hier- [35] Rundong Wang, Hongxin Wei, Bo An, Zhouyan Feng, and Jun Yao. 2021. Com-
archical multi-scale Gaussian transformer for stock movement prediction.. In mission fee is not enough: A hierarchical reinforced framework for portfolio
Proceedings of the 29th International Joint Conference on Artificial Intelligence management. In Proceedings of the 35th AAAI Conference on Artificial Intelligence.
(IJCAI). 4640–4646. [36] Zhicheng Wang, Biwei Huang, Shikui Tu, Kun Zhang, and Lei Xu. 2021. Deep-
[9] Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. 2015. Deep learning for Trader: A deep reinforcement learning approach to risk-return balanced portfolio
event-driven stock prediction. In Proceedings of the 24th International Conference management with market conditions embedding. In Proceedings of the 35th AAAI
on Artificial Intelligence. 2327–2333. Conference on Artificial Intelligence.
[10] Eugene F Fama. 1970. Efficient capital markets: A review of theory and empirical [37] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando
work. The Journal of Finance 25, 2 (1970), 383–417. Freitas. 2016. Dueling network architectures for deep reinforcement learning. In
[11] Yuchen Fang, Kan Ren, Weiqing Liu, Dong Zhou, Weinan Zhang, Jiang Bian, Proceedings of 35th International Conference on Machine Learning. 1995–2003.
Yong Yu, and Tie-Yan Liu. 2021. Universal trading for order execution with [38] Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets and
oracle policy distillation. In Proceedings of the 35th AAAI Conference on Artificial historical prices. In Proceedings of the 56th Annual Meeting of the Association for
Intelligence. 107–115. Computational Linguistics. 1970–1979.
[12] Fuli Feng, Huimin Chen, Xiangnan He, Ji Ding, Maosong Sun, and Tat-Seng Chua. [39] Yunan Ye, Hengzhi Pei, Boxin Wang, Pin-Yu Chen, Yada Zhu, Ju Xiao, and Bo
2018. Enhancing stock movement prediction with adversarial training. arXiv Li. 2020. Reinforcement-learning based portfolio management with augmented
preprint arXiv:1810.09936 (2018). asset movement prediction states. In Proceedings of the 34th AAAI Conference on
[13] Fuli Feng, Xiangnan He, Xiang Wang, Cheng Luo, Yiqun Liu, and Tat-Seng Chua. Artificial Intelligence (AAAI). 1112–1119.
2019. Temporal relational ranking for stock prediction. ACM Transactions on [40] Jaemin Yoo, Yejun Soun, Yong-chan Park, and U Kang. 2021. Accurate multivariate
Information Systems (TOIS) 37, 2 (2019), 1–30. stock movement prediction via data-axis transformer with multi-level contexts.
[14] Harrison Hong and Jeremy C Stein. 1999. A unified theory of underreaction, In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &
momentum trading, and overreaction in asset markets. The Journal of Finance Data Mining. 2037–2045.
54, 6 (1999), 2143–2184. [41] Pengqian Yu, Joon Sern Lee, Ilya Kulyatin, Zekun Shi, and Sakyasingha Das-
[15] Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, and Tie-Yan Liu. 2018. Listening gupta. 2019. Model-based deep reinforcement learning for dynamic portfolio
to chaotic whispers: A deep learning framework for news-oriented stock trend optimization. arXiv preprint arXiv:1901.08740 (2019).
prediction. In Proceedings of the 11th ACM International Conference on Web Search [42] Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J Mankowitz, and Shie Man-
and Data Mining. 261–269. nor. 2018. Learn what not to learn: Action elimination with deep reinforcement
[16] Narasimhan Jegadeesh and Sheridan Titman. 2002. Cross-sectional and time- learning. Advances in Neural Information Processing Systems (2018).
series determinants of momentum returns. The Review of Financial Studies 15, 1 [43] Liheng Zhang, Charu Aggarwal, and Guo-Jun Qi. 2017. Stock price prediction
(2002), 143–157. via discovering multi-frequency trading patterns. In Proceedings of the 23rd ACM
[17] Zhengyao Jiang, Dixing Xu, and Jinjun Liang. 2017. A deep reinforcement SIGKDD International Conference on Knowledge Discovery & Data Mining. 2141–
learning framework for the financial portfolio management problem. arXiv 2149.
preprint arXiv:1706.10059 (2017). [44] Zihao Zhang, Stefan Zohren, and Stephen Roberts. 2020. Deep reinforcement
[18] Zura Kakushadze. 2016. 101 formulaic alphas. Wilmott 2016, 84 (2016), 72–81. learning for trading. The Journal of Financial Data Science 2, 2 (2020), 25–40.