0% found this document useful (0 votes)
47 views

10 DeepScalper A Risk-Aware Reinforcement Learning Framework

This document proposes DeepScalper, a reinforcement learning framework for intraday trading. It uses a dueling Q-network with action branching to handle large action spaces in intraday trading. It includes a novel reward function and encoder-decoder architecture to learn multi-modality temporal market representations. It also uses a risk-aware auxiliary task to balance profit maximization and risk minimization. The framework is tested on real-world market data and significantly outperforms baselines in financial metrics.

Uploaded by

myshael
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

10 DeepScalper A Risk-Aware Reinforcement Learning Framework

This document proposes DeepScalper, a reinforcement learning framework for intraday trading. It uses a dueling Q-network with action branching to handle large action spaces in intraday trading. It includes a novel reward function and encoder-decoder architecture to learn multi-modality temporal market representations. It also uses a risk-aware auxiliary task to balance profit maximization and risk minimization. The framework is tested on real-world market data and significantly outperforms baselines in financial metrics.

Uploaded by

myshael
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DeepScalper: A Risk-Aware Reinforcement Learning Framework

to Capture Fleeting Intraday Trading Opportunities


Shuo Sun Wanqi Xue Rundong Wang∗
Nanyang Technological University Nanyang Technological University Nanyang Technological University

Xu He Junlei Zhu Jian Li


Huawei Noah Ark Lab Webank Tsinghua University

Bo An
arXiv:2201.09058v3 [q-fin.TR] 21 Aug 2022

Nanyang Technological University


Intraday Trader
ABSTRACT
Bid Ask

Reinforcement learning (RL) techniques have shown great success Qty Price Price Qty
100 9.9 10.0 450
Data Analysis long-term

in many challenging quantitative trading tasks, such as portfolio 30


150
9.8 10.1
9.7 10.2
200
100
short-term

management and algorithmic trading. Especially, intraday trad- LOB


ing is one of the most profitable and risky tasks because of the
intraday behaviors of the financial market that reflect billions of profit
Price Qty Risk
rapidly fluctuating capitals. However, a vast majority of existing RL Management
9.9 +3
methods focus on the relatively low frequency trading scenarios OHLCV
(e.g., day-level) and fail to capture the fleeting intraday investment
opportunities due to two major challenges: 1) how to effectively Figure 1: Workflow of professional human intraday trader
train profitable RL agents for intraday investment decision-making,
which involves high-dimensional fine-grained action space; 2) how
to learn meaningful multi-modality market representation to under- of millions of investors to pursue desirable financial assets and
stand the intraday behaviors of the financial market at tick-level. achieve investment goals. Recent years have witnessed significant
Motivated by the efficient workflow of professional human intra- development of quantitative trading [1], due to its instant and ac-
day traders, we propose DeepScalper, a deep reinforcement learn- curate order execution, and capability of analyzing and processing
ing framework for intraday trading to tackle the above challenges. large amount of temporal financial data. Especially, intraday trad-
Specifically, DeepScalper includes four components: 1) a dueling Q- ing2 , where traders actively long/short pre-selected financial assets
network with action branching to deal with the large action space (at least a few times per day) to seize intraday trading opportunities,
of intraday trading for efficient RL optimization; 2) a novel reward becomes one of the most profitable and risky quantitative trading
function with a hindsight bonus to encourage RL agents making tasks with the growth of discount brokerages (lower commission
trading decisions with a long-term horizon of the entire trading fee).
day; 3) an encoder-decoder architecture to learn multi-modality Traditional intraday trading strategies use finance knowledge
temporal market embedding, which incorporates both macro-level to discover trading opportunities with heuristic rules. For instance,
and micro-level market information; 4) a risk-aware auxiliary task momentum [24] trading is designed based on the assumption that
to maintain a striking balance between maximizing profit and min- the trend of financial assets in the past has the tendency to con-
imizing risk. Through extensive experiments on real-world market tinue in the future. Mean reversion [4] focusing on the investment
data spanning over three years on six financial futures (2 stock opportunities at the turning points. However, rule-based traditional
index and 4 treasury bond), we demonstrate that DeepScalper sig- methods exhibit poor generalization ability and only perform well
nificantly outperforms many state-of-the-art baselines in terms in certain market conditions [7]. Another paradigm is to trade based
of four financial criteria. Furthermore, we conduct a series of ex- on financial prediction. Many advanced machine learning models
ploratory and ablative studies to analyze the contributions of each such as GCN [13], Transformer [8] and LGBM [19] have been in-
component in DeepScalper. troduced for predicting future prices [9]. Many other data sources
such as economic news [15], frequency of price [43], social media
KEYWORDS [38] and investment behaviors [5] have been added as additional
Quantitative investment; reinforcement learning information to further improve prediction performance. However,
the high volatility and noisy nature of the financial market make
1 INTRODUCTION it extremely difficult to accurately predict future prices [10]. Fur-
thermore, there is a noticeable gap between prediction signals and
The financial market, an ecosystem that involves over 90 trillion1
profitable trading actions [13]. Thus, the overall performance of
market capitals globally in 2020, attracts the attention of hundreds
prediction-based methods is not satisfying as well.
∗ Corresponding author.
1 https://data.worldbank.org/indicator/CM.MKT.LCAP.CD/ 2 https://www.investopedia.com/articles/trading/05/011705.asp
Conference’17, July 2017, Washington, DC, USA Sun et al.

Similar to other application scenarios of reinforcement learning [14] and mean reversion [28] are two well-known types of tradi-
(RL), quantitative trading also interacts with the environment (fi- tional finance methods based on technical analysis. Momentum
nancial market) and maximizes the accumulative profit. Recently, trading assumes the trend of financial assets in the past has the
RL has been considered as an attractive approach to quantitative tendency to continue in the future. Time Series Momentum [24]
trading as it allows training agents to directly output profitable and Cross Sectional Momentum [16] are two classic momentum
trading actions with better generalization ability across various strategies. In contrast, mean reversion strategies such as Bollinger
market conditions [21]. Although there have been many successful bands [4] assume that the price of financial assets will finally revert
RL-based quantitative trading methods [11, 35, 39], a vast majority to the long-term mean.
of existing methods focus on the relatively low-frequency trading However, traditional finance methods are not perceptive enough
scenarios (e.g., day-level) and fail to capture the fleeting intraday to capture fleeting intraday patterns and only perform well in cer-
investment opportunities. To design profitable RL methods for in- tain market conditions [7]. In recent years, many advanced machine
traday trading, there are two major challenges. First, different from learning methods have significantly outperformed traditional fi-
the low-frequency scenarios, intraday trading involves a much nance methods.
larger high-dimensional fine-grained action space to represent the
price and quantity of orders for more accurate control of the finan-
cial market. Second, we need to learn meaningful multi-modality 2.2 Prediction-Based Methods
intraday market representation, which takes macro-level market, As for prediction-based methods, they first formulate quantitative
micro-level market and risk into consideration. trading as a supervised learning task to predict the future return
Considering the workflow of a professional human intraday (regression) or price movement (classification). Later on, trading
trader (Figure 1), the trader first collects both micro-level and macro- decisions are generated by the prediction results with a heuristic
level market information to analyze the market status. Then, he strategy generator (e.g., top-k in [40]). Specifically, Wang et al. [34]
predicts the short-term and long-term price trend based on the mar- combine attention mechanism with LSTM to model correlated time
ket status. Later on, he conducts risk management and makes final steps. To improve the robustness of LSTM, Feng et al. [12] apply
trading decisions (when and how much to long/short at what price). adversarial training techniques for stock prediction. Zhang et al.
Among many successful trading firms, this workflow plays a key [43] propose a novel State Frequency Memory (SFM) recurrent
role for designing robust and profitable intraday trading strategies. network with Discrete Fourier Transform (DFT) to discover multi-
Inspired by it, we propose DeepScalper, a novel RL framework for frequency patterns in stock markets. Liu et al. [20] introduce a
intraday trading to tackle the above challenges by mimicking the multi-scale two-way neural network to predict the stock trend. Sun
information collection, short-term and long-term market analysis et al. [32] propose an ensemble learning framework to train mixture
and risk management procedures of human intraday traders. Our of trading experts.
main contributions are three-fold: However, the high volatility and noisy nature of the financial mar-
• We apply the dueling Q-Network with action branching ket make it extremely difficult to accurately predict future prices
to effectively optimize intraday trading agents with high- [10]. Furthermore, there is a noticeable gap between prediction
dimensional fine-grained action space. A novel reward func- signals and profitable trading actions [13]. Thus, the overall perfor-
tion with hindsight bonus is designed to encourage RL agents mance of prediction-based methods is not satisfying as well.
making trading decisions with a long-term horizon of the
entire trading day.
2.3 Reinforcement Learning Methods
• We propose an multi-modality encoder-decoder architecture
to learn meaningful temporal intraday market embedding, Recent years have witnessed the successful marriage of reinforce-
which incorporates both micro-level and macro-level market ment learning and quantitative trading as RL allows training agents
information. Furthermore, we design a risk-aware auxiliary to directly output profitable trading actions with better generaliza-
task to keep balance between profit and risk. tion ability across various market conditions [31]. Neuneier [26]
• Through extensive experiments on real-world market data make the first attempt to learn trading strategies using Q-learning.
spanning over three years on six financial futures, we show Moody and Saffell [23] propose a policy-based method, namely
that DeepScalper significantly outperforms many state-of- recurrent reinforcement learning (RRL), for quantitative trading.
the-art baseline methods in terms of four financial crite- However, traditional RL approaches have difficulties in selecting
ria and demonstrate DeepScalper’s practical applicability to market features and learning good policy in large scale scenarios.
intraday trading with a series of exploratory and ablative To tackle these issues, many deep RL approaches have been pro-
studies. posed to learn market embedding through high dimensional data.
Jiang et al. [17] use DDPG to dynamically optimize cryptocurrency
2 RELATED WORK portfolios. Deng et al. [7] apply fuzzy learning and deep learning
to improve financial signal representation. Yu et al. [41] propose a
2.1 Traditional Finance Methods model-based RL framework for daily frequency portfolio trading.
Technical analysis [25], which believes that past price and volume Liu et al. [21] propose an adaptive DDPG-based framework with
data have the ability to reflect future market conditions [10], is the imitation learning. Ye et al. [39] proposed a State-Augmented RL
foundation of traditional finance methods. Millions of technical (SARL) framework based on DPG with financial news as additional
indicators are designed to generate trading signals [18]. Momentum states.
DeepScalper: A Risk-Aware Reinforcement Learning Framework to Capture Fleeting Intraday Trading Opportunities Conference’17, July 2017, Washington, DC, USA

#
Although there are many efforts on utilizing RL for quantita- sell order # 𝑞! "
𝑞! #
tive trading, a vast majority of existing RL methods focus on the #
#
𝑞! ! 𝑞! $
relatively low-frequency scenarios (e.g., day-level) and fail to cap- " " " " Ask
𝑝! " 𝑝! # 𝑝! $ 𝑝! !
ture the fleeting intraday investment opportunities. We propose
#
DeepScalper to fill this gap by mimicking the workflow of human #
𝑝! !
#
𝑝! $
#
𝑝! # 𝑝! " Price
intraday traders. Bid "
" 𝑞! #
𝑞! " " buy order
𝑞! $ "
3 PROBLEM FORMULATION 𝑞! !
In this section, we first introduce necessary preliminaries and the
objective of intraday trading. Next, we provide a Markov Decision Figure 2: A snapshot of 4-level limit order book (LOB)
Process (MDP) formulation of intraday trading.

3.1 Intraday Trading In real-world intraday trading, traders are allocated some cash into
Intraday trading is a fundamental quantitative trading task, where the account at the beginning of each trading day. During the trading
traders actively long/short one pre-selected financial asset within time, traders analyze the market and make their trading decisions.
the same trading day to maximize future profit. Below are some Then, they submit their orders (desired price and quantity) to the
necessary definitions for understanding the problem: matching system. The matching system will execute orders at best
available price (possibly at multiple price when market liquidation
Definition 1. (OHLCV) OHLCV is a type of bar chart directly
is not enough for large orders) and then return execution results to
obtained from the financial market. OHLCV vector at time 𝑡 is
traders. At the end of the trading day, traders close all remaining
denoted as x𝑡 = (𝑝𝑡𝑜 , 𝑝𝑡ℎ , 𝑝𝑡𝑙 , 𝑝𝑡𝑐 , 𝑣𝑡 ), where 𝑝𝑡𝑜 is the open price, 𝑝𝑡ℎ
positions at market price to avoid overnight risk and hold 100% cash
is the high price, 𝑝𝑡𝑙 is the low price, 𝑝𝑡𝑐 is the close price and 𝑣𝑡 is again. The objective of intraday trading is to maximize accumulative
the volume. profit for a period of multiple continuous trading days (e.g., half a
Definition 2. (Technical Indicator) A technical indicator indicates year).
a feature calculated by a formulaic combination of the original Comparing to conventional low-frequency trading scenarios,
OHLCV to uncover the underlying pattern of the financial market. intraday trading is more challenging since intraday traders need
Ð
We denote the technical indicator vector at time 𝑡: y𝑡 = 𝑘 𝑦𝑡𝑘 , to capture the tiny price fluctuation with much less responsive
𝑘 𝑘 𝑘
where 𝑦𝑡 = 𝑓𝑘 (x𝑡 −ℎ , ..., x𝑡 , 𝜃 ), 𝜃 is the parameter of technical time (e.g., 1 min). In addition, intraday trading involves a large
indicator 𝑘. fine-grained trading action space that represents a limit order to
pursue more accurate control of the market.
Definition 3. (Limit Order) A limit order is an order placed to
long/short a certain number of shares at a specific price. It is de- 3.2 MDP Formulation
fined as a tuple 𝑙 = (𝑝𝑡𝑎𝑟𝑔𝑒𝑡 , ±𝑞𝑡𝑎𝑟𝑔𝑒𝑡 ), where 𝑝𝑡𝑎𝑟𝑔𝑒𝑡 represents
We formulate intraday trading as a MDP, which is constructed by a
the submitted target price, 𝑞𝑡𝑎𝑟𝑔𝑒𝑡 represents the submitted target
5-tuple (S, A,𝑇 , 𝑅, 𝛾). Specifically, S is a finite set of states. A is a
quantity, and ± represents the trading direction (long/short).
finite set of actions. 𝑇 : S × A × S −→ [0, 1] is a state transaction
Definition 4. (Limit Order Book) As shown in Figure 2, a limit function, which consists of a set of conditional transition proba-
order book (LOB) contains public available aggregate information bilities between states. 𝑅 : S × A −→ R is the reward function,
of limit orders by all market participants. It is widely used as market where R is a continuous set of possible rewards and 𝑅 indicates the
microstructure [22] in finance to represent the relative strength immediate reward of taking an action in a state. 𝛾 ∈ [0, 1) is the
between buy and sell side. We denote an 𝑚 level LOB at time 𝑡 discount factor. A (stationary) policy 𝜋 : S × A −→ [0, 1] assigns
as b𝑡 = (𝑝𝑡𝑏 1 , 𝑝𝑡𝑎1 , 𝑞𝑏𝑡 1 , 𝑞𝑡𝑎1 , ..., 𝑝𝑡𝑏𝑚 , 𝑝𝑡𝑎𝑚 , 𝑞𝑏𝑡 𝑚 , 𝑞𝑡𝑎𝑚 ), where 𝑝𝑡𝑏𝑖 is the each state 𝑠 ∈ S a distribution over actions, where 𝑎 ∈ A has
probability 𝜋 (𝑎|𝑠). In intraday trading, O, A, 𝑅 are set as follows.
level 𝑖 bid price, 𝑝𝑡𝑎𝑖 is the level 𝑖 ask price, 𝑞𝑏𝑡 𝑖 and 𝑞𝑡𝑎𝑖 are the
State. Due to the particularity of the financial market, the state
corresponding quantities.
𝑠𝑡 ∈ S at time 𝑡 can be divided into three parts: macro-level market
Definition 5. (Matching System) The matching system is an elec- state 𝑠𝑡𝑎 ∈ S𝑎 , micro-level market state 𝑠𝑡𝑖 ∈ S𝑖 and trader’s private
𝑝
tronic system that matches the buy and sell orders for the financial state set 𝑠𝑡 ∈ S 𝑝 . Specifically, we use a vector yt of 11 technical
market. It is usually used to execute orders for different traders in indicators and the original OHLCV vector xt as macro-level state
the exchange. following [40], the historical LOB sequence (bt-h, ..., bt ) with length
ℎ + 1 as micro-level state and trader’s private state zt = (𝑝𝑜𝑠 𝑡 , 𝑐𝑡 , 𝑡𝑡 ),
Definition 6. (Position) Position 𝑝𝑜𝑠 𝑡 at time 𝑡 is the amount and
where 𝑝𝑜𝑠 𝑡 , 𝑐𝑡 and 𝑡𝑡 are the current position, cash and remaining
direction of a financial asset owned by traders. It represents a long
time. The entire set of states can be denoted as S = (S𝑎 , S𝑖 , S𝑝 ).
(short) position when 𝑝𝑜𝑠 𝑡 is positive (negative).
Compared to previous formulations, we introduce the LOB and
Definition 7. (Net Value) Net value is the normalised sum of cash trader’s private state as additional information to effectively capture
and value of financial assets held by a trader. The net value at time intraday trading opportunities.
𝑡 is denoted as 𝑛𝑡 = (𝑐𝑡 + 𝑝𝑡𝑐 × 𝑝𝑜𝑠 𝑡 )/𝑐 1 , where 𝑐𝑡 is the cash at Action. Previous works [7, 21] lie in low-frequency trading
time 𝑡 and 𝑐 1 is the initial amount of cash. scenarios, which generally stipulate that the agent trades a fixed
Conference’17, July 2017, Washington, DC, USA Sun et al.

Bid Ask (a) Micro-level Encoder (c) Risk-aware Auxiliary Task


Qty Bid Price Price AskQty
Qty Bid
Price Price Ask
Qty
𝒃𝒕"𝒌 Prediction Loss Ground Truth
100 9.9 10.0 450
Qty Price 10.0
30100 9.89.9 10.1
Price 450
100 9.89.9 10.1
200
10.0 200
Qty
450
… 𝒉𝒃𝒕 𝑦!%&' 𝐿%&' 𝑦&'(

Mirco Embedding
150 30 9.7 10.2 100
15030 9.79.8 10.210.1 100200 𝒃𝒕 LSTM
150 9.7 10.2 100 𝒆𝒊𝒕 MLP
Concat Optimize Volatility Prediction
Historical LOB
cash 𝒛𝒕"𝒌 Advantages Q-values
Price Price argmax

Market Embedding
… 𝒉𝒕𝒛 Dimension Dimension
𝒛𝒕 LSTM MLP
Private Observation 𝒆𝒕 2-tuple Action
Indicator Discovery Price Qty
State Value
MLP 9.9 +3

Marco Embedding
Indicator Indicator 𝒚𝒕 MLP
… 𝒆𝒂𝒕
𝒗𝒕
Concat
1 k
Indicator set Advantages Q-values
Quantity Quantity argmax
𝒙𝒕
Dimension Dimension
OHLCV
MLP
(b) Macro-level Encoder (d) Action Branching

Figure 3: An overview of the proposed DeepScalper framework. We show four individual building blocks: (a) micro-level
encoder, (b) macro-level encoder, (c) risk-aware auxiliary task, and (d) RL optimization with action branching.

quantity at market price and applies a coarse action space with (price and quantity) fine-grained action space. However, learning
three options (long, hold, and short). However, when focusing on from scratch for tasks with large action spaces remains a critical
relatively high-frequency trading scenarios (e.g., intraday trading), challenge for RL algorithms [3, 42]. For intraday trading, while
tiny price fluctuation (e.g., 1 cent) is of vital importance to final human traders can usually detect the subset of feasible trading
profit that makes the market price execution and fixed quantity as- actions in a given market condition, RL agents may attempt inferior
sumptions unacceptable. In the real-world financial market, traders actions, thus wasting computation time and leading to capital loss.
have the freedom to decide both the target price and the quantity As possible intraday trading actions can be naturally divided
by submitting limit orders. We use a more realistic two-dimensional into two components (e.g., desired price and quantity), we propose
fine-grained action space for intraday trading, which represents to adopt the Branching Dueling Q-Network (BDQ) [33] for decision-
a limit order as a tuple (𝑝𝑡𝑎𝑟𝑔𝑒𝑡 , ±𝑞𝑡𝑎𝑟𝑔𝑒𝑡 ). 𝑝𝑡𝑎𝑟𝑔𝑒𝑡 is the target making. Particularly, as shown in Figure 3(d), BDQ distributes the
price, 𝑞𝑡𝑎𝑟𝑔𝑒𝑡 is the target quantity and ± is the trading direction representation of the state-dependent action advantages in both the
(long/short). It is also worth pointing out that when the quantity is price and quantity branches. Later, it simultaneously adds a single
zero, it indicates that we skip the current time step with no order additional branch to estimate the state-value function. Finally, the
placement. advantages and the state value are combined via an aggregating
Reward. We define the reward function as the change of ac- layer to output the Q-values for each action dimension. During the
count P&L, which shows the value fluctuation (profit & loss) of the inference period, these Q-values are then queried with argmax to
account: generate a joint action tuple to determine the final trading actions.
Formally, intraday trading is formulated as a sequential decision
𝑟𝑡 = (𝑝𝑡𝑐+1 − 𝑝𝑡𝑐 ) × 𝑝𝑜𝑠 𝑡 − 𝛿 × 𝑝𝑡𝑐 × 𝑝𝑜𝑠 𝑡 − 𝑝𝑜𝑠 𝑡 −1
| {z } | {z } making problem with two action dimensions of |𝑝 | = 𝑛𝑝 discrete
𝑖𝑛𝑠𝑡𝑎𝑛𝑡 𝑝𝑟𝑜 𝑓 𝑖𝑡 𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛 𝑓 𝑒𝑒
relative price levels and |𝑞| = 𝑛𝑞 discrete quantity proportions.
The individual branch’s Q-value 𝑄𝑑 at state 𝑠 ∈ 𝑆 and the action
where 𝑝𝑡𝑐 is the close price at time 𝑡, 𝛿 is the transaction fee rate 𝑎𝑑 ∈ A𝑑 are expressed in terms of the common state value 𝑉 (𝑠)
and 𝑝𝑜𝑠 𝑡 is the position at time 𝑡. and the corresponding (state-dependent) action advantage [37]
𝐴𝑑𝑣𝑑 (𝑠, 𝑎𝑑 ) for 𝑑 ∈ {𝑝, 𝑞}:
4 DEEPSCALPER
In this section, we introduce DeepScalper (an overview in Fig- 1 ∑︁ ′
𝑄𝑑 (𝑠, 𝑎𝑑 ) = 𝑉 (𝑠) + (𝐴𝑑𝑣𝑑 (𝑠, 𝑎𝑑 ) − 𝐴𝑑𝑣𝑑 (𝑠, 𝑎𝑑 ))
ure 3) for intraday trading. We describe the four components of 𝑛 ′
DeepScalper: 1) RL optimization with action branching; 2) reward 𝑎𝑑 ∈𝐴𝑑
function with a hindsight bonus; 3) intraday market embedding; 4) We train our Q-value function approximator as Q-Network with
risk-aware auxiliary task orderly. parameter 𝜃𝑞 based on the one-step temporal-difference learning
with target 𝑦𝑑 in a recursive fashion:
4.1 RL Optimization with Action Branching
Comparing to conventional low-frequency trading scenarios, in- 𝑦𝑑 = 𝑟 + 𝛾 max 𝑄 − (𝑠 ′, 𝑎𝑑′ , 𝜃𝑞 )), 𝑑 ∈ {𝑝, 𝑞}
𝑎′ ∈𝐴𝑑 𝑑
traday trading tries to seize the fleeting tiny price fluctuation with 𝑑

much less responsive time. To provide more accurate trading de- where 𝑄𝑑− denoting the branch 𝑑 of the target network 𝑄 − , 𝑟 is the
cisions, intraday trading involves a much larger two-dimensional reward function result and 𝛾 is the discount factor.
DeepScalper: A Risk-Aware Reinforcement Learning Framework to Capture Fleeting Intraday Trading Opportunities Conference’17, July 2017, Washington, DC, USA

Finally, we calculate the following loss function: Noticeably, we only use the reward function with a hindsight
1 ∑︁ bonus for training to better understand the market. During the
𝐿𝑞 (𝜃𝑞 ) = E (𝑠,𝑎,𝑟,𝑠 ′ )∼𝐷 [ (𝑦𝑑 − 𝑄𝑑 (𝑠, 𝑎𝑑 , 𝜃𝑞 )) 2 ] test period, we continue to use the original reward 𝑟𝑡 to calculate
𝑁
𝑑 ∈ {𝑝,𝑞 } the profits. Furthermore, the hindsight reward function somehow
where 𝐷 denotes a prioritized experience replay buffer. 𝑎 denotes ignores details of price fluctuation between 𝑡 + 2 to 𝑡 + ℎ − 1 and
the joint-action tuple (𝑝, 𝑞). By differentiating the Q-Network loss focuses on the trend of this period, which is computational efficient
function with respect to 𝜃𝑞 , we get the following gradient: and shows robust performance in practice.
′ ′
▽𝜃𝑞 𝐿𝑞 (𝜃𝑞 ) = E (𝑠,𝑎,𝑟,𝑠 ′ )∼𝐷 [(𝑟 + 𝛾 max 𝑄𝑑 (𝑠 , 𝑎𝑑 , 𝜃𝑞 )

𝑎𝑑 ∈𝐴𝑑 4.3 Intraday Market Embedding
−𝑄𝑑 (𝑠, 𝑎𝑑 , 𝜃𝑞 )) ▽𝜃𝑞 𝑄𝑑 (𝑠, 𝑎𝑑 , 𝜃𝑞 )] To learn a meaningful multi-modality intraday market embedding,
we propose an encoder-decoder architecture to represent the mar-
In practice, we optimize the loss function by stochastic gradient ket from the micro-level and macro-level, respectively.
descent, rather than computing the full expectations in the above For micro-level encoder, we choose LOB data and trader’s private
gradient, to maintain computational efficiency. state to learn the micro-level market embedding. LOB is widely
used to analyze the relative strength of the buy and sell side based
on micro-level trading behaviors, and private state of traders is
buy miss increasing trend considered insightful to capture micro-level trading opportuni-
sell ties [27]. At time 𝑡, we have a sequence of historical LOB embed-
dings (bt-k, ..., bt ) and trader’s private state embedding (zt-k, ..., zt ),
get positive return where 𝑘 + 1 is the sequence length. As shown in Figure 3(a), we
feed them into two different LSTM layers and concatenate the last
hidden states hbt and hzt of the two LSTM layers as the micro-level
embedding eit at time 𝑡.
For macro-level encoder, we pick raw OHLCV data and technical
indicators to learn the macro-level embedding. The intuition here
is that OHLCV reflects the original market status, and technical
indicators offer additional information. At time 𝑡, we firstly concate-
Figure 4: Illustration of the motivation of the hindsight nate OHLCV vector xt and technical indicator vector yt as input
bonus vt . As shown in Figure 3(b), the concatenated embedding is then
fed into a multilayer perceptron (MLP). The MLP output is applied
as the macro-level embedding eat at time 𝑡. Finally, we concatenate
4.2 Reward Function with Hindsight Bonus micro-level embedding and macro-level embedding together as
One major issue for training directly with the profit & loss reward the market embedding et . Our market embedding is better than
is that RL agents tend to pay too much attention to the short- that of previous work, since it incorporates the micro-level market
term price fluctuation [36]. Although the agent performs well in information.
capturing local trading opportunities, ignoring the overall long-
term price trend could lead to significant loss. Here, we design a 4.4 Risk-Aware Auxiliary Task
novel reward function with a hindsight bonus to tackle this issue. As risk management is of vital importance for intraday trading, we
To demonstrate the motivation of the hindsight bonus, considering propose a risk-aware auxiliary task by predicting volatility to take
a pair of buy/sell actions in Figure 4, the trader feels happy at the into account market risk as shown in Figure 3(c). Volatility is widely
point of selling the stock, since the price of the stock increases. used as a coherent measure of risk that describing the statistical
However, this sell decision is actually a bad decision in the long run. dispersion of returns in finance [2]. We analyze the reasons why
The trader feels disappointed before 12:00 since he/she misses the volatility prediction is an effective auxiliary task to improve the
main increasing wave due to the short horizon. It is more reasonable trading policy learning as follows.
for RL agents to evaluate one trading action from both short-term First, it is consistent with the general trading goal, which is to
and long-term perspectives. Inspired by this, we add a hindsight maximize long-term profit under certain risk tolerance. Second,
bonus, which is the expected profit for holding the assets for a future volatility is easier to predict compared to future price. For
longer period of ℎ with a weight term 𝑤, into the reward function instance, considering the day that the president election result will
to add a long-term horizon while training intraday RL agents: be announced, nobody can know the result in advance, which will
lead the stock market to increase or decrease. However, everyone
𝑟𝑡ℎ𝑖𝑛𝑑 = 𝑟𝑡 + 𝑤 × (𝑝𝑡𝑐+ℎ − 𝑝𝑡𝑐 ) × 𝑝𝑜𝑠 𝑡 knows that there would be a huge fluctuation in the stock market,
| {z } which increases future volatility. Third, predicting future price and
ℎ𝑖𝑛𝑑𝑠𝑖𝑔ℎ𝑡 𝑏𝑜𝑛𝑢𝑠 volatility are two closely related tasks. Learning value function
where 𝑝𝑡𝑐 is the close price at time 𝑡, 𝑤 is the weight of the hindsight approximation and volatility prediction simultaneously can help
bonus, ℎ is the horizon of the hindsight bonus and 𝑝𝑜𝑠 𝑡 is the the agent learn a more robust market embedding. The definition
position at time 𝑡. of volatility is the variance of return sequence 𝑦 𝑣𝑜𝑙 = 𝜎 (r), where
Conference’17, July 2017, Washington, DC, USA Sun et al.

Dataset Freq Number Days From To Features Calculation Formula


Stock index 1min 2 251 19/05/01 20/04/30 𝑧𝑜𝑝𝑒𝑛 , 𝑧ℎ𝑖𝑔ℎ , 𝑧𝑙𝑜𝑤 e.g., 𝑧𝑜𝑝𝑒𝑛 = 𝑜𝑝𝑒𝑛𝑡 /𝑐𝑙𝑜𝑠𝑒𝑡 − 1
Treasury bond 1min 4 662 17/11/29 20/07/17 𝑧𝑐𝑙𝑜𝑠𝑒 , 𝑧𝑎𝑑 𝑗_𝑐𝑙𝑜𝑠𝑒 e.g., 𝑧𝑐𝑙𝑜𝑠𝑒 = 𝑐𝑙𝑜𝑠𝑒𝑡 /𝑐𝑙𝑜𝑠𝑒𝑡 −1 − 1
Í4
Table 1: Dataset statistics detailing data frequency, number 𝑧𝑑_5, 𝑧𝑑_10, 𝑧𝑑_15 𝑎𝑑 𝑗_𝑐𝑙𝑜𝑠𝑒𝑡 −𝑖 /5
e.g., 𝑧𝑑_5 = 𝑖=0 −1
of financial assets, trading days and chronological period 𝑧𝑑_20, 𝑧𝑑_25, 𝑧𝑑_30 𝑎𝑑 𝑗_𝑐𝑙𝑜𝑠𝑒𝑡
Table 2: Features to describe macro-level financial markets

r is the vector of return at each time step. Volatility prediction is


a regression task with market embedding et as input and 𝑦 𝑣𝑜𝑙 as
target. We feed the market embedding into a single layer MLP with 5.2 Evaluation Metrics
ˆ is the predicted volatility. We train
parameters 𝜃 𝑣 . The output 𝑦 𝑣𝑜𝑙 We evaluate DeepScalper on four different financial metrics, includ-
the network by minimizing the mean squared error. ing one profit criterion and three risk-adjusted profit criteria:
ˆ = 𝑀𝐿𝑃 (𝑒𝑡 , 𝜃 𝑣 )
𝑦 𝑣𝑜𝑙 • Total Return (TR) is the overall return rate for the entire
trading period. It is defined as 𝑇 𝑅 = 𝑛𝑡𝑛−𝑛 1
, where 𝑛𝑡 is the
ˆ )2
𝐿𝑣𝑜𝑙 (𝜃 𝑣 ) = (𝑦 𝑣𝑜𝑙 − 𝑦 𝑣𝑜𝑙 1
final net value and 𝑛 1 is the initial net value.
The overall loss function is defined as: • Sharpe Ratio (SR) [30] considers the amount of extra return
that a trader receives per unit of increased risk. It is defined
𝐿 = 𝐿𝑞 + 𝜂 ∗ 𝐿𝑣𝑜𝑙 as: 𝑆𝑅 = E[r]/𝜎 [r], where r denotes the historical sequence
where 𝐿𝑞 is the Q value loss and 𝜂 is the relative importance of the of the return rate.
E[r]
auxiliary task. • Calmar Ratio (CR) is defined as 𝐶𝑅 = 𝑀𝐷𝐷 . It is calculated
as the expected return divided by the maximum drawdown
5 EXPERIMENT SETUP (MDD) of the entire trading period, where MDD measures
the largest loss from any peak to show the worst case.
5.1 Datasets and Features • Sortino Ratio (SoR) applies the downside deviation (DD)
To conduct a comprehensive evaluation of DeepScalper, we eval- E[r]
as the risk measure. It is defined as: 𝑆𝑜𝑅 = 𝐷𝐷 , where DD
uate it on six financial assets from two real-world datasets (stock is the variance of the negative return.
index and treasury bond) spanning over three years in the Chinese
market collected from Wind3 . We summarize the statistics of the 5.3 Baseline
two datasets in Table 1 and further elaborate on them as follows:
Stock index is a dataset containing the minute-level OHLCV We compare DeepScalper with nine baseline methods consisting of
and 5-level LOB data of two representative stock index futures (IC three traditional finance methods, three prediction-based methods,
and IF) in the Chinese market. IC is a stock index future calculated and three reinforcement learning methods.
based on 500 small and medium market capitalization stocks. IF is Traditional Finance Methods
another stock index future that focuses on the top 300 large capi- • Buy & Hold (BAH), which is usually used to reflect the
talization stocks. For each stock index future, we split the dataset market average, indicates the trading strategy that buys
with May-Dec, 2019 for training and Jan-April, 2020 for testing. the pre-selected financial assets with full position at the
Treasury bond is a dataset containing the minute-level OHLCV beginning and holds until the end of the trading period.
and 5-level LOB data of four treasury bond futures (T01, T02, TF01, • Mean Reversion (MV) [28] is a traditional finance method
TF02). These treasury bond futures are mainstream treasury bond designed under the assumption that the price of financial as-
futures with the highest liquidity in the Chinese market. For each sets will eventually revert to the long-term mean. In practice,
treasury bond, we use 2017/11/29 - 2020/4/29 for training and it shows stellar performance under volatile market condi-
2020/04/30 - 2020/07/17 for testing. tions.
To describe macro-level financial markets, we generate 11 tempo- • Time Series Momentum (TSM) [24] is an influential momentum-
ral features from the original OHLCV as shown in Table 2 following based method, which long (short) financial assets with in-
[40]. 𝑧𝑜𝑝𝑒𝑛 , 𝑧ℎ𝑖𝑔ℎ and 𝑧𝑙𝑜𝑤 represent the relative values of the open, creasing (decreasing) trend in the past. This is in line with the
high, and low prices compared to the close price at the current time principle that the stronger is always stronger in the financial
step, respectively. 𝑧𝑐𝑙𝑜𝑠𝑒 and 𝑧𝑎𝑑 𝑗_𝑐𝑙𝑜𝑠𝑒 represent the relative values market.
of the closing and adjusted closing prices compared to the time step
Prediction Based Methods
𝑡 − 1. 𝑧𝑑𝑘 represents a long-term moving average of the adjusted
close prices during the last 𝑘 time steps compared to the current • MLP [29] use the classic multi-layer perceptron for future
close price. For micro-level markets, we extract a 20-dimensional return prediction. We apply a three-layer MLP with hidden
feature vector from the 5-level LOB where each level contains bid, size 128.
ask price and bid, ask quantity following [35]. • GRU [6] use a newer generation of recurrent networks with
gated recurrent units for future return prediction. We apply
3 https://www.wind.com.cn/ a two-layer GRU with hidden size 64.
DeepScalper: A Risk-Aware Reinforcement Learning Framework to Capture Fleeting Intraday Trading Opportunities Conference’17, July 2017, Washington, DC, USA

• LGBM [19] is an efficient implementation of the gradient


1.2 MV MV
1.10
boosting decision tree with gradient-based one-side sam- TSM
MLP
TSM
MLP
1.1 GRU 1.05
pling and exclusive feature bundling. GRU
LGBM LGBM
1.0 DQN 1.00 DQN

Net Value

Net Value
Ours Ours
Reinforcement Learning Methods 0.9 0.95

0.8 0.90
• DQN [44] applies the deep Q-network with a novel state
0.7 0.85
representation and reward function for quantitative trading, 0.80
0.6
which shows stellar performance in more than 50 financial 0 10 20 30 40 50 0 10 20 30 40 50
Trading Day Trading Day
assets.
(a) IC (b) IF
• DS-NH is a variant of DeepScalper (DS), which removes the
hindsight bonus from the reward function. 1.08 MV 1.02 MV
TSM TSM
• DS-NA is a variant of DeepScalper (DS), which removes the 1.06 MLP
GRU 1.01 MLP
GRU
LGBM LGBM
risk-aware auxiliary task. 1.04
DQN 1.00 DQN

Net Value

Net Value
Ours Ours
1.02
0.99
5.4 Preprocessing and Experiment Setup 1.00
0.98
0.98
For macro-level features, we directly calculate the 11 technical in- 0.97

dicators following the formulas in Table 2. For micro-level features, 0 10 20 30 40 50 0 10 20 30 40 50


Trading Day Trading Day
we divide the order price and quantity of each level by the first- (c) TF01 (d) TF02
level price and quantity, respectively, for normalization. For missing
values, we fill the empty price with the previous one and empty 1.06
MV
TSM
1.04 MV
TSM
MLP 1.03 MLP
quantity as zero to maintain the consistency of time series data. 1.04 GRU GRU
LGBM 1.02 LGBM
To make the evaluation more realistic, we further consider many Net Value DQN DQN

Net Value
1.02 Ours 1.01 Ours
practical real-world constraints. The transaction fee rate 𝛿 is set 1.00
1.00
as 2.3 × 10−5 and 3 × 10−6 for stock index futures and treasury 0.99
0.98
bond futures respectively, which is consistent with the real-world 0.98

scenario4 . Since leverage such as margin loans is widely used for 0 10 20


Trading Day
30 40 50 0 10 20
Trading Day
30 40 50

intraday trading, we apply a fixed five-times leverage to amplify (e) T01 (f) T02
profit and volatility. Time is discretized into 1 min interval and we
assume that the agent can only long/short a financial future at the Figure 5: Trading day vs. net value curve of different base-
end of each minute. The account of RL agents is initialized with lines and DeepScalper on stock index (IC and IF) and trea-
enough cash to buy 50 shares of the asset at the beginning. The sury bond (TF01, TF02, T01, T02) datasets. DeepScalper
maximum holding position is 50. achieves the highest profit in all six financial assets.
We perform all experiments on a Tesla V100 GPU. Grid search
is applied to find the optimal hyperprarameters. We explore the
look-ahead horizon ℎ in [30, 60, 90, 120, 150, 180], importance of
hindsight bonus 𝑤 in [1𝑒 −3, 5𝑒 −3, 1𝑒 −2, 5𝑒 −2, 1𝑒 −1 ] and importance consistently generates significantly (𝑝 < 0.01) higher performance
of auxiliary task 𝜂 in [0.5, 1.0]. As for neural network architectures, than all baselines on 7 out of 8 metrics across the two datasets. In the
we search the hidden units of MLP layers and GRU layer in [32, stock index dataset, DeepScalper performs best on all four metrics.
64, 128] with ReLU as the activation function. we use Adam as the Specifically, it outperforms the second best by 30.80%, 33.33%, 21.42%
optimizer with learning rate 𝛼 ∈ (1𝑒 −5, 1𝑒 −3 ) and train DeepScalper and 7.50% in terms of TR, SR, CR and SoR, respectively. As for the
for 5 epochs in all financial assets. Following the iterative training treasury bond dataset, DeepScalper outperforms the second best by
scheme in [27], we augment traders’ private state repeatedly during 14.87%, 7.47%, 30.94% in terms of TR, SR and CR. For SoR, DS-NA
the training to improve data efficiency. We run experiments with 5 performs slightly better than DS (2% without statistical significance).
different random seeds and report the average performance. It takes One possible reason is that volatility prediction auxiliary task is
1.5 and 3.5 hours to train and test Deepscalper on each financial not directly relevant to control downside return variance.
asset in the stock index and treasury bond datasets, respectively. Furthermore, we show the trading day vs. net value trading days
As for other baselines, we use the default settings in their public of the test period for each financial future from the two datasets
implementations5 6 . in Figure 5. We intentionally exclude BAH, DS-NH and DS-NA to
make the figure easy to follow. For traditional methods, we find
6 RESULTS AND ANALYSIS that MV achieves decent performance for most financial futures.
6.1 Profitability Comparison with Baselines In comparison, TSM’s performance is much worse. One possible
reason for TSM’s failure is that there is no evident momentum effect
We compare DeepScalper with 9 state-of-the-art baselines in terms
within the market for intraday trading. For deep learning models,
of four financial metrics in Table 3. We observe that DeepScalper
the overall performance of GRU is better than that of MLP due to
4 China its ability to learn the temporal dependency of indicators. As for
Financial Futures Exchange: http://www.cffex.com.cn/en_new/index.html
5 Qlib:https://github.com/microsoft/qlib LGBM, it achieves slightly better performance than deep learning
6 FinRL: https://github.com/AI4Finance-Foundation/FinRL models. The average performance of RL methods is the best.
Conference’17, July 2017, Washington, DC, USA Sun et al.

Stock Index Treasury Bond


Type Models TR(%)↑ SR↑ CR↑ SoR↑ TR(%)↑ SR↑ CR↑ SoR↑
BAH 5.65 0.15 0.02 0.27 -14.26 -3.42 -0.25 -4.40
FIN MV 8.39 1.18 0.21 2.22 -0.29 -0.59 -0.04 -0.62
TSM -27.62 -2.83 -0.21 -3.08 -3.02 -5.35 -0.31 -6.20
MLP -0.73 ± 7.11 -0.14 ± 1.02 -0.01 ± 0.58 -0.24 ± 1.77 0.59 ± 1.11 1.42 ± 1.30 0.42 ± 0.51 2.33 ± 1.42
PRE GRU 5.66 ± 4.98 1.25 ± 0.66 0.24 ± 0.18 2.40 ± 0.82 1.02 ± 2.10 1.90 ± 1.72 0.55 ± 0.57 3.69 ± 2.09
LGBM 7.62 ± 1.14 1.26 ± 0.22 0.28 ± 0.05 1.59 ± 0.22 1.45 ± 0.17 2.43 ± 0.43 0.58 ± 0.09 3.68 ± 0.52
DQN 7.74 ± 3.52 1.25 ± 0.62 0.28 ± 0.17 1.79 ± 0.91 3.51 ± 1.05 4.01 ± 1.27 1.15 ± 0.39 5.66 ± 1.33
RL DS-NH 8.17 ± 5.07 0.98 ± 0.77 0.17 ± 0.17 1.37 ± 0.88 3.38 ± 1.28 4.42 ± 1.21 1.39 ± 0.45 6.85 ± 1.19
DS-NA 9.74 ± 5.12 1.32 ± 0.76 0.26 ± 0.21 2.19 ± 1.11 4.17 ± 1.44 4.27 ± 0.99 1.38 ± 0.43 7.59 ± 1.49
DS 12.74♣ ± 4.65 1.76♣ ± 0.61 0.34♣ ± 0.16 2.58♣ ± 0.72 4.79♣ ± 0.99 4.75♣ ± 1.25 1.82♣ ± 0.41 7.4 ± 1.22
% Improvement 30.80↑ 33.33↑ 21.42↑ 7.50↑ 14.87↑ 7.47↑ 30.94↑ 2.57↓
Table 3: Profitability comparison (mean and standard deviation of 5 individual runs) with 9 baselines including traditional
finance (FIN), prediction based (PRE) and reinforcement learning (RL) methods. All three FIN models are deterministic meth-
ods without the performance standard deviation. Purple and pink show best & second best results. ♣ indicates improvement
over SOTA baseline is statistically significant (𝑝 < 0.01) under Wilcoxon’s signed rank test.

Macro Micro Hindsight Volatility TR(%)↑ SR↑


√ 7.0
3.45 4.42 5.4

3.47 4.43 6.5 5.2
√ √
Total Return (%)

Total Return (%)


3.62 (+0.15) 4.81 (+0.38) 5.0
√ √ √ 6.0
4.05 (+0.58) 5.03 (+0.60) 4.8
√ √ √ 5.5 4.6
5.36 (+1.89) 5.72 (+1.29)
√ √ √ √ 4.4
6.97 (+3.50) 6.10 (+1.67) 5.0
4.2
Table 4: Ablation studies over different DeepScalper compo- 30 60 90 120 150 180 210 240 0 1 5 10 50 100 500
√ Hindsight Horizon Hindsight Horizon (1e-3)
nents. indicates adding the component to DeepScalper.
(a) TF02 (b) T02

Figure 6: Hyperparameter sensitivity of hindsight bonus: (a)


effect of importance (b) effect of horizon
6.2 Model Component Ablation Study
We conduct comprehensive ablation studies on DeepScalper’s in-
vestment profitability benefits from each of its components in Table
4. First, we observe that the encoder-decoder architecture can learn
better multi-modality market embedding than agents trained with
other macro-level or micro-level market information, which leads
to 0.15% and 0.38 improvement of TR and SR, respectively. Next, we
find that adding the volatility prediction auxiliary task into Deep-
Scalper can further improve performance, indicating that taking
risk into consideration can lead to robust market understanding.
In addition, we observe that the hindsight bonus can significantly
improve DeepScalper’s ability for the evaluation of trading deci-
(a) May 26 (DS-NH) (b) May 26 (DS)
sions and further enhance profitability. Finally, we add all these
components into DeepScalper and achieve the best performance in
Figure 7: Trading behavior comparison of DS-NH and DS to
terms of TR and SR. Comprehensive ablation studies demonstrate:
show the effectiveness of the hindsight bonus
1) each individual component in DeepScalper is effective; 2) these
components are largely orthogonal and can be fruitfully integrated
to further improve performance. Figure 6b shows the impact of hindsight horizon ℎ on DeepScalper’s
performance. We observe that DeepScalper’s total return gradually
6.3 Effectiveness of Hindsight Bonus increases by moving ℎ from 30 to 180 and decreases when ℎ > 180.
We analyze the effectiveness of the hindsight bonus from two per- Moreover, we compare the detailed trading behaviors of agents
spectives. First, we explore the impact of the hindsight bonus hori- trained with and without hindsight bonus on a trading day with
zon and weight. As shown in Figure 6a, with the increase of 𝑤, the decreasing trend in Figure 7. The general good intraday trading
the agent tends to trade with a long-term horizon and achieves a strategy for that day is to short at the start of the day and long at the
higher profit. DeepScalper with 𝑤 = 0.1 achieves the highest profit. end of the day. We find that the agent trained without the hindsight
DeepScalper: A Risk-Aware Reinforcement Learning Framework to Capture Fleeting Intraday Trading Opportunities Conference’17, July 2017, Washington, DC, USA

Variance Decrease with Volatility (%) 6.5 Generalization Ability


60 TR
SR We further test the generalization ability of our framework among
40
different financial futures (TF02 and T02). In Figure 10, it is clear
20
that the price trend of TF02 and T02 is similar. We assume similar
0 price curves shares and similar trading patterns. Then, we train
20 DeepScalper using the TF02 training set and test it on the test set
40 of both TF02 and T02. We compare the performance of MV, GRU,
LGBM, and DeepScalper in Figure 11. The red and blue lines repre-
IC IF TF01 TF02 T01 T02 sent the performance on TF02 and T02, respectively. We observe
Financial Assets
that the performance of MV, GRU, and LGBM among these two
assets is quite different, demonstrating that they have poor gener-
Figure 8: Effect of the auxiliary task on performance vari-
alization ability on our task. One possible reason is that the trading
ance (>0 means RL agents trained with the risk-aware aux-
signal of MV, GRU and LGBM involves heuristic rules or threshold.
iliary task get lower standard deviation)
There will be some potential trading opportunities that are close to
meet the trading rules or threshold, but MV, GRU, and LGBM will
lose the opportunities. At the same time, our DeepScalper achieves
robust performance on both TF02 and T02 as shown in Figure 11 (d),
6 =0 although it has never seen T02 data before. All these experiments
5 = 0.5
=1 demonstrate that DeepScalper can learn a robust representation of
4
the market and achieve good generalization ability.
3
2
1 103
0 104
TR(%) SR SoR CR 102
Financial Assets 103
101
102
100
Figure 9: Sensitivity to relative importance 𝜂 in terms of TR, 101 99
SR, SoR and CR 100 98

04/30 05/21 06/09 06/30 07/17 04/30 05/21 06/09 06/30 07/17
(a) TF02 (b) T02

bonus (Figure 7a) performs well in capturing local trading opportu- Figure 10: Price curve of TF02 and T02
nities and overlooks the long-term trend of the entire trading day.
In comparison, an agent trained with the hindsight bonus (Figure
7b) trades a large volume of short actions at the beginning of the
1.06 1.06
trading day, indicating that it is aware of the decreasing trend in 1.05
TF02
T02 1.05
TF02
T02
advance. This kind of trading action is smart, since it captures the 1.04 1.04
1.03 1.03
Trading Day

Trading Day

big price gap of the overall trend and somehow ignores the local
1.02 1.02
gain or loss. 1.01 1.01
1.00 1.00
0.99 0.99
6.4 Effectiveness of Risk-aware Auxiliary Task 0.98
04/30 05/21 06/09 06/30 07/17
0.98
04/30 05/21 06/09 06/30 07/17
Net Value Net Value
Since the financial market is noisy and the RL training process is
(a) MV (b) GRU
unstable, the performance variance among different random seeds is
a major concern of RL-based trading algorithms. Intuitively, taking 1.06 1.06
TF02
TF02 1.05 T02
market risk into account can help the RL agent behave more stable 1.05 T02
1.04 1.04
with lower performance variance. We run experiments 5 times with 1.03
Trading Day

1.03
Trading Day

different random seeds and report the relative variance relationship 1.02 1.02
1.01 1.01
between RL agents trained with/without the risk-aware auxiliary 1.00
1.00
task in Figure 8. We find that RL agents trained with the risk-aware 0.99 0.99
auxiliary task achieve a lower TR variance in all six financial assets 0.98
04/30 05/21 06/09 06/30 07/17
0.98
04/30 05/21 06/09 06/30 07/17
Net Value Net Value
and a lower SR variance in 67% of financial assets. Furthermore,
(c) LGBM (d) DeepScalper
we test the impact of auxiliary task importance 𝜂 on DeepScalper’s
performance. Naturally, the volatility value scale is smaller than
return, which makes 𝜂 = 1 a decent option to start. In practice, we Figure 11: Net value curve of MV, GRU, LGBM and Deep-
test 𝜂 ∈ [0, 0.5, 1] and find the improvement of the auxiliary task is Scalper on TF02 and T02 to show the generalization ability.
robust to different importance weights as shown in Figure 9.
Conference’17, July 2017, Washington, DC, USA Sun et al.

7 CONCLUSION [19] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma,
Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting
In this article, we focus on intraday trading and propose Deep- decision tree. (2017), 3146–3154.
Scalper to mimic the workflow of professional intraday traders. [20] Guang Liu, Yuzhao Mao, Qi Sun, Hailong Huang, Weiguo Gao, Xuan Li, Jianping
Shen, Ruifan Li, and Xiaojie Wang. 2020. Multi-scale two-way deep neural
First, we apply the dueling Q-network with action branching to network for stock trend prediction.. In Proceedings of the 29th International Joint
efficiently train intraday RL agents. Then, we design a novel re- Conference on Artificial Intelligence (IJCAI). 4555–4561.
ward function with a hindsight bonus to encourage a long-term [21] Yang Liu, Qi Liu, Hongke Zhao, Zhen Pan, and Chuanren Liu. 2020. Adaptive
quantitative trading: An imitative deep reinforcement learning approach. In
horizon to capture the overall price trend. In addition, we design an Proceedings of 35th the AAAI Conference on Artificial Intelligence. 2128–2135.
encoder-decoder architecture to learn robust market embedding by [22] Ananth Madhavan. 2000. Market microstructure: A survey. Journal of Financial
Markets 3, 3 (2000), 205–258.
incorporating both micro-level and macro-level market information. [23] John E Moody and Matthew Saffell. 1999. Reinforcement learning for trading.
Finally, we propose volatility prediction as an auxiliary task to help (1999), 917–923.
agents be aware of market risk while maximizing profit. Extensive [24] Tobias J Moskowitz, Yao Hua Ooi, and Lasse Heje Pedersen. 2012. Time series
momentum. Journal of Financial Economics 104, 2 (2012), 228–250.
experiments on two stock index futures and four treasury bond [25] John J Murphy. 1999. Technical Analysis of The Financial Markets: A Comprehensive
futures demonstrate that DeepScalper significantly outperforms Guide to Trading Methods and Applications. Penguin.
many advanced methods. [26] Ralph Neuneier. 1996. Optimal asset allocation using adaptive dynamic program-
ming. In Proceedings of the 10th Neural Information Processing Systems. 952–958.
[27] Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. 2006. Reinforcement learning
REFERENCES for optimized trade execution. In Proceedings of the 23rd International Conference
[1] Bo An, Shuo Sun, and Rundong Wang. 2022. Deep Reinforcement Learning for on Machine learning. 673–680.
Quantitative Trading: Challenges and Opportunities. IEEE Intelligent Systems 37, [28] James M Poterba and Lawrence H Summers. 1988. Mean reversion in stock prices:
2 (2022), 23–26. Evidence and implications. Journal of Financial Economics 22, 1 (1988), 27–59.
[2] Gurdip Bakshi and Nikunj Kapadia. 2003. Delta-hedged gains and the negative [29] Karsten Schierholt and Cihan H Dagli. 1996. Stock market prediction using
market volatility risk premium. The Review of Financial Studies 16, 2 (2003), different neural network classification architectures. In IEEE/IAFE 1996 Conference
527–566. on Computational Intelligence for Financial Engineering. 72–78.
[3] Marc G Bellemare, Will Dabney, and Rémi Munos. 2017. A distributional perspec- [30] William F Sharpe. 1994. The sharpe ratio. Journal of Portfolio Management 21, 1
tive on reinforcement learning. In Proceedings of the 34th International Conference (1994), 49–58.
on Machine Learning. 449–458. [31] Shuo Sun, Rundong Wang, and Bo An. 2021. Reinforcement learning for quanti-
[4] John Bollinger. 2002. Bollinger on Bollinger Bands. McGraw-Hill New York. tative trading. arXiv preprint arXiv:2109.13851 (2021).
[5] Chi Chen, Li Zhao, Jiang Bian, Chunxiao Xing, and Tie-Yan Liu. 2019. Investment [32] Shuo Sun, Rundong Wang, and Bo An. 2022. Quantitative Stock Investment by
behaviors can tell what inside: Exploring stock intrinsic properties for stock Routing Uncertainty-Aware Trading Experts: A Multi-Task Learning Approach.
trend prediction. In Proceedings of the 25th ACM SIGKDD International Conference arXiv preprint arXiv:2207.07578 (2022).
on Knowledge Discovery & Data Mining. 2376–2384. [33] Arash Tavakoli, Fabio Pardo, and Petar Kormushev. 2018. Action branching
[6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. architectures for deep reinforcement learning. In Proceedings of the 32nd AAAI
Empirical evaluation of gated recurrent neural networks on sequence modeling. Conference on Artificial Intelligence. 4131–4138.
arXiv preprint arXiv:1412.3555 (2014). [34] Jia Wang, Tong Sun, Benyuan Liu, Yu Cao, and Hongwei Zhu. 2019. CLVSA:
[7] Yue Deng, Feng Bao, Youyong Kong, Zhiquan Ren, and Qionghai Dai. 2016. Deep A convolutional LSTM based variational sequence-to-sequence model with at-
direct reinforcement learning for financial signal representation and trading. tention for predicting trends of financial markets. In Proceedings of the 28th
IEEE Transactions on Neural Networks and Learning Systems 28, 3 (2016), 653–664. International Joint Conference on Artificial Intelligence (IJCAI). 3705–3711.
[8] Qianggang Ding, Sifan Wu, Hao Sun, Jiadong Guo, and Jian Guo. 2020. Hier- [35] Rundong Wang, Hongxin Wei, Bo An, Zhouyan Feng, and Jun Yao. 2021. Com-
archical multi-scale Gaussian transformer for stock movement prediction.. In mission fee is not enough: A hierarchical reinforced framework for portfolio
Proceedings of the 29th International Joint Conference on Artificial Intelligence management. In Proceedings of the 35th AAAI Conference on Artificial Intelligence.
(IJCAI). 4640–4646. [36] Zhicheng Wang, Biwei Huang, Shikui Tu, Kun Zhang, and Lei Xu. 2021. Deep-
[9] Xiao Ding, Yue Zhang, Ting Liu, and Junwen Duan. 2015. Deep learning for Trader: A deep reinforcement learning approach to risk-return balanced portfolio
event-driven stock prediction. In Proceedings of the 24th International Conference management with market conditions embedding. In Proceedings of the 35th AAAI
on Artificial Intelligence. 2327–2333. Conference on Artificial Intelligence.
[10] Eugene F Fama. 1970. Efficient capital markets: A review of theory and empirical [37] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando
work. The Journal of Finance 25, 2 (1970), 383–417. Freitas. 2016. Dueling network architectures for deep reinforcement learning. In
[11] Yuchen Fang, Kan Ren, Weiqing Liu, Dong Zhou, Weinan Zhang, Jiang Bian, Proceedings of 35th International Conference on Machine Learning. 1995–2003.
Yong Yu, and Tie-Yan Liu. 2021. Universal trading for order execution with [38] Yumo Xu and Shay B Cohen. 2018. Stock movement prediction from tweets and
oracle policy distillation. In Proceedings of the 35th AAAI Conference on Artificial historical prices. In Proceedings of the 56th Annual Meeting of the Association for
Intelligence. 107–115. Computational Linguistics. 1970–1979.
[12] Fuli Feng, Huimin Chen, Xiangnan He, Ji Ding, Maosong Sun, and Tat-Seng Chua. [39] Yunan Ye, Hengzhi Pei, Boxin Wang, Pin-Yu Chen, Yada Zhu, Ju Xiao, and Bo
2018. Enhancing stock movement prediction with adversarial training. arXiv Li. 2020. Reinforcement-learning based portfolio management with augmented
preprint arXiv:1810.09936 (2018). asset movement prediction states. In Proceedings of the 34th AAAI Conference on
[13] Fuli Feng, Xiangnan He, Xiang Wang, Cheng Luo, Yiqun Liu, and Tat-Seng Chua. Artificial Intelligence (AAAI). 1112–1119.
2019. Temporal relational ranking for stock prediction. ACM Transactions on [40] Jaemin Yoo, Yejun Soun, Yong-chan Park, and U Kang. 2021. Accurate multivariate
Information Systems (TOIS) 37, 2 (2019), 1–30. stock movement prediction via data-axis transformer with multi-level contexts.
[14] Harrison Hong and Jeremy C Stein. 1999. A unified theory of underreaction, In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &
momentum trading, and overreaction in asset markets. The Journal of Finance Data Mining. 2037–2045.
54, 6 (1999), 2143–2184. [41] Pengqian Yu, Joon Sern Lee, Ilya Kulyatin, Zekun Shi, and Sakyasingha Das-
[15] Ziniu Hu, Weiqing Liu, Jiang Bian, Xuanzhe Liu, and Tie-Yan Liu. 2018. Listening gupta. 2019. Model-based deep reinforcement learning for dynamic portfolio
to chaotic whispers: A deep learning framework for news-oriented stock trend optimization. arXiv preprint arXiv:1901.08740 (2019).
prediction. In Proceedings of the 11th ACM International Conference on Web Search [42] Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J Mankowitz, and Shie Man-
and Data Mining. 261–269. nor. 2018. Learn what not to learn: Action elimination with deep reinforcement
[16] Narasimhan Jegadeesh and Sheridan Titman. 2002. Cross-sectional and time- learning. Advances in Neural Information Processing Systems (2018).
series determinants of momentum returns. The Review of Financial Studies 15, 1 [43] Liheng Zhang, Charu Aggarwal, and Guo-Jun Qi. 2017. Stock price prediction
(2002), 143–157. via discovering multi-frequency trading patterns. In Proceedings of the 23rd ACM
[17] Zhengyao Jiang, Dixing Xu, and Jinjun Liang. 2017. A deep reinforcement SIGKDD International Conference on Knowledge Discovery & Data Mining. 2141–
learning framework for the financial portfolio management problem. arXiv 2149.
preprint arXiv:1706.10059 (2017). [44] Zihao Zhang, Stefan Zohren, and Stephen Roberts. 2020. Deep reinforcement
[18] Zura Kakushadze. 2016. 101 formulaic alphas. Wilmott 2016, 84 (2016), 72–81. learning for trading. The Journal of Financial Data Science 2, 2 (2020), 25–40.

You might also like