0% found this document useful (0 votes)
24 views

Deep Reinforcement Learning for Quantitative Trading

The document discusses the integration of Deep Reinforcement Learning (DRL) in Quantitative Trading (QT), highlighting the development of QT-Net, an adaptive trading model that utilizes DRL and imitative learning to enhance trading strategies. It addresses challenges in handling noisy financial data and the need for effective exploration and exploitation in trading decisions. The proposed model aims to improve market trend prediction and generalization across various market conditions while incorporating realistic trading constraints.

Uploaded by

1146345502
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Deep Reinforcement Learning for Quantitative Trading

The document discusses the integration of Deep Reinforcement Learning (DRL) in Quantitative Trading (QT), highlighting the development of QT-Net, an adaptive trading model that utilizes DRL and imitative learning to enhance trading strategies. It addresses challenges in handling noisy financial data and the need for effective exploration and exploitation in trading decisions. The proposed model aims to improve market trend prediction and generalization across various market conditions while incorporating realistic trading constraints.

Uploaded by

1146345502
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Deep Reinforcement Learning for

Quantitative Trading
1st Maochun Xu 2nd Zixun Lan 3rd Zheng Tao
dept. Financial and Actuarial Mathematics dept. Applied Mathematics dept. Financial and Actuarial Mathematics
Xi’an Jiaotong-Liverpool University Xi’an Jiaotong-Liverpool University Xi’an Jiaotong-Liverpool University
Suzhou, China Suzhou, China Suzhou, China
[email protected] [email protected] [email protected]

4th Jiawei Du 5th Zongao YeB


arXiv:2312.15730v1 [q-fin.TR] 25 Dec 2023

dept. Financial and Actuarial Mathematics dept. Applied Mathematics


Xi’an Jiaotong-Liverpool University Xi’an Jiaotong-Liverpool University
Suzhou, China Suzhou, China
[email protected] [email protected]

Abstract—Artificial Intelligence (AI) and Machine Learning market changes in real time, manage risks more effectively,
(ML) are transforming the domain of Quantitative Trading and ultimately enhance the profitability and efficiency of trad-
(QT) through the deployment of advanced algorithms capable of ing operations. As the financial markets become increasingly
sifting through extensive financial datasets to pinpoint lucrative
investment openings. AI-driven models, particularly those em- complex, the integration of AI and ML in QT is becoming
ploying ML techniques such as deep learning and reinforcement indispensable for maintaining competitive advantage.
learning, have shown great prowess in predicting market trends In the fluctuating environment of financial markets, trading
and executing trades at a speed and accuracy that far surpass behaviors, and economic events are inherently unpredictable,
human capabilities. Its capacity to automate critical tasks, such leading to the generation of volatile and non-stationary data.
as discerning market conditions and executing trading strategies,
has been pivotal. However, persistent challenges exist in current Despite technical analysis being a widely used methodology in
QT methods, especially in effectively handling noisy and high- Quantitative Trading (QT), its capacity for generalization has
frequency financial data. Striking a balance between exploration been questioned, underscoring the urgency for more resilient
and exploitation poses another challenge for AI-driven trading features that can be directly mined from raw financial data.
agents. To surmount these hurdles, our proposed solution, QT- As a response, machine learning techniques, especially deep
Net, introduces an adaptive trading model that autonomously
formulates QT strategies through an intelligent trading agent. learning models, have been investigated for their potential to
Incorporating deep reinforcement learning (DRL) with imitative predict market trends and improve generalization. Nonetheless,
learning methodologies, we bolster the proficiency of our model. the scope of QT extends well beyond predicting trends;
To tackle the challenges posed by volatile financial datasets, we it necessitates the formulation of strategic trading methods.
conceptualize the QT mechanism within the framework of a Par- Although reinforcement learning (RL) provides a methodical
tially Observable Markov Decision Process (POMDP). Moreover,
by embedding imitative learning, the model can capitalize on framework for tackling tasks that involve a series of de-
traditional trading tactics, nurturing a balanced synergy between cisions, achieving an equilibrium between the discovery of
discovery and utilization. For a more realistic simulation, our novel strategies and the utilization of established ones poses
trading agent undergoes training using minute-frequency data a considerable challenge, especially in the context of the
sourced from the live financial market. Experimental findings pragmatic constraints encountered in actual trading scenarios.
underscore the model’s proficiency in extracting robust market
features and its adaptability to diverse market conditions. In response to these difficulties, we introduce Observational
Index Terms—Quantitative Trading, Reinforcement Learning and Recurrent Deterministic Policy Gradients (QTNet). We
cast QT within the framework of a Partially Observable
Markov Decision Process (POMDP) to effectively tackle the
I. I NTRODUCTION
representation of unpredictable financial data. Recurrent neural
In the realm of financial security investment, quantitative networks, such as Recurrent Deterministic Policy Gradient
trading (QT) is distinguished by its substantial automation, (RDPG), are employed to handle the POMDP, offering an off-
utilizing computing technology to diminish dependence on policy deep reinforcement learning (DRL) algorithm.
personal discretion and mitigate illogical decision-making. As Balancing exploration and exploitation is addressed through
the dominance of quantitative hedge funds grows, there is imitative learning techniques. A demonstration buffer, initial-
an increasing focus on integrating machine learning into QT, ized with actions from Dual Thrust, and behavior cloning is
especially in the context of Fintech. These technologies enable introduced to guide the trading agent. By integrating these
the creation of dynamic trading strategies that can adapt to techniques into the POMDP framework, QTNet benefits from
enhanced financial domain knowledge. Real financial data Model-free reinforcement learning algorithms, while effec-
from the futures market is used to test QTNet, demonstrating tive, often struggle with sampling efficiency in extensive state
its ability to learn profitable trading policies and exhibit spaces, a situation frequently encountered in QT. To address
superior generalization across various futures markets. this, Yu et al. (2019) [18] and Lan et al. (2023) [20] developed
a model-based RL approach for daily frequency portfolio
II. R ELATED WORK management. However, this framework doesn’t fully cater to
the needs of minute-frequency data, which is more common in
Research within our field typically falls into two primary QT. To bridge this gap, our study proposes a policy-based RL
classifications. The first, extensively documented by Murphy mechanism that operates in continuous time and is augmented
in 1999, is Quantitative Trading (QT), which is fundamentally with Recurrent Neural Networks (RNNs). This advanced DRL
dependent on Technical Analysis. This method assumes that architecture is tailored to grasp the subtle nuances of minute-
all pertinent market details are encoded within the price by-minute financial data and exhibits flexibility in various
and volume figures, utilizing a variety of technical indica- financial market conditions.
tors—mathematically derived instruments—to signal trading
actions. Although these indicators are prevalent, their rigidity III. P ROBLEM D EFINITION
often hinders their ability to conform to the market’s multi- Within this section, we initially provide clear definitions for
faceted and evolving patterns. Within this realm of indicators, mathematical symbols, followed by a comprehensive formal
two dominant types exist moving averages, which aim to introduction to expound upon the intricacies of the quantitative
discern the direction of price trends, and oscillators, designed trading problem.
to gauge the market’s momentum. Each type, however, has
its inherent constraints as they are based on historical data, A. Foundations
which might not always be a reliable predictor of future market For each discrete moment t, we define the OHLC price
behavior due to the dynamic nature of financial markets. This array as pt = [opt , hpt , ltp , cpt ], where opt , hpt , ltp , and cpt corre-
reliance on past data to forecast future trends often fails to spondingly represent the opening, highest, lowest, and closing
accommodate the unpredictable fluctuations that characterize prices. The comprehensive price array for a financial instru-
market movements, leading to a demand for more adaptable ment is expressed as P = [p1 , . . . , pt , . . .]. For convenience,
and nuanced trading tools. Pt−n:t symbolizes the historical price array within a given
Lately, the integration of machine learning techniques into timeframe, with n denoting the length of the look-back period
the realm of securities investment has seen a notable surge, or window length. The technical indicator vector at time t is
with particular emphasis on Reinforcement Learning (RL) denoted as It = [I1t , . . . , Ijt ], where Ijt is a function of the
for its aptitude in tackling sequential decision-making tasks. historical price sequence Pt−n:t through Ijt = f (Pt−n:t ; θj ),
The exploration of RL within the context of Quantitative with θj representing the parameters for the technical strategy
Trading (QT) is not new; Littman (1996) [9], [17] was a j. The sequence of technical indicators is I = [I1 , . . . , It , . . .].
pioneer in applying Q-learning, a conventional value-based Similarly, the account profit at time step t is rt , and the
RL algorithm. The landscape of complex problem spaces in sequence of account profits is R = [r1 , . . . , rt , . . .].
QT has acknowledged the limitations of traditional methods.
In response, Murphy (1999) [4]–[6], [12] and his colleagues, B. POMDP
around the turn of the millennium, shifted the focus toward In this section, we explore the distinctive attributes of
policy-based RL approaches. They put forth the concept of Quantitative Trading (QT) and elaborate on the rationale
recurrent reinforcement learning (RRL), a method better suited behind framing the entire QT process as a Partially Observable
to handle the intricacies and temporal dependencies inherent Markov Decision Process (POMDP). The financial market is
in financial decision-making processes. This shift underscored characterized by security prices influenced by both bullish and
the evolving nature of machine learning in finance, seeking bearish investors, macroeconomic and microeconomic activi-
algorithms that could not only predict but also learn and adapt ties, unpredictable occurrences, and diverse trading behaviors,
strategies over time, harnessing the vast amounts of data to contributing to inherent noise. Direct observation of true
navigate the oft-turbulent waters of the stock market. market states becomes unattainable due to these complexities.
Traditional RL faces challenges in selecting appropri- For instance, the impact of positive news or order executions
ate market features. Deep Reinforcement Learning (DRL), remains uncertain, and available data for analysis is limited
combining RL and deep learning, is well-suited for high- to historical prices and volumes, offering only a partial view
dimensional data problems. DRL has shown advancements in of the market state. QT, as a sequential decision-making
tackling complex tasks, including video games, and extends challenge, revolves around determining what and when to
its potential to QT. Jiang, Xu, and Liang (2017) [3], [7] trade.
leveraged Deep Deterministic Policy Gradient (DDPG) for Within the realm of QT, the analytical structure evolves
cryptocurrency portfolios. Efforts have also been made to from a classic Markov Decision Process (MDP) to a Partially
enhance financial signal representation using fuzzy learning Observable Markov Decision Process (POMDP). An MDP is
and deep neural networks (DNNs). defined by a quintuple [S, A, P, R, γ], where S denotes a
distinct set of states, A represents a specific set of actions, P QT, where the market’s state isn’t fully observable. Agents
is the state transition probability function, and R signifies the in this system utilize a historical compilation of market and
reward function. By incorporating observational elements O personal account observations, including prices, indicators,
and Z, with O being the observation set and Z the observation and profits, encapsulated within a sequence H denoted by
probability function, the framework is modified into a POMDP. ht = (o1 , a1 , . . . , ot−1 , at−1 , ot ). RDPG employs recurrent
In this context, the agent receives various observations at each neural networks (RNNs) to assimilate this history, enhancing
interval, and the comprehension of observable history until the agent’s ability to retain and utilize past information to
time t is fundamental in capturing the dynamics of QT. inform future trades.
Observation. In the financial market, observations are split This approach aligns with the actor-critic methodology,
into two distinct categories: the portfolio observation subset where RDPG is adept at learning both a deterministic policy
opt and the economic observation subset oet . The term opt (actor) and an action-value function (critic). It aims to optimize
represents the aggregated profit of the portfolio, whereas oet the estimated action-value function Qµ , facilitated by the actor
is associated with market pricing and technical metrics. The function µ(h) and critic function Q(h, a), each parameterized
overall observation set O comprises O = P, I, R, leveraging by θ and ω respectively. A replay buffer is also maintained,
technical indicators such as BuyLine and SellLine from the archiving sequences of observations, actions, and rewards.
Dual Thrust strategy. Our research innovates further by integrating Long Short-
Action. Trading actions involve a continuous probability Term Memory (LSTM) networks into the QT framework, of-
vector at representing long and short positions. The agent fering an alternative to Gate Recurrent Units (GRUs). LSTMs
executes the action associated with the maximum probability. treat the prior observation-action history ht−1 as the latent
Actions are represented as at ∈ { long , short } = {1, −1}, state from the previous timestep, leading to the current state
simplifying position management and mitigating the impact ht being formulated as ht = LSTM (ht−1 , at−1 , ot ). This
of market capacity. Trading actions are interpreted as signals, substitution underscores the potential of LSTMs in capturing
guiding the execution of trades based on specific rules. the temporal intricacies of the market, a crucial aspect for
Reward: Practical constraints such as transaction fees and effective trading strategies within our POMDP model for QT.
slippage are incorporated into the trading simulation. The
account profit rt is computed considering these factors. To B. imitative learning based on DB and BC
address the inefficiency of rt as a reward function, the Sharpe Effectively navigating the dynamic intricacies of financial
ratio is adopted. The Sharpe ratio (Sr) serves as a risk-adjusted market data poses a significant challenge due to the exponen-
return measure, representing the ratio of excess return to one tial growth of the exploration value space. Traditional model-
unit of total risk. free Reinforcement Learning (RL) algorithms face limitations
This comprehensive framework allows for the representa- in devising profitable policies efficiently within the context
tion of QT challenges in a dynamic and uncertain financial of Quantitative Trading (QT). Moreover, random exploration,
environment while incorporating realistic market constraints. lacking specific goals, becomes inefficient, especially when
considering the imperatives of trading continuity and market
IV. OBSERVATIONAL AND R ECURRENT DPG friction factors. However, leveraging the model-free Recurrent
In this segment, we introduce QTNet, our designed model, Deterministic Policy Gradient (RDPG) algorithm with well-
specifically devised for the POMDP setup in Quantitative defined training goals proves to be a promising approach. As
Trading (QT). This model integrates elements from Recurrent an off-policy algorithm, RDPG demonstrates adaptability to
Deterministic Policy Gradient (RDPG) and imitative learning auxiliary data, providing the basis for the introduction of two
in structured manner. Furthermore, a graphical representation pivotal components: the Demonstration Buffer and Behavior
of the QTNet architecture is provided in Figure 1. Cloning. These modules serve distinct roles in passive and
active imitative learning, respectively, [10]–[12], [16], [19]
A. Recurrent DPG strategically guiding our RDPG agent through the complex
Silver (2015) [8] unveiled the Deterministic Policy Gra- landscape of QT.
dient (DPG), a specialized off-policy reinforcement learning Demonstration Buffer (DB). Initially, a prioritized replay
(RL) algorithm designed for continuous action spaces. Its buffer denoted as DB is established, and it is pre-filled with
application is especially pertinent to the domain of high- demonstration episodes (o1 , a1 , r1 , . . . , oT , aT , rT ) obtained
frequency Quantitative Trading (QT), where decisions need from the Dual Thrust strategy. Drawing inspiration from the
to be made on a continuum. The DPG’s ability to handle methodologies introduced in DQfD Hester et al (2018) [2]
the demands of constant trading activity, mindful of the and DDPGfD Vercerik (2017) [17], our approach involves
costs associated with frequent transaction changes, makes it a pretraining the agent using these demonstrations prior to
suitable match for QT’s requirements. Building on DPG, the engaging in the actual interaction. This pre-training, enriched
Recurrent Deterministic Policy Gradient (RDPG), introduced with insights from technical analysis, equips the agent with a
by Heess et al (2015) [1], incorporates a recurrent structure to foundational trading strategy from the outset.
more adeptly navigate the QT landscape. This methodology During the training phase, each minibatch consists of a
acknowledges the importance of historical data sequences for mixture of instructional and agent-based episodes, selected
Price: p1 Price: p2 Price: pT

Observation Indicator: I1 Indicator: I2 Indicator: IT

Profit: r1 Profit: r2 Profit: rT

GRU Layer
h1 h2 ... hT

Demonstration
Buffer
Target Actor Critic Target Actor Critic Target Actor Critic

Agent Loss a1 r1 Loss a2 r2 Loss aT rT

Fig. 1. The overview of QTNet model

using prioritized experience replay (PER) (Schaul et al. (2015) Specifically, we employ Behavior Cloning losses ( BCLoss
[15]). PER promotes the selection of episodes with greater ) selectively when the critic Q(h, a) indicates that the expert
significance more frequently. The selection likelihood of an actions outperform the actor actions:
episode P (i) is directly linked to its importance, defined as:
P (i) = Pptpi , where pi denotes the significance of episode LBC = −E [max (0, Q (ht , āt ) − Q (ht , µθ (ht )))] (3)
i
i. In our implementation, we utilize the episode importance This adjustment, referred to as Q-Filter (Nair et al. (2018)
criteria as proposed by Vercerik (2017) [17]. Within this [13]), ensures that the Behavior Cloning losses are only
framework, pi is established as follows: considered when the expert actions are superior.
The Behavior Cloning loss (LBC ) serves as an auxiliary loss
for updates. Consequently, a modified policy gradient ∇θ J¯
E yti − Qω hit , ait + λ0 ∇a Qω hit , ait + ϵD ,
  
(1)
is applied to the actor:
In this instance, the initial term signifies the episode’s loss
Li , as detailed in Equation (8), and the subsequent term refers ∇θ J¯ = λ1 ∇θ J + λ2 ∇θ LBC (4)
to the magnitude of the actor’s gradient change, as shown
In this context, ∇θ J signifies the policy gradient as outlined
in Equation (9). The constant θD , designated for demonstra-
in Equation (9), while λ1 and λ2 are parameters that govern
tion episodes, enhances their sampling likelihood, while λ0
the balance among the various loss functions. By integrating
assesses the actor gradient’s influence. With the alteration in
actions derived from expert insights, we set specific objectives
sample allocation, network modifications are adjusted using
for each stage of training, thereby minimizing phases of unpro-
significance sampling weights wi :
ductive exploration. Behavior Cloning proves to be a valuable
1 1 technique in aligning agent actions with expert strategies for
wi = × (2) improved trading performance.
N P (i)ϕ
Here, ϕ replaces θ as the exponent, modifying the formulation V. E XPERIMENTS
slightly to ensure uniqueness. In our empirical evaluation, we meticulously test the ef-
Here, θ is a constant. This prioritized demonstration buffer ficacy of our trading model by performing a back-test that
controls the data ratio between demonstration and agent incorporates data sampled at one-minute intervals from the
episodes. Importantly, it facilitates the efficient propagation Chinese financial futures market. This data encompasses two
of rewards. key stock-index futures: IF, which is constructed from a com-
Behavior Cloning (BC). Behavior Cloning is utilized to posite index reflecting the performance of the 300 foremost
define goals for every trading move. Intra-day opportunistic stocks traded on the Shanghai and Shenzhen stock exchanges,
actions (ā) are integrated as expert-level maneuvers. Reflecting and IC, which similarly tracks an index but focuses on stocks
back, we develop a visionary trading expert, who invariably with smaller market capitalizations. The back-testing process
opts for a long position when prices are at their lowest and not only adheres to stringent, real-world trading conditions
a short position at their peak. At each stage of training, the but also provides a detailed temporal resolution of market
Behavior Cloning method (Ross and Bagnell (2010) [14]) is behavior. In illustration of our data’s granularity, Figure 2
applied to measure the differences between the agent’s actions and Figure 3 present a time series of the closing prices
and the strategies of this foresighted expert. recorded on a minute-by-minute basis for both the IF and
IC futures, thereby offering a precise visual account of their margin requirements and contract settlement peculiarities, are
market movements during the testing period. factored into our model. The training process is designed to
conclude under two conditions: either a 50% loss of positions
or the depletion of the required margin. We set the initial
account balance at $1, 000, 000 for the commencement of the
testing phase.
The performance of our trading policy is assessed using a
suite of standard quantitative trading (QT) metrics:
• Total Return Rate (Tr): Calculated as (Pend − Pstart ) /Pstart
where P represents the total value of the position and
cash.
• Sharpe Ratio (Sr): Originally introduced by Sharpe
(1966), it is defined as E[r]/σ[r], providing a measure
of excess return over unit systematic risk.
• Volatility (Vol): Expressed by σ[r], with r representing
Fig. 2. IC
the historical series of return rates, this metric illustrates
the variability of returns and the associated risk factors
of the strategies.
• Maximum Drawdown (Mdd): Determined by
max (Pi − Pj ) /Pi for j > i, this metric quantifies
the most significant historical reduction, signifying the
most adverse scenario encountered.
These metrics collectively offer a comprehensive evaluation
of our strategy, with Tr providing a direct gauge of prof-
itability, Sr and Vol offering insights into risk-adjusted returns
and the stability of returns, and Mdd presenting a measure of
potential worst-case scenarios.
B. imitative learning Strategies
Fig. 3. IF In our research, we incorporate the Dual Thrust strategy into
the Demonstration Buffer module as a model trading policy.
This strategy, rooted in the principles of technical analysis,
A. Setup particularly oscillators, sets a rational price oscillation range
In our study, we deploy a data-driven approach, utilizing using a calculation based on the highest high (HH), lowest
minute-bar OHLC (Open, High, Low, Close) data for futures close (LC), highest close (HC), and lowest low (LL) of the past
trading, which captures the nuances of price movements within n periods. The formula for the range is defined as the maxi-
each minute. This high-frequency data, commonly found in mum of either HH minus LC or HC minus LL. On each trading
actual financial markets, poses unique challenges for reinforce- day, two key thresholds, the BuyLine, and the SellLine, are
ment learning (RL) algorithms, particularly in maintaining established by respectively adding and subtracting a specified
action continuity. The dataset, sourced from python package proportion of this Range to the day’s opening price, where K1

qstock ′ , which is a quantitative investment analysis open and K2 are constants dictating the market’s resistance levels
source library, currently includes data acquisition, visualiza- for breaching these thresholds.
tion, stock selection and quantitative backtest four modules. The Dual Thrust strategy creates trading signals when the
Among them, the data module comes from the open data of current price breaches the Range’s threshold, either upwards or
the Oriental Wealth network, Flush, Sina Finance, and other downwards. In the context of our Behavioral Cloning module,
online resources. we also introduce intra-day greedy actions as the benchmark
Data spans a training period from September 28, 2015, to for expert behavior. This involves taking long positions at the
November 10, 2022, and a subsequent testing period from day’s lowest price and short positions at the highest, mim-
November 11, 2022, to November 10, 2023. To mirror real- icking a theoretically optimal, albeit opportunistic, strategy.
market conditions more accurately, we integrate practical These actions provide a vital reference for training the agent,
elements like a transaction fee (notated as δ = 2 × 10−5 ) with the expert action at each time step determined by the
and a fixed slippage amount (ζ = 0.15). minimum or maximum opening prices over a trading day.
For the purposes of our simulation, certain assumptions are For every trading session, the upper boundary, termed
made. We presume that each order is executed right at the BuyLine, and the lower boundary, known as SellLine, are
opening of each minute bar, with the rewards calculated at the calculated. This is done by either adding or subtracting a
minute’s closing. Key futures-specific considerations, such as specific percentage of the Range to or from the opening price
of the day. The key components of the Dual Thrust strategy D. Experimental Outcomes
are expressed as follows: In this section, we present a comprehensive set of ex-
Range = max[HH − LC, HC − LL], periments to compare the performance of our QTNet model
against baseline methods. Ablation experiments are also con-
BuyLine = Open + K1 × Range, (5) ducted to highlight the significance of each module within
SellLine = Open − K2 × Range, QTNet. To evaluate the generalization capabilities of both
Here, Open represents the day’s opening price, and K1 and QTNet and the Dual Thrust strategy, tests are performed across
K2 are constants controlling the resistance levels of market distinct futures markets.
prices against breaking BuyLine and SellLine, respectively. E. Data Demonstration
Additionally, denotes the highest high price over the previous
n time periods. In our study, we utilized minute-level frequency data for
In practical terms, trading signals are generated based on the IC futures to assess the performance of the QTNet model
Dual Thrust strategy when the current price exceeds a certain against several baseline trading strategies. As detailed in Table
percentage of the Range either upwards or downwards. I, the QTNet consistently outperforms traditional methods,
Moving into the Behavioral Cloning module, intra-day particularly in terms of the total return rate (Tr) and Sharpe
greedy actions are introduced as expert actions from a hind- Ratio (Sr), indicating its adaptability and profitability in a
sight perspective. Adopting a long position at the lowest price high-frequency quantitative trading setting. Although DDPG
and a short position at the highest price consistently represents is a renowned reinforcement learning algorithm that excels
a relatively optimal and greedy strategy. This prophetic policy in certain scenarios, it displayed relatively lower performance
serves as expert actions for the agent throughout the training in our tests, which may be attributed to its adaptability in
process. At each time step t, the expert action is determined handling high-frequency trading data.
by: The QTNet also demonstrated impressive performance in
( the critical risk management metric of Maximum Drawdown
1, if t = arg min Pot−ndt
āt = (Mdd), showing more robustness during market downturns
−1, if t = arg max Pot−ndt compared to straightforward strategies such as Long & Hold
Here, nd denotes the length of one trading day, and Pot−ndt and Short & Hold, thereby minimizing potential maximum
represents the sequence of opening prices for the day. This losses. This result highlights the resilience of the QTNet
approach provides a rich foundation for understanding and strategy in the face of market volatility and its capacity to
incorporating expert behavior into the agent’s learning process. maintain stability in complex market conditions. Overall, the
Our study also includes a comparison of the proposed Itera- comprehensive performance of QTNet validates the efficiency
tive Recurrent Deep Policy Gradient (QTNet) method against of recurrent GRU networks in capturing the complexities of
several baseline strategies for context. These include the Long market dynamics in quantitative trading.
& Hold strategy, where a long position is maintained through-
out the test period, and its counterpart, the Short & Hold TABLE I
strategy. Additionally, we examine the Deep Deterministic P ERFORMANCE OF COMPARISON METHODS ON IC.
Policy Gradient (DDPG) algorithm, known for its effectiveness
Methods Tr (%) Sr Vol Mdd (%)
in continuous control scenarios, as a comparative benchmark.
Long & Hold -8.32 -0.318 0.746 62.84
This juxtaposition allows us to evaluate the effectiveness of Short & Hold 9.53 0.167 0.658 49.30
our approach in the dynamic and often unpredictable realm of DDPG -17.26 -0.428 0.498 57.22
quantitative trading. QTNet 20.28 0.562 0.421 23.73

C. Baseline Methods
Our proposed QTNet is subjected to comparison with sev- F. Ablation Experiments
eral baseline strategies: The ablation studies focusing on IC futures shed light on
• Long & Hold: This strategy involves initiating a long the significance of each element within the QTNet framework
position at the outset and maintaining the position until concerning its aggregate performance. The data extracted from
the conclusion of the test period. Table II demonstrates that the integrated return trajectory of
• Short & Hold: In contrast to the Long & Hold strategy, IR-DPG stands out against its counterparts for the duration of
Short & Hold commences with taking a short position at the evaluation phase. This study dissects the IR-RDPG into
the beginning and retaining the position throughout the three distinct iterations: the basic RDPG utilizing only GRU
test period. networks, RDPG-DB which integrates a demonstration buffer,
• DDPG (Lillicrap et al. (2015) [8]): DDPG stands for and RDPG-BC which employs a behavior cloning module.
Deep Deterministic Policy Gradient, an off-policy model- The analysis delineates that the base RDPG variant grapples
free actor-critic reinforcement learning algorithm. Typi- with achieving action coherence and mastering a lucrative
cally, it exhibits robust performance in tasks characterized trading approach. The assimilation of the demonstration buffer
by continuous control. in RDPG-DB markedly augments the efficiency of experience
sampling, while the behavior cloning feature in RDPG-BC put on display. To conclude, QTNet underscores the advantage
curtails the detrimental effects of random exploration on the for trading agents in real-world markets to draw on established
agent’s performance. The fully-fledged QTNet, amalgamat- trading techniques. Our results underscore the enhanced adapt-
ing all the components, evidently excels in terms of both ability and performance gains achieved by merging imitative
profit generation and risk management metrics, underlining learning with a reinforcement learning paradigm in the realm
the prowess of imitative learning strategies in the domain. of QT solutions.
R EFERENCES
TABLE II
A BLATION EXPERIMENTS ON IC. [1] Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, and David Silver.
Memory-based control with recurrent neural networks. arXiv preprint
Methods Tr (%) Sr Vol Mdd (%) arXiv:1512.04455, 2015.
[2] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul,
RDPG 8.96 0.067 0.590 34.62 Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband,
RDPG-DB 18.72 0.467 0.452 23.72 et al. Deep q-learning from demonstrations. In Proceedings of the AAAI
RDPG-BC 31.24 0.619 0.427 29.81 conference on artificial intelligence, volume 32, 2018.
QTNet 36.26 0.742 0.411 25.37 [3] Zhengyao Jiang, Dixing Xu, and Jinjun Liang. A deep reinforcement
learning framework for the financial portfolio management problem.
arXiv preprint arXiv:1706.10059, 2017.
[4] Zixun Lan, Binjie Hong, Ye Ma, and Fei Ma. More interpretable graph
G. Generalization Ability similarity computation via maximum common subgraph inference. arXiv
The versatility of the QTNet model is assessed by its perfor- preprint arXiv:2208.04580, 2022.
[5] Zixun Lan, Ye Ma, Limin Yu, Linglong Yuan, and Fei Ma. Aednet:
mance in varying market conditions, specifically in IF and IC Adaptive edge-deleting network for subgraph matching. Pattern Recog-
futures, as depicted in Table III. The table elucidates the stark nition, 133:109033, 2023.
contrast in Dual Thrust’s effectiveness between the IC and [6] Zixun Lan, Limin Yu, Linglong Yuan, Zili Wu, Qiang Niu, and Fei
Ma. Sub-gmn: The subgraph matching network model. arXiv preprint
IF markets, which illustrates the potential constraints of fixed arXiv:2104.00186, 2021.
trading indicators. On the other hand, QTNet exhibits superior [7] Zixun Lan, Zuo Zeng, Binjie Hong, Zhenfu Liu, and Fei Ma. Rcsearcher:
performance, especially in the IC market, regardless of being Reaction center identification in retrosynthesis via deep q-learning. arXiv
preprint arXiv:2301.12071, 2023.
trained solely on IF data. This demonstrates QTNet’s inherent [8] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas
flexibility and its proficiency in discerning durable features Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra.
that are crucial for formulating dynamic trading strategies ap- Continuous control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971, 2015.
plicable to different market environments. The model’s ability [9] ML Littman and AW Moore. Reinforcement learning: A survey, journal
to perform well across various markets cements its standing of artificial intelligence research 4, 1996.
as a robust and adaptable tool for financial applications in the [10] Ye Ma, Zixun Lan, Lu Zong, and Kaizhu Huang. Global-aware
beam search for neural abstractive summarization. Advances in Neural
real world. Information Processing Systems, 34:16545–16557, 2021.
[11] Malik Magdon-Ismail and Amir F Atiya. Maximum drawdown. Risk
TABLE III Magazine, 17(10):99–102, 2004.
P ERFORMANCE OF COMPARISON METHODS ON IC AND IF. [12] John J Murphy. Technical analysis of the financial markets: A compre-
hensive guide to trading methods and applications. Penguin, 1999.
Methods Data Tr(%) Sr Vol Mdd(%) [13] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba,
and Pieter Abbeel. Overcoming exploration in reinforcement learning
Dual Thrust IC 26.40 0.810 0.469 19.24 with demonstrations. In 2018 IEEE international conference on robotics
IF -29.59 -0.577 0.796 55.81 and automation (ICRA), pages 6292–6299. IEEE, 2018.
QTNet IC 34.27 0.742 0.457 24.56 [14] Stéphane Ross and Drew Bagnell. Efficient reductions for imitation
IF 28.73 0.523 0.517 26.25 learning. In Proceedings of the thirteenth international conference on
artificial intelligence and statistics, pages 661–668. JMLR Workshop
and Conference Proceedings, 2010.
[15] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Priori-
VI. C ONCLUSION tized experience replay. arXiv preprint arXiv:1511.05952, 2015.
[16] Richard S Sutton and Andrew G Barto. Reinforcement learning: An
In our research, we unveiled the QTNet, a dynamic, im- introduction. MIT press, 2018.
itative learning framework designed for the challenges of [17] Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier
Quantitative Trading (QT). We constructed a Partially Observ- Pietquin, Bilal Piot, Nicolas Heess, Thomas Rothörl, Thomas Lampe,
and Martin Riedmiller. Leveraging demonstrations for deep reinforce-
able Markov Decision Process (POMDP) to adeptly handle ment learning on robotics problems with sparse rewards. arXiv preprint
the erratic nature of high-frequency trading data. Crucially, arXiv:1707.08817, 2017.
our model harnesses the principles of imitative learning to [18] Pengqian Yu, Joon Sern Lee, Ilya Kulyatin, Zekun Shi, and Sakyasingha
Dasgupta. Model-based deep reinforcement learning for dynamic
harmonize the exploration-exploitation trade-off, thus refining portfolio optimization. arXiv preprint arXiv:1901.08740, 2019.
the decision-making process of our trading agent. We validated [19] Hongke Zhao, Qi Liu, Hengshu Zhu, Yong Ge, Enhong Chen, Yan Zhu,
the performance of QTNet by conducting rigorous evaluations and Junping Du. A sequential approach to market state modeling and
analysis in online p2p lending. IEEE Transactions on Systems, Man,
using authentic data from stock-index futures, acknowledging and Cybernetics: Systems, 48(1):21–33, 2017.
the operational limitations typical of financial trading. The [20] JiaJun Zhu, Zichuan Yang, Binjie Hong, Jiacheng Song, Jiwei Wang,
outcomes affirmed the financial viability and risk mitigation Tianhao Chen, Shuilan Yang, Zixun Lan, and Fei Ma. Use neural net-
works to recognize students’ handwritten letters and incorrect symbols.
prowess of QTNet. Through a series of comparative analyses, arXiv preprint arXiv:2309.06221, 2023.
the model’s capacity to adapt to varied market conditions was

You might also like