0% found this document useful (0 votes)

22 views

Deep_Reinforcement_Learning_for_Quantitative_Trading

Uploaded by

陳欣玄

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Deep_Reinforcement_Learning_for_Quantitative_Trading

Uploaded by

陳欣玄

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

2024 4th International Conference on Electronics, Circuits and Information Engineering (ECIE)

Deep Reinforcement Learning for Quantitative

Trading
2024 4th International Conference on Electronics, Circuits and Information Engineering (ECIE) | 979-8-3503-8831-2/24/$31.00 ©2024 IEEE | DOI: 10.1109/ECIE61885.2024.10626209

1st Maochun Xu 2nd Zixun Lan 3rd Zheng Tao

dept. Financial and Actuarial dept. Applied Mathematics Xi’an dept. Financial and Actuarial
Mathematics Xi’an Jiaotong-Liverpool Jiaotong-Liverpool University Mathematics Xi’an Jiaotong-Liverpool
University Suzhou, China University
Suzhou, China [email protected] Suzhou, China
[email protected] [email protected]

4th Jiawei Du 5th Zongao Ye✉

dept. Financial and Actuarial Mathematics Xi’an Jiaotong- dept. Applied Mathematics Xi’an Jiaotong-Liverpool University
Liverpool University Suzhou, China [email protected]
Suzhou, China [email protected]

Abstract—Artificial Intelligence (AI) and Machine Learning operations. As the financial markets become increasingly
(ML) are transforming the domain of Quantitative Trading (QT) complex, the integration of AI and ML in QT is becoming
through the deployment of advanced algorithms capable of sifting indispensable for maintaining competitive advantage.
through extensive financial datasets to pinpoint lucrative
investment openings. AI-driven models, particularly those em- In the fluctuating environment of financial markets,
ploying ML techniques such as deep learning and reinforcement trading behaviors, and economic events are inherently
learning, have shown great prowess in predicting market trends and unpredictable, leading to the generation of volatile and non-
executing trades at a speed and accuracy that far surpass human stationary data. Despite technical analysis being a widely used
capabilities. Its capacity to automate critical tasks, such as methodology in Quantitative Trading (QT), its capacity for
discerning market conditions and executing trading strategies, has generalization has been questioned, underscoring the urgency
been pivotal. However, persistent challenges exist in current QT for more resilient features that can be directly mined from raw
methods, especially in effectively handling noisy and high- financial data. As a response, machine learning techniques,
frequency financial data. Striking a balance between exploration especially deep learning models, have been investigated for
and exploitation poses another challenge for AI-driven trading their potential to predict market trends and improve
agents. To surmount these hurdles, our proposed solution, QT- Net, generalization. Nonetheless, the scope of QT extends well
introduces an adaptive trading model that autonomously beyond predicting trends; it necessitates the formulation of
formulates QT strategies through an intelligent trading agent.
strategic trading methods. Although reinforcement learning
Incorporating deep reinforcement learning (DRL) with imitative
learning methodologies, we bolster the proficiency of our model. To
(RL) provides a methodical framework for tackling tasks that
tackle the challenges posed by volatile financial datasets, we involve a series of decisions, achieving an equilibrium
conceptualize the QT mechanism within the framework of a Par- between the discovery of novel strategies and the utilization
tially Observable Markov Decision Process (POMDP). Moreover, by of established ones poses a considerable challenge, especially
embedding imitative learning, the model can capitalize on in the context of the pragmatic constraints encountered in
traditional trading tactics, nurturing a balanced synergy between actual trading scenarios. In response to these difficulties, we
discovery and utilization. For a more realistic simulation, our introduce Observational and Recurrent Deterministic Policy
trading agent undergoes training using minute-frequency data Gradients (QTNet). We cast QT within the framework of a
sourced from the live financial market. Experimental findings Partially Observable Markov Decision Process (POMDP) to
underscore the model’s proficiency in extracting robust market effectively tackle the representation of unpredictable financial
features and its adaptability to diverse market conditions. data. Recurrent neural networks, such as Recurrent
Deterministic Policy Gradient (RDPG), are employed to
Keywords—Quantitative Trading, Reinforcement Learning handle the POMDP, offering an off- policy deep
reinforcement learning (DRL) algorithm.
I. INTRODUCTION
In the realm of financial security investment, quantitative Balancing exploration and exploitation is addressed
trading (QT) is distinguished by its substantial automation, through imitative learning techniques. A demonstration buffer,
utilizing computing technology to diminish dependence on initial- ized with actions from Dual Thrust, and behavior
personal discretion and mitigate illogical decision-making. As cloning is introduced to guide the trading agent. By
the dominance of quantitative hedge funds grows, there is an integrating these techniques into the POMDP framework,
increasing focus on integrating machine learning into QT, QTNet benefits from enhanced financial domain knowledge.
especially in the context of Fintech. These technologies enable Real financial data from the futures market is used to test
the creation of dynamic trading strategies that can adapt to QTNet, demonstrating its ability to learn profitable trading
market changes in real time, manage risks more effectively, and policies and exhibit superior generalization across various
ultimately enhance the profitability and efficiency of trading futures markets.

979-8-3503-8831-2/24/$31.00 ©2024 IEEE 583

Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
II. RELATED WORK advanced DRL architecture is tailored to grasp the subtle
Research within our field typically falls into two primary nuances of minute-by-minute financial data and exhibits
classifications. The first, extensively documented by Murphy in flexibility in various financial market conditions.
1999, is Quantitative Trading (QT), which is fundamentally III. PROBLEM DEFINITION
dependent on Technical Analysis. This method assumes that all
pertinent market details are encoded within the price and volume Within this section, we initially provide clear definitions
figures, utilizing a variety of technical indicators— for mathematical symbols, followed by a comprehensive
mathematically derived instruments—to signal trading actions. formal introduction to expound upon the intricacies of the
Although these indicators are prevalent, their rigidity often quantitative trading problem.
hinders their ability to conform to the market’s multi- faceted and A. Foundations
evolving patterns. Within this realm of indicators, two dominant
types exist moving averages, which aim to discern the direction For each discrete moment t, we define the OHLC price
of price trends, and oscillators, designed to gauge the market’s array as = ,ℎ , , , where , ℎ , , and
momentum. Each type, however, has its inherent constraints as correspondingly represent the opening, highest, lowest, and
they are based on historical data, which might not always be a closing prices. The comprehensive price array for a financial
reliable predictor of future market behavior due to the dynamic instrument is expressed as = [ , … , , … ] . For
nature of financial markets. This reliance on past data to forecast convenience, : symbolizes the historical price array
future trends often fails to accommodate the unpredictable within a given timeframe, with n denoting the length of the
fluctuations that characterize market movements, leading to a look-back period or window length. The technical indicator
demand for more adaptable and nuanced trading tools. vector at time t is denoted as = , … , , where is a
function of the historical price sequence : through =
Lately, the integration of machine learning techniques into
the realm of securities investment has seen a notable surge, with : ; , with representing the parameters for the
particular emphasis on Reinforcement Learning (RL) for its technical strategy j. The sequence of technical indicators is
aptitude in tackling sequential decision-making tasks. The = [ , … , , … ]. Similarly, the account profit at time step
exploration of RL within the context of Quantitative Trading (QT) is , and the sequence of account profits is = [ , … , , … ].
is not new; Littman (1996) [8], [14] was a pioneer in applying Q- B. Partially Observable Markov Decision Process
learning, a conventional value-based RL algorithm. The
landscape of complex problem spaces in QT has acknowledged In this section, we explore the distinctive attributes of
the limitations of traditional methods. In response, Murphy (1999) Quantitative Trading (QT) and elaborate on the rationale
[4]–[6], [10] and his colleagues, around the turn of the behind framing the entire QT process as a Partially
millennium, shifted the focus toward policy-based RL Observable Markov Decision Process (POMDP). The
approaches. They put forth the concept of recurrent financial market is characterized by security prices influenced
reinforcement learning (RRL), a method better suited to handle by both bullish and bearish investors, macroeconomic and
the intricacies and temporal dependencies inherent in financial microeconomic activi- ties, unpredictable occurrences, and
decision-making processes. This shift underscored the evolving diverse trading behaviors, contributing to inherent noise.
nature of machine learning in finance, seeking algorithms that Direct observation of true market states becomes unattainable
could not only predict but also learn and adapt strategies over due to these complexities. For instance, the impact of positive
time, harnessing the vast amounts of data to navigate the oft- news or order executions remains uncertain, and available
turbulent waters of the stock market. data for analysis is limited to historical prices and volumes,
offering only a partial view of the market state. QT, as a
Traditional RL faces challenges in selecting appropriate sequential decision-making challenge, revolves around
market features. Deep Reinforcement Learning (DRL), determining what and when to trade.
combining RL and deep learning, is well-suited for high-
dimensional data problems. DRL has shown advancements in Within the realm of QT, the analytical structure evolves
tackling complex tasks, including video games, and extends its from a classic Markov Decision Process (MDP) to a Partially
potential to QT. Jiang, Xu, and Liang (2017) [3] leveraged Deep Observable Markov Decision Process (POMDP). An MDP is
Deterministic Policy Gradient (DDPG) for cryptocur- rency defined by a quintuple [ , , , ℛ, ] , where denotes
portfolios. Efforts have also been made to enhance financial distinct set of states, A represents a specific set of actions, P
signal representation using fuzzy learning and deep neural is the state transition probability function, and R signifies the
networks (DNNs). rewardfunction. By incorporating observational elements O
and Z, with O being the observation set and Z the observation
Model-free reinforcement learning algorithms, while effecti- probability function, the framework is modified into a
ve,often struggle with sampling efficiency in extensive state POMDP. In this context, the agent receives various
spaces, a situation frequently encountered in QT. To address this, observations at each interval, and the comprehension of
Yu et al. (2019) [15] developed a model-based RL approach for observable history until time t is fundamental in capturing the
daily frequency portfolio management. However, this framework dynamics of QT.
doesn’t fully cater to the needs of minute- frequency data, which
is more common in QT. To bridge this gap, our study proposes a Observation. In the financial market, observations are split
policy-based RL mechanism that operates in continuous time and into two distinct categories: the portfolio observation subset
is augmented with Recurrent Neural Networks (RNNs). This opt and the economic observation subset . The term
represents the aggregated profit of the portfolio, whereas oet

584

Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
is associated with market pricing and technical metrics. The and respectively. A replay buffer is also maintained,
overall observation set comprises = , , , leveraging archiving sequences of observations, actions, and rewards.
technical indicators such as BuyLine and SellLine from the
Dual Thrust strategy. Our research innovates further by integrating Long Short-
Term Memory (LSTM) networks into the QT framework, of-
Action. Trading actions involve a continuous probability fering an alternative to Gate Recurrent Units (GRUs). LSTMs
vector representing long and short positions. The agent treat the prior observation-action history ℎ as the latent
executes the action associated with the maximum probability. state from the previous timestep, leading to the current state
ℎ being formulated as ℎ = LSTM (ℎ , , ) . This
Actions are represented as ∈ { long , short } = {1, −1},
substitution underscores the potential of LSTMs in capturing
simplifying position management and mitigating the the temporal intricacies of the market, a crucial aspect for
impact of market capacity. Trading actions are interpreted effective trading strategies within our POMDP model for QT.
as signals, guiding the execution of trades based on specific
rules. B. imitative learning based on DB and BC
Effectively navigating the dynamic intricacies of financial
Reward: Practical constraints such as transaction fees and market data poses a significant challenge due to the exponen-
slippage are incorporated into the trading simulation. The tial growth of the exploration value space. Traditional model-
account profit is computed considering these factors. To free Reinforcement Learning (RL) algorithms face limitations
address the inefficiency of as a reward function, the Sharpe in devising profitable policies efficiently within the context
ratio is adopted. The Sharpe ratio ( ) serves as a risk-adjusted of Quantitative Trading (QT). Moreover, random exploration,
return measure, representing the ratio of excess return to one unit lacking specific goals, becomes inefficient, especially when
of total risk. considering the imperatives of trading continuity and market
This comprehensive framework allows for the representa- friction factors. However, leveraging the model-free
tion of QT challenges in a dynamic and uncertain financial Recurrent Deterministic Policy Gradient (RDPG) algorithm
environment while incorporating realistic market constraints. with well- defined training goals proves to be a promising
approach. As an off-policy algorithm, RDPG demonstrates
IV. OBSERVATIONAL AND RECURRENT DPG adaptability to auxiliary data, providing the basis for the
In this segment, we introduce QTNet, our designed model, introduction of two pivotal components: the Demonstration
specifically devised for the POMDP setup in Quantitative Buffer and Behavior Cloning. These modules serve distinct
Trading (QT). This model integrates elements from Recurrent roles in passive and ac- tive imitative learning, respectively,
Deterministic Policy Gradient (RDPG) and imitative learning in [?], [9], [10] strategically guiding our RDPG agent through
structured manner. Furthermore, a graphical representation of the the complex landscape of QT.
QTNet architecture is provided in Figure 1. Demonstration Buffer (DB). Initially, a prioritized
A. Recurrent DPG replay buffer denoted as is established, and it is pre-
filled with demonstration episodes ( , , , … , , , )
Silver (2015) [7] unveiled the Deterministic Policy Gra- dient obtained from the Dual Thrust strategy. Drawing inspiration
(DPG), a specialized off-policy reinforcement learning (RL) from the methodologies introduced in DQfD Hester et
algorithm designed for continuous action spaces. Its application
is especially pertinent to the domain of high- frequency
al (2018) [2] and DDPGfD Vercerik (2017) [14], our
Quantitative Trading (QT), where decisions need to be made on approach involves pretraining the agent using these
a continuum. The DPG’s ability to handle the demands of demonstrations prior to engaging in the actual
constant trading activity, mindful of the costs associated with interaction. This pre-training, enriched with insights
frequent transaction changes, makes it a suitable match for QT’s from technical analysis, equips the agent with a
requirements. Building on DPG, the Recurrent Deterministic foundational trading strategy from the outset.
Policy Gradient (RDPG), introduced by Heess et al (2015) [1],
incorporates a recurrent structure to more adeptly navigate the During the training phase, each minibatch consists of a
QT landscape. This methodology acknowledges the importance mixture of instructional and agent-based episodes, selected,
of historical data sequences for QT, where the market’s state isn’t using prioritized experience replay (PER) (Schaul et al. (2015)
fully observable. Agents in this system utilize a historical [13]). PER promotes the selection of episodes with greater
compilation of market and personal account observations, significance more frequently. The selection likelihood of an
including prices, indicators, and profits, encapsulated within a episode ( ) is directly linked to its importance, defined as:
sequence denoted by ℎ = ( , , … , , , ). RDPG ( ) = /∑ , where denotes the significance of
employs recurrent neural networks (RNNs) to assimilate this episode . In our implementation, we utilize the episode
history, enhancing the agent’s ability to retain and utilize past importance criteria as proposed by Vercerik (2017) [14].
information to inform future trades. Within this framework, is established as follows:

This approach aligns with the actor-critic methodology, − ℎ, + ∇ ℎ, + , (1)

where RDPG is adept at learning both a deterministic policy In this instance, the initial term signifies the episode’s loss
(actor) and an action-value function (critic). It aims to optimize , as detailed in Equation (8), and the subsequent term refers
the estimated action-value function , facilitated by the actor to the magnitude of the actor’s gradient change, as shown in
function (ℎ) and critic function (ℎ, ), each parameterized by Equation (9). The constant , designated for demonstration

585

Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
episodes, enhances their sampling likelihood, while assesses Here, ϕ replaces as the exponent, modifying the
the actor gradient’s influence. With the alteration in sample formulation slightly to ensure uniqueness.
allocation, network modifications are adjusted using significance
Here, is a constant. This prioritized demonstration
sampling weights :
buffer
1 1
= × (2)
()

Price: Price: p2 Price: pT

Indicator: Indicator:
Observatio Indicator: I2 Profit: IT Profit:
I1 r2 rT

GRU
...
h1 h h
T

Demonstrati
on Buffer Targe Act Criti Targe Act Criti Targe Act Criti

Agen Loss a1 r1 Loss a2 r Loss a r

Fig. 1. The overview of QTNet model

controls the data ratio between demonstration and agent episodes. specific objectives for each stage of training, thereby mini-
Importantly, it facilitates the efficient propagation of rewards. mizing phases of unproductive exploration. Behavior Cloning
proves to be a valuable technique in aligning agent actions
Behavior Cloning (BC). Behavior Cloning is utilized to with expert strategies for improved trading performance.
define goals for every trading move. Intra-day opportunistic
actions ( ‾) are integrated as expert-level maneuvers. Reflecting V. EXPERIMENTS
back, we develop a visionary trading expert, who invariably opts
for a long position when prices are at their lowest and a short In our empirical evaluation, we meticulously test the ef-
position at their peak. At each stage of training, the Behavior ficacy of our trading model by performing a back-test that
Cloning method (Ross and Bagnell (2010) [12]) is applied to incorporates data sampled at one-minute intervals from the
measure the differences between the agent’s actions and the Chinese financial futures market. This data encompasses two
strategies of this foresighted expert. key stock-index futures: IF, which is constructed from a com-
posite index reflecting the performance of the 300 foremost
Specifically, we employ Behavior Cloning losses ( BCLoss ) stocks traded on the Shanghai and Shenzhen stock exchanges,
selectively when the critic (ℎ, ) indicates that the expert and IC, which similarly tracks an index but focuses on stocks
actions outperform the actor actions: with smaller market capitalizations. The back-testing process
not only adheres to stringent, real-world trading conditions
=− max 0, (ℎ , ‾ ) − ℎ, (ℎ ) (3) but also provides a detailed temporal resolution of market
behavior. In illustration of our data’s granularity, Figure 2 and
This adjustment, referred to as Q-Filter (Nair et al. (2018) Figure 3 present a time series of the closing prices recorded
[11]), ensures that the Behavior Cloning losses are only on a minute-by-minute basis for both the IF and IC futures,
considered when the expert actions are superior. thereby offering a precise visual account of their market
The Behavior Cloning loss ( ) serves as an auxiliary loss movements during the testing period.
for updates. Consequently, a modified policy gradient (∇ ‾) is
applied to the actor:
∇ ‾= ∇ + ∇ (4)
In this context, ∇ signifies the policy gradient as outlined
in Equation (9), while and are parameters that govern the
balance among the various loss functions. is multiplied by the
policy gradient, adjusting the contribution of the original policy
gradient to the final policy update. and is multiplied by the
gradient of the behavior cloning loss, regulating the influence of
behavior cloning loss on the policy update.
By integrating actions derived from expert insights, we set Fig. 2. IC

586

Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
/ for > , this metric quantifies the most
significant historical reduction, signifying the most
adverse scenario encountered.
These metrics collectively offer a comprehensive
evaluation of our strategy, with Tr providing a direct gauge of
profitability, Sr and Vol offering insights into risk-adjusted
returns and the stability of returns, and Mdd presenting a
measure of potential worst-case scenarios.
B. Imitative learning Strategies
In our research, we incorporate the Dual Thrust strategy
Fig. 3. IF into the Demonstration Buffer module as a model trading
policy. This strategy, rooted in the principles of technical
A. Setup analysis, particularly oscillators, sets a rational price
In our study, we deploy a data-driven approach, utilizing oscillation range using a calculation based on the highest high
minute-bar OHLC (Open, High, Low, Close) data for futures (HH), lowest close (LC), highest close (HC), and lowest low
trading, which captures the nuances of price movements within (LL) of the past n periods. The formula for the range is defined
each minute. This high-frequency data, commonly found in as the maximum of either HH minus LC or HC minus LL.
actual financial markets, poses unique challenges for reinforce- On each trading day, two key thresholds, the BuyLine, and the
ment learning ( ) algorithms, particularly in maintaining SellLine, are established by respectively adding and
action continuity. The dataset, sourced from python subtracting a specified proportion of this Range to the day’s
package ′qstock′, which is a quantitative investment analysis opening price, where K1 and K2 are constants dictating the
open source library, currently includes data acquisition, market’s resistance levels for breaching these thresholds.
visualiza- tion, stock selection and quantitative backtest four
modules. Among them, the data module comes from the open The Dual Thrust strategy creates trading signals when the
data of the Oriental Wealth network, Flush, Sina Finance, and current price breaches the Range’s threshold, either upwards
other online resources. or downwards. In the context of our Behavioral Cloning
module, we also introduce intra-day greedy actions as the
Data spans a training period from September 28, 2015, to benchmark for expert behavior. This involves taking long
November 10, 2022, and a subsequent testing period from positions at the day’s lowest price and short positions at the
November 11, 2022, to November 10, 2023. To mirror real- highest, mim- icking a theoretically optimal, albeit
market conditions more accurately, we integrate practical opportunistic, strategy. These actions provide a vital reference
elements like a transaction fee (notated as = 2 × 10 ) and a for training the agent, with the expert action at each time step
fixed slippage amount ( = 0.15). determined by the minimum or maximum opening prices over
a trading day.
For the purposes of our simulation, certain assumptions are
made. We presume that each order is executed right at the For every trading session, the upper boundary, termed
opening of each minute bar, with the rewards calculated at the BuyLine, and the lower boundary, known as SellLine, are
minute’s closing. Key futures-specific considerations, such as calculated. This is done by either adding or subtracting a
margin requirements and contract settlement peculiarities, are specific percentage of the Range to or from the opening price
factored into our model. The training process is designed to of the day. The key components of the Dual Thrust strategy
conclude under two conditions: either a 50% loss of positions or are expressed as follows:
the depletion of the required margin. We set the initial account
Range = max[ − , − ]
balance at $1, 000, 000 for the commencement of the testing (5)
phase. SellLine = Open − × Range,

The performance of our trading policy is assessed using a Here, Open represents the day’s opening price, and and
suite of standard quantitative trading (QT) metrics: are constants controlling the resistance levels of market
prices against breaking BuyLine and SellLine, respectively.
• Total Return Rate (Tr): Calculated as ( end − start )/ Additionally, denotes the highest high price over the previous
start where represents the total value of the position n time periods.
and cash.
In practical terms, trading signals are generated based on
• Sharpe Ratio (Sr): Originally introduced by Sharpe the Dual Thrust strategy when the current price exceeds a
(1966), it is defined as [ ]/ [ ], providing a measure certain percentage of the Range either upwards or downwards.
of excess return over unit systematic risk.
Moving into the Behavioral Cloning module, intra-day
• Volatility (Vol): Expressed by [ ], with representing greedy actions are introduced as expert actions from a hind-
the historical series of return rates, this metric illustrates sight perspective. Adopting a long position at the lowest price
the variability of returns and the associated risk factors and a short position at the highest price consistently represents
of the strategies. a relatively optimal and greedy strategy. This prophetic policy
serves as expert actions for the agent throughout the training
• Maximum Drawdown (Mdd): Determined by max −

587

Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
process. At each time step t, the expert action is determined by: straightforward strategies such as Long & Hold and Short &
Hold, thereby minimizing potential maximum losses. This
1, if = arg min result highlights the resilience of the QTNet strategy in the
‾ = (6)
−1, if = arg max face of market volatility and its capacity to maintain stability
in complex market conditions. Overall, the comprehensive
Here, denotes the length of one trading day, and performance of QTNet validates the efficiency of recurrent
represents the sequence of opening prices for the day. This GRU networks in capturing the complexities of market
approach provides a rich foundation for understanding and dynamics in quantitative trading.
incorporating expert behavior into the agent's learning process.
TABLE I. PERFORMANCE OF COMPARISON METHODS ON IC.
Our study also includes a comparison of the proposed
Iterative Recurrent Deep Policy Gradient (QTNet) method Methods Tr (%) Sr Vol Mdd (%)
against several baseline strategies for context. These include the
Long & Hold -8.32 -0.318 0.746 62.84
Long & Hold strategy, where a long position is maintained
throughout the test period, and its counterpart, the Short & Hold Short & Hold 9.53 0.167 0.658 49.30
DDPG -17.26 -0.428 0.498 57.22
strategy. Additionally, we examine the Deep Deterministic
Policy Gradient (DDPG) algorithm, known for its effectiveness QTNet 20.28 0.562 0.421 23.73
in continuous control scenarios, as a comparative benchmark.
F. Ablation Experiments
This juxtaposition allows us to evaluate the effectiveness of our
approach in the dynamic and often unpredictable realm of The ablation studies focusing on IC futures shed light on
quantitative trading. the significance of each element within the QTNet framework
concerning its aggregate performance. The data extracted from
C. Baseline Methods
Table II demonstrates that the integrated return trajectory
Our proposed QTNet is subjected to comparison with sev-
of IR-DPG stands out against its counterparts for the duration
eral baseline strategies:
of the evaluation phase. This study dissects the IR-RDPG into
• Long & Hold: This strategy involves initiating a long three distinct iterations: the basic RDPG utilizing only GRU
position at the outset and maintaining the position until networks, RDPG-DB which integrates a demonstration buffer,
the conclusion of the test period. and RDPG-BC which employs a behavior cloning module.
The analysis delineates that the base RDPG variant grapples
• Short & Hold: In contrast to the Long & Hold strategy, with achieving action coherence and mastering a lucrative
Short & Hold commences with taking a short position at trading approach. The assimilation of the demonstration
the beginning and retaining the position throughout the buffer in RDPG-DB markedly augments the efficiency of
test period. experience sampling, while the behavior cloning feature in
• DDPG (Lillicrap et al. (2015) [7]): DDPG stands for RDPG-BC curtails the detrimental effects of random
Deep Deterministic Policy Gradient, an off-policy. exploration on the agent’s performance. The fully-fledged
QTNet, amalgamat- ing all the components, evidently excels
D. Experimental Outcomes in terms of both profit generation and risk management
In this section, we present a comprehensive set of ex- metrics, underlining the prowess of imitative learning
periments to compare the performance of our QTNet model strategies in the domain.
against baseline methods. Ablation experiments are also con-
ducted to highlight the significance of each module within TABLE II. Ablation experiments on IC.
QTNet. To evaluate the generalization capabilities of both
QTNet and the Dual Thrust strategy, tests are performed across Methods Tr (%) Sr Vol Mdd (%)
distinct futures markets. RDPG 8.96 0.067 0.590 34.62
RDPG-DB 18.72 0.467 0.452 23.72
E. Data Demonstration RDPG-BC 31.24 0.619 0.427 29.81
In our study, we utilized minute-level frequency data for IC QTNet 36.26 0.742 0.411 25.37
futures to assess the performance of the QTNet model against
several baseline trading strategies. As detailed in Table I, the G. Generalization Ability
QTNet consistently outperforms traditional methods, particularly The versatility of the QTNet model is assessed by its
in terms of the total return rate (Tr) and Sharpe Ratio (Sr), performance in varying market conditions, specifically in IF
indicating its adaptability and profitability in a high-frequency and IC futures, as depicted in Table III. The table elucidates
quantitative trading setting. Although DDPG is a renowned the stark contrast in Dual Thrust’s effectiveness between the
reinforcement learning algorithm that excels in certain scenarios, IC and IF markets, which illustrates the potential constraints
it displayed relatively lower performance in our tests, which may of fixed trading indicators. On the other hand, QTNet exhibits
be attributed to its adaptability in handling high-frequency superior performance, especially in the IC market, regardless
trading data. of being trained solely on IF data. This demonstrates QTNet’s
The QTNet also demonstrated impressive performance in the inherent flexibility and its proficiency in discerning durable
critical risk management metric of Maximum Drawdown (Mdd), features that are crucial for formulating dynamic trading
showing more robustness during market downturns compared to strategies ap- plicable to different market environments. The

588

Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
model’s ability to perform well across various markets cements [11] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba,
its standing as a robust and adaptable tool for financial and Pieter Abbeel. Overcoming exploration in reinforcement learning
with demonstrations. In 2018 IEEE international conference on robotics
applications in the real world. and automation (ICRA), pages 6292–6299. IEEE, 2018.
[12] Ste´phane Ross and Drew Bagnell. Efficient reductions for imitation
TABLE III. PERFORMANCE OF COMPARISON METHODS ON IC AND IF. learning. In Proceedings of the thirteenth international conference on
artificial intelligence and statistics, pages 661–668. JMLR Workshop
Methods Data Tr(%) Sr Vol Mdd(%) and Conference Proceedings, 2010.
Dual Thrust IC 26.40 0.810 0.469 19.24 [13] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Priori-
IF -29.59 -0.577 0.796 55.81 tized experience replay. arXiv preprint arXiv:1511.05952, 2015.
QTNet IC 34.27 0.742 0.457 24.56 [14] Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier
IF 28.73 0.523 0.517 26.25 Pietquin, Bilal Piot, Nicolas Heess, Thomas Rotho¨rl, Thomas Lampe,
and Martin Riedmiller. Leveraging demonstrations for deep reinforce-
ment learning on robotics problems with sparse rewards. arXiv preprint
VI. CONCLUSION arXiv:1707.08817, 2017.
In our research, we unveiled the QTNet, a dynamic, im- [15] Pengqian Yu, Joon Sern Lee, Ilya Kulyatin, Zekun Shi, and Sakyasingha
itative learning framework designed for the challenges of Dasgupta. Model-based deep reinforcement learning for dynamic
Quantitative Trading (QT). We constructed a Partially Observ- portfolio optimization. arXiv preprint arXiv:1901.08740, 2019.
able Markov Decision Process (POMDP) to adeptly handle the
erratic nature of high-frequency trading data. Crucially, our
model harnesses the principles of imitative learning to harmonize
the exploration-exploitation trade-off, thus refining the decision-
making process of our trading agent. We validated the
performance of QTNet by conducting rigorous evaluations using
authentic data from stock-index futures, acknowledging the
operational limitations typical of financial trading. The outcomes
affirmed the financial viability and risk mitigation prowess of
QTNet. Through a series of comparative analyses, the model’s
capacity to adapt to varied market conditions was put on display.
To conclude, QTNet underscores the advantage for trading
agents in real-world markets to draw on established trading
techniques. Our results underscore the enhanced adaptability and
performance gains achieved by merging imitative learning with
a reinforcement learning paradigm in the realm of QT solutions.
REFERENCES
[1] Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, and David Silver.
Memory-based control with recurrent neural networks. arXiv preprint
arXiv:1512.04455, 2015.
[2] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal
Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al.
Deep q-learning from demonstrations. In Proceedings of the AAAI
conference on artificial intelligence, volume 32, 2018.
[3] Zhengyao Jiang, Dixing Xu, and Jinjun Liang. A deep reinforcement
learning framework for the financial portfolio management problem. arXiv
preprint arXiv:1706.10059, 2017.
[4] Zixun Lan, Binjie Hong, Ye Ma, and Fei Ma. More interpretable graph
similarity computation via maximum common subgraph inference. arXiv
preprint arXiv:2208.04580, 2022.
[5] Zixun Lan, Ye Ma, Limin Yu, Linglong Yuan, and Fei Ma. Aednet: Adaptive
edge-deleting network for subgraph matching. Pattern Recog- nition,
133:109033, 2023.
[6] Zixun Lan, Limin Yu, Linglong Yuan, Zili Wu, Qiang Niu, and Fei Ma.
Sub-gmn: The subgraph matching network model. arXiv preprint
arXiv:2104.00186, 2021.
[7] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess,
Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous
control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971, 2015.
[8] ML Littman and AW Moore. Reinforcement learning: A survey, journal of
artificial intelligence research 4, 1996.
[9] Malik Magdon-Ismail and Amir F Atiya. Maximum drawdown. Risk
Magazine, 17(10):99–102, 2004.
[10] John J Murphy. Technical analysis of the financial markets: A compre-
hensive guide to trading methods and applications. Penguin, 1999.

589

Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.

Deep Reinforcement Learning for Quantitative Trading
No ratings yet
Deep Reinforcement Learning for Quantitative Trading
7 pages
Reinforcement Learning For Quantitative Trading: Shuo Sun Rundong Wang Bo An
No ratings yet
Reinforcement Learning For Quantitative Trading: Shuo Sun Rundong Wang Bo An
29 pages
Reinforcement Learning For Quantitative Trading - 2021
No ratings yet
Reinforcement Learning For Quantitative Trading - 2021
26 pages
Quantitative Trading Using Deep Q Learning
No ratings yet
Quantitative Trading Using Deep Q Learning
13 pages
5587-Article Text-8812-1-10-20200512
No ratings yet
5587-Article Text-8812-1-10-20200512
8 pages
540900b455
No ratings yet
540900b455
2 pages
2利用深度強化學習於股票市場_unlocked
No ratings yet
2利用深度強化學習於股票市場_unlocked
37 pages
Beating The Stock Market With A Deep Reinforcement Learning Day Trading System
No ratings yet
Beating The Stock Market With A Deep Reinforcement Learning Day Trading System
8 pages
Stock Predection Paper Reference
No ratings yet
Stock Predection Paper Reference
3 pages
Algorithmic Trading On Financial Time Series Using
No ratings yet
Algorithmic Trading On Financial Time Series Using
20 pages
QF-TraderNet Intraday Trading Via Deep Reinforceme
No ratings yet
QF-TraderNet Intraday Trading Via Deep Reinforceme
12 pages
Quantitative Trading Using Deep Q Learning
No ratings yet
Quantitative Trading Using Deep Q Learning
10 pages
impt ml2
No ratings yet
impt ml2
5 pages
Financial Trading As A Game: A Deep Reinforcement Learning Approach
No ratings yet
Financial Trading As A Game: A Deep Reinforcement Learning Approach
15 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
16 pages
Application_of_Deep_Reinforcement_Learning_to_Algo_Trading
No ratings yet
Application_of_Deep_Reinforcement_Learning_to_Algo_Trading
15 pages
A Mean-VaR Based Deep Reinforcement Learning Framework For Practical Algorithmic Trading
No ratings yet
A Mean-VaR Based Deep Reinforcement Learning Framework For Practical Algorithmic Trading
14 pages
A Novel Deep Reinforcement Learning Based Automated Stock Trading System Using Cascaded LSTM Networks
No ratings yet
A Novel Deep Reinforcement Learning Based Automated Stock Trading System Using Cascaded LSTM Networks
11 pages
Application of Deep Reinforcement Learning To Algo Trading
No ratings yet
Application of Deep Reinforcement Learning To Algo Trading
19 pages
Knowledge-Based Systems TSF Trading
No ratings yet
Knowledge-Based Systems TSF Trading
10 pages
Deep Policy Gradient Methods in Commodity Markets
No ratings yet
Deep Policy Gradient Methods in Commodity Markets
114 pages
Self-Operating Stock Exchange: A Deep Reinforcement Learning Approach
No ratings yet
Self-Operating Stock Exchange: A Deep Reinforcement Learning Approach
16 pages
Deep Reinforcement Learning Approach For Trading Automation in The Stock Market
No ratings yet
Deep Reinforcement Learning Approach For Trading Automation in The Stock Market
10 pages
C D L O B R L P T: Ombining EEP Earning On Rder Ooks With Einforcement Earning For Rofitable Rading
No ratings yet
C D L O B R L P T: Ombining EEP Earning On Rder Ooks With Einforcement Earning For Rofitable Rading
41 pages
Deep Robust Reinforcement Learning For Practical Algorithmic Trading
No ratings yet
Deep Robust Reinforcement Learning For Practical Algorithmic Trading
9 pages
VC Comp Deep Reinforcement Learning Accepted
No ratings yet
VC Comp Deep Reinforcement Learning Accepted
20 pages
Learning To Trade With Deep Actor Critic Methods
No ratings yet
Learning To Trade With Deep Actor Critic Methods
6 pages
Analysis of Algorithmic Trading With Q-Learning in The Forex Market
No ratings yet
Analysis of Algorithmic Trading With Q-Learning in The Forex Market
5 pages
Deep Reinforcement Learning For Active High Frequency Trading
No ratings yet
Deep Reinforcement Learning For Active High Frequency Trading
9 pages
Good - DRL Survey
No ratings yet
Good - DRL Survey
25 pages
10 DeepScalper A Risk-Aware Reinforcement Learning Framework
No ratings yet
10 DeepScalper A Risk-Aware Reinforcement Learning Framework
10 pages
Deep_Reinforcement_Learning_Algorithms_for_Profitable_Stock_Trading_Strategies
No ratings yet
Deep_Reinforcement_Learning_Algorithms_for_Profitable_Stock_Trading_Strategies
6 pages
TA23
No ratings yet
TA23
9 pages
s00521-020-05359-8
No ratings yet
s00521-020-05359-8
16 pages
Model-Free Reinforcement Learning For Asset Allocation: Practicum Final Report
No ratings yet
Model-Free Reinforcement Learning For Asset Allocation: Practicum Final Report
69 pages
Pplication of Deep Reinforcement Learning For Ndian Stock Trading Automation
No ratings yet
Pplication of Deep Reinforcement Learning For Ndian Stock Trading Automation
9 pages
2007.05402v1
No ratings yet
2007.05402v1
7 pages
Recent Advances in Reinforcement Learning in Finance
No ratings yet
Recent Advances in Reinforcement Learning in Finance
64 pages
2411.17900v1
No ratings yet
2411.17900v1
12 pages
Recent Advances in Reinforcement Learning in Finance
No ratings yet
Recent Advances in Reinforcement Learning in Finance
60 pages
Recent Advances in Reinforcement Learning in Finance
No ratings yet
Recent Advances in Reinforcement Learning in Finance
65 pages
1909.09571v1
No ratings yet
1909.09571v1
123 pages
Deep Reinforcement Learning For Algorithmic Trading
No ratings yet
Deep Reinforcement Learning For Algorithmic Trading
9 pages
mathematics-12-01621
No ratings yet
mathematics-12-01621
22 pages
J Eswa 2018 09 036
No ratings yet
J Eswa 2018 09 036
36 pages
Deep_Reinforcement_Learning_Robots_for_Algorithmic_Trading_Considering_Stock_Market_Conditions_and_U.S._Interest_Rates
No ratings yet
Deep_Reinforcement_Learning_Robots_for_Algorithmic_Trading_Considering_Stock_Market_Conditions_and_U.S._Interest_Rates
21 pages
ICBDA49040.2020.9101333
No ratings yet
ICBDA49040.2020.9101333
8 pages
Chenyu Liu
No ratings yet
Chenyu Liu
81 pages
Yuming Li Pin Ni Victor Chang
No ratings yet
Yuming Li Pin Ni Victor Chang
19 pages
Applsci 13 01956
No ratings yet
Applsci 13 01956
27 pages
MainProject First Review
No ratings yet
MainProject First Review
16 pages
Full Text 01
No ratings yet
Full Text 01
132 pages
Dynamic Replication Hedging Nyu P Kolm
No ratings yet
Dynamic Replication Hedging Nyu P Kolm
41 pages
Learning The Market - Sentiment-Based Ensemble Trading Agents
No ratings yet
Learning The Market - Sentiment-Based Ensemble Trading Agents
10 pages
A I F: I - D R L P O: Dvancing Nvestment Rontiers Ndustry Grade EEP Einforcement Earning For Ortfolio Ptimization
No ratings yet
A I F: I - D R L P O: Dvancing Nvestment Rontiers Ndustry Grade EEP Einforcement Earning For Ortfolio Ptimization
25 pages
A Multi-layer and Multi-Ensemble Stock Trader Using Deep Learning and Deep Reinforcement Learning - Anna’s Archive
No ratings yet
A Multi-layer and Multi-Ensemble Stock Trader Using Deep Learning and Deep Reinforcement Learning - Anna’s Archive
17 pages
A_Futures_Quantitative_Trading_Strategy_Based_on_a_Deep_Reinforcement_Learning_Algorithm
No ratings yet
A_Futures_Quantitative_Trading_Strategy_Based_on_a_Deep_Reinforcement_Learning_Algorithm
5 pages
Kumar - Deep Recurrent Q-Networks For Market Making PDF
No ratings yet
Kumar - Deep Recurrent Q-Networks For Market Making PDF
10 pages
Neurips 2018
No ratings yet
Neurips 2018
7 pages
ICT Project Management: Framework for ICT-based Pedagogy System: Development, Operation, and Management
From Everand
ICT Project Management: Framework for ICT-based Pedagogy System: Development, Operation, and Management
Suman Ahmmed
No ratings yet
Vibro Sifter
No ratings yet
Vibro Sifter
3 pages
LandisGyr E650 Brochure
No ratings yet
LandisGyr E650 Brochure
2 pages
School of Professional Advancement
No ratings yet
School of Professional Advancement
3 pages
SAP MM Procurement Type
No ratings yet
SAP MM Procurement Type
109 pages
Internal Controls For Purchasing: Getting It Right
No ratings yet
Internal Controls For Purchasing: Getting It Right
8 pages
Comparison of Rayleigh and Rician Fading Channel Under Frequency PDF
No ratings yet
Comparison of Rayleigh and Rician Fading Channel Under Frequency PDF
4 pages
Steel Frame Vs Concrete Frame For High Rise Buildings
No ratings yet
Steel Frame Vs Concrete Frame For High Rise Buildings
6 pages
Sensory Stimulation Theory
50% (2)
Sensory Stimulation Theory
13 pages
All Theory
No ratings yet
All Theory
18 pages
Belletti - Truncation Vs Reduction in Development.10.06.2021
No ratings yet
Belletti - Truncation Vs Reduction in Development.10.06.2021
23 pages
Opia Player - Emotion Based Musical Player PDF
No ratings yet
Opia Player - Emotion Based Musical Player PDF
5 pages
PDF Detailed Lesson Plan Adjectives
No ratings yet
PDF Detailed Lesson Plan Adjectives
3 pages
Pit Design
100% (1)
Pit Design
99 pages
Matlab Simulink Based
No ratings yet
Matlab Simulink Based
9 pages
PR 1
No ratings yet
PR 1
88 pages
Dr. Lourdes C. Generalao: Sr. Ma. Concepcion E. Mendez, TDM
No ratings yet
Dr. Lourdes C. Generalao: Sr. Ma. Concepcion E. Mendez, TDM
3 pages
Low Melting Glasses in ZnO-Fe2O3-P2O5 System With High Chemical
No ratings yet
Low Melting Glasses in ZnO-Fe2O3-P2O5 System With High Chemical
8 pages
Subject and Verb Agreement
No ratings yet
Subject and Verb Agreement
10 pages
Ash Content Determination
50% (2)
Ash Content Determination
17 pages
Anti-Theft Immobilizer
No ratings yet
Anti-Theft Immobilizer
43 pages
The Crucible Project
No ratings yet
The Crucible Project
3 pages
2024 Brochure
No ratings yet
2024 Brochure
4 pages
Amazon - in - Order 403-8423612-4851555
No ratings yet
Amazon - in - Order 403-8423612-4851555
1 page
Lesson Plan Course:: Subject: Methodology: Number of Hours: Learning Objectives
No ratings yet
Lesson Plan Course:: Subject: Methodology: Number of Hours: Learning Objectives
8 pages
GECG Curriculum - Biomedical Engineering
No ratings yet
GECG Curriculum - Biomedical Engineering
52 pages
Hap-4 4
No ratings yet
Hap-4 4
142 pages
June 2019 Paper Mark Scheme
No ratings yet
June 2019 Paper Mark Scheme
20 pages
DR-ID 300CL Service Manual: Safety Precautions
No ratings yet
DR-ID 300CL Service Manual: Safety Precautions
12 pages
PQM T20 Ac
No ratings yet
PQM T20 Ac
348 pages
Publi 4156
No ratings yet
Publi 4156
170 pages

Deep_Reinforcement_Learning_for_Quantitative_Trading

Uploaded by

Deep_Reinforcement_Learning_for_Quantitative_Trading

Uploaded by

2024 4th International Conference on Electronics, Circuits and Information Engineering (ECIE)

Deep Reinforcement Learning for Quantitative

1st Maochun Xu 2nd Zixun Lan 3rd Zheng Tao

4th Jiawei Du 5th Zongao Ye✉

979-8-3503-8831-2/24/$31.00 ©2024 IEEE 583

This approach aligns with the actor-critic methodology, − ℎ, + ∇ ℎ, + , (1)

Price: Price: p2 Price: pT

Agen Loss a1 r1 Loss a2 r Loss a r

Fig. 1. The overview of QTNet model

You might also like