Deep_Reinforcement_Learning_for_Quantitative_Trading
Deep_Reinforcement_Learning_for_Quantitative_Trading
Abstract—Artificial Intelligence (AI) and Machine Learning operations. As the financial markets become increasingly
(ML) are transforming the domain of Quantitative Trading (QT) complex, the integration of AI and ML in QT is becoming
through the deployment of advanced algorithms capable of sifting indispensable for maintaining competitive advantage.
through extensive financial datasets to pinpoint lucrative
investment openings. AI-driven models, particularly those em- In the fluctuating environment of financial markets,
ploying ML techniques such as deep learning and reinforcement trading behaviors, and economic events are inherently
learning, have shown great prowess in predicting market trends and unpredictable, leading to the generation of volatile and non-
executing trades at a speed and accuracy that far surpass human stationary data. Despite technical analysis being a widely used
capabilities. Its capacity to automate critical tasks, such as methodology in Quantitative Trading (QT), its capacity for
discerning market conditions and executing trading strategies, has generalization has been questioned, underscoring the urgency
been pivotal. However, persistent challenges exist in current QT for more resilient features that can be directly mined from raw
methods, especially in effectively handling noisy and high- financial data. As a response, machine learning techniques,
frequency financial data. Striking a balance between exploration especially deep learning models, have been investigated for
and exploitation poses another challenge for AI-driven trading their potential to predict market trends and improve
agents. To surmount these hurdles, our proposed solution, QT- Net, generalization. Nonetheless, the scope of QT extends well
introduces an adaptive trading model that autonomously beyond predicting trends; it necessitates the formulation of
formulates QT strategies through an intelligent trading agent.
strategic trading methods. Although reinforcement learning
Incorporating deep reinforcement learning (DRL) with imitative
learning methodologies, we bolster the proficiency of our model. To
(RL) provides a methodical framework for tackling tasks that
tackle the challenges posed by volatile financial datasets, we involve a series of de- cisions, achieving an equilibrium
conceptualize the QT mechanism within the framework of a Par- between the discovery of novel strategies and the utilization
tially Observable Markov Decision Process (POMDP). Moreover, by of established ones poses a considerable challenge, especially
embedding imitative learning, the model can capitalize on in the context of the pragmatic constraints encountered in
traditional trading tactics, nurturing a balanced synergy between actual trading scenarios. In response to these difficulties, we
discovery and utilization. For a more realistic simulation, our introduce Observational and Recurrent Deterministic Policy
trading agent undergoes training using minute-frequency data Gradients (QTNet). We cast QT within the framework of a
sourced from the live financial market. Experimental findings Partially Observable Markov Decision Process (POMDP) to
underscore the model’s proficiency in extracting robust market effectively tackle the representation of unpredictable financial
features and its adaptability to diverse market conditions. data. Recurrent neural networks, such as Recurrent
Deterministic Policy Gradient (RDPG), are employed to
Keywords—Quantitative Trading, Reinforcement Learning handle the POMDP, offering an off- policy deep
reinforcement learning (DRL) algorithm.
I. INTRODUCTION
In the realm of financial security investment, quantitative Balancing exploration and exploitation is addressed
trading (QT) is distinguished by its substantial automation, through imitative learning techniques. A demonstration buffer,
utilizing computing technology to diminish dependence on initial- ized with actions from Dual Thrust, and behavior
personal discretion and mitigate illogical decision-making. As cloning is introduced to guide the trading agent. By
the dominance of quantitative hedge funds grows, there is an integrating these techniques into the POMDP framework,
increasing focus on integrating machine learning into QT, QTNet benefits from enhanced financial domain knowledge.
especially in the context of Fintech. These technologies enable Real financial data from the futures market is used to test
the creation of dynamic trading strategies that can adapt to QTNet, demonstrating its ability to learn profitable trading
market changes in real time, manage risks more effectively, and policies and exhibit superior generalization across various
ultimately enhance the profitability and efficiency of trad- ing futures markets.
Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
II. RELATED WORK advanced DRL architecture is tailored to grasp the subtle
Research within our field typically falls into two primary nuances of minute-by-minute financial data and exhibits
classifications. The first, extensively documented by Murphy in flexibility in various financial market conditions.
1999, is Quantitative Trading (QT), which is fundamentally III. PROBLEM DEFINITION
dependent on Technical Analysis. This method assumes that all
pertinent market details are encoded within the price and volume Within this section, we initially provide clear definitions
figures, utilizing a variety of technical indicators— for mathematical symbols, followed by a comprehensive
mathematically derived instruments—to signal trading actions. formal introduction to expound upon the intricacies of the
Although these indicators are prevalent, their rigidity often quantitative trading problem.
hinders their ability to conform to the market’s multi- faceted and A. Foundations
evolving patterns. Within this realm of indicators, two dominant
types exist moving averages, which aim to discern the direction For each discrete moment t, we define the OHLC price
of price trends, and oscillators, designed to gauge the market’s array as = ,ℎ , , , where , ℎ , , and
momentum. Each type, however, has its inherent constraints as correspondingly represent the opening, highest, lowest, and
they are based on historical data, which might not always be a closing prices. The comprehensive price array for a financial
reliable predictor of future market behavior due to the dynamic instrument is expressed as = [ , … , , … ] . For
nature of financial markets. This reliance on past data to forecast convenience, : symbolizes the historical price array
future trends often fails to accommodate the unpredictable within a given timeframe, with n denoting the length of the
fluctuations that characterize market movements, leading to a look-back period or window length. The technical indicator
demand for more adaptable and nuanced trading tools. vector at time t is denoted as = , … , , where is a
function of the historical price sequence : through =
Lately, the integration of machine learning techniques into
the realm of securities investment has seen a notable surge, with : ; , with representing the parameters for the
particular emphasis on Reinforcement Learning (RL) for its technical strategy j. The sequence of technical indicators is
aptitude in tackling sequential decision-making tasks. The = [ , … , , … ]. Similarly, the account profit at time step
exploration of RL within the context of Quantitative Trading (QT) is , and the sequence of account profits is = [ , … , , … ].
is not new; Littman (1996) [8], [14] was a pioneer in applying Q- B. Partially Observable Markov Decision Process
learning, a conventional value-based RL algorithm. The
landscape of complex problem spaces in QT has acknowledged In this section, we explore the distinctive attributes of
the limitations of traditional methods. In response, Murphy (1999) Quantitative Trading (QT) and elaborate on the rationale
[4]–[6], [10] and his colleagues, around the turn of the behind framing the entire QT process as a Partially
millennium, shifted the focus toward policy-based RL Observable Markov Decision Process (POMDP). The
approaches. They put forth the concept of recurrent financial market is characterized by security prices influenced
reinforcement learning (RRL), a method better suited to handle by both bullish and bearish investors, macroeconomic and
the intricacies and temporal dependencies inherent in financial microeconomic activi- ties, unpredictable occurrences, and
decision-making processes. This shift underscored the evolving diverse trading behaviors, contributing to inherent noise.
nature of machine learning in finance, seeking algorithms that Direct observation of true market states becomes unattainable
could not only predict but also learn and adapt strategies over due to these complexities. For instance, the impact of positive
time, harnessing the vast amounts of data to navigate the oft- news or order executions remains uncertain, and available
turbulent waters of the stock market. data for analysis is limited to historical prices and volumes,
offering only a partial view of the market state. QT, as a
Traditional RL faces challenges in selecting appropriate sequential decision-making challenge, revolves around
market features. Deep Reinforcement Learning (DRL), determining what and when to trade.
combining RL and deep learning, is well-suited for high-
dimensional data problems. DRL has shown advancements in Within the realm of QT, the analytical structure evolves
tackling complex tasks, including video games, and extends its from a classic Markov Decision Process (MDP) to a Partially
potential to QT. Jiang, Xu, and Liang (2017) [3] leveraged Deep Observable Markov Decision Process (POMDP). An MDP is
Deterministic Policy Gradient (DDPG) for cryptocur- rency defined by a quintuple [ , , , ℛ, ] , where denotes
portfolios. Efforts have also been made to enhance financial distinct set of states, A represents a specific set of actions, P
signal representation using fuzzy learning and deep neural is the state transition probability function, and R signifies the
networks (DNNs). rewardfunction. By incorporating observational elements O
and Z, with O being the observation set and Z the observation
Model-free reinforcement learning algorithms, while effecti- probability function, the framework is modified into a
ve,often struggle with sampling efficiency in extensive state POMDP. In this context, the agent receives various
spaces, a situation frequently encountered in QT. To address this, observations at each interval, and the comprehension of
Yu et al. (2019) [15] developed a model-based RL approach for observable history until time t is fundamental in capturing the
daily frequency portfolio management. However, this framework dynamics of QT.
doesn’t fully cater to the needs of minute- frequency data, which
is more common in QT. To bridge this gap, our study proposes a Observation. In the financial market, observations are split
policy-based RL mechanism that operates in continuous time and into two distinct categories: the portfolio observation subset
is augmented with Recurrent Neural Networks (RNNs). This opt and the economic observation subset . The term
represents the aggregated profit of the portfolio, whereas oet
584
Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
is associated with market pricing and technical metrics. The and respectively. A replay buffer is also maintained,
overall observation set comprises = , , , leveraging archiving sequences of observations, actions, and rewards.
technical indicators such as BuyLine and SellLine from the
Dual Thrust strategy. Our research innovates further by integrating Long Short-
Term Memory (LSTM) networks into the QT framework, of-
Action. Trading actions involve a continuous probability fering an alternative to Gate Recurrent Units (GRUs). LSTMs
vector representing long and short positions. The agent treat the prior observation-action history ℎ as the latent
executes the action associated with the maximum probability. state from the previous timestep, leading to the current state
ℎ being formulated as ℎ = LSTM (ℎ , , ) . This
Actions are represented as ∈ { long , short } = {1, −1},
substitution underscores the potential of LSTMs in capturing
simplifying position management and mitigating the the temporal intricacies of the market, a crucial aspect for
impact of market capacity. Trading actions are interpreted effective trading strategies within our POMDP model for QT.
as signals, guiding the execution of trades based on specific
rules. B. imitative learning based on DB and BC
Effectively navigating the dynamic intricacies of financial
Reward: Practical constraints such as transaction fees and market data poses a significant challenge due to the exponen-
slippage are incorporated into the trading simulation. The tial growth of the exploration value space. Traditional model-
account profit is computed considering these factors. To free Reinforcement Learning (RL) algorithms face limitations
address the inefficiency of as a reward function, the Sharpe in devising profitable policies efficiently within the context
ratio is adopted. The Sharpe ratio ( ) serves as a risk-adjusted of Quantitative Trading (QT). Moreover, random exploration,
return measure, representing the ratio of excess return to one unit lacking specific goals, becomes inefficient, especially when
of total risk. considering the imperatives of trading continuity and market
This comprehensive framework allows for the representa- friction factors. However, leveraging the model-free
tion of QT challenges in a dynamic and uncertain financial Recurrent Deterministic Policy Gradient (RDPG) algorithm
environment while incorporating realistic market constraints. with well- defined training goals proves to be a promising
approach. As an off-policy algorithm, RDPG demonstrates
IV. OBSERVATIONAL AND RECURRENT DPG adaptability to auxiliary data, providing the basis for the
In this segment, we introduce QTNet, our designed model, introduction of two pivotal components: the Demonstration
specifically devised for the POMDP setup in Quantitative Buffer and Behavior Cloning. These modules serve distinct
Trading (QT). This model integrates elements from Recurrent roles in passive and ac- tive imitative learning, respectively,
Deterministic Policy Gradient (RDPG) and imitative learning in [?], [9], [10] strategically guiding our RDPG agent through
structured manner. Furthermore, a graphical representation of the the complex landscape of QT.
QTNet architecture is provided in Figure 1. Demonstration Buffer (DB). Initially, a prioritized
A. Recurrent DPG replay buffer denoted as is established, and it is pre-
filled with demonstration episodes ( , , , … , , , )
Silver (2015) [7] unveiled the Deterministic Policy Gra- dient obtained from the Dual Thrust strategy. Drawing inspiration
(DPG), a specialized off-policy reinforcement learning (RL) from the methodologies introduced in DQfD Hester et
algorithm designed for continuous action spaces. Its application
is especially pertinent to the domain of high- frequency
al (2018) [2] and DDPGfD Vercerik (2017) [14], our
Quantitative Trading (QT), where decisions need to be made on approach involves pretraining the agent using these
a continuum. The DPG’s ability to handle the demands of demonstrations prior to engaging in the actual
constant trading activity, mindful of the costs associated with interaction. This pre-training, enriched with insights
frequent transaction changes, makes it a suitable match for QT’s from technical analysis, equips the agent with a
requirements. Building on DPG, the Recurrent Deterministic foundational trading strategy from the outset.
Policy Gradient (RDPG), introduced by Heess et al (2015) [1],
incorporates a recurrent structure to more adeptly navigate the During the training phase, each minibatch consists of a
QT landscape. This methodology acknowledges the importance mixture of instructional and agent-based episodes, selected,
of historical data sequences for QT, where the market’s state isn’t using prioritized experience replay (PER) (Schaul et al. (2015)
fully observable. Agents in this system utilize a historical [13]). PER promotes the selection of episodes with greater
compilation of market and personal account observations, significance more frequently. The selection likelihood of an
including prices, indicators, and profits, encapsulated within a episode ( ) is directly linked to its importance, defined as:
sequence denoted by ℎ = ( , , … , , , ). RDPG ( ) = /∑ , where denotes the significance of
employs recurrent neural networks (RNNs) to assimilate this episode . In our implementation, we utilize the episode
history, enhancing the agent’s ability to retain and utilize past importance criteria as proposed by Vercerik (2017) [14].
information to inform future trades. Within this framework, is established as follows:
585
Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
episodes, enhances their sampling likelihood, while assesses Here, ϕ replaces as the exponent, modifying the
the actor gradient’s influence. With the alteration in sample formulation slightly to ensure uniqueness.
allocation, network modifications are adjusted using significance
Here, is a constant. This prioritized demonstration
sampling weights :
buffer
1 1
= × (2)
()
GRU
...
h1 h h
T
Demonstrati
on Buffer Targe Act Criti Targe Act Criti Targe Act Criti
controls the data ratio between demonstration and agent episodes. specific objectives for each stage of training, thereby mini-
Importantly, it facilitates the efficient propagation of rewards. mizing phases of unproductive exploration. Behavior Cloning
proves to be a valuable technique in aligning agent actions
Behavior Cloning (BC). Behavior Cloning is utilized to with expert strategies for improved trading performance.
define goals for every trading move. Intra-day opportunistic
actions ( ‾) are integrated as expert-level maneuvers. Reflecting V. EXPERIMENTS
back, we develop a visionary trading expert, who invariably opts
for a long position when prices are at their lowest and a short In our empirical evaluation, we meticulously test the ef-
position at their peak. At each stage of training, the Behavior ficacy of our trading model by performing a back-test that
Cloning method (Ross and Bagnell (2010) [12]) is applied to incorporates data sampled at one-minute intervals from the
measure the differences between the agent’s actions and the Chinese financial futures market. This data encompasses two
strategies of this foresighted expert. key stock-index futures: IF, which is constructed from a com-
posite index reflecting the performance of the 300 foremost
Specifically, we employ Behavior Cloning losses ( BCLoss ) stocks traded on the Shanghai and Shenzhen stock exchanges,
selectively when the critic (ℎ, ) indicates that the expert and IC, which similarly tracks an index but focuses on stocks
actions outperform the actor actions: with smaller market capitalizations. The back-testing process
not only adheres to stringent, real-world trading conditions
=− max 0, (ℎ , ‾ ) − ℎ, (ℎ ) (3) but also provides a detailed temporal resolution of market
behavior. In illustration of our data’s granularity, Figure 2 and
This adjustment, referred to as Q-Filter (Nair et al. (2018) Figure 3 present a time series of the closing prices recorded
[11]), ensures that the Behavior Cloning losses are only on a minute-by-minute basis for both the IF and IC futures,
considered when the expert actions are superior. thereby offering a precise visual account of their market
The Behavior Cloning loss ( ) serves as an auxiliary loss movements during the testing period.
for updates. Consequently, a modified policy gradient (∇ ‾) is
applied to the actor:
∇ ‾= ∇ + ∇ (4)
In this context, ∇ signifies the policy gradient as outlined
in Equation (9), while and are parameters that govern the
balance among the various loss functions. is multiplied by the
policy gradient, adjusting the contribution of the original policy
gradient to the final policy update. and is multiplied by the
gradient of the behavior cloning loss, regulating the influence of
behavior cloning loss on the policy update.
By integrating actions derived from expert insights, we set Fig. 2. IC
586
Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
/ for > , this metric quantifies the most
significant historical reduction, signifying the most
adverse scenario encountered.
These metrics collectively offer a comprehensive
evaluation of our strategy, with Tr providing a direct gauge of
prof- itability, Sr and Vol offering insights into risk-adjusted
returns and the stability of returns, and Mdd presenting a
measure of potential worst-case scenarios.
B. Imitative learning Strategies
In our research, we incorporate the Dual Thrust strategy
Fig. 3. IF into the Demonstration Buffer module as a model trading
policy. This strategy, rooted in the principles of technical
A. Setup analysis, particularly oscillators, sets a rational price
In our study, we deploy a data-driven approach, utilizing oscillation range using a calculation based on the highest high
minute-bar OHLC (Open, High, Low, Close) data for futures (HH), lowest close (LC), highest close (HC), and lowest low
trading, which captures the nuances of price movements within (LL) of the past n periods. The formula for the range is defined
each minute. This high-frequency data, commonly found in as the maxi- mum of either HH minus LC or HC minus LL.
actual financial markets, poses unique challenges for reinforce- On each trading day, two key thresholds, the BuyLine, and the
ment learning ( ) algorithms, particularly in maintaining SellLine, are established by respectively adding and
action continuity. The dataset, sourced from python subtracting a specified proportion of this Range to the day’s
package ′qstock′, which is a quantitative investment analysis opening price, where K1 and K2 are constants dictating the
open source library, currently includes data acquisition, market’s resistance levels for breaching these thresholds.
visualiza- tion, stock selection and quantitative backtest four
modules. Among them, the data module comes from the open The Dual Thrust strategy creates trading signals when the
data of the Oriental Wealth network, Flush, Sina Finance, and current price breaches the Range’s threshold, either upwards
other online resources. or downwards. In the context of our Behavioral Cloning
module, we also introduce intra-day greedy actions as the
Data spans a training period from September 28, 2015, to benchmark for expert behavior. This involves taking long
November 10, 2022, and a subsequent testing period from positions at the day’s lowest price and short positions at the
November 11, 2022, to November 10, 2023. To mirror real- highest, mim- icking a theoretically optimal, albeit
market conditions more accurately, we integrate practical opportunistic, strategy. These actions provide a vital reference
elements like a transaction fee (notated as = 2 × 10 ) and a for training the agent, with the expert action at each time step
fixed slippage amount ( = 0.15). determined by the minimum or maximum opening prices over
a trading day.
For the purposes of our simulation, certain assumptions are
made. We presume that each order is executed right at the For every trading session, the upper boundary, termed
opening of each minute bar, with the rewards calculated at the BuyLine, and the lower boundary, known as SellLine, are
minute’s closing. Key futures-specific considerations, such as calculated. This is done by either adding or subtracting a
margin requirements and contract settlement peculiarities, are specific percentage of the Range to or from the opening price
factored into our model. The training process is designed to of the day. The key components of the Dual Thrust strategy
conclude under two conditions: either a 50% loss of positions or are expressed as follows:
the depletion of the required margin. We set the initial account
Range = max[ − , − ]
balance at $1, 000, 000 for the commencement of the testing (5)
phase. SellLine = Open − × Range,
The performance of our trading policy is assessed using a Here, Open represents the day’s opening price, and and
suite of standard quantitative trading (QT) metrics: are constants controlling the resistance levels of market
prices against breaking BuyLine and SellLine, respectively.
• Total Return Rate (Tr): Calculated as ( end − start )/ Additionally, denotes the highest high price over the previous
start where represents the total value of the position n time periods.
and cash.
In practical terms, trading signals are generated based on
• Sharpe Ratio (Sr): Originally introduced by Sharpe the Dual Thrust strategy when the current price exceeds a
(1966), it is defined as [ ]/ [ ], providing a measure certain percentage of the Range either upwards or downwards.
of excess return over unit systematic risk.
Moving into the Behavioral Cloning module, intra-day
• Volatility (Vol): Expressed by [ ], with representing greedy actions are introduced as expert actions from a hind-
the historical series of return rates, this metric illustrates sight perspective. Adopting a long position at the lowest price
the variability of returns and the associated risk factors and a short position at the highest price consistently represents
of the strategies. a relatively optimal and greedy strategy. This prophetic policy
serves as expert actions for the agent throughout the training
• Maximum Drawdown (Mdd): Determined by max −
587
Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
process. At each time step t, the expert action is determined by: straightforward strategies such as Long & Hold and Short &
Hold, thereby minimizing potential maximum losses. This
1, if = arg min result highlights the resilience of the QTNet strategy in the
‾ = (6)
−1, if = arg max face of market volatility and its capacity to maintain stability
in complex market conditions. Overall, the comprehensive
Here, denotes the length of one trading day, and performance of QTNet validates the efficiency of recurrent
represents the sequence of opening prices for the day. This GRU networks in capturing the complexities of market
approach provides a rich foundation for understanding and dynamics in quantitative trading.
incorporating expert behavior into the agent's learning process.
TABLE I. PERFORMANCE OF COMPARISON METHODS ON IC.
Our study also includes a comparison of the proposed
Iterative Recurrent Deep Policy Gradient (QTNet) method Methods Tr (%) Sr Vol Mdd (%)
against several baseline strategies for context. These include the
Long & Hold -8.32 -0.318 0.746 62.84
Long & Hold strategy, where a long position is maintained
throughout the test period, and its counterpart, the Short & Hold Short & Hold 9.53 0.167 0.658 49.30
DDPG -17.26 -0.428 0.498 57.22
strategy. Additionally, we examine the Deep Deterministic
Policy Gradient (DDPG) algorithm, known for its effectiveness QTNet 20.28 0.562 0.421 23.73
in continuous control scenarios, as a comparative benchmark.
F. Ablation Experiments
This juxtaposition allows us to evaluate the effectiveness of our
approach in the dynamic and often unpredictable realm of The ablation studies focusing on IC futures shed light on
quantitative trading. the significance of each element within the QTNet framework
concerning its aggregate performance. The data extracted from
C. Baseline Methods
Table II demonstrates that the integrated return trajectory
Our proposed QTNet is subjected to comparison with sev-
of IR-DPG stands out against its counterparts for the duration
eral baseline strategies:
of the evaluation phase. This study dissects the IR-RDPG into
• Long & Hold: This strategy involves initiating a long three distinct iterations: the basic RDPG utilizing only GRU
position at the outset and maintaining the position until networks, RDPG-DB which integrates a demonstration buffer,
the conclusion of the test period. and RDPG-BC which employs a behavior cloning module.
The analysis delineates that the base RDPG variant grapples
• Short & Hold: In contrast to the Long & Hold strategy, with achieving action coherence and mastering a lucrative
Short & Hold commences with taking a short position at trading approach. The assimilation of the demonstration
the beginning and retaining the position throughout the buffer in RDPG-DB markedly augments the efficiency of
test period. experience sampling, while the behavior cloning feature in
• DDPG (Lillicrap et al. (2015) [7]): DDPG stands for RDPG-BC curtails the detrimental effects of random
Deep Deterministic Policy Gradient, an off-policy. exploration on the agent’s performance. The fully-fledged
QTNet, amalgamat- ing all the components, evidently excels
D. Experimental Outcomes in terms of both profit generation and risk management
In this section, we present a comprehensive set of ex- metrics, underlining the prowess of imitative learning
periments to compare the performance of our QTNet model strategies in the domain.
against baseline methods. Ablation experiments are also con-
ducted to highlight the significance of each module within TABLE II. Ablation experiments on IC.
QTNet. To evaluate the generalization capabilities of both
QTNet and the Dual Thrust strategy, tests are performed across Methods Tr (%) Sr Vol Mdd (%)
distinct futures markets. RDPG 8.96 0.067 0.590 34.62
RDPG-DB 18.72 0.467 0.452 23.72
E. Data Demonstration RDPG-BC 31.24 0.619 0.427 29.81
In our study, we utilized minute-level frequency data for IC QTNet 36.26 0.742 0.411 25.37
futures to assess the performance of the QTNet model against
several baseline trading strategies. As detailed in Table I, the G. Generalization Ability
QTNet consistently outperforms traditional methods, particularly The versatility of the QTNet model is assessed by its
in terms of the total return rate (Tr) and Sharpe Ratio (Sr), perfor- mance in varying market conditions, specifically in IF
indicating its adaptability and profitability in a high-frequency and IC futures, as depicted in Table III. The table elucidates
quantitative trading setting. Although DDPG is a renowned the stark contrast in Dual Thrust’s effectiveness between the
reinforcement learning algorithm that excels in certain scenarios, IC and IF markets, which illustrates the potential constraints
it displayed relatively lower performance in our tests, which may of fixed trading indicators. On the other hand, QTNet exhibits
be attributed to its adaptability in handling high-frequency superior performance, especially in the IC market, regardless
trading data. of being trained solely on IF data. This demonstrates QTNet’s
The QTNet also demonstrated impressive performance in the inherent flexibility and its proficiency in discerning durable
critical risk management metric of Maximum Drawdown (Mdd), features that are crucial for formulating dynamic trading
showing more robustness during market downturns compared to strategies ap- plicable to different market environments. The
588
Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.
model’s ability to perform well across various markets cements [11] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wojciech Zaremba,
its standing as a robust and adaptable tool for financial and Pieter Abbeel. Overcoming exploration in reinforcement learning
with demonstrations. In 2018 IEEE international conference on robotics
applications in the real world. and automation (ICRA), pages 6292–6299. IEEE, 2018.
[12] Ste´phane Ross and Drew Bagnell. Efficient reductions for imitation
TABLE III. PERFORMANCE OF COMPARISON METHODS ON IC AND IF. learning. In Proceedings of the thirteenth international conference on
artificial intelligence and statistics, pages 661–668. JMLR Workshop
Methods Data Tr(%) Sr Vol Mdd(%) and Conference Proceedings, 2010.
Dual Thrust IC 26.40 0.810 0.469 19.24 [13] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Priori-
IF -29.59 -0.577 0.796 55.81 tized experience replay. arXiv preprint arXiv:1511.05952, 2015.
QTNet IC 34.27 0.742 0.457 24.56 [14] Mel Vecerik, Todd Hester, Jonathan Scholz, Fumin Wang, Olivier
IF 28.73 0.523 0.517 26.25 Pietquin, Bilal Piot, Nicolas Heess, Thomas Rotho¨rl, Thomas Lampe,
and Martin Riedmiller. Leveraging demonstrations for deep reinforce-
ment learning on robotics problems with sparse rewards. arXiv preprint
VI. CONCLUSION arXiv:1707.08817, 2017.
In our research, we unveiled the QTNet, a dynamic, im- [15] Pengqian Yu, Joon Sern Lee, Ilya Kulyatin, Zekun Shi, and Sakyasingha
itative learning framework designed for the challenges of Dasgupta. Model-based deep reinforcement learning for dynamic
Quantitative Trading (QT). We constructed a Partially Observ- portfolio optimization. arXiv preprint arXiv:1901.08740, 2019.
able Markov Decision Process (POMDP) to adeptly handle the
erratic nature of high-frequency trading data. Crucially, our
model harnesses the principles of imitative learning to harmonize
the exploration-exploitation trade-off, thus refining the decision-
making process of our trading agent. We validated the
performance of QTNet by conducting rigorous evaluations using
authentic data from stock-index futures, acknowledging the
operational limitations typical of financial trading. The outcomes
affirmed the financial viability and risk mitigation prowess of
QTNet. Through a series of comparative analyses, the model’s
capacity to adapt to varied market conditions was put on display.
To conclude, QTNet underscores the advantage for trading
agents in real-world markets to draw on established trading
techniques. Our results underscore the enhanced adaptability and
performance gains achieved by merging imitative learning with
a reinforcement learning paradigm in the realm of QT solutions.
REFERENCES
[1] Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, and David Silver.
Memory-based control with recurrent neural networks. arXiv preprint
arXiv:1512.04455, 2015.
[2] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal
Piot, Dan Horgan, John Quan, Andrew Sendonaris, Ian Osband, et al.
Deep q-learning from demonstrations. In Proceedings of the AAAI
conference on artificial intelligence, volume 32, 2018.
[3] Zhengyao Jiang, Dixing Xu, and Jinjun Liang. A deep reinforcement
learning framework for the financial portfolio management problem. arXiv
preprint arXiv:1706.10059, 2017.
[4] Zixun Lan, Binjie Hong, Ye Ma, and Fei Ma. More interpretable graph
similarity computation via maximum common subgraph inference. arXiv
preprint arXiv:2208.04580, 2022.
[5] Zixun Lan, Ye Ma, Limin Yu, Linglong Yuan, and Fei Ma. Aednet: Adaptive
edge-deleting network for subgraph matching. Pattern Recog- nition,
133:109033, 2023.
[6] Zixun Lan, Limin Yu, Linglong Yuan, Zili Wu, Qiang Niu, and Fei Ma.
Sub-gmn: The subgraph matching network model. arXiv preprint
arXiv:2104.00186, 2021.
[7] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess,
Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous
control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971, 2015.
[8] ML Littman and AW Moore. Reinforcement learning: A survey, journal of
artificial intelligence research 4, 1996.
[9] Malik Magdon-Ismail and Amir F Atiya. Maximum drawdown. Risk
Magazine, 17(10):99–102, 2004.
[10] John J Murphy. Technical analysis of the financial markets: A compre-
hensive guide to trading methods and applications. Penguin, 1999.
589
Authorized licensed use limited to: National Central University. Downloaded on September 04,2024 at 13:24:06 UTC from IEEE Xplore. Restrictions apply.