Deep Reinforcement Learning for Optimal Portfolio Allocation JPMorgan
Deep Reinforcement Learning for Optimal Portfolio Allocation JPMorgan
Abstract progress has been made in both theoretical and applied as-
pects of portfolio optimization. These range from improve-
Portfolio Management is the process of overseeing a group
of investments, referred to as a portfolio, with the objective
ments in the optimization process, to the framing of ad-
of achieving predetermined investment goals and objectives. ditional constraints that might be desirable to rational in-
Portfolio Optimization is a key component that involves al- vestors (Cornuejols and Tütüncü 2006; Li and Hoi 2014;
locating the portfolio assets so as to maximize returns while Kalayci et al. 2017; Ghahtarani, Saif, and Ghasemi 2022).
minimizing risk taken. Portfolio optimization is typically car- Recently, the community has tapped the many advancements
ried out by financial professionals who use a combination of in Machine Learning (ML) to aid with feature selection,
quantitative techniques and investment expertise to make de- forecasting and estimation of asset means and covariances,
cisions about the portfolio allocation. as well as using gradient based methods for optimization.
Recent applications of Deep Reinforcement Learning (DRL) Concurrently, the past decade has witnessed the success
have shown promising results when used to optimize portfo-
of Reinforcement Learning (RL) in the fields of gaming,
lio allocation by training model-free agents on historical mar-
ket data. Many of these methods compare their results against robotics, natural language processing etc. (Silver et al. 2017;
basic benchmarks or other state-of-the-art DRL agents but Nguyen and La 2019; Su et al. 2016). The sequential deci-
often fail to compare their performance against traditional sion making nature of Deep RL, along with its success in
methods used by financial professionals in practical settings. applied settings, has captured the attention of the finance
One of the most commonly used methods for this task is research community. In particular, some of the most pop-
Mean-Variance Portfolio Optimization (MVO), which uses ular areas of focus of the application of DRL in finance have
historical timeseries information to estimate expected asset been on automated stock trading (Yang et al. 2020; Théate
returns and covariances, which are then used to optimize for and Ernst 2021; Zhang, Zohren, and Roberts 2020; Wu et al.
an investment objective. 2020), risk management through deep hedging (Buehler
Our work is a thorough comparison between model-free DRL et al. 2019; Du et al. 2020; Cao et al. 2021; Benhamou et al.
and MVO for optimal portfolio allocation. We detail the 2020b) and portfolio optimization. In the upcoming section,
specifics of how to make DRL for portfolio optimization
work in practice, also noting the comparable tweaks needed
we’ll examine the landscape of DRL in portfolio optimiza-
to get MVO operational. Backtest results display strong im- tion and trading problems. While these approaches exhibit
proved performance of the DRL agent in terms of many per- improved performance than previous studies, they do have
formance metrics, including Sharpe ratio, maximum draw- some shortcomings. For instance, some generate discrete as-
downs, and absolute returns. set trading signals which limit their use in broader portfo-
lio management. Additionally, majority of these approaches
compare results against ML or buy-and-hold baselines, and
Introduction don’t consider classical portfolio optimization techniques,
Portfolio management is a key issue in the financial ser- such as Mean-Variance Optimization.
vices domain. It constitutes allocating funds across a diverse In our work, we aim to compare a simple and robust DRL
variety of assets, typically to generate uncorrelated returns framework, that was designed around risk-adjusted returns,
while minimizing risk and operational costs. Portfolios can with one of the traditional finance methods for portfolio op-
constitute holdings across asset classes (cash, bonds, equi- timization, MVO. We train policy gradient based agents on
ties, etc.), or can also be optimized within a specific asset a multi asset trading environment that simulates the US Eq-
class (e.g., picking the appropriate composition of stocks for uities market (using market data replay), and create observa-
an equity portfolio). Investors may choose to optimize for tion states derived from the observed asset prices. The agents
various performance criteria, often centered around maxi- optimize for risk-adjusted returns, not dissimilar to the tra-
mizing portfolio returns relative to risk taken. Since the ad- ditional MVO methods. We compare the performance of the
vent of Modern Portfolio Theory (Markowitz 1952), a lot of DRL strategy against MVO through a series of systematic
Copyright © 2023, Association for the Advancement of Artificial backtests, and observe improved performance along many
Intelligence (www.aaai.org). All rights reserved. performance metrics, including risk adjusted returns, max
drawdown, and portfolio turnover. tions to obtain efficiently diversified portfolios. This frame-
work relies on the assumption that a rational investor will
Related Work prefer a portfolio with less risk for a specified level of re-
turn and concludes that risk can be reduced by diversifying
There is a lot of recent research interest into the application a portfolio. In this section, we will introduce Mean Variance
of Deep RL in trading and portfolio management problems. Optimization (MVO) – one of the main techniques of the
For portfolio optimization, a lot of the research focuses on MPT – which we later compare to the performance of our
defining various policy network configurations and reports DRL framework. Additionally, we introduce RL preliminar-
results that outperform various traditional baseline meth- ies, describing the technique independent of portfolio opti-
ods (Wang et al. 2019; Liang et al. 2018; Lu 2017; Jiang mization.
and Liang 2017; Wang et al. 2021; Deng et al. 2016; Cong
et al. 2021). Other work explores frameworks that inject in- Mean-Variance Portfolio Optimization
formation in the RL agent’s state by incorporating asset en-
dogenous information such as technical indicators (Liu et al. Mean-Variance Optimization (MVO) is the mathematical
2020; Sun et al. 2021; Du and Tanaka-Ishii 2020) as well as process of allocating capital across a portfolio of assets (op-
exogenous information such as information extracted from timizing portfolio weights) to achieve a desired investment
news data (Ye et al. 2020; Lima Paiva et al. 2021). goal, usually: 1. Maximize returns for a given level of risk,
The current benchmarks for DRL frameworks typically 2. Achieve a desired rate of return while minimizing risk, or
involve comparing results against other DRL or ML ap- 3. Maximize returns generated per unit risk. Risk is usually
proaches, a buy-and-hold baseline, or market/index perfor- measured by the volatility of a portfolio (or asset), which is
mance. However, these benchmarks may be overly simplis- the variance of its rate of return. For a given set of assets,
tic or provide only a relative comparison. To truly gauge the this process requires as inputs the rates of returns for each
effectiveness of a DRL agent, it would be more meaning- asset, along with their covariances. As the true asset returns
ful to benchmark it against methodologies used by financial are unknown, in practice, these are estimated or forecasted
professionals in practice, such as Mean Variance Optimiza- using various techniques that leverage historical data.
tion (MVO). This task is then framed as an optimization problem, sin-
While there are some approaches that compare DRL per- gle or multi-objective, which can be solved in a variety of
formance with MVO (Li et al. 2019; Koratamaddi et al. ways (Cornuejols and Tütüncü 2006; Kalayci et al. 2017;
2021; i Alonso, Srivastava et al. 2020), the comparison sim- Ghahtarani, Saif, and Ghasemi 2022).
ply serves as another baseline, and the methodology is not A typical procedure is to solve it as a convex optimization
clearly described because an in-depth comparison is not the problem and generate an efficient frontier of portfolios such
primary focus of their study. To our knowledge, there is only that no portfolio can be improved without sacrificing some
one study that goes into a robust in-depth comparison of measure of performance (e.g., returns, risk). Let w be the
MVO and DRL (Benhamou et al. 2020a). However, across weight vector for a set of assets, µ be the expected returns,
all these studies, there is usually a discrepancy between the the portfolio risk can be described as wT Σw, for covariance
reward function used to train the RL agent, and the objective matrix Σ. To achieve a desired rate of return µ∗ , we can solve
function used for MVO (for e.g., daily returns maximization the portfolio optimization problem:
vs risk minimization). In order to make a fair comparison, it
is crucial that both approaches optimize for the same goal. minimize wT Σw
w
Additionally, some of these approaches provide exogenous subject to wT µ ≥ µ∗ ,
information (e.g., signals from news data) to the DRL agent,
which makes for a biased comparison with MVO. Addition- wi ≥ 0,
ally, none of these works provide implementation details for
X
wi = 1
the MVO frameworks they used for their comparison. We
aim to address these issues by conducting a robust compar- Varying µ∗ gives us the aforementioned efficient frontier.
ison of Deep RL and Mean-Variance Optimization for the Another common objective is the Sharpe Ratio (Sharpe
Portfolio Allocation problem. 1998; Chen, He, and Zhang 2011), which measures the re-
turn per unit risk. Formally, for portfolio p, the Sharpe Ratio
Background is defined as:
The goal of portfolio optimization is to continuously diver-
sify and reallocate funds across assets with the objective of E[Rp − Rf ]
Sharpe Ratiop =
maximizing realized rewards while simultaneously restrain- σp
ing the risk. In practice, portfolio management often aims to where Rp are the returns of the portfolio, σp is the stan-
not only maximize risk-adjusted returns but also to perform dard deviation of these returns, and Rf is a constant risk-free
as consistently as possible over a given time interval (e.g. on rate (e.g., US Treasuries, approximated by 0.0% in recent
a quarterly or yearly basis). history). Although tricky to solve in its direct form –
Markowitz introduced the modern portfolio theory (MPT)
(Markowitz 1952), a framework that allows an investor to µT w − Rf
mathematically balance risk tolerance and return expecta- max
w (wT Σw)1/2
– it can be framed as a convex optimization problem through Problem Setup
the use of a variable substitution (Cornuejols and Tütüncü We frame the portfolio optimization problem in the RL set-
2006). We choose the Sharpe Ratio as our desired objective ting. As described in the Background section, RL entails
function for this study as we can optimize for risk-adjusted learning in a framework with interactions between an agent
returns without having to specify explicit figures for mini- and an environment. For the portfolio optimization setting,
mum expected returns or maximum risk tolerance. we create an environment that simulates the US Equities
market (using market data replay), and create observation
Reinforcement Learning states derived from the observed asset prices. The agent’s
Reinforcement Learning (RL) is a sub-field of machine actions output a set of a portfolio weights, which are used to
learning that refers to a class of techniques that involve rebalance the portfolio at each timestep.
learning by optimizing long-term reward sequences obtained
by interactions with an environment (Sutton and Barto Actions
2018). An environment is typically formalized by means of
For portfolio allocation over N assets, an agent selects port-
a Markov Decision Process (MDP). An MDP consists of a a PN
5-tuple (S, A, Pa , Ra , γ), where: folio weights w = [w1 , . . . , wn ] such that i=1 wi = 1,
where 0 ≤ wi ≤ 1. An asset weight of 0 indicates zero hold-
• S is a set of states ings of a particular asset in a portfolio, whereas a weight of
• A is a set of actions 1 means the entire portfolio is concentrated in said asset. In
• Pa (s, s′ ) = Pr(st+1 = s′ | st = s, at = a) is the extensions of this framework, wi < 0 would allow for short-
probability that action a in state s at time t will lead to ing an asset, whereas wi > 1 indicates a leveraged position.
state s′ at time t + 1 However, for our case, we restrict actions to non-leveraged
• Ra (s, s′ ) is the immediate reward received after transi- long-only positions. These constraints can be enforced by
tioning from state s to state s′ , due to action a applying the softmax function to an agent’s continuous ac-
• γ is a discount factor between [0, 1] that represents the tions.
difference in importance between present and future re-
wards States
A solution to an MDP is a policy π that specifies the action An asset’s price at time t is denoted by Pt . The one-period
π(s) that the decision maker will choose when in state s. simple return is defined as Rt = PtP−P t−1
t−1
. Consequently, the
The objective is to choose a policy π that will maximize one-period gross return can be defined as PPt−1 t
= Rt + 1.
the expected discounted sum of rewards over a potentially
infinite horizon: Further, we can define the one-period log return as rt =
log( PPt−1
t
) = log(Rt + 1). For our setting, we choose the
"∞ #
X time period to be daily, and therefore calculate daily log
t
E γ Rat (st , st+1 ) returns using end-of-day close prices. An asset’s log re-
t=0 turns over a lookback period T can then be captured as
The field of Deep Reinforcement Learning (DRL) lever- rt = [rt−1 , rt−2 , . . . , rt−T +1 ]. In our case, the lookback
ages the advancements in Deep Learning by using Neu- period is T = 60 days.
ral Networks as function approximators to estimate state- For a selection of n + 1 assets - n securities and cash
action value functions, or to learn policy mappings π. These (denoted by c) - we form the agent’s observation state at
techniques have seen tremendous success in game-playing, time t, St as a [(n + 1) × T ] matrix:
robotics, continuous control, and finance (Mnih et al. 2013;
w r ... r
Berner et al. 2019; Nguyen and La 2019; Hambly, Xu, and
1 1,t−1 1,t−T +1
Yang 2021; Charpentier, Elie, and Remlinger 2021). w2 r2,t−1 ... r2,t−T +1
. .. ..
RL for Portfolio Allocation Given its success in stochas- St = . .
. .
tic control problems, RL extends nicely to the problem of wn rn,t−1 ... rn,t−T +1
portfolio optimization. Therefore, it is not surprising that the wc vol20 vol20
VIXt . . .
vol60
use of DRL to perform tasks such as trading and portfolio
optimization has received a lot of attention lately. Recent The first column is the agent’s portfolio allocation vec-
methods focus on learning deep features and state represen- tor w as it enters timestep t. This might differ slightly from
tations, for example, through the use of embedding features the portfolio weights it chooses at the timestep before, as
derived from deep neural networks such as autoencoders and we convert the continuous weights into an actual allocation
LSTM models. These embeddings capture price related fea- (whole shares only), and rebalance the allocation such that
tures which can range from technical indicators (Wang et al. it sums to 1.
2019; Soleymani and Paquet 2020; Wang et al. 2021), to in- For each non-cash asset, we include the log re-
formation extracted from news in order to account for price turns over T . These are represented by the vector
fluctuations (Ye et al. 2020). Other proposed features use [rn,t−1 , . . . , rn,t−T +1 ] for asset n in the state matrix above.
attention networks or graph structures (Wang et al. 2021, Additionally, in the last row, we include three market volatil-
2019) to perform cross-asset interrelationship feature extrac- ity indicators at time t: vol20 , volvol60 , VIX, which we describe
20
Figure 3: a) DRL Monthly Returns b) DRL Annual Returns c) DRL Monthly Distribution of Returns.
Figure 4: a) MVO Monthly Returns b) MVO Annual Returns c) MVO Monthly Distribution of Returns.
• We have designed a simple environment that serves as Benhamou, E.; Saltiel, D.; Ungari, S.; and Mukhopadhyay,
a wrapper for the market, sliding over historical data us- A. 2020b. Time your hedge with deep reinforcement learn-
ing market replay. The environment can allocate multiple ing. arXiv preprint arXiv:2009.14136.
assets and can be easily modified to reflect transaction Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak,
costs. P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse,
• We compare our framework’s performance during ten C.; et al. 2019. Dota 2 with large scale deep reinforcement
backtest experiments over different periods for the US learning. arXiv preprint arXiv:1912.06680.
Equities Market using S&P500 Sector indices. Our ex- Buehler, H.; Gonon, L.; Teichmann, J.; and Wood, B. 2019.
periments demonstrate the improved performance of Deep hedging. Quantitative Finance, 19(8): 1271–1291.
the deep reinforcement learning framework over Mean- Cao, J.; Chen, J.; Hull, J.; and Poulos, Z. 2021. Deep hedg-
Variance portfolio optimization. ing of derivatives using reinforcement learning. The Journal
• The profitability of the framework surpasses MVO in of Financial Data Science, 3(1): 10–27.
terms of Annual Returns, Sharpe ratio and Maximum Charpentier, A.; Elie, R.; and Remlinger, C. 2021. Rein-
Drawdown. Additionally, we observe that DRL strategy forcement learning in economics and finance. Computa-
leads to more consistent returns and more stable port- tional Economics, 1–38.
folios with decreased turnover. This has implications for
Chen, L.; He, S.; and Zhang, S. 2011. When all risk-adjusted
live-deployment, where transaction costs and slippage af-
performance measures are the same: In praise of the Sharpe
fect P&L.
ratio. Quantitative Finance, 11(10): 1439–1447.
Future Work Cong, L. W.; Tang, K.; Wang, J.; and Zhang, Y. 2021. Alpha-
Portfolio: Direct construction through deep reinforcement
In our future work, we would like to model transaction costs learning and interpretable AI. Available at SSRN, 3554486.
and slippage either by explicitly calculating them during as- Cont, R. 2001. Empirical properties of asset returns: stylized
set reallocation or as a penalty term to our reward. Moreover, facts and statistical issues. Quantitative finance, 1(2): 223.
we would like to explore adding a drawdown minimization
component to our reward that will potentially help the agent Cornuejols, G.; and Tütüncü, R. 2006. Optimization meth-
learn a more consistent trading strategy. ods in finance, volume 5. Cambridge University Press.
Another area of exploration is training a regime switching Dempster, M. A.; and Leemans, V. 2006. An automated FX
model which will balance its funds amongst two agents de- trading system using adaptive reinforcement learning. Ex-
pending on market volatility (low vs high). One of them will pert Systems with Applications, 30(3): 543–552.
be a low-volatility trained agent and the other a high volatil- Deng, Y.; Bao, F.; Kong, Y.; Ren, Z.; and Dai, Q. 2016. Deep
ity trained agent. We would like to compare performance direct reinforcement learning for financial signal representa-
between our current implicit regime parametrization and an tion and trading. IEEE transactions on neural networks and
explicit one. Further exploration of these research directions learning systems, 28(3): 653–664.
may produce significant insights into practical trading be- Du, J.; Jin, M.; Kolm, P. N.; Ritter, G.; Wang, Y.; and Zhang,
havior. B. 2020. Deep reinforcement learning for option replication
and hedging. The Journal of Financial Data Science, 2(4):
Disclaimer: This paper was prepared for information
44–57.
purposes by the Artificial Intelligence Research group of
J. P. Morgan Chase & Co. and its affiliates (“J. P. Morgan”), Du, X.; and Tanaka-Ishii, K. 2020. Stock embeddings ac-
and is not a product of the Research Department of J. P. Mor- quired from news articles and price history, and an appli-
gan. J. P. Morgan makes no representation and warranty cation to portfolio optimization. In Proceedings of the 58th
whatsoever and disclaims all liability, for the completeness, annual meeting of the association for computational linguis-
accuracy or reliability of the information contained herein. tics, 3353–3363.
This document is not intended as investment research or in- Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos,
vestment advice, or a recommendation, offer or solicitation F.; Rudolph, L.; and Madry, A. 2019. Implementation mat-
for the purchase or sale of any security, financial instrument, ters in deep rl: A case study on ppo and trpo. In International
financial product or service, or to be used in any way for conference on learning representations.
evaluating the merits of participating in any transaction, and Ghahtarani, A.; Saif, A.; and Ghasemi, A. 2022. Robust
shall not constitute a solicitation under any jurisdiction or to portfolio selection problems: a comprehensive review. Op-
any person, if such solicitation under such jurisdiction or to erational Research, 1–62.
such person would be unlawful.
Hambly, B.; Xu, R.; and Yang, H. 2021. Recent ad-
vances in reinforcement learning in finance. arXiv preprint
References arXiv:2112.04553.
Benhamou, E.; Saltiel, D.; Ungari, S.; and Mukhopad- Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup,
hyay, A. 2020a. Bridging the gap between Markowitz D.; and Meger, D. 2018. Deep reinforcement learning that
planning and deep reinforcement learning. arXiv preprint matters. In Proceedings of the AAAI conference on artificial
arXiv:2010.09108. intelligence, volume 32.
i Alonso, M. N.; Srivastava, S.; et al. 2020. Deep reinforce- Moody, J.; Wu, L.; Liao, Y.; and Saffell, M. 1998. Perfor-
ment learning for asset allocation in us equities. Technical mance functions and reinforcement learning for trading sys-
report. tems and portfolios. Journal of Forecasting, 17(5-6): 441–
Jiang, Z.; and Liang, J. 2017. Cryptocurrency portfolio man- 470.
agement with deep reinforcement learning. In 2017 Intelli- Nguyen, H.; and La, H. 2019. Review of deep reinforcement
gent Systems Conference (IntelliSys), 905–913. IEEE. learning for robot manipulation. In 2019 Third IEEE Inter-
Kalayci, C.; Ertenlice, O.; Akyer, H.; and Aygören, H. 2017. national Conference on Robotic Computing (IRC), 590–595.
A review on the current applications of genetic algorithms IEEE.
in mean-variance portfolio optimization. Pamukkale Uni- Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus,
versity Journal of Engineering Sciences, 23: 470–476. M.; and Dormann, N. 2021. Stable-Baselines3: Reliable Re-
Koratamaddi, P.; Wadhwani, K.; Gupta, M.; and Sanjeevi, inforcement Learning Implementations. Journal of Machine
S. G. 2021. Market sentiment-aware deep reinforcement Learning Research, 22(268): 1–8.
learning approach for stock portfolio allocation. Engi- Rao, N.; Aljalbout, E.; Sauer, A.; and Haddadin, S. 2020.
neering Science and Technology, an International Journal, How to make deep RL work in practice. arXiv preprint
24(4): 848–859. arXiv:2010.13083.
Ledoit, O.; and Wolf, M. 2004. Honey, I shrunk the sample Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
covariance matrix. The Journal of Portfolio Management, Klimov, O. 2017. Proximal policy optimization algorithms.
30(4): 110–119. arXiv preprint arXiv:1707.06347.
Li, B.; and Hoi, S. C. 2014. Online portfolio selection: A Sharpe, W. F. 1998. The sharpe ratio. Streetwise–the Best of
survey. ACM Computing Surveys (CSUR), 46(3): 1–36. the Journal of Portfolio Management, 169–185.
Li, X.; Li, Y.; Zhan, Y.; and Liu, X.-Y. 2019. Opti- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.;
mistic bull or pessimistic bear: Adaptive deep reinforce- Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton,
ment learning for stock portfolio allocation. arXiv preprint A.; et al. 2017. Mastering the game of go without human
arXiv:1907.01503. knowledge. nature, 550(7676): 354–359.
Liang, Z.; Chen, H.; Zhu, J.; Jiang, K.; and Li, Y. 2018. Ad- Soleymani, F.; and Paquet, E. 2020. Financial portfolio op-
versarial deep reinforcement learning in portfolio manage- timization with online deep reinforcement learning and re-
ment. arXiv preprint arXiv:1808.09940. stricted stacked autoencoder—DeepBreath. Expert Systems
with Applications, 156: 113456.
Lima Paiva, F. C.; Felizardo, L. K.; Bianchi, R. A. d. C.;
and Costa, A. H. R. 2021. Intelligent trading systems: a Su, P.-H.; Gasic, M.; Mrksic, N.; Rojas-Barahona, L.; Ultes,
sentiment-aware reinforcement learning approach. In Pro- S.; Vandyke, D.; Wen, T.-H.; and Young, S. 2016. On-line
ceedings of the Second ACM International Conference on active reward learning for policy optimisation in spoken di-
AI in Finance, 1–9. alogue systems. arXiv preprint arXiv:1605.07669.
Sun, S.; Wang, R.; He, X.; Zhu, J.; Li, J.; and An, B.
Liu, X.-Y.; Yang, H.; Chen, Q.; Zhang, R.; Yang, L.; Xiao,
2021. Deepscalper: A risk-aware deep reinforcement learn-
B.; and Wang, C. D. 2020. FinRL: A deep reinforcement
ing framework for intraday trading with micro-level market
learning library for automated stock trading in quantitative
embedding. arXiv preprint arXiv:2201.09058.
finance. arXiv preprint arXiv:2011.09607.
Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learn-
Lu, D. W. 2017. Agent inspired trading using recurrent rein-
ing: An introduction. MIT press.
forcement learning and lstm neural networks. arXiv preprint
arXiv:1707.07338. Théate, T.; and Ernst, D. 2021. An application of deep re-
inforcement learning to algorithmic trading. Expert Systems
Markowitz, H. 1952. Portfolio Selection. The Journal of with Applications, 173: 114632.
Finance, 7(1): 77–91.
Wang, J.; Zhang, Y.; Tang, K.; Wu, J.; and Xiong, Z.
Martin, R. A. 2021. PyPortfolioOpt: portfolio optimization 2019. Alphastock: A buying-winners-and-selling-losers in-
in Python. Journal of Open Source Software, 6(61): 3066. vestment strategy using interpretable deep reinforcement at-
Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; tention networks. In Proceedings of the 25th ACM SIGKDD
Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asyn- international conference on knowledge discovery & data
chronous methods for deep reinforcement learning. In In- mining, 1900–1908.
ternational conference on machine learning, 1928–1937. Wang, Z.; Huang, B.; Tu, S.; Zhang, K.; and Xu, L. 2021.
PMLR. DeepTrader: A Deep Reinforcement Learning Approach for
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Risk-Return Balanced Portfolio Management with Market
Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Play- Conditions Embedding. Proceedings of the AAAI Confer-
ing atari with deep reinforcement learning. arXiv preprint ence on Artificial Intelligence, 35(1): 643–650.
arXiv:1312.5602. Wu, X.; Chen, H.; Wang, J.; Troiano, L.; Loia, V.; and Fu-
Moody, J.; and Saffell, M. 2001. Learning to trade via di- jita, H. 2020. Adaptive stock trading strategies with deep
rect reinforcement. IEEE transactions on neural Networks, reinforcement learning methods. Information Sciences, 538:
12(4): 875–889. 142–158.
Yang, H.; Liu, X.-Y.; Zhong, S.; and Walid, A. 2020. Deep
reinforcement learning for automated stock trading: An en-
semble strategy. In Proceedings of the First ACM Interna-
tional Conference on AI in Finance, 1–8.
Ye, Y.; Pei, H.; Wang, B.; Chen, P.-Y.; Zhu, Y.; Xiao, J.; and
Li, B. 2020. Reinforcement-learning based portfolio man-
agement with augmented asset movement prediction states.
In Proceedings of the AAAI Conference on Artificial Intelli-
gence, volume 34, 1112–1119.
Zhang, Z.; Zohren, S.; and Roberts, S. 2020. Deep rein-
forcement learning for trading. The Journal of Financial
Data Science, 2(2): 25–40.