0% found this document useful (0 votes)
13 views

Deep Reinforcement Learning for Optimal Portfolio Allocation JPMorgan

Uploaded by

GerardAlba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Deep Reinforcement Learning for Optimal Portfolio Allocation JPMorgan

Uploaded by

GerardAlba
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Deep Reinforcement Learning for Optimal Portfolio Allocation: A Comparative

Study with Mean-Variance Optimization


Srijan Sood, 1 Kassiani Papasotiriou, 1 Marius Vaiciulis,2 Tucker Balch 1 .
1
J.P. Morgan AI Research
2
J.P. Morgan Global Equities; Oxford-Man Institute of Quantitative Finance
{srijan.sood, kassiani.papasotiriou, marius.vaiciulis, tucker.balch}@jpmorgan.com

Abstract progress has been made in both theoretical and applied as-
pects of portfolio optimization. These range from improve-
Portfolio Management is the process of overseeing a group
of investments, referred to as a portfolio, with the objective
ments in the optimization process, to the framing of ad-
of achieving predetermined investment goals and objectives. ditional constraints that might be desirable to rational in-
Portfolio Optimization is a key component that involves al- vestors (Cornuejols and Tütüncü 2006; Li and Hoi 2014;
locating the portfolio assets so as to maximize returns while Kalayci et al. 2017; Ghahtarani, Saif, and Ghasemi 2022).
minimizing risk taken. Portfolio optimization is typically car- Recently, the community has tapped the many advancements
ried out by financial professionals who use a combination of in Machine Learning (ML) to aid with feature selection,
quantitative techniques and investment expertise to make de- forecasting and estimation of asset means and covariances,
cisions about the portfolio allocation. as well as using gradient based methods for optimization.
Recent applications of Deep Reinforcement Learning (DRL) Concurrently, the past decade has witnessed the success
have shown promising results when used to optimize portfo-
of Reinforcement Learning (RL) in the fields of gaming,
lio allocation by training model-free agents on historical mar-
ket data. Many of these methods compare their results against robotics, natural language processing etc. (Silver et al. 2017;
basic benchmarks or other state-of-the-art DRL agents but Nguyen and La 2019; Su et al. 2016). The sequential deci-
often fail to compare their performance against traditional sion making nature of Deep RL, along with its success in
methods used by financial professionals in practical settings. applied settings, has captured the attention of the finance
One of the most commonly used methods for this task is research community. In particular, some of the most pop-
Mean-Variance Portfolio Optimization (MVO), which uses ular areas of focus of the application of DRL in finance have
historical timeseries information to estimate expected asset been on automated stock trading (Yang et al. 2020; Théate
returns and covariances, which are then used to optimize for and Ernst 2021; Zhang, Zohren, and Roberts 2020; Wu et al.
an investment objective. 2020), risk management through deep hedging (Buehler
Our work is a thorough comparison between model-free DRL et al. 2019; Du et al. 2020; Cao et al. 2021; Benhamou et al.
and MVO for optimal portfolio allocation. We detail the 2020b) and portfolio optimization. In the upcoming section,
specifics of how to make DRL for portfolio optimization
work in practice, also noting the comparable tweaks needed
we’ll examine the landscape of DRL in portfolio optimiza-
to get MVO operational. Backtest results display strong im- tion and trading problems. While these approaches exhibit
proved performance of the DRL agent in terms of many per- improved performance than previous studies, they do have
formance metrics, including Sharpe ratio, maximum draw- some shortcomings. For instance, some generate discrete as-
downs, and absolute returns. set trading signals which limit their use in broader portfo-
lio management. Additionally, majority of these approaches
compare results against ML or buy-and-hold baselines, and
Introduction don’t consider classical portfolio optimization techniques,
Portfolio management is a key issue in the financial ser- such as Mean-Variance Optimization.
vices domain. It constitutes allocating funds across a diverse In our work, we aim to compare a simple and robust DRL
variety of assets, typically to generate uncorrelated returns framework, that was designed around risk-adjusted returns,
while minimizing risk and operational costs. Portfolios can with one of the traditional finance methods for portfolio op-
constitute holdings across asset classes (cash, bonds, equi- timization, MVO. We train policy gradient based agents on
ties, etc.), or can also be optimized within a specific asset a multi asset trading environment that simulates the US Eq-
class (e.g., picking the appropriate composition of stocks for uities market (using market data replay), and create observa-
an equity portfolio). Investors may choose to optimize for tion states derived from the observed asset prices. The agents
various performance criteria, often centered around maxi- optimize for risk-adjusted returns, not dissimilar to the tra-
mizing portfolio returns relative to risk taken. Since the ad- ditional MVO methods. We compare the performance of the
vent of Modern Portfolio Theory (Markowitz 1952), a lot of DRL strategy against MVO through a series of systematic
Copyright © 2023, Association for the Advancement of Artificial backtests, and observe improved performance along many
Intelligence (www.aaai.org). All rights reserved. performance metrics, including risk adjusted returns, max
drawdown, and portfolio turnover. tions to obtain efficiently diversified portfolios. This frame-
work relies on the assumption that a rational investor will
Related Work prefer a portfolio with less risk for a specified level of re-
turn and concludes that risk can be reduced by diversifying
There is a lot of recent research interest into the application a portfolio. In this section, we will introduce Mean Variance
of Deep RL in trading and portfolio management problems. Optimization (MVO) – one of the main techniques of the
For portfolio optimization, a lot of the research focuses on MPT – which we later compare to the performance of our
defining various policy network configurations and reports DRL framework. Additionally, we introduce RL preliminar-
results that outperform various traditional baseline meth- ies, describing the technique independent of portfolio opti-
ods (Wang et al. 2019; Liang et al. 2018; Lu 2017; Jiang mization.
and Liang 2017; Wang et al. 2021; Deng et al. 2016; Cong
et al. 2021). Other work explores frameworks that inject in- Mean-Variance Portfolio Optimization
formation in the RL agent’s state by incorporating asset en-
dogenous information such as technical indicators (Liu et al. Mean-Variance Optimization (MVO) is the mathematical
2020; Sun et al. 2021; Du and Tanaka-Ishii 2020) as well as process of allocating capital across a portfolio of assets (op-
exogenous information such as information extracted from timizing portfolio weights) to achieve a desired investment
news data (Ye et al. 2020; Lima Paiva et al. 2021). goal, usually: 1. Maximize returns for a given level of risk,
The current benchmarks for DRL frameworks typically 2. Achieve a desired rate of return while minimizing risk, or
involve comparing results against other DRL or ML ap- 3. Maximize returns generated per unit risk. Risk is usually
proaches, a buy-and-hold baseline, or market/index perfor- measured by the volatility of a portfolio (or asset), which is
mance. However, these benchmarks may be overly simplis- the variance of its rate of return. For a given set of assets,
tic or provide only a relative comparison. To truly gauge the this process requires as inputs the rates of returns for each
effectiveness of a DRL agent, it would be more meaning- asset, along with their covariances. As the true asset returns
ful to benchmark it against methodologies used by financial are unknown, in practice, these are estimated or forecasted
professionals in practice, such as Mean Variance Optimiza- using various techniques that leverage historical data.
tion (MVO). This task is then framed as an optimization problem, sin-
While there are some approaches that compare DRL per- gle or multi-objective, which can be solved in a variety of
formance with MVO (Li et al. 2019; Koratamaddi et al. ways (Cornuejols and Tütüncü 2006; Kalayci et al. 2017;
2021; i Alonso, Srivastava et al. 2020), the comparison sim- Ghahtarani, Saif, and Ghasemi 2022).
ply serves as another baseline, and the methodology is not A typical procedure is to solve it as a convex optimization
clearly described because an in-depth comparison is not the problem and generate an efficient frontier of portfolios such
primary focus of their study. To our knowledge, there is only that no portfolio can be improved without sacrificing some
one study that goes into a robust in-depth comparison of measure of performance (e.g., returns, risk). Let w be the
MVO and DRL (Benhamou et al. 2020a). However, across weight vector for a set of assets, µ be the expected returns,
all these studies, there is usually a discrepancy between the the portfolio risk can be described as wT Σw, for covariance
reward function used to train the RL agent, and the objective matrix Σ. To achieve a desired rate of return µ∗ , we can solve
function used for MVO (for e.g., daily returns maximization the portfolio optimization problem:
vs risk minimization). In order to make a fair comparison, it
is crucial that both approaches optimize for the same goal. minimize wT Σw
w
Additionally, some of these approaches provide exogenous subject to wT µ ≥ µ∗ ,
information (e.g., signals from news data) to the DRL agent,
which makes for a biased comparison with MVO. Addition- wi ≥ 0,
ally, none of these works provide implementation details for
X
wi = 1
the MVO frameworks they used for their comparison. We
aim to address these issues by conducting a robust compar- Varying µ∗ gives us the aforementioned efficient frontier.
ison of Deep RL and Mean-Variance Optimization for the Another common objective is the Sharpe Ratio (Sharpe
Portfolio Allocation problem. 1998; Chen, He, and Zhang 2011), which measures the re-
turn per unit risk. Formally, for portfolio p, the Sharpe Ratio
Background is defined as:
The goal of portfolio optimization is to continuously diver-
sify and reallocate funds across assets with the objective of E[Rp − Rf ]
Sharpe Ratiop =
maximizing realized rewards while simultaneously restrain- σp
ing the risk. In practice, portfolio management often aims to where Rp are the returns of the portfolio, σp is the stan-
not only maximize risk-adjusted returns but also to perform dard deviation of these returns, and Rf is a constant risk-free
as consistently as possible over a given time interval (e.g. on rate (e.g., US Treasuries, approximated by 0.0% in recent
a quarterly or yearly basis). history). Although tricky to solve in its direct form –
Markowitz introduced the modern portfolio theory (MPT)
(Markowitz 1952), a framework that allows an investor to µT w − Rf
mathematically balance risk tolerance and return expecta- max
w (wT Σw)1/2
– it can be framed as a convex optimization problem through Problem Setup
the use of a variable substitution (Cornuejols and Tütüncü We frame the portfolio optimization problem in the RL set-
2006). We choose the Sharpe Ratio as our desired objective ting. As described in the Background section, RL entails
function for this study as we can optimize for risk-adjusted learning in a framework with interactions between an agent
returns without having to specify explicit figures for mini- and an environment. For the portfolio optimization setting,
mum expected returns or maximum risk tolerance. we create an environment that simulates the US Equities
market (using market data replay), and create observation
Reinforcement Learning states derived from the observed asset prices. The agent’s
Reinforcement Learning (RL) is a sub-field of machine actions output a set of a portfolio weights, which are used to
learning that refers to a class of techniques that involve rebalance the portfolio at each timestep.
learning by optimizing long-term reward sequences obtained
by interactions with an environment (Sutton and Barto Actions
2018). An environment is typically formalized by means of
For portfolio allocation over N assets, an agent selects port-
a Markov Decision Process (MDP). An MDP consists of a a PN
5-tuple (S, A, Pa , Ra , γ), where: folio weights w = [w1 , . . . , wn ] such that i=1 wi = 1,
where 0 ≤ wi ≤ 1. An asset weight of 0 indicates zero hold-
• S is a set of states ings of a particular asset in a portfolio, whereas a weight of
• A is a set of actions 1 means the entire portfolio is concentrated in said asset. In
• Pa (s, s′ ) = Pr(st+1 = s′ | st = s, at = a) is the extensions of this framework, wi < 0 would allow for short-
probability that action a in state s at time t will lead to ing an asset, whereas wi > 1 indicates a leveraged position.
state s′ at time t + 1 However, for our case, we restrict actions to non-leveraged
• Ra (s, s′ ) is the immediate reward received after transi- long-only positions. These constraints can be enforced by
tioning from state s to state s′ , due to action a applying the softmax function to an agent’s continuous ac-
• γ is a discount factor between [0, 1] that represents the tions.
difference in importance between present and future re-
wards States
A solution to an MDP is a policy π that specifies the action An asset’s price at time t is denoted by Pt . The one-period
π(s) that the decision maker will choose when in state s. simple return is defined as Rt = PtP−P t−1
t−1
. Consequently, the
The objective is to choose a policy π that will maximize one-period gross return can be defined as PPt−1 t
= Rt + 1.
the expected discounted sum of rewards over a potentially
infinite horizon: Further, we can define the one-period log return as rt =
log( PPt−1
t
) = log(Rt + 1). For our setting, we choose the
"∞ #
X time period to be daily, and therefore calculate daily log
t
E γ Rat (st , st+1 ) returns using end-of-day close prices. An asset’s log re-
t=0 turns over a lookback period T can then be captured as
The field of Deep Reinforcement Learning (DRL) lever- rt = [rt−1 , rt−2 , . . . , rt−T +1 ]. In our case, the lookback
ages the advancements in Deep Learning by using Neu- period is T = 60 days.
ral Networks as function approximators to estimate state- For a selection of n + 1 assets - n securities and cash
action value functions, or to learn policy mappings π. These (denoted by c) - we form the agent’s observation state at
techniques have seen tremendous success in game-playing, time t, St as a [(n + 1) × T ] matrix:
robotics, continuous control, and finance (Mnih et al. 2013;
w r ... r
Berner et al. 2019; Nguyen and La 2019; Hambly, Xu, and

1 1,t−1 1,t−T +1
Yang 2021; Charpentier, Elie, and Remlinger 2021).  w2 r2,t−1 ... r2,t−T +1 
 . .. .. 
RL for Portfolio Allocation Given its success in stochas- St =  . .
 . . 

tic control problems, RL extends nicely to the problem of wn rn,t−1 ... rn,t−T +1 
portfolio optimization. Therefore, it is not surprising that the wc vol20 vol20
VIXt . . .
vol60
use of DRL to perform tasks such as trading and portfolio
optimization has received a lot of attention lately. Recent The first column is the agent’s portfolio allocation vec-
methods focus on learning deep features and state represen- tor w as it enters timestep t. This might differ slightly from
tations, for example, through the use of embedding features the portfolio weights it chooses at the timestep before, as
derived from deep neural networks such as autoencoders and we convert the continuous weights into an actual allocation
LSTM models. These embeddings capture price related fea- (whole shares only), and rebalance the allocation such that
tures which can range from technical indicators (Wang et al. it sums to 1.
2019; Soleymani and Paquet 2020; Wang et al. 2021), to in- For each non-cash asset, we include the log re-
formation extracted from news in order to account for price turns over T . These are represented by the vector
fluctuations (Ye et al. 2020). Other proposed features use [rn,t−1 , . . . , rn,t−T +1 ] for asset n in the state matrix above.
attention networks or graph structures (Wang et al. 2021, Additionally, in the last row, we include three market volatil-
2019) to perform cross-asset interrelationship feature extrac- ity indicators at time t: vol20 , volvol60 , VIX, which we describe
20

tion. in detail in the Experiments section.


Reward Policy Optimization Policy optimization methods are
Rather than maximizing returns, most modern portfolio centered around the policy πθ (a|s) which is the function
managers attempt to maximize risk-adjusted returns. Since that maps the agent’s state s to the distribution of its next
we wish to utilize DRL for portfolio allocation, we want a action a. These methods optimize the parameters θ either by
reward function that helps optimize for risk-adjust returns. gradient ascent on the performance objective J(πθ ) or by
The Sharpe ratio is the most widely-used measure for this, maximizing local approximations of J(πθ ). This optimiza-
however, it is inappropriate for online learning settings as it tion is almost always performed on-policy since the experi-
is defined over a period of time T . To combat this, we use ences are collected using the latest learned policy, and then
the Differential Sharpe Ratio Dt (Moody et al. 1998) which using that experience to improve the policy. Some exam-
represents the risk-adjusted returns at each timestep t and ples of popular policy optimization methods are A2C/A3C
has been found yield more consistent returns than maximiz- (Mnih et al. 2016) and PPO (Schulman et al. 2017). For our
ing profit (Moody and Saffell 2001; Dempster and Leemans experiments we use PPO.
2006). Therefore, an agent that aims to maximize its future
Differential Sharpe rewards learns how to optimize for risk RL Environment Specifics
adjusted returns. The environment serves as a wrapper for the market, sliding
We can define the Sharpe Ratio over a period of t returns over historical data in an approach called market replay. It
Rt , in terms of estimates of the first and second moments of also serves as a broker and exchange; at every timestep, it
the returns’ distributions: processes the agents’ actions and rebalances the portfolio
using the latest prices and the given allocation. As the day
At shifts and new prices are received, it communicates these to
St =
Kt (Bt − A2t )1/2 the agent as observations, along with the Differential Sharpe
with reward. For the purposes of this study, we assume that there
are no transaction costs in the environment, and we allow for
t t immediate rebalancing of the portfolio.
1X 1X 2 t 1/2
At = Ri and Bt = R , Kt = ( ) At the beginning of each timestep t, the environment cal-
t i=1 t i=1 i t−1
culates the current portfolio value:
where Kt is a normalizing factor. X
A and B can be recursively estimated as exponential mov- port valt = Pi,t ∗ sharesi,t−1 + ct−1
ing averages of the returns and standard deviation of returns
on time scale η −1 . We can obtain a differential Sharpe ratio In the above expression, Pi is the price of index i at day
Dt by expanding St to first order in η: t, sharesi,t−1 are the index shares at t − 1, and ct−1 is the
St ≈ St−1 + ηDt |η=0 + O(η 2 ) amount of cash at t − 1.
In order to calculate sharesi,t and ct , the environment al-
Where Differential Sharpe Ratio Dt :
locates port valt to the indices and cash according to the
∂St Bt−1 ∆At − 12 At−1 ∆Bt new weights wi . Next, it rebalances the portfolio weights wi
Dt ≡ = to wi reb by multiplying wi with the current portfolio value,
∂η (Bt−1 − A2t−1 )3/2 rounding down the number of shares and converting the re-
with maining shares into cash.
After rebalancing, the environment creates the next state
At = At−1 + η∆At St+1 and proceeds to the next timestep t+1. It calculates the
new portfolio value based on Pt+1 and computes the reward
Bt = Bt−1 + η∆Bt Rt = Dt which it returns to the agent.

∆At = Rt − At−1 Experiments


Data & Features
∆Bt = Rt2 − Bt−1
initialized with A0 = B0 = 0. We pick η ≈ 1
(a year has For our experiments, we use daily adjusted close price data
252
approximately 252 trading days). of the S&P500 sector indices as shown in Figure 1, the VIX
index and the S&P500 index between 2006 and 2021 (inclu-
Learning Algorithm sive), extracted from Yahoo Finance. The price data is used
RL algorithms can be mainly divided into two categories, to compute log returns, as described in a previous section.
model-based and model-free, depending whether the agent To capture market regime, we compute three volatility
has access to or has to learn a model of the environment. metrics from the S&P500 index. The first one, vol20 , is
Model-free algorithms seek to learn the outcomes of their the 20-day rolling window standard deviation of the daily
actions through collecting experience via algorithms such as S&P500 index returns, the second, vol60 , is the 60-day
Policy Gradient, Q-Learning, etc. Such an algorithm will try rolling window standard deviation of the daily S&P500 in-
vol20
an action multiple times and adjust its policy (its strategy) dex returns and the third is the ratio of these two vol60
. This
based on the outcomes of its action in order optimize re- ratio indicates the short-term versus the long-term volatil-
vol20
wards. ity trend. If vol60
> 1, that indicates that the past 20-day
training timesteps 7.5M
n envs 10
n steps 756
batch size 1260
n epochs 16
gamma 0.9
gae lambda 0.9
clip range 0.25
learning rate 3e-4 annealed to 1e-5

Table 1: Hyperparameters used for PPO.

agents (10 periods x 5 agents), and 10 corresponding back-


tests (described in a following section).
PPO Implementation & Hyperparameters We use the
StableBaselines3 (Raffin et al. 2021) implementation of
Figure 1: S&P500 and its 11 sector indices between 2006 PPO, and report the hyperparameters used in Table 1. These
and 2021. were picked based on empirical studies (Henderson et al.
2018; Engstrom et al. 2019; Rao et al. 2020), as well as a
coarse grid search over held-out validation data.
daily returns of the S&P500 have been more volatile than
Additionally, we make use of the Vectorized Sub-
the past 60-day daily returns, which might indicate a move-
ProcVecEnv environment wrappers provided by StableBase-
ment from lower volatility to a higher volatility regime (and
lines3 to collect experience rollouts through multiprocessing
vice versa). We use the first and third metrics in the obser-
across independent instances of our environment. Therefore,
vation matrix, along with the value of the VIX index. These
instead of training the DRL agent on one environment per
values are standardized by subtracting the mean and dividing
step, we trained our model on n envs = 10 environments
by the standard deviation, where the mean and standard de-
per step in order to gain more diverse experience and speed
viation are estimated using an expanding lookback window
up training.
to prevent information leakage.
Each round of training lasted a total 7.5M timesteps
so as to have approximately 600 episodes per round per
Deep RL Approach environment: (252 trading days per yr × 5 yrs per round) ×
Training Process Although financial data is notoriously (10 environments) × (600 episodes) ≈ 7.5M timesteps. The
scarce (atleast on the daily scale), we want to test the DRL rollout buffer size was set to n steps = 252×3×n envs so
framework across multiple years (backtests). Additionally, as to collect sufficient experiences across environments. We
financial timeseries exhibit non-stationarity (Cont 2001); set up the learning rate as a decaying function of the current
this can be tackled by retraining or fine-tuning models by progress remaining, starting from 3e − 4, annealed to a final
utilizing the most recently available data. In light of these value of 1e − 5. We used a batch size of 1260 = (252 × 5),
stylized facts, we devise our experiment framework as fol- set the number of epochs when optimizing the surrogate loss
lows: to n epochs = 16, picked the discount factor γ = 0.9,
The data is split into 10 sliding window groups (shifted set the bias-variance trade-off factor for Generalized Advan-
by 1-year). Each group contains 7 years worth of data, the tage Estimator gae lambda = 0.9 and clip range = 0.25.
first 5 years are used for training, the next 1 year is a burn Additionally, we use a [64, 64] fully-connected architecture
year used for training validation, and the last year is kept with tanh activations, and intiailize the policy with a log
out-of-sample for backtesting. standard deviation log std init = −1.
During the first round of training, we initialize 5 agents
(different seeds) with the hyperparameters described in the Mean-Variance Optimization Approach
following section. All five agents start training on data from As we wish to compare the model-free DRL approach with
[2006 − 2011) and their performance is periodically evalu- MVO, we equalize the training and operational conditions.
ated using the validation period 2011. At the end of the first For training, the MVO approach uses a 60-day lookback pe-
round of training, we save the best performing agent (based riod (same as DRL) to estimate the means and covariances
on highest mean episode validation reward). The final year of assets. Asset means are simply the sample means over the
(2012) is kept held-out for backtesting. lookback period. However, we do not directly use the sam-
This agent is used as a seed policy for the next group of ple covariance, as this has been shown to be subject to esti-
5 agents in the following training window [2007 − 2012), mation error that is incompatible with MVO. To tackle this,
validation year 2012 and testing year 2013, where this ex- we make use of the Ledoit-Wolf Shrinkage operator (Ledoit
periment is repeated. This process continues till we reach and Wolf 2004). Additionally, we enforce non-singular and
the final validation period of 2020, generating a total of 50 positive-semi-definite conditions on the covariance matri-
Metric DRL MVO ing periods [2012-2021]. The DRL agent outperforms the
Annual return 0.1211 0.0653 MVO portfolio by exhibiting higher Sharpe and lower yearly
Cumulative returns 0.1195 0.0650 maximum drawdowns in virtually every year throughout the
Annual volatility 0.1249 0.1460 backtest period (see Figure 1). It also outperforms the MVO
Sharpe ratio 1.1662 0.6776 portfolio in terms of having marginally lower maximum
Calmar ratio 2.3133 1.1608 drawdown.
Stability 0.6234 0.4841 To compare overall performance on the entire backtest pe-
Max drawdown -0.3296 -0.3303 riod between the two methods, we compute the average per-
Omega ratio 1.2360 1.1315 formance across all 10 backtest periods. For DRL, we aver-
Sortino ratio 1.7208 1.0060 age the performance across the 5 agents (each trained on a
Skew -0.4063 -0.3328 different seed) for each year and then average performance
Kurtosis 2.7054 2.6801 across all backtest periods. Similarly, for MVO, we average
Tail ratio 1.0423 0.9448 its performance across all 10 years. By looking at Table 2 we
Daily value at risk -0.0152 -0.0181 observe that DRL annual returns and Sharpe ratio are 1.85x
higher than those of the MVO portfolio. The DRL strategy’s
Table 2: Statistics for the DRL and MVO approaches. All Sharpe throughout the whole backtest period is 1.16x com-
metrics are averaged across 10 backtests (backtesting pe- pared to 0.66x for MVO.
riod: [2012 − 2021]), except Max Drawdown which is re- Figure 3a) and Figure 4a) plot the monthly returns over all
ported as the maximum seen in any period. backtest periods for the two methods. It is evident that DRL
is experiencing more steady returns month-to-month than
MVO. On the other hand, MVO swings between periods of
ces, setting negative eigenvalues to 0, and then rebuilding high returns to periods of low returns a lot more frequently
the non-compliant matrices. without a steady positive return trajectory. Similarly, in Fig-
Given the estimated means and covariances for a look- ure 3b) and Figure 3b), we plot the annual returns for the
back period, we then optimize for the Sharpe Maximization two methods. The vertical dashed line indicates the average
problem and obtain the weights at every timestep. We use annual return across the 10 backtests. For DRL we observe
the implementation in PyPortfolioOpt (Martin 2021) to aid positive returns for almost all backtest years which is a lot
us with this process. more consistent than the behavior of MVO’s annual returns.
Figure 3c) and Figure 3c) plot the distribution of monthly re-
turns averaged across all months. The DRL monthly returns
distribution has a lower standard deviation and and spread
than MVO and a positive mean.
Evaluation & Backtesting
Further, we compute the daily portfolio change for each
We evaluate the performance of both techniques through 10 strategy by measuring the change in its portfolio weights.
independent backtests [2012 − 2021]. Both strategies start ∆pw is the absolute value of the element-wise difference
each backtest period with an all cash portfolio allocation of between two allocations (ignoring the cash component). As
$100, 000. Then, the strategies trade daily using the portfo- buying and selling are treated as individual transactions,
lio weights P
obtained by each method, enforcing for weight ∆pw ∈ [0.0, 2.0]. For example, take a case where the port-
constraints w = 1, 0 ≤ wi ≤ 1, and ensuring only whole folio at time t − 1 is concentrated in non-cash asset A, and at
number of shares are purchased. By doing so, we can obtain time t is entirely concentrated in non-cash asset B. This re-
daily portfolio values (and returns), which we subsequently quires selling all holdings of A, and acquiring the equivalent
use to compute the statistics we will discuss in the Results shares in B, leading to ∆pw = 2.0.
section. These are computed with the aid of the Python li- Using metric ∆pw , we observe that the Reinforcement
brary Pyfolio. Learning strategy has less frequent changes to its portfolio.
DRL Agent: We evaluate the trained PPO agents in de- In practice, this would result in lower average transaction
terministic mode. For each backtest, the agent used has a costs. In particular, the average change in portfolio composi-
gap burn year between the last day seen in training and the tion is nearly double for Mean-Variance portfolio compared
backtest period. For example, a DRL backtest carried out in to the DRL strategy during market downturn in March 2020,
2012 would use an agent trained in [2006−2011), with 2011 as shown in Figure 2, when trading conditions were partic-
being the burn year. ularly challenging (i.e. significantly lower market liquidity
MVO: As the MVO approach does not require any train- and elevated bid/ask spreads). Finally, the DRL strategy’s
ing, it simply uses the past 60-day lookback period before performance is derived from the average of five individual
any given day to calculate portfolio weights. For example, agents initialized with different seeds, providing additional
a MVO backtest starting January 2012 will use data starting regularization which is likely to result in a more stable out-
October 2011 (this 60-day window shifts with each day). of-sample strategy compared to the MVO strategy.
Results
Figure 2 illustrates the performance metrics obtained by
Conclusion
applying the aforementioned backtest process on all test- We highlight our key contributions as follows:
Figure 2: Backtest Results: MVO vs DRL Portfolio Allocation.

Figure 3: a) DRL Monthly Returns b) DRL Annual Returns c) DRL Monthly Distribution of Returns.

Figure 4: a) MVO Monthly Returns b) MVO Annual Returns c) MVO Monthly Distribution of Returns.
• We have designed a simple environment that serves as Benhamou, E.; Saltiel, D.; Ungari, S.; and Mukhopadhyay,
a wrapper for the market, sliding over historical data us- A. 2020b. Time your hedge with deep reinforcement learn-
ing market replay. The environment can allocate multiple ing. arXiv preprint arXiv:2009.14136.
assets and can be easily modified to reflect transaction Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Debiak,
costs. P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse,
• We compare our framework’s performance during ten C.; et al. 2019. Dota 2 with large scale deep reinforcement
backtest experiments over different periods for the US learning. arXiv preprint arXiv:1912.06680.
Equities Market using S&P500 Sector indices. Our ex- Buehler, H.; Gonon, L.; Teichmann, J.; and Wood, B. 2019.
periments demonstrate the improved performance of Deep hedging. Quantitative Finance, 19(8): 1271–1291.
the deep reinforcement learning framework over Mean- Cao, J.; Chen, J.; Hull, J.; and Poulos, Z. 2021. Deep hedg-
Variance portfolio optimization. ing of derivatives using reinforcement learning. The Journal
• The profitability of the framework surpasses MVO in of Financial Data Science, 3(1): 10–27.
terms of Annual Returns, Sharpe ratio and Maximum Charpentier, A.; Elie, R.; and Remlinger, C. 2021. Rein-
Drawdown. Additionally, we observe that DRL strategy forcement learning in economics and finance. Computa-
leads to more consistent returns and more stable port- tional Economics, 1–38.
folios with decreased turnover. This has implications for
Chen, L.; He, S.; and Zhang, S. 2011. When all risk-adjusted
live-deployment, where transaction costs and slippage af-
performance measures are the same: In praise of the Sharpe
fect P&L.
ratio. Quantitative Finance, 11(10): 1439–1447.
Future Work Cong, L. W.; Tang, K.; Wang, J.; and Zhang, Y. 2021. Alpha-
Portfolio: Direct construction through deep reinforcement
In our future work, we would like to model transaction costs learning and interpretable AI. Available at SSRN, 3554486.
and slippage either by explicitly calculating them during as- Cont, R. 2001. Empirical properties of asset returns: stylized
set reallocation or as a penalty term to our reward. Moreover, facts and statistical issues. Quantitative finance, 1(2): 223.
we would like to explore adding a drawdown minimization
component to our reward that will potentially help the agent Cornuejols, G.; and Tütüncü, R. 2006. Optimization meth-
learn a more consistent trading strategy. ods in finance, volume 5. Cambridge University Press.
Another area of exploration is training a regime switching Dempster, M. A.; and Leemans, V. 2006. An automated FX
model which will balance its funds amongst two agents de- trading system using adaptive reinforcement learning. Ex-
pending on market volatility (low vs high). One of them will pert Systems with Applications, 30(3): 543–552.
be a low-volatility trained agent and the other a high volatil- Deng, Y.; Bao, F.; Kong, Y.; Ren, Z.; and Dai, Q. 2016. Deep
ity trained agent. We would like to compare performance direct reinforcement learning for financial signal representa-
between our current implicit regime parametrization and an tion and trading. IEEE transactions on neural networks and
explicit one. Further exploration of these research directions learning systems, 28(3): 653–664.
may produce significant insights into practical trading be- Du, J.; Jin, M.; Kolm, P. N.; Ritter, G.; Wang, Y.; and Zhang,
havior. B. 2020. Deep reinforcement learning for option replication
and hedging. The Journal of Financial Data Science, 2(4):
Disclaimer: This paper was prepared for information
44–57.
purposes by the Artificial Intelligence Research group of
J. P. Morgan Chase & Co. and its affiliates (“J. P. Morgan”), Du, X.; and Tanaka-Ishii, K. 2020. Stock embeddings ac-
and is not a product of the Research Department of J. P. Mor- quired from news articles and price history, and an appli-
gan. J. P. Morgan makes no representation and warranty cation to portfolio optimization. In Proceedings of the 58th
whatsoever and disclaims all liability, for the completeness, annual meeting of the association for computational linguis-
accuracy or reliability of the information contained herein. tics, 3353–3363.
This document is not intended as investment research or in- Engstrom, L.; Ilyas, A.; Santurkar, S.; Tsipras, D.; Janoos,
vestment advice, or a recommendation, offer or solicitation F.; Rudolph, L.; and Madry, A. 2019. Implementation mat-
for the purchase or sale of any security, financial instrument, ters in deep rl: A case study on ppo and trpo. In International
financial product or service, or to be used in any way for conference on learning representations.
evaluating the merits of participating in any transaction, and Ghahtarani, A.; Saif, A.; and Ghasemi, A. 2022. Robust
shall not constitute a solicitation under any jurisdiction or to portfolio selection problems: a comprehensive review. Op-
any person, if such solicitation under such jurisdiction or to erational Research, 1–62.
such person would be unlawful.
Hambly, B.; Xu, R.; and Yang, H. 2021. Recent ad-
vances in reinforcement learning in finance. arXiv preprint
References arXiv:2112.04553.
Benhamou, E.; Saltiel, D.; Ungari, S.; and Mukhopad- Henderson, P.; Islam, R.; Bachman, P.; Pineau, J.; Precup,
hyay, A. 2020a. Bridging the gap between Markowitz D.; and Meger, D. 2018. Deep reinforcement learning that
planning and deep reinforcement learning. arXiv preprint matters. In Proceedings of the AAAI conference on artificial
arXiv:2010.09108. intelligence, volume 32.
i Alonso, M. N.; Srivastava, S.; et al. 2020. Deep reinforce- Moody, J.; Wu, L.; Liao, Y.; and Saffell, M. 1998. Perfor-
ment learning for asset allocation in us equities. Technical mance functions and reinforcement learning for trading sys-
report. tems and portfolios. Journal of Forecasting, 17(5-6): 441–
Jiang, Z.; and Liang, J. 2017. Cryptocurrency portfolio man- 470.
agement with deep reinforcement learning. In 2017 Intelli- Nguyen, H.; and La, H. 2019. Review of deep reinforcement
gent Systems Conference (IntelliSys), 905–913. IEEE. learning for robot manipulation. In 2019 Third IEEE Inter-
Kalayci, C.; Ertenlice, O.; Akyer, H.; and Aygören, H. 2017. national Conference on Robotic Computing (IRC), 590–595.
A review on the current applications of genetic algorithms IEEE.
in mean-variance portfolio optimization. Pamukkale Uni- Raffin, A.; Hill, A.; Gleave, A.; Kanervisto, A.; Ernestus,
versity Journal of Engineering Sciences, 23: 470–476. M.; and Dormann, N. 2021. Stable-Baselines3: Reliable Re-
Koratamaddi, P.; Wadhwani, K.; Gupta, M.; and Sanjeevi, inforcement Learning Implementations. Journal of Machine
S. G. 2021. Market sentiment-aware deep reinforcement Learning Research, 22(268): 1–8.
learning approach for stock portfolio allocation. Engi- Rao, N.; Aljalbout, E.; Sauer, A.; and Haddadin, S. 2020.
neering Science and Technology, an International Journal, How to make deep RL work in practice. arXiv preprint
24(4): 848–859. arXiv:2010.13083.
Ledoit, O.; and Wolf, M. 2004. Honey, I shrunk the sample Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
covariance matrix. The Journal of Portfolio Management, Klimov, O. 2017. Proximal policy optimization algorithms.
30(4): 110–119. arXiv preprint arXiv:1707.06347.
Li, B.; and Hoi, S. C. 2014. Online portfolio selection: A Sharpe, W. F. 1998. The sharpe ratio. Streetwise–the Best of
survey. ACM Computing Surveys (CSUR), 46(3): 1–36. the Journal of Portfolio Management, 169–185.
Li, X.; Li, Y.; Zhan, Y.; and Liu, X.-Y. 2019. Opti- Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.;
mistic bull or pessimistic bear: Adaptive deep reinforce- Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton,
ment learning for stock portfolio allocation. arXiv preprint A.; et al. 2017. Mastering the game of go without human
arXiv:1907.01503. knowledge. nature, 550(7676): 354–359.
Liang, Z.; Chen, H.; Zhu, J.; Jiang, K.; and Li, Y. 2018. Ad- Soleymani, F.; and Paquet, E. 2020. Financial portfolio op-
versarial deep reinforcement learning in portfolio manage- timization with online deep reinforcement learning and re-
ment. arXiv preprint arXiv:1808.09940. stricted stacked autoencoder—DeepBreath. Expert Systems
with Applications, 156: 113456.
Lima Paiva, F. C.; Felizardo, L. K.; Bianchi, R. A. d. C.;
and Costa, A. H. R. 2021. Intelligent trading systems: a Su, P.-H.; Gasic, M.; Mrksic, N.; Rojas-Barahona, L.; Ultes,
sentiment-aware reinforcement learning approach. In Pro- S.; Vandyke, D.; Wen, T.-H.; and Young, S. 2016. On-line
ceedings of the Second ACM International Conference on active reward learning for policy optimisation in spoken di-
AI in Finance, 1–9. alogue systems. arXiv preprint arXiv:1605.07669.
Sun, S.; Wang, R.; He, X.; Zhu, J.; Li, J.; and An, B.
Liu, X.-Y.; Yang, H.; Chen, Q.; Zhang, R.; Yang, L.; Xiao,
2021. Deepscalper: A risk-aware deep reinforcement learn-
B.; and Wang, C. D. 2020. FinRL: A deep reinforcement
ing framework for intraday trading with micro-level market
learning library for automated stock trading in quantitative
embedding. arXiv preprint arXiv:2201.09058.
finance. arXiv preprint arXiv:2011.09607.
Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learn-
Lu, D. W. 2017. Agent inspired trading using recurrent rein-
ing: An introduction. MIT press.
forcement learning and lstm neural networks. arXiv preprint
arXiv:1707.07338. Théate, T.; and Ernst, D. 2021. An application of deep re-
inforcement learning to algorithmic trading. Expert Systems
Markowitz, H. 1952. Portfolio Selection. The Journal of with Applications, 173: 114632.
Finance, 7(1): 77–91.
Wang, J.; Zhang, Y.; Tang, K.; Wu, J.; and Xiong, Z.
Martin, R. A. 2021. PyPortfolioOpt: portfolio optimization 2019. Alphastock: A buying-winners-and-selling-losers in-
in Python. Journal of Open Source Software, 6(61): 3066. vestment strategy using interpretable deep reinforcement at-
Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; tention networks. In Proceedings of the 25th ACM SIGKDD
Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asyn- international conference on knowledge discovery & data
chronous methods for deep reinforcement learning. In In- mining, 1900–1908.
ternational conference on machine learning, 1928–1937. Wang, Z.; Huang, B.; Tu, S.; Zhang, K.; and Xu, L. 2021.
PMLR. DeepTrader: A Deep Reinforcement Learning Approach for
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Risk-Return Balanced Portfolio Management with Market
Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Play- Conditions Embedding. Proceedings of the AAAI Confer-
ing atari with deep reinforcement learning. arXiv preprint ence on Artificial Intelligence, 35(1): 643–650.
arXiv:1312.5602. Wu, X.; Chen, H.; Wang, J.; Troiano, L.; Loia, V.; and Fu-
Moody, J.; and Saffell, M. 2001. Learning to trade via di- jita, H. 2020. Adaptive stock trading strategies with deep
rect reinforcement. IEEE transactions on neural Networks, reinforcement learning methods. Information Sciences, 538:
12(4): 875–889. 142–158.
Yang, H.; Liu, X.-Y.; Zhong, S.; and Walid, A. 2020. Deep
reinforcement learning for automated stock trading: An en-
semble strategy. In Proceedings of the First ACM Interna-
tional Conference on AI in Finance, 1–8.
Ye, Y.; Pei, H.; Wang, B.; Chen, P.-Y.; Zhu, Y.; Xiao, J.; and
Li, B. 2020. Reinforcement-learning based portfolio man-
agement with augmented asset movement prediction states.
In Proceedings of the AAAI Conference on Artificial Intelli-
gence, volume 34, 1112–1119.
Zhang, Z.; Zohren, S.; and Roberts, S. 2020. Deep rein-
forcement learning for trading. The Journal of Financial
Data Science, 2(2): 25–40.

You might also like