Model-Free Reinforcement Learning For Asset Allocation: Practicum Final Report
Model-Free Reinforcement Learning For Asset Allocation: Practicum Final Report
This work would not have been possible without the advice and support of several
people. First and foremost, we would like to express our gratitude to Mahmoud
Mahfouz, Vice President of AI and Research at J.P. Morgan and Chase Co., and his
colleague Sood Srijan for providing us with resources and valuable guidance during
this project. We would also like to thank our advisor, Prof. David Vernon, for his
guidance and constant support.
1
Table of Contents
1 Introduction 6
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Significance of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Limitations of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Structure of Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Portfolio Management 9
2.1 Portfolio Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Markowitz Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Modern Portfolio Theory . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Post-modern Portfolio Theory . . . . . . . . . . . . . . . . . . . . . . 10
4 Financial Environment 18
4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Reward functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Trading Agents 25
6.1 Baseline Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Selection Criteria for RL Agents . . . . . . . . . . . . . . . . . . . . . 26
6.3 Theoretical Description of Selected RL Agents . . . . . . . . . . . . . 26
6.3.1 Normalized Advantage Function (NAF) . . . . . . . . . . . . . 26
6.3.2 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3.3 Deep Deterministic Policy Gradient (DDPG) . . . . . . . . . . 29
6.3.4 Twin Delayed Deep Deterministic Policy Gradient (TD3) . . . 31
2
6.3.5 Advantage Actor Critic (A2C) . . . . . . . . . . . . . . . . . . 32
6.3.6 Soft Actor Critic(SAC) . . . . . . . . . . . . . . . . . . . . . . 34
6.3.7 Trust Region Policy Optimization (TRPO) . . . . . . . . . . . 35
6.3.8 Proximal Policy Optimization (PPO) . . . . . . . . . . . . . . 39
7 Experiments 42
7.1 Data - Dow Jones 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.2 Experiment Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.2.1 Environment Parameters . . . . . . . . . . . . . . . . . . . . . 42
7.2.2 RL Agent Hyper-parameters . . . . . . . . . . . . . . . . . . . 43
7.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
9 Conclusion 57
9.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
References 64
3
List of Figures
4
List of Tables
A.1 Table of Mean Performance at No Trading Cost & Log Returns Reward 59
A.2 Table of Mean Performance at No Trading Cost & Sharpe Ratio Reward 60
A.3 Table of Peak Performance at No Trading Cost . . . . . . . . . . . . 60
A.4 Table of Mean Performance at 0.1% Trading Costs & Log Returns
Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.5 Table of Mean Performance at 0.1% Trading Costs & Sharpe Ratio
Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.6 Table of Peak Performance at 0.1% Trading Costs . . . . . . . . . . . 62
A.7 Table of Mean Performance at 1% Trading Costs & Log Returns Reward 62
A.8 Table of Mean Performance at 1% Trading Costs & Sharpe Ratio
Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.9 Table of Peak Performance at 1% Trading Costs & Sharpe Ratio Reward 63
5
Chapter 1
Introduction
1.1 Background
Asset allocation (or portfolio management) is the task of determining how to opti-
mally allocate funds of a finite budget into a range of financial instruments/assets
such as stocks (Filos, 2019). Coming up with a profitable trading strategy involves
making critical decisions on allocating capital into different stock options. Usually,
this allocation should maximize the expected return while minimizing the investment
risks involved (Gao et al., 2021). There are several existing portfolio management
strategies, and the state-of-the-art portfolio management frameworks are broadly
classified into baseline models, follow the winner, follow the loser, pattern-matching,
and meta-learning(B. Li & Hoi, 2014).
While many of these state-of-the-art models achieve good results, there are some
limitations. First, they are overly reliant on using predictive models (Adämmer &
Schüssler, 2020). These predictive models are not usually too successful at predicting
the financial markets since these markets are highly stochastic and thus are very
difficult to predict accurately(Belousov, Abdulsamad, Klink, Parisi, & Peters, 2021).
Similarly, many of these models make simplistic and usually unrealistic assumptions
around the financial signals’ second-order and higher-order statistical moments (Gao
et al., 2021). Finally, these models are usually limited to discrete action spaces to
make the resulting models tractable to solve (Filos, 2019).
Reinforcement learning (RL) has become increasingly popular in financial portfolio
management (J. Huang, Chai, & Cho, 2020). Reinforcement learning is the sub-field
of machine learning concerned with how intelligent machines ought to make deci-
sions in an environment to maximize the cumulative reward over time (Contributors,
2021). The addition of deep neural networks have been one of the breakthrough con-
cepts in reinforcement learning in recent years, achieving superhuman performance
in several tasks such as playing chess (D. Silver et al., 2018).
6
markets. Techniques that reduce the number of simplifying assumptions made could
yield better results. Thus, there may be a significant gap between what is possible
and what is currently available in the field.
7
the impact of the RL agent on the market. However, these assumptions are generally
considered reasonable and should not impact the portability of this study into the
real world.
8
Chapter 2
Portfolio Management
9
(Filos, 2019):
M
X
w∗,i = 1, wx ∈ R (2.1)
i=1
The Markowitz model assumes that the investor is risk-averse and thus determines
the optimal portfolio by selecting a portfolio that gives a maximum return for a
given risk or minimum risk for given returns. Therefore, the optimal portfolio is
selected as follows:
• From a set of portfolios with the same return, the investor will prefer the
portfolio with the lower risk.
• From a set of portfolios with the same risk, the investor will choose the portfolio
with the highest return.
The Markowitz model also presupposes that the analysis is based on a one-period
investment model and is thus only applicable in a one-period situation. This means
that the choice to distribute assets is made only at the start of the term. As a
result, the repercussions of this decision can only be seen after the term, and no
additional action may be taken during that time. This makes the model a static
model. However, since the Markowitz model is simple to grasp, it is frequently
utilized as the cornerstone for current portfolio optimization strategies due to its
simplicity and efficiency. However, because it is based on a single-period investment
model, it is not ideal for continuous-time situations.
10
deviation of returns, is the most important aspect of portfolio creation. The MPT
is based on symmetrical risk, whereas the PMPT is based on asymmetrical risk.
The downside risk is quantified by target semi-deviation, also known as downside
deviation, and it reflects what investors dread the most: negative returns (J. Chen,
2021).
11
Chapter 3
3.1 Introduction
Machine learning has become increasingly important in the finance industry across
several tasks. Figure 3.1 shows a taxonomy of some of the typical applications of
machine learning in finance.
For financial tasks, different classes of machine learning algorithms have been fre-
quently used. These classes include generalized linear models, tree-based models,
kernel-based models, and neural networks. Similarly, the data has widely varied
12
from structured data such as tables to unstructured data such as text. Table 3.1
itemizes various finance research articles where some of these models were used.
Table 3.1: Table of Previous ML Research in Finance.
13
(L. Chen, Pelger, & Zhu, 2020) Stochastic Firm Generative
discount factor characteristics, Adversarial
historical Networks
return, and
macroeconomic
indicators.
(Colombo & Pelagatti, 2020) The direction of Marker Support Vector
changes in uncertainty Machines
exchange rates indicator data
(Croux, Jagtiani, Korivi, & Loan Default Loan data, LASSO
Vulanovic, 2020) borrowers’
characteristics,
macroeconomic
indicators
(Gomes, Carvalho, & Carvalho, Anomaly Chamber of Deep
2017) detection Deputies open Autoencoders
data,
Companies data
from Secretariat
of Federal
Revenue of
Brazil
(Goumagias, Hristu-Varsakelis, Tax evasion Empirical data Deep
& Assael, 2018) prediction from Greek Q-Learning
firms
(Gu, Kelly, & Xiu, 2020) Stock returns Firm Linear
forecasting characteristics, Regression,
Historical Random
returns, Forests, Boosted
Macroeconomic Trees, Neural
indicator Networks
(Gulen, Jens, & Page, 2020) Estimation of Firms Causal Forests
heterogeneous characteristic
treatment data
effects of debt
covenant
violations on
firms’
investment
levels.
(Y. Huang et al., 2016) Price direction Tweets Neural
prediction Networks
(Iwasaki & Chen, 2018) Stock price Analyst Report LSTM, CNN,
prediction Bi-LSTM
(Jiang & Liang, 2017) Cryptocurrency Cryptocurrency RNN, LSTM,
portfolio price data CNN
management
14
(Kvamme, Sellereite, Aas, & Mortgage Mortgage data Convolutional
Sjursen, 2018) default from Norwegian Neural
prediction financial service Networks and
group, DNB Random Forest
(Lahmiri & Bekiros, 2019) Corporate Firms’ financial Neural networks
Bankruptcy statements,
market data,
and general risk
indicators
(H. Li, Shen, & Zhu, 2018) Stock price Stocks data RNN, LSTM,
prediction GRU
(Liang, Chen, Zhu, Jiang, & Li, Portfolio Stocks data Deep
2018) Allocation Reinforcement
Learning
(Luo, Wu, & Wu, 2017) Corporate credit CDS data Deep Belief
rating Network and
Restricted
Boltzmann
Machines
(Ozbayoglu, Gudelek, & Sezer, Financial Financial News SVM, Deep
2020) distress Belief Network,
prediction LSTM
(Damrongsakmethee & Neagoe, Credit score Credit scores MLP and CNN
2017) classification data
(Paula, Ladeira, Carvalho, & Financial fraud Databases of Deep
Marzagao, 2016) and money foreign trade of Autoencoders
laundering the Secretariat
of Federal
Revenue of
Brazil
(Reichenbacher, Schuster, & future bond Bond Elastic Nets and
Uhrig-Homburg, 2020) liquidity transactions and Random Forests
characteristics
data
(Antunes, Ribeiro, & Pereira, Bankruptcy Financial Deep Belief
2017) Prediction Statements Network
(Rossi & Utkus, 2020) Investors Investor Regression Trees
portfolio characteristics
allocation and
performance
and effects of
Robo-advising
(Ozbayoglu et al., 2020) Credit card Account data Decision Trees,
default and Random Forest,
macroeconomic Boosted Trees
indicators
15
(Taghian, Asadi, & Safabakhsh, Trading Signal Market data Neural Network,
2020) Generation Genetic
Programming,
and
Reinforcement
Learning
(Spilak, 2018) Dynamic Cryptocurrency LSTM, RNN,
portfolio data MLP
allocation
(Ozbayoglu et al., 2020) Stock Stocks data Deep RBM
classification Encoder-
Classifier
Network
(Tian, Yu, & Guo, 2015) Corporate Firms’ financial LASSO
Bankruptcy statements and
market data
(Vamossy, 2021) Investor’s StockTwits post Deep Neural
emotions Networks
(Deng, Ren, Kong, Bao, & Dai, Stock price Stock price data Fuzzy Deep
2016) prediction and Direct
trading signal Reinforcement
generation. Learning
16
the agent beat benchmarks, although its performance did not differ statistically
substantially from the S&P 500 index.
In deep reinforcement learning for portfolio management (Hieu, 2020), the authors
examined three cutting-edge continuous policy gradient algorithms - deep determin-
istic policy gradient (DDPG), generalized deterministic policy gradient (GDPG),
and proximal policy optimization (PPO). The authors concluded that the first two
performed significantly better than the third.
A research work (Gao et al., 2021) suggested a hierarchical reinforcement learn-
ing framework that can manage an arbitrary number of assets while accounting for
transaction fees. On real-world market data, the framework performs admirably.
However, since just four equities were evaluated, there is some doubt on the frame-
work’s capacity to deal with enormous amounts of data.
17
Chapter 4
Financial Environment
4.1 Assumptions
The real-world financial market is a highly complex system. In this work, we model
the financial environment as a discrete-time, stochastic dynamic system with the
following simplifying assumptions:
• There is no dependence on explicit stock price prediction (model-free).
• The actions of the RL agent should be continuous.
• There will be zero slippage.
• The RL agents have zero market impact.
• Short-selling is prohibited.
• The environment is a partially observable system.
These simplifying assumptions are consistent with similar works in literature ((Filos,
2019), (Gao et al., 2021), (Betancourt & Chen, 2021), (Jiang & Liang, 2017), (Hu
& Lin, 2019)).
4.2 Description
The environment takes in the following inputs:
• Data: This is either a dataframe or a list of stock tickers. If a dataframe is
provided, the index of the dataframe must be of type datetime (or can be
cast to datetime). Each column should contain the prices of the stock name
provided in the header over the time period.
• Episode Length: This refers to how long (in days) the agent is allowed to
interact with the environment.
• Returns: The environment has two reward signals (see section 4.4). The
returns variable is a boolean flag used to choose between these reward signals.
When set to true, the environment uses the log-returns as the reward signal.
When set to false, the environment used the differential Sharpe ratio.
18
• Trading Cost Ratio: This is the percentage of the stock price that will be
attributed to the cost of either selling or buying a unit stock.
• Lookback Period: This is a fixed-sized window used to control how much
historical data to return to the agent as the observation at each timestep.
• Initial Investment: This refers to the initial amount available to the agent to
spend on all the available stocks in the environment.
• Retain Cash: This is a boolean flag value used to inform the environment
whether the agent can keep a cash element or not.
• Random Start Range: The agent is encouraged to start from a random range
to avoid over-fitting. This value controls what that range should be.
• DSR Constant: This is a smoothing parameter for the differential Sharpe ratio.
• Add Softmax: This is a boolean flag that controls whether the environment
should perform the softmax operation or not. This is required to support the
out-of-the-box RL agents from other libraries.
• Start Date: If a list of tickers was provided instead of a dataframe, this start
date parameter is used for the yahoo finance data download.
• End Date: If a list of tickers was provided instead of a dataframe, this end
date parameter is used for the yahoo finance data download.
• Seed: This is a seed value for environment reproducibility
T
P1 ,t P 2 ,t P M ,t
Pt = log , log , . . . , log RM (4.1)
P1,t−1 P2,t−1 PM,t−1
At = [At ,1 , At ,2 , . . . , At ,M ]T RM (4.2)
M
X
Ai ,t = 1 (4.3)
i=1
0 ≤ Ai ,t ≤ 1 ∀ i, t (4.4)
where M is the total number of assets in the portfolio
19
If the agent is required to keep a cash element, the weight vector’s size is increased
by one yielding an extended portfolio vector that satisfies equations 4.2 to 4.4 with
the cash element treated as an additional asset.
where:
LR is log-returns
0 <= α <= 1; α is a smoothing f actor.
20
Chapter 5
The environment can be anything that processes an agent’s actions and the con-
21
sequences of those actions. The environment’s input is an agent’s action A(t)
performed in the current state S(t), and the environment’s output is the next
state S(t + 1) and reward R(t + 1). A state is a location or position in each
world/environment that an agent may reach or visit. The reward function R(t)
returns a numerical value to an agent for being in a state after performing an ac-
tion. Rewards indicate whether or not a state is valuable and how valuable that
state is (Contributors, 2021). According to the reward hypothesis, all objectives
may be characterized by maximizing the predicted cumulative reward. Actions are
anything that an agent is permitted to do in a given context within the environment.
An RL agent may include one or more of three components - a policy, a value
function, and a model. A policy directs an agent’s decision-making in each state. It
is a mapping between a set of states and a set of actions. An optimum policy provides
the best long-term benefits. A value function determines the quality of each state or
state-action pair. A model is an agent’s representation of the environment, through
which the agent predicts what the environment will do next (Filos, 2019; D. Silver
et al., 2018). RL agents can be classified into different classes based on which of
these three components they have (Figure 5.2). Model-free RL methods rely on trial
and error to update their experience and information about the given environment
because they lack knowledge of the transition model and reward function. Tables
5.1 and 5.2 provides a summary of some of the common model-free RL methods.
22
Table 5.1: Table of Model-Free RL Methods.
Algorithm Description
A2C Advantage Actor-Critic Algorithm
A3C Asynchronous Advantage Actor-Critic Algorithm
DDPG Deep Deterministic Policy Gradient
DQN Deep Q Network
Monte Carlo Every visit to Monte Carlo
NAF Q-Learning with Normalized Advantage Functions
PPO Proximal Policy Optimization
Q-learning State-action-reward-state
Q-learning - Lambda State action reward state with eligibility traces
REINFORCE Monte-Carlo sampling of policy gradient methods
SAC Soft Actor-Critic
SARSA State-action-reward-state-action
TD3 Twin Delayed Deep Deterministic Policy Gradient
TRPO Trust Region Policy Optimization
23
5.2 Reinforcement Learning Approaches
There are mainly four common approaches to reinforcement learning - dynamic pro-
gramming, Monte Carlo methods, temporal difference and policy gradient. Dynamic
programming is a two-step process that uses the Bellman equation to improve the
policy after the policy evaluation has been done. Dynamic programming is used
when the model is fully known. Monte Carlo methods learn from raw experience
episodes without modeling the environmental dynamics and compute the observed
mean returns as an approximation of the predicted return. This means that learning
can only occur when all episodes have been completed (Betancourt & Chen, 2021).
Temporal Difference methods learn from incomplete episodes by using bootstrap-
ping to estimate the value. Temporal difference may be thought of as a hybrid of
dynamic programming and Monte Carlo. Unlike temporal difference approaches,
which rely on the value estimate to establish the optimum policy, policy gradient
methods rely directly on policy estimation. The policy, pi(a|s; theta), is parameter-
ized by theta, and we apply gradient ascent to discover the best theta producing
the highest return. Policy learning can take place on- or off- policy. On-policy RL
agents analyze the same policy that generated the action. Off-policy RL agents, on
the other hand, analyze a policy that is not necessarily the same as the one that
generated the action (B. Li & Hoi, 2014).
24
Chapter 6
Trading Agents
• Random Allocation
This involves setting the portfolio weights such that:
f (i)
At,i = PM where f (i) is a f unction of a random variable (6.2)
i f (i)
At,i = ci ∀ i 0 ≤ ci ≤ 1, (6.3)
25
6.2 Selection Criteria for RL Agents
The RL agents were chosen using three criteria. First, the RL agent must be model-
free as the focus of this work is on model-free RL. Second, the agent must have
been used in similar works in literature. Finally, the agent must support continuous
action and state spaces as the financial environment has continuous action and state
spaces. Using table 5.2 as the starting point, fifteen agents satisfied the first two
criteria. However, only nine agents out of the fifteen agents in that table satisfied
the third criterion. A3C was dropped because it is a modification of A2C that enjoys
the benefits of faster computational speed when the training is distributed. Since
we do not use distributed training, an A3C agent would not have offered additional
advantage. Thus, at the end of the selection process, we were left with the following
eight agents - A2C, DDPG, NAF, PPO, REINFORCE, SAC, TD3, TRPO.
26
• Periodically Updated Target
A target network, which is just a clone of the original Q network, is established
and updated on a regular basis. This improvement stabilizes the training by
removing the short-term fluctuations.
However, one significant limitation of the DQN and many of its variants is that they
cannot be used in continuous action spaces. (Gu, Lillicrap, Sutskever, & Levine,
2016) proposed the normalized advantage function, a continuous variant of the Q-
learning algorithm, as an alternative to policy-gradient and actor-critic approaches
in 2016. The NAF model for applying Q-learning with experience replay to con-
tinuous action spaces (Algorithm 2). (Gu et al., 2016) also investigated the use of
learned models for speeding model-free reinforcement learning and discovered that
repeatedly updated local linear models are particularly successful. The authors
represented the advantage function (A) such that:
6.3.2 REINFORCE
Policy-Gradient algorithms learn a parameterized policy that can select actions with-
out consulting a value function. While a value function may be used to learn the
policy parameters, it is not required for selecting actions. In equation 6.7, we can
27
express the policy as the probability of action a being taken at time t given that the
environment is in state s at time t with parameter θ (Sutton & Barto, 2018).
Policy gradient (PG) approaches are model-free methods that attempt to maximize
the RL goal without using a value function. The RL goal, also known as the perfor-
mance measure J(θ), is defined as the total of rewards from the beginning state to
the terminal state for an episodic task and the average return for a continuous task
when policy πθ is followed.
.
J(θ) = vπθ (s0 ) = Eτ ∼πθ (τ ) γ t r (st , at ) (6.8)
Where the value function vπθ (s0 ) is the value of the expected discounted sum of
rewards for a trajectory starting at state s0 and following policy πθ until the episode
terminates. This objective can be evaluated in an unbiased manner by sampling N
trajectories from the environment using policy πθ :
N Ti −1
1 XX
J(θ) ≈ γ t r (si,t , ai,t ) (6.9)
N i=1 t=0
Ti is the timestep in which trajectory τi terminates.
The probability distribution πθ (a|s) can be defined:
• over a discrete action space, in which case the distribution is usually categorical
with a softmax over the action logits.
• over a continuous action space, in which case the output is the parameters of
a continuous distribution (e.g. the mean and variance of a gaussian).
The gradient with respect to θ according to the policy gradient theorem can be
approximated over N trajectories as:
N
"T −1 #
i
1 X X
∇θ J(θ) ≈ Gi,t ∇θ log πθ [ai,t | si,t ] (6.10)
N i=1 t=0
Where ai,t is the action taken at time t of episode i at state si,t , Ti is the timestep in
which trajectory τi terminates and Gi,t is a function of the reward assigned to this
action
For REINFORCE, Gi,t is the sum of rewards in trajectory i
T
Xi −1
28
Algorithm 3: REINFORCE: Monte-Carlo Policy Gradient Control (episodic)
Initialize policy network with weights θ
for each episode {s0 , a0 , r2 ... sT −1 , aT −1 , rT } sampled from policy πθ : do
for t = 0...T − 1: do
Evaluate the gradient
1
PN hPTi −1 i
∇θ J(θ) ≈ N i=1 t=0 Gi,t ∇θ log πθ [ai,t | si,t ]
Update the policy parameters
θ ← θ + αOθ J(θ)
end
end
Limitations
1. The procedure of updating is inefficient. The trajectory is deleted after per-
forming the policy and changing the parameters.
2. The gradient estimate is noisy, and there is a chance that the gathered trajec-
tory does not accurately represent the policy item.
3. There is no apparent credit assignment. A trajectory can contain numerous
good or harmful activities, and whether or not these behaviours are reinforced
is only determined by the ultimate product.
Other Policy Gradient methods like A2C, DDPG, TD3, SAC, and PPO were created
to overcome the limitations of REINFORCE.
29
temporal difference loss L
1 X
L(w) = (yi − q̂(si , ai , w))2 (6.12)
N i
1 X
Oθ J(θ) ≈ Oa q̂(s, a, w)|s=Si ,a=π(Si ) Oθ π(s, θ)|s=Si (6.13)
N i
To maximize the Q-value function while reducing the temporal difference loss, we
must enhance the performance measure J. The actor takes the state as input and
outputs an action, whereas the critic takes both the state and the action as input
and outputs the value of the Q function. The critic uses gradient temporal-difference
learning, whilst the actor parameters are discovered using the Policy gradient theo-
rem. The essential principle of this design is that the policy network acts, resulting
in an action, and the Q-network critiques that action.
The use of non-linear function approximators such as neural networks, which are
required to generalize on vast state spaces, means that convergence is no longer
assured, as it was with Q learning. As a result, experience replay is required in
order to generate independent and identically dispersed samples. In addition, target
networks must be used to avoid divergence when upgrading the critic network. In
DDPG, parameters are changed differently than in DQN, where the target network
is updated every C steps. Following the "soft" update, the parameters of the target
networks are changed at each time step as shown:
w− ← τ w + (1 − τ )w−
(6.14)
θ− ← τ θ + (1 − τ )θ−
30
control issues.
Algorithm 4: DDPG (Lillicrap et al., 2019)
Randomly initialize critic network Q(s, a|θQ ) and actor µ(s|θµ ) with weights θQ
and θµ
0 0
Initialize target network Q0 and µ0 with weights θQ ← θQ , θµ ← θµ
Initialize replay buffer R
for episode = 1...M: do
Initialize a random process N for action exploration
Receive initial observation state s1
for t = 1...T: do
Select action at = µ(st |θµ ) + Nt according to the current policy and
exploration noise
Execute action at and observe reward rt and observe new state st+1
Store transition (st , at , rt , st+1 ) in R
Sample a random minibatch of N transitions (si , ai , ri , si+1 ) from R
0 0
Set yi = ri + γQ0 (si+1 , µ0 (si+1 |θµ )|θQ )
Update critic by minimizing the loss: L(w) = N1 i (yi − q̂(si , ai , w))2
P
Update the the actor P policy using the sampled policy gradient:
Oθµ J(θ) ≈ N1 i Oa Q(s, a|θQ )|s=si ,a=µ(si ) Oθµ µ(s|θµ )|si
Update the target networks:
0 0
θQ ← τ θQ + (1 − τ )θQ
0 0
θµ ← τ θµ + (1 − τ )θµ
end
end
31
framework and the places where these algorithmic tricks were used.
T −1
X
Oθ J(θ) ∼ Oθ logπθ (at |st )A(st , at ) (6.16)
t=0
Thus, the evaluation of an action examines not only how excellent the action is,
but also how the action may be improved so that the high variance of the policy
networks is lowered and the model becomes more resilient (Konda & Gao, 2000).
The value function of the critic component can also make use of temporal difference
error (TD error) computed using the TD learning approach.
The advantage of the actor-critic algorithm is that it separates the policy from the
32
value function by learning the value function and the policy function using linear
approximation, where the critic part is the value function approximator, learning
the estimate function, and then passing to the actor part. The actor is a policy
approximator that learns a random strategy and selects an action using the gradient-
based policy updating approach (Fig. 6.1).
Figure 6.1: Actor Critic Algorithm Framework (Sutton & Barto, 2018)
"T −1 #
X
= Eτ Oθ logπθ (at |st )Qw (st , at ) (6.21)
t=0
33
6.3.6 Soft Actor Critic(SAC)
As a bridge between stochastic policy optimization and DDPG techniques, the Soft
Actor-Critic Algorithm is an off-policy algorithm that optimizes a stochastic pol-
icy. Entropy regularization is the primary aspect of SAC. The policy is trained to
optimize a trade-off between anticipated return and entropy, which is a measure of
the policy’s unpredictability. SAC is thoroughly described by first establishing the
entropy regularized reinforcement learning setup and the value functions associated
with it (Haarnoja, Zhou, Abbeel, & Levine, 2018).
The SAC algorithm learns a policy as well as two Q- functions, Q1 and Q2. There
are two basic SAC variants: one that utilizes a constant entropy regularization
coefficient and another that enforces an entropy restriction by altering the training
process. The constant entropy regularization coefficient is employed for ease in
spinning up, although the entropy-constrained variation is more widely utilized,
according to (Haarnoja et al., 2018). The Q-functions are taught in a manner
similar to the TD3 mentioned in this study, with a few major modifications.
In each stage, the policy should be implemented in order to maximize projected
future returns and expected future entropy. It should aim to maximize V π (s), which
we expand out into
34
Algorithm 7: Soft Actor Critic:
Input: initial policy parameters θ, Q-function parameters φ1 , φ2 , empty replay
buffer D
Set target parameters equal to main parameters φtarg ,1 ← φ1 , φtarg ,2 ← φ2
Repeat the following steps until convergence:
Observe state s and select action a ∼ πθ (· | s)
Execute a in the environment
Observe next state s0 , reward r, and done signal d to indicate whether s0 is
terminal Store (s, a, r, s0 , d) in replay buffer D
If s0 is terminal, reset environment state.
if it’s time to update then
for k = 0, 1, 2, ... do
Randomly sample a batch of transitions, B = {(s, a, r, s0 , d)} from D
Compute targets for the Q functions:
y (r, s , d) = r+γ(1−d) min Qφtarg ,i (s , ã ) − α log πθ (ã | s ) , ã0 ∼ πθ (· | s0 )
0 0 0 0 0
i=1,2
where ãθ (s) is a sample from πθ (· | s) which is differentiable wrt θ via the
reparametrization trick. Update target networks with
end
35
Let π denote a stochastic policy π : S × A → [0, 1], and let η(π) denote its expected
discounted reward:
"∞ #
X
η(π) = Es0 ,a0 ,... γ t r (st ) , where
t=0
s0 ∼ ρ0 (s0 ) , at ∼ π (at | st ) , st+1 ∼ P (st+1 | st , at )
In TRPO, the state-action value function Qπ , the value function Vπ , and the advan-
tage function Aπ are defined as:
"∞ #
X
Qπ (st , at ) = Est+1 ,at+1 ,... γ l r (st+l )
l=0
" ∞
#
X
l
Vπ (st ) = Eat ,st+1 ,... γ r (st+l )
l=0
Aπ (s, a) = Qπ (s, a) − Vπ (s), where
at ∼ π (at | st ) , st+1 ∼ P (st+1 | st , at ) for t ≥ 0
The following useful identity expresses the expected return of another policy π̃ in
terms of the advantage over π, accumulated over time-steps (Schulman et al., 2015):
" ∞
#
X
η(π̃) = η(π) + Es0 ,a0 ,···∼π̄ γ t Aπ (st , at ) (6.22)
t=0
where the notation Es0 ,a0 ,...∼π̄ [. . .] indicates that actions are sampled at ∼ π̃ (· | st ) .
Let ρπ be the (unnormalized) discounted visitation frequencies
where s0 ∼ ρ0 and the actions are chosen according to π. 6.22 can be re-written
with a sum over states instead of time steps:
∞ X
X X
η(π̃) = η(π) + P (st = s | π̃) π̃(a | s)γ t Aπ (s, a)
t=0 s a
∞
(6.23)
XX X
= η(π) + γ t P (st = s | π̃) π̃(a | s)Aπ (s, a)
s t=0 a
X X
= η(π) + ρπ̄ (s) π̃(a | s)Aπ (s, a)
s a
Equation 6.23 implies that any policy updateP π → π̃ that has a non-negative ex-
pected advantage at every state s, i.e., a π̃(a | s)Aπ (s, a) ≥ 0, is guaranteed
to increase the policy performance η, or leave it constant in the case that the
expected advantage is zero everywhere. This implies the classic result that the
update performed by exact policy iteration, which uses the deterministic policy
π̄(s) = arg maxa Aπ (s, a), improves the policy if there is at least one state-action pair
with a positive advantage value and nonzero state visitation probability; otherwise,
36
the algorithm has converged to the optimal policy. However, in the approximate
setting, it will typically be unavoidable, due to estimation and approximation error,
that
Pthere will be some states s for which the expected advantage is negative, that
is, a π̃(a | s)Aπ (s, a) < 0. The complex dependency of ρπ̄ (s) on π̃ makes Equation
6.23 difficult to optimize directly. Instead, the following local approximation to η is
introduced (Schulman et al., 2015):
X X
Lπ (π̃) = η(π) + ρπ (s) π̃(a | s)Aπ (s, a) (6.24)
s a
Lπ uses the visitation frequency ρπ rather than ρπ̄ , ignoring changes in state visi-
tation density due to changes in the policy. However, when using a parameterized
policy πθ , where πθ (a | s) is a differentiable function of the parameter vector θ, then
Lπ matches η to first order. That is, for any parameter value θ0 ,
Equation 6.24 implies that a sufficiently small step πθ0 → π̄ that improves Lπθold
will also improve η, but does not give us any guidance on how big of a step to take
(Schulman et al., 2015).
To address this issue, (Kakade & Langford, 2002) proposed a policy updating scheme
called conservative policy iteration, for which they could provide explicit lower
bounds on the improvement of η. To define the conservative policy iteration up-
date, let πold denote the current policy, and let π 0 = arg maxπ0 Lπold (π 0 ) . The new
policy πnew was defined to be the following mixture (Schulman et al., 2015):
2γ
η (πnew ) ≥ Lπold (πnew ) − α2
(1 − γ)2 (6.27)
where = max Ea∼π0 (a|s) [Aπ (s, a)]
s
Equation 6.27, which applies to conservative policy iteration, implies that a policy
update that improves the right hand side is guaranteed to improve the true per-
formance η. Our principal theoretical result is that the policy improvement bound
in Equation 6.27 can be extended to general stochastic policies, rather than just
mixture policies, by replacing α with a distance measure between π and π̃, and
changing the constant appropriately. Since mixture policies are rarely used in
practice, this result is crucial for extending the improvement guarantee to practical
problems. The particular distance measure we use is the total variation divergence,
which is defined by DT V (pkq) = 21 i |pi − qi | for discrete probability distributions
P
p, q.1 Define DTV
max
(π, π̃) as
37
max
DTV (π, π̃) = max DT V (π(· | s)kπ̃(· | s)) (6.28)
s
max
maximize [Lθold (θ) − CDKL (θold , θ)] (6.29)
θ
maximizeLθold (θ)
θ (6.30)
subject to D̄KL
ρold
(θold , θ) ≤ δ
38
Algorithm 8: Trust Region Policy Optimization (Schulman et al., 2015)
Input: initial policy parameters θ0 , initial value function paramaters φ0
Hyperparameters: KL-divergence limit δ, backtracking coefficient α,
maximum number of backtracking steps K
for k = 0, 1, 2, ... do
Collect set of trajectories Dk = {τi } by running policy πk = π (θk ) in the
environment. Compute rewards-to-go R̂t . Compute advantage estimates,
Ât (using any method of advantage estimation) based on the current value
function Vφk . Estimate policy gradient as
T
1 XX
ĝk = ∇θ log πθ (at | st ) Ât
|Dk |
τ ∈Dkt=0
θk
where Ĥk is the Hessian of the sample average KL-divergence. Update the
policy by backtracking line search with
s
2δ
θk+1 = θk + αj x̂k
x̂Tk Ĥk x̂k
39
(Schulman et al., 2017).
This objective function of PPO can be represented as (Schulman et al., 2017):
πθ (a|s) πθk
L(s, a, θk , θ) = min πθk
A (s, a), g(, A (s, a) (6.33)
πθold (a|s)
where
if A ≥ 0
(1 + )A
g(, A) = (6.34)
(1 − )A if A < 0
In the implementation, PPO maintains two policy networks. The first one is the
current policy that needs to be refined (Schulman et al., 2017):
PPO switches between sampling data and interacting with the environment. In
order to increase sample efficiency, a new policy is reviewed using samples obtained
from an earlier policy using the concept of significance sampling. (Schulman et al.,
2017).
πθ (at |st )
maximize Êt Ât (6.37)
θ πθold (at |st )
As the current policy gets developed, the gap between it and the previous policy
grows bigger. The estimation’s variance grows, which leads to poor judgments due
to inaccuracy. So, say, every four iterations, we resynchronize the second network
with the revised policy. (Schulman et al., 2017).
With clipped objective, we compute a ratio between the new policy and the old
policy (Schulman et al., 2017):
πθ (at |st )
rt (θ) = (6.38)
πθk (at |st )
This ratio compares the two policies. If the new policy is distant from the previous
policy, a new objective function is created to clip the estimated advantage function.
The new objective function is now (Schulman et al., 2017):
" T #
Xh i
LCLIP
θk (θ) = E min(rt (θ)Âπt k , clip(rt (θ), 1 − , 1 + )Âπt k ) (6.39)
τ ∼πk
t=0
40
The advantage function will be trimmed if the probability ratio between the new
and old policies goes beyond the range (1 − ) and (1 + ). The clipping restricts the
amount of effective change one may make at each phase to increase stability. This
inhibits major policy changes if they are outside of our comfort zone (Schulman et
al., 2017). As a result, the method can be expressed as shown in algorithm 9
41
Chapter 7
Experiments
3. Trading Costs
We experimented with three different trading costs scenarios - no trading costs,
1
Dow Jones stock companies are 3M, American Express, Amgen, Apple, Boeing, Caterpillar,
Chevron, Cisco Systems, Coca-Cola, Disney, Dow, Goldman Sachs, Home Depot, Honeywell, IBM,
Intel, Johnson and Johnson, JP Morgan Chase, McDonald’s, Merck, Microsoft, Nike, Procter &
Gamble, Salesforce, Travelers, UnitedHealth, Visa, Walgreens, and Walmart.
42
0.1% of the stock’s price, and 1% of the stock’s price.
2. REINFORCE
• Discount Factor (gamma): 0.99
• hidden size for linear layers: 128
43
• memory_dim: 100,000
• max_action: 1
• discount: 0.99
• update_freq: 2
• tau: 0.005
• policy_noise_std: 0.2
• policy_noise_clip: 0.5
• actor_lr: 1e-3
• critic_lr: 1e-3
• batch_size: 128
• exploration_noise: 0.1
• num_layers: 3
• dropout: 0.2
• add_lstm: False
• warmup_steps: 100
44
• soft_tau: 1e-2
• replay_buffer_size: 1,000,000
• batch_size: 128
7.3 Metrics
In this work, we use the following metrics when backtesting to evaluate and compare
the performance of the RL agents:
1. Annualized Returns
This is the yearly average profit from the trading strategy.
where:
N = Number of periods measured
45
2. Cumulative Return
This is the sum of returns obtained from a trading strategy over a period given
an initial investment is known as the cumulative return.
where:
Pcurrent = Current Price
Pinitial = Original Price
3. Sharpe Ratio
This is the reward/risk ratio or risk-adjusted rewards of the trading strategy.
Rp − Rf
SharpeRatio =
σp
Where:
Rp = return of portfolio
Rf = risk-free rate
σp = standard deviation of the portfolio’s excess return
4. Maximum Drawdown (Max DD)
This is the difference between the maximum and minimum values of a portfolio
over a time horizon used to measure the downside risk of a trading strategy.
It is usually represented as a percentage, and lower values indicate good per-
formance.
Trough Value - Peak Value
M DD =
Peak Value
5. Calmar Ratio
This measures the trading strategy’s performance relative to its risk. It is
calculated by dividing the average annual rate of return by the maximum
drawdown. Similar to the Sharpe ratio, higher values indicate better risk-
adjusted performance.
Rp − Rf
Calmar Ratio =
Maximum Drawdown
Where:
Rp = Portfolio return
Rf = Risk-free rate
Rp − Rf = Annual rate of return
46
Chapter 8
8.1 Results
This chapter presents the results from the experiments described in chapter 7. Tables
8.1 to 8.3 present a comparison between the RL agents’ mean and peak performance
ranks at different trading costs across both reward functions. A huge variation be-
tween mean and peak performance shows that an agent’s portfolio management
strategy is unstable. Figure 8.1 shows the final average position across all experi-
ments of all the agents from best to worse. Appendix A shows the raw metric values
aggregated into Figure 8.1. Figures 8.2 to 8.10 show plots of cumulative returns
and portfolio management strategies for the best performing baseline model (MPT)
and the RL agents (A2C and SAC) that consistently outperformed MPT based on
the average rank metric in Tables 8.1 to 8.3. The mean of portfolio weights graphs
provide information on how an agent distributes its portfolio among the available
stocks while the standard deviation of portfolio weights graphs provide information
about how much an agent changes its portfolio distribution. Together, these graphs
explain an agent’s portfolio management strategy.
47
Table 8.1: Table of Rank Comparison at No Trading Cost
48
Table 8.3: Table of Rank Comparison at 1% Trading Cost
49
Figure 8.1: Graph of Final Average Rank of All Agents
50
Figure 8.3: Graph of Mean of Portfolio Weights For Each Stock at No Trading Costs
Figure 8.4: Graph of Mean of Portfolio Weights For Each Stock at No Trading Costs
51
Figure 8.5: Graph of Cumulative Returns Plot at 0.6% Trading Costs
Figure 8.6: Graph of Mean of Portfolio Weights For Each Stock at 0.6% Trading
Costs
52
Figure 8.7: Graph of Mean of Portfolio Weights For Each Stock at 0.6% Trading
Costs
53
Figure 8.9: Graph of Mean of Portfolio Weights For Each Stock at 1% Trading Costs
Figure 8.10: Graph of Mean of Portfolio Weights For Each Stock at 1% Trading
Costs
54
8.2 Discussion
8.2.1 RL vs. Baselines
From Tables 8.1 to 8.3, we see that the only two baseline agents compare favourably
with the RL agents. These baseline agents are Buy & Hold and MPT, with the latter
being the stronger one. Trading costs have significant effects on the performance of
the trading agents. Every RL agent outperforms Buy & Hold at no trading costs.
However, when trading costs are introduced, only four RL agents outperform Buy &
Hold. This is still significant achievement and provides evidence that the RL agents
are able to discover good portfolio management strategies.
55
an average of 5% across all the stocks over the testing period. It should be noted that
A2C had similar cumulative returns as the MPT baseline, which also put 90% of its
holdings into just three stocks - HD, UNH, and V. This confirms a general theory
in portfolio management - different market strategies could yield similar results.
When a trading cost of 0.1% is introduced, both A2C and SAC agents change their
strategy. Rather than put 80% of its holdings into one stock only, the SAC agent
put a similar percentage into four different stocks (JPM, V, GS, CSCO) and kept
its portfolio spread over them. It is interesting to note that these stocks are entirely
different from those it chose at no trading costs, and yet, it outperformed itself on
most of the metrics. Similarly, rather than spread its portfolio into most of the
available stocks, the A2C agent put about half of its portfolio into three stocks
(MRK, GS, KO) this time. Nevertheless, it still kept its portfolio spread across all
stocks but usually chose to trade one of CVX, GS, KO, MCD, MRK, RTX, or V.
When the trading cost was 1%, the strategy landscape changed dramatically. The
SAC agent chose a buy and hold strategy and held an almost uniform proportion
of stocks across all the available stocks. On the other hand, the A2C agent put
most of its stocks (about 90%) into just three stocks - KO, PFE, and RTX. Also, it
kept the portfolio spread across just these three stocks. It is necessary to note that
while both SAC and A2C underperformed compared to MPT at 1% trading costs
on returns-related metrics, the A2C agent’s strategy enabled it to outperform MPT
on risk-related metrics and overall, on average.
The performance of on-policy and off-policy agents seem to be similar at mean
performance, but on-policy agents consistently significantly outperform off-policy
agents at peak performance. Three out of the four on-policy agents (A2C, PPO,
TRPO) are ranked in the top five at different trading costs, consistently outper-
forming the Buy and Hold baseline. In contrast, only one of the four off-policy
agents (SAC) ranks consistently in the top 5. The SAC agent’s performance can
be attributed its maximum entropy learning framework, which allows it to perform
stochastic optimization of policies.
While the performance of SAC shows that off-policy agents can perform as well as
on-policy agents in the task of portfolio management, the evidence of this analysis
suggests that on-policy agents are more suited to the task of portfolio management
in comparison to off-policy agents. The good performance of on-policy agents is
because they are better at evaluating policy and sample efficiency is not a significant
problem in portfolio management. While unlikely, the off-policy agents may get
better with hyperparameter optimization.
56
Chapter 9
Conclusion
9.1 Contributions
This study investigated the performance of RL when applied to portfolio manage-
ment using model-free deep reinforcement learning agents. We trained several RL
agents on real-world stock prices to learn how to perform asset allocation. We com-
pared the performance of these RL agents against some baseline agents. We also
compared the RL agents among themselves to understand which classes of agents
performed better.
From our analysis, RL agents can perform the task of portfolio management since
they significantly outperformed two of the baseline agents (random allocation and
uniform allocation). Four RL agents (A2C, SAC, PPO, and TRPO) outperformed
the best baseline, MPT, overall. This shows the abilities of RL agents to uncover
more profitable trading strategies.
Furthermore, there were no significant performance differences between value-based
and policy-based RL agents. Actor-critic agents performed better than other types
of agents. Also, on-policy agents performed better than off-policy agents because
they are better at policy evaluation and sample efficiency is not a significant problem
in portfolio management.
In summary, this study shows that RL agents can substantially improve asset al-
location since they outperform strong baselines. On-policy, actor-critic RL agents
showed the most promise based on our analysis. The next section discusses some
directions that future works may want to explore to build on this work.
57
proposed hyperparameters seen in the original papers. Thus, another possible ex-
tension to this work will be to carry out extensive hyperparameter optimization for
all the eight agents studied in this work to see how the performance of these agents
changes.
Furthermore, in this project, we focused only on using feedforward neural networks
as the function approximator for all the agents as proposed by the initial authors.
However, the financial market is a time-series. Using neural networks such as recur-
rent neural networks, convolutional neural networks, transformers, among others,
that can take into account the temporal nature of the market could yield better
results.
Finally, while we have used the Dow Jones market as requested by the client, there
is potential for comparative analysis across several other markets. It would be
interesting to see if the RL agents perform better when the market is small (e.g.,
just the top 5 technology companies) or large (e.g. the S&P 500 market). Similarly,
other markets from other locations around the world (e.g., the DAX market of
Germany, the HK50 market of Hong Kong, and the JSE market of South Africa)
can be studied to see how the insights garnered from the Dow Jones market transfer
to these new markets.
58
Appendix A
Tables A.1 to A.9 show all the trading agents’ mean and peak performances at dif-
ferent reward functions and trading costs. . The rank columns show an algorithm’s
position, based on its ranks across all the metrics.
Table A.1: Table of Mean Performance at No Trading Cost & Log Returns Reward
59
Table A.2: Table of Mean Performance at No Trading Cost & Sharpe Ratio Reward
60
Table A.4: Table of Mean Performance at 0.1% Trading Costs & Log Returns Re-
ward
Table A.5: Table of Mean Performance at 0.1% Trading Costs & Sharpe Ratio
Reward
61
Table A.6: Table of Peak Performance at 0.1% Trading Costs
Table A.7: Table of Mean Performance at 1% Trading Costs & Log Returns Reward
62
Table A.8: Table of Mean Performance at 1% Trading Costs & Sharpe Ratio Reward
Table A.9: Table of Peak Performance at 1% Trading Costs & Sharpe Ratio Reward
63
References
Adämmer, P., & Schüssler, R. A. (2020). Forecasting the equity premium: mind
the news! Review of Finance, 24 (6), 1313–1355.
Albanesi, S., & Vamossy, D. F. (2019). Predicting consumer default: A deep learning
approach (Tech. Rep.). None: National Bureau of Economic Research.
Amel-Zadeh, A., Calliess, J.-P., Kaiser, D., & Roberts, S. (2020). Machine learning-
based financial statement analysis. Available at SSRN 3520684 .
Ang, Y. Q., Chia, A., & Saghafian, S. (2020). Using machine learning to demystify
startups funding, post-money valuation, and success. Post-Money Valuation,
and Success (August 27, 2020).
Antunes, F., Ribeiro, B., & Pereira, F. (2017). Probabilistic modeling and visual-
ization for bankruptcy prediction. Applied Soft Computing, 60 , 831–843.
Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? the information
content of internet stock message boards. The Journal of finance, 59 (3), 1259–
1294.
Bao, W., Yue, J., & Rao, Y. (2017). A deep learning framework for financial time
series using stacked autoencoders and long-short term memory. PloS one,
12 (7), e0180944.
Bao, Y., Ke, B., Li, B., Yu, Y. J., & Zhang, J. (2020). Detecting accounting fraud
in publicly traded us firms using a machine learning approach. Journal of
Accounting Research, 58 (1), 199–235.
Bari, O. A., & Agah, A. (2020). Ensembles of text and time-series models for
automatic generation of financial trading signals from social media content.
Journal of Intelligent Systems, 29 (1), 753–772.
Belousov, B., Abdulsamad, H., Klink, P., Parisi, S., & Peters, J. (Eds.). (2021).
Reinforcement Learning Algorithms: Analysis and Applications (Vol. 883).
Cham: Springer International Publishing. Retrieved 2021-10-25, from http://
link.springer.com/10.1007/978-3-030-41188-6 doi: 10.1007/978-3-030
-41188-6
Betancourt, C., & Chen, W.-H. (2021, February). Deep reinforcement learning
for portfolio management of markets with a dynamic number of assets. Expert
Systems with Applications, 164 , 114002. Retrieved 2021-10-25, from https://
linkinghub.elsevier.com/retrieve/pii/S0957417420307776 doi: 10
.1016/j.eswa.2020.114002
Björkegren, D., & Grissen, D. (2020). Behavior revealed in mobile phone usage
predicts credit repayment. The World Bank Economic Review , 34 (3), 618–
634.
Chen, J. (2021, May). Post-Modern portfolio theory (PMPT). https://www
.investopedia.com/terms/p/pmpt.asp. (Accessed: 2021-11-14)
64
Chen, L., Pelger, M., & Zhu, J. (2020). Deep learning in asset pricing. Available at
SSRN 3350138 .
Colombo, E., & Pelagatti, M. (2020). Statistical learning and exchange rate fore-
casting. International Journal of Forecasting, 36 (4), 1260–1289.
Contributors, W. (2021, October). Reinforcement learning. Retrieved 2021-10-
26, from https://en.wikipedia.org/w/index.php?title=Reinforcement
_learning&oldid=1051236695 (Page Version ID: 1051236695)
Croux, C., Jagtiani, J., Korivi, T., & Vulanovic, M. (2020). Important factors deter-
mining fintech loan default: Evidence from a lendingclub consumer platform.
Journal of Economic Behavior & Organization, 173 , 270–296.
Damrongsakmethee, T., & Neagoe, V.-E. (2017). Data mining and machine learning
for financial analysis. Indian Journal of Science and Technology, 10 (39), 1–7.
Deng, Y., Ren, Z., Kong, Y., Bao, F., & Dai, Q. (2016). A hierarchical fused
fuzzy deep neural network for data classification. IEEE Transactions on Fuzzy
Systems, 25 (4), 1006–1012.
E., S., & E., R. (2021). INVESTMENT PORTFOLIO: TRADITIONAL AP-
PROACH. Norwegian Journal of Development of the International Science.
Filos, A. (2019, September). Reinforcement Learning for Portfolio Management.
arXiv:1909.09571 [cs, q-fin, stat] . Retrieved 2021-10-25, from http://arxiv
.org/abs/1909.09571 (arXiv: 1909.09571)
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation
error in actor-critic methods. In International conference on machine learning
(pp. 1587–1596).
Gao, Y., Gao, Z., Hu, Y., Song, S., Jiang, Z., & Su, J. (2021, October). A
Framework of Hierarchical Deep Q-Network for Portfolio Management. In
(pp. 132–140). Retrieved 2021-10-25, from https://www.scitepress.org/
PublicationsDetail.aspx?ID=0fLwyxE3WOE=&t=1
Gomes, T. A., Carvalho, R. N., & Carvalho, R. S. (2017). Identifying anomalies
in parliamentary expenditures of brazilian chamber of deputies with deep au-
toencoders. In 2017 16th ieee international conference on machine learning
and applications (icmla) (pp. 940–943).
Goumagias, N. D., Hristu-Varsakelis, D., & Assael, Y. M. (2018). Using deep q-
learning to understand the tax evasion behavior of risk-averse firms. Expert
Systems with Applications, 101 , 258–270.
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning.
The Review of Financial Studies, 33 (5), 2223–2273.
Gu, S., Lillicrap, T., Sutskever, I., & Levine, S. (2016). Continuous deep q-learning
with model-based acceleration. In International conference on machine learn-
ing (pp. 2829–2838).
Gulen, H., Jens, C., & Page, T. B. (2020). An application of causal forest in
corporate finance: How does financing affect investment? Available at SSRN
3583685 .
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor.
In International conference on machine learning (pp. 1861–1870).
Harmon, M. E., & Harmon, S. S. (1996). Reinforcement Learning: A Tutorial.
Hayes, A. (2021, November). Portfolio management. https://www.investopedia
.com/terms/p/portfoliomanagement.asp. (Accessed: 2021-11-17)
65
Hieu, L. T. (2020, October). Deep Reinforcement Learning for Stock Portfolio Op-
timization. International Journal of Modeling and Optimization, 10 (5), 139–
144. Retrieved 2021-10-25, from http://arxiv.org/abs/2012.06325 (arXiv:
2012.06325) doi: 10.7763/IJMO.2020.V10.761
Hu, Y., & Lin, S.-J. (2019). Deep Reinforcement Learning for Optimizing Finance
Portfolio Management. 2019 Amity International Conference on Artificial
Intelligence (AICAI). doi: 10.1109/AICAI.2019.8701368
Huang, J., Chai, J., & Cho, S. (2020, June). Deep learning in finance and banking: A
literature review and classification. Frontiers of Business Research in China,
14 (1), 13. Retrieved 2021-10-25, from https://doi.org/10.1186/s11782
-020-00082-6 doi: 10.1186/s11782-020-00082-6
Huang, Y., Huang, K., Wang, Y., Zhang, H., Guan, J., & Zhou, S. (2016). Exploit-
ing twitter moods to boost financial trend prediction based on deep network
models. In International conference on intelligent computing (pp. 449–460).
Huotari, T., Savolainen, J., & Collan, M. (2020, December). Deep Reinforce-
ment Learning Agent for S&P 500 Stock Selection. Axioms, 9 (4),
130. Retrieved 2021-10-25, from https://www.mdpi.com/2075-1680/9/4/
130 (Number: 4 Publisher: Multidisciplinary Digital Publishing Institute)
doi: 10.3390/axioms9040130
Iwasaki, H., & Chen, Y. (2018). Topic sentiment asset pricing with dnn supervised
learning. Available at SSRN 3228485 .
Jiang, Z., & Liang, J. (2017). Cryptocurrency portfolio management with deep
reinforcement learning. In 2017 intelligent systems conference (intellisys) (pp.
905–913).
Jiang, Z., Xu, D., & Liang, J. (2017, July). A Deep Reinforcement Learning Frame-
work for the Financial Portfolio Management Problem. arXiv:1706.10059
[cs, q-fin] . Retrieved 2021-10-25, from http://arxiv.org/abs/1706.10059
(arXiv: 1706.10059)
Joyce, J. M. (2011). Kullback-leibler divergence. In M. Lovric (Ed.), Inter-
national encyclopedia of statistical science (pp. 720–722). Berlin, Heidel-
berg: Springer Berlin Heidelberg. Retrieved from https://doi.org/10.1007/
978-3-642-04898-2_327 doi: 10.1007/978-3-642-04898-2_327
Kakade, S., & Langford, J. (2002, 01). Approximately optimal approximate rein-
forcement learning. In (p. 267-274).
Konda, V., & Gao, V. (2000, January). Actor-critic algorithms. None.
Kvamme, H., Sellereite, N., Aas, K., & Sjursen, S. (2018). Predicting mortgage
default using convolutional neural networks. Expert Systems with Applications,
102 , 207–217.
Lahmiri, S., & Bekiros, S. (2019). Can machine learning approaches predict corpo-
rate bankruptcy? evidence from a qualitative experimental design. Quantita-
tive Finance, 19 (9), 1569–1577.
Levišauskait, K. (2010). Investment Analysis and Portfolio Management. None,
167.
Li, B., & Hoi, S. C. H. (2014, January). Online portfolio selection: A survey. ACM
Computing Surveys, 46 (3), 35:1–35:36. Retrieved 2021-10-25, from https://
doi.org/10.1145/2512962 doi: 10.1145/2512962
Li, H., Shen, Y., & Zhu, Y. (2018). Stock price prediction using attention-based
multi-input lstm. In Asian conference on machine learning (pp. 454–469).
66
Liang, Z., Chen, H., Zhu, J., Jiang, K., & Li, Y. (2018). Adversarial deep reinforce-
ment learning in portfolio management. arXiv preprint arXiv:1808.09940 .
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., . . . Wier-
stra, D. (2019, July). Continuous control with deep reinforcement learning.
arXiv:1509.02971 [cs, stat] . Retrieved 2021-11-07, from http://arxiv.org/
abs/1509.02971 (arXiv: 1509.02971)
Luo, C., Wu, D., & Wu, D. (2017). A deep learning approach for credit scoring
using credit default swaps. Engineering Applications of Artificial Intelligence,
65 , 465–470.
Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7 (1), 77–91.
Retrieved from http://www.jstor.org/stable/2975974
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., &
Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv
preprint arXiv:1312.5602 .
Ozbayoglu, A. M., Gudelek, M. U., & Sezer, O. B. (2020). Deep learning for financial
applications: A survey. Applied Soft Computing, 93 , 106384.
Paula, E. L., Ladeira, M., Carvalho, R. N., & Marzagao, T. (2016). Deep learning
anomaly detection as support fraud investigation in brazilian exports and anti-
money laundering. In 2016 15th ieee international conference on machine
learning and applications (icmla) (pp. 954–960).
Reichenbacher, M., Schuster, P., & Uhrig-Homburg, M. (2020). Expected bond
liquidity. Available at SSRN 3642604 .
Rossi, A. G., & Utkus, S. P. (2020). Who benefits from robo-advising? evidence
from machine learning. Evidence from Machine Learning (March 10, 2020).
Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust
region policy optimization. None.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017, 07).
Proximal policy optimization algorithms. None.
Silver, C. (2021, November). Modern portfolio theory (MPT). https://
www.investopedia.com/terms/m/modernportfoliotheory.asp. (Accessed:
2021-11-14)
Silver, D. (2015). Introduction to Reinforcement Learning with David Silver.
Retrieved 2021-10-25, from https://deepmind.com/learning-resources/
-introduction-reinforcement-learning-david-silver
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., . . .
Hassabis, D. (2018, December). A general reinforcement learning algorithm
that masters chess, shogi, and Go through self-play. Science (New York, N.Y.),
362 (6419), 1140–1144. doi: 10.1126/science.aar6404
Spilak, B. (2018). Deep neural networks for cryptocurrencies price prediction (Un-
published master’s thesis). Humboldt-Universität zu Berlin.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction
(Second ed.). The MIT Press. Retrieved from http://incompleteideas.net/
book/the-book-2nd.html
Taghian, M., Asadi, A., & Safabakhsh, R. (2020). Learning financial asset-
specific trading rules via deep reinforcement learning. arXiv preprint
arXiv:2010.14194 .
Tang, L. (2018, December). An actor-critic-based portfolio investment method
inspired by benefit-risk optimization. Journal of Algorithms & Computa-
67
tional Technology, 12 (4), 351–360. Retrieved 2021-10-28, from http://
journals.sagepub.com/doi/10.1177/1748301818779059 doi: 10.1177/
1748301818779059
Tian, S., Yu, Y., & Guo, H. (2015). Variable selection and corporate bankruptcy
forecasts. Journal of Banking & Finance, 52 , 89–100.
Vamossy, D. F. (2021). Investor emotions and earnings announcements. Journal of
Behavioral and Experimental Finance, 30 , 100474.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8 (3-4), 279–292.
Weng, L. (2018, February). A (Long) Peek into Reinforcement Learning.
Retrieved 2021-11-02, from https://lilianweng.github.io/2018/02/19/
a-long-peek-into-reinforcement-learning.html
Yiu, T. (2020, October). Understanding portfolio optimization. https://
towardsdatascience.com/understanding-portfolio-optimization
-795668cef596. (Accessed: 2021-11-17)
68