0% found this document useful (0 votes)
49 views69 pages

Model-Free Reinforcement Learning For Asset Allocation: Practicum Final Report

Uploaded by

Alex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views69 pages

Model-Free Reinforcement Learning For Asset Allocation: Practicum Final Report

Uploaded by

Alex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

arXiv:2209.10458v1 [q-fin.

PM] 21 Sep 2022

Model-Free Reinforcement Learning


for Asset Allocation

Practicum Final Report

Authors: Industry Advisor & Client:


Adebayo Oshingbesan Mahmoud Mahfouz
Eniola Ajiboye Srijan Sood
Peruth Kamashazi Faculty Supervisor:
Timothy Mbaka David Vernon

September 22, 2022


Acknowledgments

This work would not have been possible without the advice and support of several
people. First and foremost, we would like to express our gratitude to Mahmoud
Mahfouz, Vice President of AI and Research at J.P. Morgan and Chase Co., and his
colleague Sood Srijan for providing us with resources and valuable guidance during
this project. We would also like to thank our advisor, Prof. David Vernon, for his
guidance and constant support.

1
Table of Contents

1 Introduction 6
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.5 Significance of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.6 Limitations of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.7 Structure of Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Portfolio Management 9
2.1 Portfolio Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Markowitz Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Modern Portfolio Theory . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Post-modern Portfolio Theory . . . . . . . . . . . . . . . . . . . . . . 10

3 Survey of Machine Learning in Finance 12


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 RL Applications in Portfolio Management . . . . . . . . . . . . . . . 16

4 Financial Environment 18
4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Reward functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Survey of Reinforcement Learning Techniques 21


5.1 Introduction to Reinforcement Learning . . . . . . . . . . . . . . . . 21
5.2 Reinforcement Learning Approaches . . . . . . . . . . . . . . . . . . . 24

6 Trading Agents 25
6.1 Baseline Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.2 Selection Criteria for RL Agents . . . . . . . . . . . . . . . . . . . . . 26
6.3 Theoretical Description of Selected RL Agents . . . . . . . . . . . . . 26
6.3.1 Normalized Advantage Function (NAF) . . . . . . . . . . . . . 26
6.3.2 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.3.3 Deep Deterministic Policy Gradient (DDPG) . . . . . . . . . . 29
6.3.4 Twin Delayed Deep Deterministic Policy Gradient (TD3) . . . 31

2
6.3.5 Advantage Actor Critic (A2C) . . . . . . . . . . . . . . . . . . 32
6.3.6 Soft Actor Critic(SAC) . . . . . . . . . . . . . . . . . . . . . . 34
6.3.7 Trust Region Policy Optimization (TRPO) . . . . . . . . . . . 35
6.3.8 Proximal Policy Optimization (PPO) . . . . . . . . . . . . . . 39

7 Experiments 42
7.1 Data - Dow Jones 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.2 Experiment Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.2.1 Environment Parameters . . . . . . . . . . . . . . . . . . . . . 42
7.2.2 RL Agent Hyper-parameters . . . . . . . . . . . . . . . . . . . 43
7.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

8 Results & Discussion 47


8.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.2.1 RL vs. Baselines . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.2.2 Value-Based RL vs. Policy-Based RL . . . . . . . . . . . . . . 55
8.2.3 On-Policy vs. Off-Policy . . . . . . . . . . . . . . . . . . . . . 55

9 Conclusion 57
9.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

A Raw Metric Scores for All Experiments 59

References 64

3
List of Figures

3.1 Taxonomy of Common Applications of Machine Learning in Finance . 12

5.1 Components of an RL System (D. Silver, 2015) . . . . . . . . . . . . 21


5.2 Taxonomy of RL Agents (Weng, 2018) . . . . . . . . . . . . . . . . . 22

6.1 Actor Critic Algorithm Framework (Sutton & Barto, 2018) . . . . . . 33

8.1 Graph of Final Average Rank of All Agents . . . . . . . . . . . . . . . 50


8.2 Graph of Cumulative Returns Plot at No Trading Costs . . . . . . . 50
8.3 Graph of Mean of Portfolio Weights For Each Stock at No Trading
Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.4 Graph of Mean of Portfolio Weights For Each Stock at No Trading
Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
8.5 Graph of Cumulative Returns Plot at 0.6% Trading Costs . . . . . . 52
8.6 Graph of Mean of Portfolio Weights For Each Stock at 0.6% Trading
Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.7 Graph of Mean of Portfolio Weights For Each Stock at 0.6% Trading
Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.8 Graph of Cumulative Returns Plot at 1% Trading Costs . . . . . . . 53
8.9 Graph of Mean of Portfolio Weights For Each Stock at 1% Trading
Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.10 Graph of Mean of Portfolio Weights For Each Stock at 1% Trading
Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4
List of Tables

3.1 Table of Previous ML Research in Finance. . . . . . . . . . . . . . . . 13

5.1 Table of Model-Free RL Methods. . . . . . . . . . . . . . . . . . . . . 23


5.2 Table of Model-Free RL Methods and Their Learning Mechanics. . . 23

8.1 Table of Rank Comparison at No Trading Cost . . . . . . . . . . . . 48


8.2 Table of Rank Comparison at 0.1% Trading Cost . . . . . . . . . . . 48
8.3 Table of Rank Comparison at 1% Trading Cost . . . . . . . . . . . . 49

A.1 Table of Mean Performance at No Trading Cost & Log Returns Reward 59
A.2 Table of Mean Performance at No Trading Cost & Sharpe Ratio Reward 60
A.3 Table of Peak Performance at No Trading Cost . . . . . . . . . . . . 60
A.4 Table of Mean Performance at 0.1% Trading Costs & Log Returns
Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.5 Table of Mean Performance at 0.1% Trading Costs & Sharpe Ratio
Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
A.6 Table of Peak Performance at 0.1% Trading Costs . . . . . . . . . . . 62
A.7 Table of Mean Performance at 1% Trading Costs & Log Returns Reward 62
A.8 Table of Mean Performance at 1% Trading Costs & Sharpe Ratio
Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.9 Table of Peak Performance at 1% Trading Costs & Sharpe Ratio Reward 63

5
Chapter 1

Introduction

1.1 Background
Asset allocation (or portfolio management) is the task of determining how to opti-
mally allocate funds of a finite budget into a range of financial instruments/assets
such as stocks (Filos, 2019). Coming up with a profitable trading strategy involves
making critical decisions on allocating capital into different stock options. Usually,
this allocation should maximize the expected return while minimizing the investment
risks involved (Gao et al., 2021). There are several existing portfolio management
strategies, and the state-of-the-art portfolio management frameworks are broadly
classified into baseline models, follow the winner, follow the loser, pattern-matching,
and meta-learning(B. Li & Hoi, 2014).
While many of these state-of-the-art models achieve good results, there are some
limitations. First, they are overly reliant on using predictive models (Adämmer &
Schüssler, 2020). These predictive models are not usually too successful at predicting
the financial markets since these markets are highly stochastic and thus are very
difficult to predict accurately(Belousov, Abdulsamad, Klink, Parisi, & Peters, 2021).
Similarly, many of these models make simplistic and usually unrealistic assumptions
around the financial signals’ second-order and higher-order statistical moments (Gao
et al., 2021). Finally, these models are usually limited to discrete action spaces to
make the resulting models tractable to solve (Filos, 2019).
Reinforcement learning (RL) has become increasingly popular in financial portfolio
management (J. Huang, Chai, & Cho, 2020). Reinforcement learning is the sub-field
of machine learning concerned with how intelligent machines ought to make deci-
sions in an environment to maximize the cumulative reward over time (Contributors,
2021). The addition of deep neural networks have been one of the breakthrough con-
cepts in reinforcement learning in recent years, achieving superhuman performance
in several tasks such as playing chess (D. Silver et al., 2018).

1.2 Problem Definition


While there are a lot of tools currently available for portfolio management, they
are limited in their abilities because of inherent assumptions about the financial

6
markets. Techniques that reduce the number of simplifying assumptions made could
yield better results. Thus, there may be a significant gap between what is possible
and what is currently available in the field.

1.3 Aim and Objectives


This study investigates the effectiveness of model-free deep reinforcement learning
agents with portfolio management. The specific objectives are:
• Training RL agents on real-world prices of a finite set of stocks to optimally
allocate a finite cash budget into a range of securities in a portfolio.
• Comparing the performance of RL agents to baseline agents.
• Comparing the performance of value-based RL agents to policy-based RL
agents.
• Comparing the performance of off-policy RL agents to on-policy RL agents.

1.4 Research Questions


At the end of this report, we should be able to answer the following questions:
1. How well can RL agents perform the task of portfolio management?
2. Are RL agents markedly better than the classical state-of-the-art portfolio
management techniques?
3. Are there certain classes of RL agents that are consistently better at portfolio
management?

1.5 Significance of Study


Since Harry Markowitz proposed the idea of portfolio formation in 1952 to improve
expected rewards and reduce the investment risk, portfolio trading has been the
dominant strategy for many traders. Any technique that helps improve this strategy
will have many positive ripple effects (Levišauskait, 2010). Reinforcement learning
has shown that it is capable of improving performance in certain tasks substantially.
An excellent example of this is the game of chess where AlphaZero and LeelaZero,
two RL agents, beat state-of-the-art chess engines. These RL agents won by using
the strategy of long-term positional advantage over materialism leading many chess
players to modify their playing styles (D. Silver et al., 2018). This study aims to
understand if RL can uncover strategies for portfolio management that yield better
results than the current state of the art methods.

1.6 Limitations of Study


While this study will not be making any strong assumptions regarding the financial
signals obtained from the market, it will make some reasonable assumptions around

7
the impact of the RL agent on the market. However, these assumptions are generally
considered reasonable and should not impact the portability of this study into the
real world.

1.7 Structure of Report


The remainder of this report is structured as follows:
• Chapter 2: Portfolio Management - This chapter provides an overview of the
field of portfolio management.
• Chapter 3: Survey of Machine Learning in Finance - This chapter provides an
overview of the different applications of machine learning in finance, especially
in reinforcement learning for portfolio management.
• Chapter 4: Financial Environment - This chapter describes how the trading
environment is modeled as a reinforcement learning environment.
• Chapter 5: Survey of Reinforcement Learning Techniques - This chapter pro-
vides an overview of the field of reinforcement learning.
• Chapter 6: Trading Agents - This chapter describes the baseline and RL agents
considered in this work.
• Chapter 7: Experiments - This chapter describes all the experiments that were
carried in this study.
• Chapter 8: Results and Discussion - This chapter documents and discusses
the results of this study.
• Chapter 9: Conclusion - This chapter summarizes the results, answers the ini-
tial research questions, ensures that the study aim & objectives were achieved,
and recommends future directions for the study.

8
Chapter 2

Portfolio Management

Portfolio management involves selecting and managing a collection of assets in order


to meet long-term financial goals while adhering to a risk tolerance level. Diversifi-
cation and portfolio optimization are critical components of efficient portfolio man-
agement. Diversification is a risk management approach that involves combining a
wide range of investments in a portfolio. A diversified portfolio comprises various
asset types and investment vehicles to reduce exposure to any particular asset or
risk (Hayes, 2021).

2.1 Portfolio Optimization


A portfolio is a collection of several financial assets. Portfolio optimization is the
process of determining the optimum portfolio from a set of all possible portfolios
with a specific goal in mind. An optimum portfolio mixes candidate assets so that
the chance of the portfolio generating a positive return is maximized for a given
level of risk (Yiu, 2020). Portfolio optimization systematically solves the asset allo-
cation problem by constructing and optimizing an objective function expressing the
investor’s preferences concerning the portfolio vector (Filos, 2019).
One method for accomplishing this is to assign equal weights to each asset. When the
returns across the assets are random and there is no historical data, assigning equal
weights is considered the most basic type of asset allocation and can be effective.
This strategy, however, is inefficient since it does not take into account past data.
Thus, more efficient methods have been developed. One of these methods is the
Markowitz Model.

2.2 Markowitz Model


The Markowitz model (Markowitz, 1952) is regarded as one of the first efforts to
codify and propose an optimization strategy for portfolio management mathemati-
cally. The Markowitz model formulates portfolio allocation as discovering a portfolio
vector w among a universe of M assets. The Markowitz model gives the optimal
portfolio vector w∗ which minimizes volatility for a given return level, such that

9
(Filos, 2019):
M
X
w∗,i = 1, wx ∈ R (2.1)
i=1

The Markowitz model assumes that the investor is risk-averse and thus determines
the optimal portfolio by selecting a portfolio that gives a maximum return for a
given risk or minimum risk for given returns. Therefore, the optimal portfolio is
selected as follows:
• From a set of portfolios with the same return, the investor will prefer the
portfolio with the lower risk.
• From a set of portfolios with the same risk, the investor will choose the portfolio
with the highest return.
The Markowitz model also presupposes that the analysis is based on a one-period
investment model and is thus only applicable in a one-period situation. This means
that the choice to distribute assets is made only at the start of the term. As a
result, the repercussions of this decision can only be seen after the term, and no
additional action may be taken during that time. This makes the model a static
model. However, since the Markowitz model is simple to grasp, it is frequently
utilized as the cornerstone for current portfolio optimization strategies due to its
simplicity and efficiency. However, because it is based on a single-period investment
model, it is not ideal for continuous-time situations.

2.3 Modern Portfolio Theory


The Markowitz model serves as the foundation for the modern portfolio theory
(MPT). MPT is a mathematical framework for constructing a portfolio of assets
to maximize the expected return for a given amount of risk. Diversification is an
essential component of the MPT. Diversification refers to the concept that having
a variety of financial assets is less hazardous than owning only one type (C. Silver,
2021).
A core principle of the modern portfolio theory is that an asset’s risk should be
measured not by itself but by how it contributes to the total risk and return of
the portfolio. Modern portfolio theory uses the standard deviation of all returns to
assess the risk of a specific portfolio (E. & E., 2021). Modern portfolio theory may
be used to diversify a portfolio in order to get a higher overall return with less risk.
The efficient frontier is a key concept in MPT. It is the line indicating the investment
combination that will deliver the best level of return for the lowest degree of risk
(C. Silver, 2021).

2.4 Post-modern Portfolio Theory


The post-modern portfolio theory (PMPT) extends the modern portfolio theory
(MPT). PMPT is an optimization approach that uses the downside risk of returns
rather than the expected variance of investment returns employed by MPT. The
difference in risk between the PMPT and the MPT, as measured by the standard

10
deviation of returns, is the most important aspect of portfolio creation. The MPT
is based on symmetrical risk, whereas the PMPT is based on asymmetrical risk.
The downside risk is quantified by target semi-deviation, also known as downside
deviation, and it reflects what investors dread the most: negative returns (J. Chen,
2021).

11
Chapter 3

Survey of Machine Learning in


Finance

3.1 Introduction
Machine learning has become increasingly important in the finance industry across
several tasks. Figure 3.1 shows a taxonomy of some of the typical applications of
machine learning in finance.

Figure 3.1: Taxonomy of Common Applications of Machine Learning in Finance

For financial tasks, different classes of machine learning algorithms have been fre-
quently used. These classes include generalized linear models, tree-based models,
kernel-based models, and neural networks. Similarly, the data has widely varied

12
from structured data such as tables to unstructured data such as text. Table 3.1
itemizes various finance research articles where some of these models were used.
Table 3.1: Table of Previous ML Research in Finance.

Reference Task Data ML


Algorithm(s)
(Adämmer & Schüssler, 2020) Information News articles Correlated
extraction and Topic Model
Risk assessment (CTM), LASSO,
(equity risk Support Vector
premium Regression,
prediction) Ridge
Regression,
Elastic Net,
Decision Trees,
Random
Forests, Boosted
Trees
(Albanesi & Vamossy, 2019) Credit default Credit bureau Boosted Trees
behavior data and Deep
Neural
Networks
(Amel-Zadeh, Calliess, Kaiser, & Abnormal Stock Financial LASSO,
Roberts, 2020) Returns Statement Random
Variables Forests, Neural
Networks
(Ang, Chia, & Saghafian, 2020) Start-ups Start-up LDA and
Valuations and funding data Gradient
Probability of and descriptions Boosting
Success from Regressor
Crunchbase over
ten years,
(Antweiler & Frank, 2004) Stock market Yahoo Finance Naïve Bayes
sentiment message board and SVM
analysis
(W. Bao, Yue, & Rao, 2017) Stock prices Technical LSTM and
forecasting Indicators Stacked
Autoencoders
(Y. Bao, Ke, Li, Yu, & Zhang, Accounting Raw financial Boosted Trees
2020) Fraud statement
(Bari & Agah, 2020) Trading signal Tweets and LSTMs and
generations Financial News GRUs
(Björkegren & Grissen, 2020) Payment of bills Mobile phone Random Forests
– credit risk metadata
prediction

13
(L. Chen, Pelger, & Zhu, 2020) Stochastic Firm Generative
discount factor characteristics, Adversarial
historical Networks
return, and
macroeconomic
indicators.
(Colombo & Pelagatti, 2020) The direction of Marker Support Vector
changes in uncertainty Machines
exchange rates indicator data
(Croux, Jagtiani, Korivi, & Loan Default Loan data, LASSO
Vulanovic, 2020) borrowers’
characteristics,
macroeconomic
indicators
(Gomes, Carvalho, & Carvalho, Anomaly Chamber of Deep
2017) detection Deputies open Autoencoders
data,
Companies data
from Secretariat
of Federal
Revenue of
Brazil
(Goumagias, Hristu-Varsakelis, Tax evasion Empirical data Deep
& Assael, 2018) prediction from Greek Q-Learning
firms
(Gu, Kelly, & Xiu, 2020) Stock returns Firm Linear
forecasting characteristics, Regression,
Historical Random
returns, Forests, Boosted
Macroeconomic Trees, Neural
indicator Networks
(Gulen, Jens, & Page, 2020) Estimation of Firms Causal Forests
heterogeneous characteristic
treatment data
effects of debt
covenant
violations on
firms’
investment
levels.
(Y. Huang et al., 2016) Price direction Tweets Neural
prediction Networks
(Iwasaki & Chen, 2018) Stock price Analyst Report LSTM, CNN,
prediction Bi-LSTM
(Jiang & Liang, 2017) Cryptocurrency Cryptocurrency RNN, LSTM,
portfolio price data CNN
management

14
(Kvamme, Sellereite, Aas, & Mortgage Mortgage data Convolutional
Sjursen, 2018) default from Norwegian Neural
prediction financial service Networks and
group, DNB Random Forest
(Lahmiri & Bekiros, 2019) Corporate Firms’ financial Neural networks
Bankruptcy statements,
market data,
and general risk
indicators
(H. Li, Shen, & Zhu, 2018) Stock price Stocks data RNN, LSTM,
prediction GRU
(Liang, Chen, Zhu, Jiang, & Li, Portfolio Stocks data Deep
2018) Allocation Reinforcement
Learning
(Luo, Wu, & Wu, 2017) Corporate credit CDS data Deep Belief
rating Network and
Restricted
Boltzmann
Machines
(Ozbayoglu, Gudelek, & Sezer, Financial Financial News SVM, Deep
2020) distress Belief Network,
prediction LSTM
(Damrongsakmethee & Neagoe, Credit score Credit scores MLP and CNN
2017) classification data
(Paula, Ladeira, Carvalho, & Financial fraud Databases of Deep
Marzagao, 2016) and money foreign trade of Autoencoders
laundering the Secretariat
of Federal
Revenue of
Brazil
(Reichenbacher, Schuster, & future bond Bond Elastic Nets and
Uhrig-Homburg, 2020) liquidity transactions and Random Forests
characteristics
data
(Antunes, Ribeiro, & Pereira, Bankruptcy Financial Deep Belief
2017) Prediction Statements Network
(Rossi & Utkus, 2020) Investors Investor Regression Trees
portfolio characteristics
allocation and
performance
and effects of
Robo-advising
(Ozbayoglu et al., 2020) Credit card Account data Decision Trees,
default and Random Forest,
macroeconomic Boosted Trees
indicators

15
(Taghian, Asadi, & Safabakhsh, Trading Signal Market data Neural Network,
2020) Generation Genetic
Programming,
and
Reinforcement
Learning
(Spilak, 2018) Dynamic Cryptocurrency LSTM, RNN,
portfolio data MLP
allocation
(Ozbayoglu et al., 2020) Stock Stocks data Deep RBM
classification Encoder-
Classifier
Network
(Tian, Yu, & Guo, 2015) Corporate Firms’ financial LASSO
Bankruptcy statements and
market data
(Vamossy, 2021) Investor’s StockTwits post Deep Neural
emotions Networks
(Deng, Ren, Kong, Bao, & Dai, Stock price Stock price data Fuzzy Deep
2016) prediction and Direct
trading signal Reinforcement
generation. Learning

3.2 RL Applications in Portfolio Management


As seen in table 3.1, reinforcement learning has been applied in stock price predic-
tion, portfolio management/allocation, tax evasion prediction, and trading signal
generation, among others. In the following paragraphs, we will be providing an
overview of several significant works that have been done in the domain of reinforce-
ment learning for portfolio management in the past few years.
A study proposed an extendable reinforcement learning framework for handling a
generic portfolio management challenge (Jiang, Xu, & Liang, 2017). At the lowest
level, the framework is based on the Ensemble of Identical Independent Evaluators
(EIIE) meta topology, which allows for many different forms of weight-sharing neural
networks. This framework was tested on the bitcoin market, and it outperformed
other trading algorithms by a wide margin.
Another study (Filos, 2019) introduced a global model-free reinforcement learning
family of agents. Because it generalizes across assets and markets independent of
the training environment, this methodology proved economical memory-wise and
computation-wise. Furthermore, the author used pre-training, data augmentation,
and simulation to ensure more robust training.
A group of researchers (Huotari, Savolainen, & Collan, 2020) studied the portfolio
performance of a trading agent based on a convolutional neural network model. The
agent’s activity correlated with an investor’s high risk-taking behavior. Furthermore,

16
the agent beat benchmarks, although its performance did not differ statistically
substantially from the S&P 500 index.
In deep reinforcement learning for portfolio management (Hieu, 2020), the authors
examined three cutting-edge continuous policy gradient algorithms - deep determin-
istic policy gradient (DDPG), generalized deterministic policy gradient (GDPG),
and proximal policy optimization (PPO). The authors concluded that the first two
performed significantly better than the third.
A research work (Gao et al., 2021) suggested a hierarchical reinforcement learn-
ing framework that can manage an arbitrary number of assets while accounting for
transaction fees. On real-world market data, the framework performs admirably.
However, since just four equities were evaluated, there is some doubt on the frame-
work’s capacity to deal with enormous amounts of data.

17
Chapter 4

Financial Environment

4.1 Assumptions
The real-world financial market is a highly complex system. In this work, we model
the financial environment as a discrete-time, stochastic dynamic system with the
following simplifying assumptions:
• There is no dependence on explicit stock price prediction (model-free).
• The actions of the RL agent should be continuous.
• There will be zero slippage.
• The RL agents have zero market impact.
• Short-selling is prohibited.
• The environment is a partially observable system.
These simplifying assumptions are consistent with similar works in literature ((Filos,
2019), (Gao et al., 2021), (Betancourt & Chen, 2021), (Jiang & Liang, 2017), (Hu
& Lin, 2019)).

4.2 Description
The environment takes in the following inputs:
• Data: This is either a dataframe or a list of stock tickers. If a dataframe is
provided, the index of the dataframe must be of type datetime (or can be
cast to datetime). Each column should contain the prices of the stock name
provided in the header over the time period.
• Episode Length: This refers to how long (in days) the agent is allowed to
interact with the environment.
• Returns: The environment has two reward signals (see section 4.4). The
returns variable is a boolean flag used to choose between these reward signals.
When set to true, the environment uses the log-returns as the reward signal.
When set to false, the environment used the differential Sharpe ratio.

18
• Trading Cost Ratio: This is the percentage of the stock price that will be
attributed to the cost of either selling or buying a unit stock.
• Lookback Period: This is a fixed-sized window used to control how much
historical data to return to the agent as the observation at each timestep.
• Initial Investment: This refers to the initial amount available to the agent to
spend on all the available stocks in the environment.
• Retain Cash: This is a boolean flag value used to inform the environment
whether the agent can keep a cash element or not.
• Random Start Range: The agent is encouraged to start from a random range
to avoid over-fitting. This value controls what that range should be.
• DSR Constant: This is a smoothing parameter for the differential Sharpe ratio.
• Add Softmax: This is a boolean flag that controls whether the environment
should perform the softmax operation or not. This is required to support the
out-of-the-box RL agents from other libraries.
• Start Date: If a list of tickers was provided instead of a dataframe, this start
date parameter is used for the yahoo finance data download.
• End Date: If a list of tickers was provided instead of a dataframe, this end
date parameter is used for the yahoo finance data download.
• Seed: This is a seed value for environment reproducibility

4.3 State Space


At any time step t, the agent will observe a stack of T vectors such that T is the
amount of lookback context that the agent can observe. Each of the vectors will be
an asset information vector denoted as Pt where:

      T
P1 ,t P 2 ,t P M ,t
Pt = log , log , . . . , log  RM (4.1)
P1,t−1 P2,t−1 PM,t−1

4.4 Action Space


The action represents the weights of the stocks in a portfolio at any time t where
its i-th component represents the ratio of the i-th asset such that:

At = [At ,1 , At ,2 , . . . , At ,M ]T  RM (4.2)
M
X
Ai ,t = 1 (4.3)
i=1

0 ≤ Ai ,t ≤ 1 ∀ i, t (4.4)
where M is the total number of assets in the portfolio

19
If the agent is required to keep a cash element, the weight vector’s size is increased
by one yielding an extended portfolio vector that satisfies equations 4.2 to 4.4 with
the cash element treated as an additional asset.

4.5 Reward functions


There are two reward functions Rt - log-returns and differential Sharpe ratio. The
specific training reward returned for a particular training episode will be determined
by the returns input value to the environment. The reward functions are defined as
follows:
• Log-returns
This is defined as the weighted sum of log-returns for the portfolio such that:

Rt = In(1 + P t+1 • At ) (4.5)

• Differential Sharpe Ratio


This is defined as an instantaneous risk-adjusted Sharpe ratio. Equations 4.6
to 4.10 provide its mathematical formulation (Filos, 2019).

Rt = (Yt−1 ∗ δXt + 0.5Xt−1 ∗ δYt )/(Yt − 1 − X 2 t−1 )1.5 (4.6)

Xt = Xt−1 + α ∗ δXt (4.7)

Yt = Yt−1 + α ∗ δYt (4.8)

δXt = LRt − Xt−1 (4.9)

δYt = LR2 t − Yt−1 , (4.10)

where:
LR is log-returns
0 <= α <= 1; α is a smoothing f actor.

20
Chapter 5

Survey of Reinforcement Learning


Techniques

5.1 Introduction to Reinforcement Learning


Reinforcement learning (RL) involves learning by interaction with an environment
led by a specified goal. The agent learns without being explicitly programmed,
selecting actions based on prior experience (J. Huang et al., 2020). RL is growing
more prominent as computer technologies such as artificial intelligence (AI) have
advanced over the years. Machine learning (ML) is a sub-field of AI that focuses
on the use of data and algorithms to emulate the way humans learn, eventually
improving its performance(Filos, 2019). There are mainly three categories of ML,
namely supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning involves learning from a set of labeled training data. The objec-
tive is to give the learning agent the capacity to generalize responses to circumstances
not in the training set (D. Silver et al., 2018). On the other hand, unsupervised learn-
ing is concerned with discovering hidden patterns and information in an unlabelled
data set (Harmon & Harmon, 1996). Finally, in reinforcement learning, an agent
learns by interacting with an unfamiliar environment. The agent receives input from
the environment in the form of a reward (or punishment) which it then utilizes to
gain experience and knowledge about the environment (Figure 5.1).

Figure 5.1: Components of an RL System (D. Silver, 2015)

The environment can be anything that processes an agent’s actions and the con-

21
sequences of those actions. The environment’s input is an agent’s action A(t)
performed in the current state S(t), and the environment’s output is the next
state S(t + 1) and reward R(t + 1). A state is a location or position in each
world/environment that an agent may reach or visit. The reward function R(t)
returns a numerical value to an agent for being in a state after performing an ac-
tion. Rewards indicate whether or not a state is valuable and how valuable that
state is (Contributors, 2021). According to the reward hypothesis, all objectives
may be characterized by maximizing the predicted cumulative reward. Actions are
anything that an agent is permitted to do in a given context within the environment.
An RL agent may include one or more of three components - a policy, a value
function, and a model. A policy directs an agent’s decision-making in each state. It
is a mapping between a set of states and a set of actions. An optimum policy provides
the best long-term benefits. A value function determines the quality of each state or
state-action pair. A model is an agent’s representation of the environment, through
which the agent predicts what the environment will do next (Filos, 2019; D. Silver
et al., 2018). RL agents can be classified into different classes based on which of
these three components they have (Figure 5.2). Model-free RL methods rely on trial
and error to update their experience and information about the given environment
because they lack knowledge of the transition model and reward function. Tables
5.1 and 5.2 provides a summary of some of the common model-free RL methods.

Figure 5.2: Taxonomy of RL Agents (Weng, 2018)

22
Table 5.1: Table of Model-Free RL Methods.

Algorithm Description
A2C Advantage Actor-Critic Algorithm
A3C Asynchronous Advantage Actor-Critic Algorithm
DDPG Deep Deterministic Policy Gradient
DQN Deep Q Network
Monte Carlo Every visit to Monte Carlo
NAF Q-Learning with Normalized Advantage Functions
PPO Proximal Policy Optimization
Q-learning State-action-reward-state
Q-learning - Lambda State action reward state with eligibility traces
REINFORCE Monte-Carlo sampling of policy gradient methods
SAC Soft Actor-Critic
SARSA State-action-reward-state-action
TD3 Twin Delayed Deep Deterministic Policy Gradient
TRPO Trust Region Policy Optimization

Table 5.2: Table of Model-Free RL Methods and Their Learning Mechanics.

Algorithm Policy Action Space State Space Operator Class


A2C On-policy Continuous Continuous Advantage Actor-Critic
A3C On-policy Continuous Continuous Advantage Actor-Critic
DDPG Off-policy Continuous Continuous Q-value Actor-Critic
DQN Off-policy Discrete Continuous Q-value Value-based
Monte Carlo Either Discrete Discrete Sample means Value-based
NAF Off-policy Continuous Continuous Advantage Value-based
PPO On-policy Continuous Continuous Advantage Actor-Critic
Q-learning Off-policy Discrete Discrete Q-value Value-based
Q-learning - Lambda Off-policy Discrete Discrete Q-value Value-based
REINFORCE On-policy Continuous Continuous Q-value Policy-based
SAC Off-policy Continuous Continuous Advantage Actor-Critic
SARSA On-policy Discrete Discrete Q-value Value-based
SARSA - Lambda On-policy Discrete Discrete Q-value Value-based
TD3 Off-policy Continuous Continuous Q-value Actor-Critic
TRPO On-policy Continuous Continuous Advantage Actor-Critic

23
5.2 Reinforcement Learning Approaches
There are mainly four common approaches to reinforcement learning - dynamic pro-
gramming, Monte Carlo methods, temporal difference and policy gradient. Dynamic
programming is a two-step process that uses the Bellman equation to improve the
policy after the policy evaluation has been done. Dynamic programming is used
when the model is fully known. Monte Carlo methods learn from raw experience
episodes without modeling the environmental dynamics and compute the observed
mean returns as an approximation of the predicted return. This means that learning
can only occur when all episodes have been completed (Betancourt & Chen, 2021).
Temporal Difference methods learn from incomplete episodes by using bootstrap-
ping to estimate the value. Temporal difference may be thought of as a hybrid of
dynamic programming and Monte Carlo. Unlike temporal difference approaches,
which rely on the value estimate to establish the optimum policy, policy gradient
methods rely directly on policy estimation. The policy, pi(a|s; theta), is parameter-
ized by theta, and we apply gradient ascent to discover the best theta producing
the highest return. Policy learning can take place on- or off- policy. On-policy RL
agents analyze the same policy that generated the action. Off-policy RL agents, on
the other hand, analyze a policy that is not necessarily the same as the one that
generated the action (B. Li & Hoi, 2014).

24
Chapter 6

Trading Agents

6.1 Baseline Agents


To properly benchmark our RL agents, we compared their performance to four
baseline models. These benchmarks were chosen based on literature review and the
client’s recommendation. They include:
• Uniform Allocation
This involves setting the portfolio weights such that:
1
At,i = ∀i (6.1)
M

• Random Allocation
This involves setting the portfolio weights such that:
f (i)
At,i = PM where f (i) is a f unction of a random variable (6.2)
i f (i)

• Buy And Hold


This involves setting the portfolio weights such that:

At,i = ci ∀ i 0 ≤ ci ≤ 1, (6.3)

where ci is a constant portfolio.


and At,i is chosen based on the mean returns of the initial observation
• Markowitz Model
This involves setting the portfolio weights such that:
X
minimizeA AT A, (6.4)
P
where is the covariance matrix
subject to AT µ = µtarget
and 1TM A = 1
and Ai ≥ 0 ∀ i

25
6.2 Selection Criteria for RL Agents
The RL agents were chosen using three criteria. First, the RL agent must be model-
free as the focus of this work is on model-free RL. Second, the agent must have
been used in similar works in literature. Finally, the agent must support continuous
action and state spaces as the financial environment has continuous action and state
spaces. Using table 5.2 as the starting point, fifteen agents satisfied the first two
criteria. However, only nine agents out of the fifteen agents in that table satisfied
the third criterion. A3C was dropped because it is a modification of A2C that enjoys
the benefits of faster computational speed when the training is distributed. Since
we do not use distributed training, an A3C agent would not have offered additional
advantage. Thus, at the end of the selection process, we were left with the following
eight agents - A2C, DDPG, NAF, PPO, REINFORCE, SAC, TD3, TRPO.

6.3 Theoretical Description of Selected RL Agents


6.3.1 Normalized Advantage Function (NAF)
Q-learning is a temporal difference algorithm introduced in 1992 (Watkins & Dayan,
1992). Algorithm 1 provides the Q-learning algorithm.

Algorithm 1: Q-learning (Watkins & Dayan, 1992)


Initialize Q(s, a) arbitrarily.
for each episode do
Initialize s
for each step of episode do
Choose a from using policy derived from Q
Take action a and observe reward r and new state s0
Q(s, a) ← Q(s, a) + α*(r + γ*maxa0 *Q(s0 , a0 ) - Q(s, a))
s ← s0
end
end

Because Q-learning is typically computationally impractical in large action and state


spaces, the Q-value function must be estimated. This may be accomplished by em-
ploying a neural network (Mnih et al., 2013). This structure is known as a Deep
Q-network (DQN), and it has served as the foundation for numerous successful ap-
plications of reinforcement learning to various tasks in recent years. DQN addition-
ally intends to significantly enhance and stabilize the Q-learning training method
through the use of two additional novel techniques:
• Experience Replay
Over several episodes, experience tuples comprising (St , At , Rt , St+1 ) are kept
in a replay memory. During Q-learning updates, samples from the replay
memory are taken at random. This increases data utilization efficiency, breaks
down correlations in observation sequences, and smooths out variations in data
distribution.

26
• Periodically Updated Target
A target network, which is just a clone of the original Q network, is established
and updated on a regular basis. This improvement stabilizes the training by
removing the short-term fluctuations.
However, one significant limitation of the DQN and many of its variants is that they
cannot be used in continuous action spaces. (Gu, Lillicrap, Sutskever, & Levine,
2016) proposed the normalized advantage function, a continuous variant of the Q-
learning algorithm, as an alternative to policy-gradient and actor-critic approaches
in 2016. The NAF model for applying Q-learning with experience replay to con-
tinuous action spaces (Algorithm 2). (Gu et al., 2016) also investigated the use of
learned models for speeding model-free reinforcement learning and discovered that
repeatedly updated local linear models are particularly successful. The authors
represented the advantage function (A) such that:

Q(x, u|θQ ) = A(x, u|θA ) + V (x|θU ) (6.5)

A(x, u|θA ) = −0.5(u − µ(x|thetaµ ))T ∗ P (x|θP ) ∗ (u − µ(x|thetaµ )) (6.6)

P (x|θP ) is a state-dependent, positive definite square matrix, which is parametrized


by L(x|θP ) which is a lower-triangular matrix whose entries come from a linear
output of a neural network with the diagonal terms exponentiated.

Algorithm 2: Continuous Q-Learning With NAF (Gu et al., 2016)


Randomly initialize normalized Q network Q(x, u|θQ ).
0
Initialize target network with weight θQ ← θQ
for episode = 1...M: do
Initialize a random process N for action exploration
Receive initial observation state x1 p(x ˜ 1)
for t = 1...T: do
Select action ut = µ(xt |θµ ) + Nt
Execute ut and observe rt and x( t + 1)
Store transition (xt , ut , rt , x( t + 1)inR
for iteration = 1, I do
Sample a random minibatch of m transitions from R
Q0
Set yi = ri + γ ∗ V 0 ∗ (x( i + 1)|theta )
Update θ by minimizing 1/N (yi − Q(xi , ui |θQ ))2
Q
P
0 0
Update the target network: θQ ← τ ∗ θQ + (1 − τ ) ∗ θQ
end
end
end

6.3.2 REINFORCE
Policy-Gradient algorithms learn a parameterized policy that can select actions with-
out consulting a value function. While a value function may be used to learn the
policy parameters, it is not required for selecting actions. In equation 6.7, we can

27
express the policy as the probability of action a being taken at time t given that the
environment is in state s at time t with parameter θ (Sutton & Barto, 2018).

π(a|s, θ) = P r{At = a|St = s, θt = θ} (6.7)

Policy gradient (PG) approaches are model-free methods that attempt to maximize
the RL goal without using a value function. The RL goal, also known as the perfor-
mance measure J(θ), is defined as the total of rewards from the beginning state to
the terminal state for an episodic task and the average return for a continuous task
when policy πθ is followed.

.
J(θ) = vπθ (s0 ) = Eτ ∼πθ (τ ) γ t r (st , at ) (6.8)
 

Where the value function vπθ (s0 ) is the value of the expected discounted sum of
rewards for a trajectory starting at state s0 and following policy πθ until the episode
terminates. This objective can be evaluated in an unbiased manner by sampling N
trajectories from the environment using policy πθ :

N Ti −1
1 XX
J(θ) ≈ γ t r (si,t , ai,t ) (6.9)
N i=1 t=0
Ti is the timestep in which trajectory τi terminates.
The probability distribution πθ (a|s) can be defined:
• over a discrete action space, in which case the distribution is usually categorical
with a softmax over the action logits.
• over a continuous action space, in which case the output is the parameters of
a continuous distribution (e.g. the mean and variance of a gaussian).
The gradient with respect to θ according to the policy gradient theorem can be
approximated over N trajectories as:

N
"T −1 #
i
1 X X
∇θ J(θ) ≈ Gi,t ∇θ log πθ [ai,t | si,t ] (6.10)
N i=1 t=0
Where ai,t is the action taken at time t of episode i at state si,t , Ti is the timestep in
which trajectory τi terminates and Gi,t is a function of the reward assigned to this
action
For REINFORCE, Gi,t is the sum of rewards in trajectory i
T
Xi −1

Gi,t = r (si,t0 , ai,t0 ) (6.11)


t0 =0

28
Algorithm 3: REINFORCE: Monte-Carlo Policy Gradient Control (episodic)
Initialize policy network with weights θ
for each episode {s0 , a0 , r2 ... sT −1 , aT −1 , rT } sampled from policy πθ : do
for t = 0...T − 1: do
Evaluate the gradient
1
PN hPTi −1 i
∇θ J(θ) ≈ N i=1 t=0 Gi,t ∇θ log πθ [ai,t | si,t ]
Update the policy parameters
θ ← θ + αOθ J(θ)
end
end

Limitations
1. The procedure of updating is inefficient. The trajectory is deleted after per-
forming the policy and changing the parameters.
2. The gradient estimate is noisy, and there is a chance that the gathered trajec-
tory does not accurately represent the policy item.
3. There is no apparent credit assignment. A trajectory can contain numerous
good or harmful activities, and whether or not these behaviours are reinforced
is only determined by the ultimate product.
Other Policy Gradient methods like A2C, DDPG, TD3, SAC, and PPO were created
to overcome the limitations of REINFORCE.

6.3.3 Deep Deterministic Policy Gradient (DDPG)


Following the success of the Deep-Q Learning algorithm, which beat humans in Atari
games, DeepMind applied the same concept to physics challenges, where the action
space is considerably larger than in Atari games. Deep Q-Learning performed well
in high-dimensional state spaces but not in high-dimensional action spaces (con-
tinuous action). To cope with high-dimensional (continuous) action spaces, DDPG
blends Deep Learning and Reinforcement Learning approaches. DDPG employs the
concepts of an experience replay buffer, in which the network is trained off-policy
by sampling experience batches, and target networks, in which copies of the net-
work are created for use in objective functions to avoid divergence and instability in
complex and non-linear function approximators such as neural networks (Lillicrap
et al., 2019).
Aside from using a neural network to parameterize the Q-function "critic," as shown
in DQN, we also have the policy network called "actor" to parameterize the policy
function. The policy is simply the behaviour of the agent, "a mapping from state to
action" in the case of a deterministic policy or "a distribution of actions" in the case
of a stochastic policy. Since we have two networks, there are two sets of parameters
to update:
1. The parameters of the policy network have to be updated in order to maximize
the performance measure J defined in the policy gradient theorem
2. The parameters of the critic network are updated in order to minimize the

29
temporal difference loss L

1 X
L(w) = (yi − q̂(si , ai , w))2 (6.12)
N i

1 X
Oθ J(θ) ≈ Oa q̂(s, a, w)|s=Si ,a=π(Si ) Oθ π(s, θ)|s=Si (6.13)
N i

To maximize the Q-value function while reducing the temporal difference loss, we
must enhance the performance measure J. The actor takes the state as input and
outputs an action, whereas the critic takes both the state and the action as input
and outputs the value of the Q function. The critic uses gradient temporal-difference
learning, whilst the actor parameters are discovered using the Policy gradient theo-
rem. The essential principle of this design is that the policy network acts, resulting
in an action, and the Q-network critiques that action.
The use of non-linear function approximators such as neural networks, which are
required to generalize on vast state spaces, means that convergence is no longer
assured, as it was with Q learning. As a result, experience replay is required in
order to generate independent and identically dispersed samples. In addition, target
networks must be used to avoid divergence when upgrading the critic network. In
DDPG, parameters are changed differently than in DQN, where the target network
is updated every C steps. Following the "soft" update, the parameters of the target
networks are changed at each time step as shown:

w− ← τ w + (1 − τ )w−
(6.14)
θ− ← τ θ + (1 − τ )θ−

where τ  1, w− = weights of target Q network, θ− = weights of target policy


network (= much less than)
The weights of target networks are limited to fluctuate slowly using "soft" updates,
boosting the stability of learning and convergence outcomes. The target network is
then utilized instead of the Q-network in the temporal difference loss. In algorithms
such as DDPG, the problem of exploration may be addressed quite easily and in-
dependently of the learning process. The actor policy is then supplemented with
noise taken from a noise process N to create the exploration policy. The exploration
policy then becomes:

π 0 (St ) = π(St , θ) + v (6.15)

Where v is an Ornstein-Uhlenbeck process - a stochastic method capable of pro-


ducing temporally coordinated actions that provide smooth exploration in physical

30
control issues.
Algorithm 4: DDPG (Lillicrap et al., 2019)
Randomly initialize critic network Q(s, a|θQ ) and actor µ(s|θµ ) with weights θQ
and θµ
0 0
Initialize target network Q0 and µ0 with weights θQ ← θQ , θµ ← θµ
Initialize replay buffer R
for episode = 1...M: do
Initialize a random process N for action exploration
Receive initial observation state s1
for t = 1...T: do
Select action at = µ(st |θµ ) + Nt according to the current policy and
exploration noise
Execute action at and observe reward rt and observe new state st+1
Store transition (st , at , rt , st+1 ) in R
Sample a random minibatch of N transitions (si , ai , ri , si+1 ) from R
0 0
Set yi = ri + γQ0 (si+1 , µ0 (si+1 |θµ )|θQ )
Update critic by minimizing the loss: L(w) = N1 i (yi − q̂(si , ai , w))2
P
Update the the actor P policy using the sampled policy gradient:
Oθµ J(θ) ≈ N1 i Oa Q(s, a|θQ )|s=si ,a=µ(si ) Oθµ µ(s|θµ )|si
Update the target networks:
0 0
θQ ← τ θQ + (1 − τ )θQ
0 0
θµ ← τ θµ + (1 − τ )θµ
end
end

6.3.4 Twin Delayed Deep Deterministic Policy Gradient (TD3)


The DQN method is known to exhibit overestimation bias, which means that it
overestimates the value function. This is due to the fact that the goal Q value
is an approximation, and choosing the maximum over an estimate implies we are
strengthening the approximation inaccuracy. This difficulty prompted various en-
hancements to the underlying DQN algorithm. TD3 uses numerous algorithmic
methods on DDPG, a network designed to improve on the DQN, to limit the possi-
bility of overestimation bias drastically. The algorithmic methods are clipped double
Q-Learning, delayed policy/targets updates, and target policy smoothing.
TD3 employs six neural networks: one actor, two critics, and the target networks
that correspond to them. Clipped Double Q-Learning employs the least estimation
between the two actor reviewers in order to favor underestimating the value function,
which is difficult to transmit through the training process. To limit the volatility
in the value estimation, TD3 updates the policy at a reduced frequency (Delayed
Policy and Targets Updates).
The policy network remains unchanged until the value error is small enough. Fur-
thermore, rather than just duplicating the weights after k steps, the target networks
are updated using the Polyak averaging approach. Finally, to avoid overfitting, TD3
smooths the value function by adding a little amount of clipped random noises to
the chosen action and averaging over mini-batches. Algorithm 5 depicts the TD3

31
framework and the places where these algorithmic tricks were used.

Algorithm 5: TD3 (Fujimoto, Hoof, & Meger, 2018)


Initialize critic networks Qθ1 , Qθ2 and actor network πφ with random
parameters θ1 , θ2 , φ
Initialize target networks θ10 ← θ1 , θ20 ← θ2 , φ0 ← φ
Initialize replay buffer B
for t = 1...T: do
Select action with exploration noise a ∼ π(s) + ,  ∼ N (0, σ) and observe
reward r and new state s0
Store transition tuple (s, a, r, s0 ) in B
Sample mini-batch of N transitions (s, a, r, s0 ) from B
ã ← πφ0 (s) + ,  ∼ clip(N (0, σ̃), −c, c) ; // Target Policy Smoothing
y ← r + γ mini=1,2 Qθi0 (s0 , ã) ; P // Clipped Double Q-learning
Update critics θi ← minθi N −1 (y − Qθi (s, a))2
if t mod d then
/* Delayed update of target and policy networks */
Update φ by the Pdeterministic policy gradient:
−1
Oφ J(φ) = N Oa Qθ1 (s, a)|a=πφ (s) Oφ πφ (s)
Update the target networks:
θi0 ← τ θi + (1 − τ )θi0
φ0i ← τ φ + (1 − τ )φ0
end
end

6.3.5 Advantage Actor Critic (A2C)


A2C is a policy gradient method that combines two types of reinforcement learning
algorithms: policy-based and value-based. The actor-critic algorithm is composed
of two distinct structures: one for storing and updating the value function and the
other for storing the updated policy. The agent chooses the action based on the
policy rather than the value function, where the policy component is called the
actor, which conducts an action, changes the value of the function, and uses the
value function to assess the action, and the value function part is called the critic.
To lower the variance of the policy gradient, it employs an advantage (equation
6.16). The critic network assesses the advantage function rather than the value
function only (Tang, 2018).

T −1
X
Oθ J(θ) ∼ Oθ logπθ (at |st )A(st , at ) (6.16)
t=0

Thus, the evaluation of an action examines not only how excellent the action is,
but also how the action may be improved so that the high variance of the policy
networks is lowered and the model becomes more resilient (Konda & Gao, 2000).
The value function of the critic component can also make use of temporal difference
error (TD error) computed using the TD learning approach.
The advantage of the actor-critic algorithm is that it separates the policy from the

32
value function by learning the value function and the policy function using linear
approximation, where the critic part is the value function approximator, learning
the estimate function, and then passing to the actor part. The actor is a policy
approximator that learns a random strategy and selects an action using the gradient-
based policy updating approach (Fig. 6.1).

Figure 6.1: Actor Critic Algorithm Framework (Sutton & Barto, 2018)

The Policy gradient is defined as follows:


"T −1 #
X
Oθ J(θ) = Eτ Oθ logπθ (at |st )Gt (6.17)
t=0

Introducing baseline b(s):


"T −1 #
X
Oθ J(θ) = E Oθ logπθ (at |st )(Gt − b(st )) (6.18)
t=0

Ert+1 ,st+1 ,...,rT ,sT [Gt ] = Q(st , at ) (6.19)

Plugging that in, we can rewrite the update equation as such:


"T −1 #
X
Oθ J(θ) = Es0 ,a0 ,...,st ,at Oθ logπθ (at |st ) Qw (st , at ) (6.20)
t=0

"T −1 #
X
= Eτ Oθ logπθ (at |st )Qw (st , at ) (6.21)
t=0

Algorithm 6: Actor Critic


Initialize parameters s, θ, w and learning rates αθ , αw sample a ∼ πθ (a|s)
for t ← 1, 2, ...T: do
Sample reward rt ∼ R(s, a) and next state s0 ∼ P (s0 |s, a)
Then sample the next action a0 ∼ πθ (a0 |s0 )
Update the policy parameters: θ ← θ + αθ Qw (s, a)Oθ logπθ (a|s)
Compute the correction (TD error) for action-value at time t:
δt = rt + γQw (s0 , a0 ) − Qw (s, a)
and use it to update the parameters of Q function:
w ← w + αw δt Ow Qw (s, a)
Move to a ← a0 and s ← s0
end

33
6.3.6 Soft Actor Critic(SAC)
As a bridge between stochastic policy optimization and DDPG techniques, the Soft
Actor-Critic Algorithm is an off-policy algorithm that optimizes a stochastic pol-
icy. Entropy regularization is the primary aspect of SAC. The policy is trained to
optimize a trade-off between anticipated return and entropy, which is a measure of
the policy’s unpredictability. SAC is thoroughly described by first establishing the
entropy regularized reinforcement learning setup and the value functions associated
with it (Haarnoja, Zhou, Abbeel, & Levine, 2018).
The SAC algorithm learns a policy as well as two Q- functions, Q1 and Q2. There
are two basic SAC variants: one that utilizes a constant entropy regularization
coefficient and another that enforces an entropy restriction by altering the training
process. The constant entropy regularization coefficient is employed for ease in
spinning up, although the entropy-constrained variation is more widely utilized,
according to (Haarnoja et al., 2018). The Q-functions are taught in a manner
similar to the TD3 mentioned in this study, with a few major modifications.
In each stage, the policy should be implemented in order to maximize projected
future returns and expected future entropy. It should aim to maximize V π (s), which
we expand out into

V π (s) = E [Qπ (s, a)] + αH(π(· | s))


a∼π
= E [Qπ (s, a) − α log π(a | s)]
a∼π

34
Algorithm 7: Soft Actor Critic:
Input: initial policy parameters θ, Q-function parameters φ1 , φ2 , empty replay
buffer D
Set target parameters equal to main parameters φtarg ,1 ← φ1 , φtarg ,2 ← φ2
Repeat the following steps until convergence:
Observe state s and select action a ∼ πθ (· | s)
Execute a in the environment
Observe next state s0 , reward r, and done signal d to indicate whether s0 is
terminal Store (s, a, r, s0 , d) in replay buffer D
If s0 is terminal, reset environment state.
if it’s time to update then
for k = 0, 1, 2, ... do
Randomly sample a batch of transitions, B = {(s, a, r, s0 , d)} from D
Compute targets for the Q functions:
 
y (r, s , d) = r+γ(1−d) min Qφtarg ,i (s , ã ) − α log πθ (ã | s ) , ã0 ∼ πθ (· | s0 )
0 0 0 0 0
i=1,2

Update Q-functions by one step of gradient descent using


1 X 2
∇φi (Qφi (s, a) − y (r, s0 , d)) for i = 1, 2
|B|
(s,a,r,s0 ,d)∈B

Update policy by one step of gradient ascent using


 
1 X
∇θ min Qφ (s, ãθ (s)) − α log πθ (ãθ (s) | s)
|B| s∈B i=1,2 i

where ãθ (s) is a sample from πθ (· | s) which is differentiable wrt θ via the
reparametrization trick. Update target networks with

φtarg,i ← ρφtarg,i + (1 − ρ)φi for i = 1, 2

end

6.3.7 Trust Region Policy Optimization (TRPO)


TRPO is based on trust region optimization, which ensures monotonic improvement
by adding trust region restrictions to satisfy how near the new and old policies may
be (Schulman, Levine, Moritz, Jordan, & Abbeel, 2015). The restriction is defined
in terms of KL divergence, which is a measure of the distance between probability
distributions (Joyce, 2011).
In TRPO, the goal is to optimize the objective function which is denoted as η(π).
Consider an infinite-horizon discounted Markov decision process (MDP), defined by
the tuple (S, A, P, r, ρ0 , γ), where S is a finite set of states, A is a finite set of actions,
P : S ×A×S → R is the transition probability distribution, r : S → R is the reward
function, ρ0 : S → R is the distribution of the initial state s0 , and γ ∈ (0, 1) is the
discount factor (Schulman et al., 2015).

35
Let π denote a stochastic policy π : S × A → [0, 1], and let η(π) denote its expected
discounted reward:
"∞ #
X
η(π) = Es0 ,a0 ,... γ t r (st ) , where
t=0
s0 ∼ ρ0 (s0 ) , at ∼ π (at | st ) , st+1 ∼ P (st+1 | st , at )

In TRPO, the state-action value function Qπ , the value function Vπ , and the advan-
tage function Aπ are defined as:
"∞ #
X
Qπ (st , at ) = Est+1 ,at+1 ,... γ l r (st+l )
l=0
" ∞
#
X
l
Vπ (st ) = Eat ,st+1 ,... γ r (st+l )
l=0
Aπ (s, a) = Qπ (s, a) − Vπ (s), where
at ∼ π (at | st ) , st+1 ∼ P (st+1 | st , at ) for t ≥ 0

The following useful identity expresses the expected return of another policy π̃ in
terms of the advantage over π, accumulated over time-steps (Schulman et al., 2015):
" ∞
#
X
η(π̃) = η(π) + Es0 ,a0 ,···∼π̄ γ t Aπ (st , at ) (6.22)
t=0

where the notation Es0 ,a0 ,...∼π̄ [. . .] indicates that actions are sampled at ∼ π̃ (· | st ) .
Let ρπ be the (unnormalized) discounted visitation frequencies

ρπ (s) = P (s0 = s) + γP (s1 = s) + γ 2 P (s2 = s) + . . .

where s0 ∼ ρ0 and the actions are chosen according to π. 6.22 can be re-written
with a sum over states instead of time steps:

∞ X
X X
η(π̃) = η(π) + P (st = s | π̃) π̃(a | s)γ t Aπ (s, a)
t=0 s a

(6.23)
XX X
= η(π) + γ t P (st = s | π̃) π̃(a | s)Aπ (s, a)
s t=0 a
X X
= η(π) + ρπ̄ (s) π̃(a | s)Aπ (s, a)
s a

Equation 6.23 implies that any policy updateP π → π̃ that has a non-negative ex-
pected advantage at every state s, i.e., a π̃(a | s)Aπ (s, a) ≥ 0, is guaranteed
to increase the policy performance η, or leave it constant in the case that the
expected advantage is zero everywhere. This implies the classic result that the
update performed by exact policy iteration, which uses the deterministic policy
π̄(s) = arg maxa Aπ (s, a), improves the policy if there is at least one state-action pair
with a positive advantage value and nonzero state visitation probability; otherwise,

36
the algorithm has converged to the optimal policy. However, in the approximate
setting, it will typically be unavoidable, due to estimation and approximation error,
that
Pthere will be some states s for which the expected advantage is negative, that
is, a π̃(a | s)Aπ (s, a) < 0. The complex dependency of ρπ̄ (s) on π̃ makes Equation
6.23 difficult to optimize directly. Instead, the following local approximation to η is
introduced (Schulman et al., 2015):

X X
Lπ (π̃) = η(π) + ρπ (s) π̃(a | s)Aπ (s, a) (6.24)
s a

Lπ uses the visitation frequency ρπ rather than ρπ̄ , ignoring changes in state visi-
tation density due to changes in the policy. However, when using a parameterized
policy πθ , where πθ (a | s) is a differentiable function of the parameter vector θ, then
Lπ matches η to first order. That is, for any parameter value θ0 ,

Lπθ (πθ0 ) = η (πθ0 )


(6.25)
∇θ Lπθ0 (πθ ) = ∇θ η (πθ )|θ=θ0

θ=θ0

Equation 6.24 implies that a sufficiently small step πθ0 → π̄ that improves Lπθold
will also improve η, but does not give us any guidance on how big of a step to take
(Schulman et al., 2015).
To address this issue, (Kakade & Langford, 2002) proposed a policy updating scheme
called conservative policy iteration, for which they could provide explicit lower
bounds on the improvement of η. To define the conservative policy iteration up-
date, let πold denote the current policy, and let π 0 = arg maxπ0 Lπold (π 0 ) . The new
policy πnew was defined to be the following mixture (Schulman et al., 2015):

πnew (a | s) = (1 − α)πold (a | s) + απ 0 (a | s) (6.26)

(Kakade & Langford, 2002) derived the following lower bound:

2γ
η (πnew ) ≥ Lπold (πnew ) − α2
(1 − γ)2 (6.27)
where  = max Ea∼π0 (a|s) [Aπ (s, a)]

s

Equation 6.27, which applies to conservative policy iteration, implies that a policy
update that improves the right hand side is guaranteed to improve the true per-
formance η. Our principal theoretical result is that the policy improvement bound
in Equation 6.27 can be extended to general stochastic policies, rather than just
mixture policies, by replacing α with a distance measure between π and π̃, and
changing the constant  appropriately. Since mixture policies are rarely used in
practice, this result is crucial for extending the improvement guarantee to practical
problems. The particular distance measure we use is the total variation divergence,
which is defined by DT V (pkq) = 21 i |pi − qi | for discrete probability distributions
P
p, q.1 Define DTV
max
(π, π̃) as

37
max
DTV (π, π̃) = max DT V (π(· | s)kπ̃(· | s)) (6.28)
s

Trust region policy optimization uses a constraint on the KL divergence rather


than a penalty to robustly allow large updates. Thus, by performing the following
maximization, we are guaranteed to improve the true objective η (Schulman et al.,
2015):

max
maximize [Lθold (θ) − CDKL (θold , θ)] (6.29)
θ

Therefore, TRPO solves the following optimization problem to generate a policy


update (Schulman et al., 2015):

maximizeLθold (θ)
θ (6.30)
subject to D̄KL
ρold
(θold , θ) ≤ δ

TRPO seeks to solve the following optimization problem, obtained by expanding


Lθold in Equation (2.10):
P P
maximize s ρθold (s) a πθ (a | s)Aθold (s, a)
θ (6.31)
subject to D̄KL
ρθ
(θold , θ) ≤ δ

The optimization problem in Equation 6.31 is exactly equivalent to the following


one, written in terms of expectations (Schulman et al., 2015):
 
πθ (a | s)
maximizeEs∼ρθold ,a∼q Qθ (s, a)
θ q(a | s) old (6.32)
subject to Es∼ρθold [DKL (πθold (· | s)kπθ (· | s))

38
Algorithm 8: Trust Region Policy Optimization (Schulman et al., 2015)
Input: initial policy parameters θ0 , initial value function paramaters φ0
Hyperparameters: KL-divergence limit δ, backtracking coefficient α,
maximum number of backtracking steps K
for k = 0, 1, 2, ... do
Collect set of trajectories Dk = {τi } by running policy πk = π (θk ) in the
environment. Compute rewards-to-go R̂t . Compute advantage estimates,
Ât (using any method of advantage estimation) based on the current value
function Vφk . Estimate policy gradient as

T
1 XX
ĝk = ∇θ log πθ (at | st ) Ât

|Dk |
τ ∈Dkt=0

θk

Use the conjugate gradient algorithm to compute

x̂k ≈ Ĥk−1 ĝk

where Ĥk is the Hessian of the sample average KL-divergence. Update the
policy by backtracking line search with
s

θk+1 = θk + αj x̂k
x̂Tk Ĥk x̂k

where j ∈ {0, 1, 2, . . . K} is the smallest value which improves the sample


loss and satisfies the sample KL-divergence constraint. Fit value function
by regression on mean-squared error:
T 
1 XX 2
φk+1 = arg min Vφ (st ) − R̂t
φ |Dk | T τ ∈D t=0
k

typically via some gradient descent algorithm.


end

6.3.8 Proximal Policy Optimization (PPO)


As stated in the original paper (Schulman, Wolski, Dhariwal, Radford, & Klimov,
2017), PPO is an algorithm that achieves the data efficiency and reliability of TRPO
while utilizing just first-order optimization. PPO provides a unique objective func-
tion with clipped probability ratios that gives a pessimistic assessment (i.e., lower
limit) of the policy’s performance. PPO alternates between collecting data from the
policy and executing many epochs of optimization on the sampled data (Schulman
et al., 2017) to optimize policies.
PPO is a novel policy gradient technique family that alternates between sampling
data through interaction with the environment and maximizing a "surrogate" ob-
jective function using stochastic gradient ascent. Unlike traditional policy gradient
approaches, which conduct one gradient update per data sample, PPO employs an
updated objective function that allows for several epochs of mini-batch updates

39
(Schulman et al., 2017).
This objective function of PPO can be represented as (Schulman et al., 2017):
 
πθ (a|s) πθk
L(s, a, θk , θ) = min πθk
A (s, a), g(, A (s, a) (6.33)
πθold (a|s)

where

if A ≥ 0

(1 + )A
g(, A) = (6.34)
(1 − )A if A < 0

In the implementation, PPO maintains two policy networks. The first one is the
current policy that needs to be refined (Schulman et al., 2017):

πθ (at |st ) (6.35)

The policy that was used last to collect samples:

πθk (at |st ) (6.36)

PPO switches between sampling data and interacting with the environment. In
order to increase sample efficiency, a new policy is reviewed using samples obtained
from an earlier policy using the concept of significance sampling. (Schulman et al.,
2017).
 
πθ (at |st )
maximize Êt Ât (6.37)
θ πθold (at |st )

As the current policy gets developed, the gap between it and the previous policy
grows bigger. The estimation’s variance grows, which leads to poor judgments due
to inaccuracy. So, say, every four iterations, we resynchronize the second network
with the revised policy. (Schulman et al., 2017).
With clipped objective, we compute a ratio between the new policy and the old
policy (Schulman et al., 2017):

πθ (at |st )
rt (θ) = (6.38)
πθk (at |st )

This ratio compares the two policies. If the new policy is distant from the previous
policy, a new objective function is created to clip the estimated advantage function.
The new objective function is now (Schulman et al., 2017):
" T #
Xh i
LCLIP
θk (θ) = E min(rt (θ)Âπt k , clip(rt (θ), 1 − , 1 + )Âπt k ) (6.39)
τ ∼πk
t=0

40
The advantage function will be trimmed if the probability ratio between the new
and old policies goes beyond the range (1 − ) and (1 + ). The clipping restricts the
amount of effective change one may make at each phase to increase stability. This
inhibits major policy changes if they are outside of our comfort zone (Schulman et
al., 2017). As a result, the method can be expressed as shown in algorithm 9

Algorithm 9: PPO with Clipped Objective (Schulman et al., 2017)


input: initial policy parameters θ0 , clipping threshold 
for k = 0, 1, 2, ... do
Collect set of partial trajectories Dk on policy πk = π(θk )
Estimate advantages Âπt k using any advantage estimation algorithm
Compute policy update
θk+1 = arg max LCLIP
θk (θ)
θ
by taking K steps of minibatch
T h SGD (via Adam), where
i
CLIP
P πk πk
Lθk (θ) = E min(rt (θ)Ât , clip(rt (θ), 1 − , 1 + )Ât )
τ ∼πk t=0
end

41
Chapter 7

Experiments

7.1 Data - Dow Jones 30


The Dow Jones 30, also known as Dow Jones Industrial Average (DJIA), refers to
the thirty blue-chip publicly-traded U.S companies1 . Daily stock data from January
2011 up to November 2021 (10 year period) was extracted and used. 70 percent of
the data was used for training while 30 percent was used for testing the RL agents.

7.2 Experiment Scope


To conduct this study, we carried out experiments involving the eight agents de-
scribed in chapter 6 using the environment described in chapter 4. We carried out
three training runs and a hundred test runs. We then stored the peak and mean
performances of each agent for analysis. We also ran the training for 10,000 and
100,000 timesteps for all the RL agents. The rest of this section describes what
parameters & hyperparameters were used for both the environment and the agents.

7.2.1 Environment Parameters


1. Reward Function
In the experiments, we used both reward functions available in the environment
- log returns and differential Sharpe ratio.
2. Lookback Period
The lookback period is the duration the agent observes the environment before
taking action. A lookback period of 64 days was used. This was determined
based on literature and recommendations from the industry advisors.

3. Trading Costs
We experimented with three different trading costs scenarios - no trading costs,
1
Dow Jones stock companies are 3M, American Express, Amgen, Apple, Boeing, Caterpillar,
Chevron, Cisco Systems, Coca-Cola, Disney, Dow, Goldman Sachs, Home Depot, Honeywell, IBM,
Intel, Johnson and Johnson, JP Morgan Chase, McDonald’s, Merck, Microsoft, Nike, Procter &
Gamble, Salesforce, Travelers, UnitedHealth, Visa, Walgreens, and Walmart.

42
0.1% of the stock’s price, and 1% of the stock’s price.

7.2.2 RL Agent Hyper-parameters


We enumerate the parameters for each of the RL agents below.
1. Normalized Advantage Function (NAF)
• layer size: 256
• batch size: 128
• buffer size: 10,000
• LR: 1e-3
• TAU: 1e-3
• GAMMA (discount factor): 0.99
• update_every: 2
• number_of_updates: 1
• Seed: 0

2. REINFORCE
• Discount Factor (gamma): 0.99
• hidden size for linear layers: 128

3. Deep Deterministic Policy Gradient (DDPG)


• memory capacity: 10000
• num_memory_fill_episodes: 10
• gamma (discount factor): 0.99
• tau: 0.005
• sigma: 0.2
• theta: 0.15
• actor_lr: 1e-4
• critic_lr: 1e-3
• batch_size: 64
• warmup_steps: 100

4. Twin Delayed Deep Deterministic Policy Gradient (TD3)


• hidden_dim: 256

43
• memory_dim: 100,000
• max_action: 1
• discount: 0.99
• update_freq: 2
• tau: 0.005
• policy_noise_std: 0.2
• policy_noise_clip: 0.5
• actor_lr: 1e-3
• critic_lr: 1e-3
• batch_size: 128
• exploration_noise: 0.1
• num_layers: 3
• dropout: 0.2
• add_lstm: False
• warmup_steps: 100

5. Advantage Actor Critic (A2C)


• hidden_dim: 256
• entropy_beta: 0
• gamma (discount factor): 0.9
• actor_lr: 4e-4
• critic_lr: 4e-3
• max_grad_norm: 0.5

6. Soft Actor Critic(SAC)


• hidden_dim: 256
• value_lr: 3e-4
• soft_q_lr: 3e-4
• policy_lr: 3e-4
• gamma (discount factor): 0.99
• mean_lambda: 1e-3
• std_lambda: 1e-3
• z_lambda: 0.0

44
• soft_tau: 1e-2
• replay_buffer_size: 1,000,000
• batch_size: 128

7. Trust Region Policy Optimization (TRPO)


• damping: 0.1
• episode_length: 2000
• fisher_ratio: 1
• gamma (discount factor): 0.995
• l2_reg: 0.001
• lambda_: 0.97
• lr (learning-rate): 0.001
• max_iteration_number: 200
• max_kl (kl-divergence): 0.01
• val_opt_iter: 200
• value_memory:1

8. Proximal Policy Optimization (PPO)


• timesteps_per_batch: 50,000
• max_timesteps_per_episode: 2,000
• n_updates_per_iteration: 5
• lr (learning-rate): 0.005
• gamma (discount factor): 0.95
• clip: 0.2

7.3 Metrics
In this work, we use the following metrics when backtesting to evaluate and compare
the performance of the RL agents:
1. Annualized Returns
This is the yearly average profit from the trading strategy.

(1 + Return )∧ (1/N ) − 1 = Annualized Return

where:
N = Number of periods measured

45
2. Cumulative Return
This is the sum of returns obtained from a trading strategy over a period given
an initial investment is known as the cumulative return.

CumulativeReturn = (Pcurrent − Pinitial ) /Pinitial

where:
Pcurrent = Current Price
Pinitial = Original Price
3. Sharpe Ratio
This is the reward/risk ratio or risk-adjusted rewards of the trading strategy.

Rp − Rf
SharpeRatio =
σp

Where:
Rp = return of portfolio
Rf = risk-free rate
σp = standard deviation of the portfolio’s excess return
4. Maximum Drawdown (Max DD)
This is the difference between the maximum and minimum values of a portfolio
over a time horizon used to measure the downside risk of a trading strategy.
It is usually represented as a percentage, and lower values indicate good per-
formance.
Trough Value - Peak Value
M DD =
Peak Value
5. Calmar Ratio
This measures the trading strategy’s performance relative to its risk. It is
calculated by dividing the average annual rate of return by the maximum
drawdown. Similar to the Sharpe ratio, higher values indicate better risk-
adjusted performance.
Rp − Rf
Calmar Ratio =
Maximum Drawdown
Where:
Rp = Portfolio return
Rf = Risk-free rate
Rp − Rf = Annual rate of return

46
Chapter 8

Results & Discussion

8.1 Results
This chapter presents the results from the experiments described in chapter 7. Tables
8.1 to 8.3 present a comparison between the RL agents’ mean and peak performance
ranks at different trading costs across both reward functions. A huge variation be-
tween mean and peak performance shows that an agent’s portfolio management
strategy is unstable. Figure 8.1 shows the final average position across all experi-
ments of all the agents from best to worse. Appendix A shows the raw metric values
aggregated into Figure 8.1. Figures 8.2 to 8.10 show plots of cumulative returns
and portfolio management strategies for the best performing baseline model (MPT)
and the RL agents (A2C and SAC) that consistently outperformed MPT based on
the average rank metric in Tables 8.1 to 8.3. The mean of portfolio weights graphs
provide information on how an agent distributes its portfolio among the available
stocks while the standard deviation of portfolio weights graphs provide information
about how much an agent changes its portfolio distribution. Together, these graphs
explain an agent’s portfolio management strategy.

47
Table 8.1: Table of Rank Comparison at No Trading Cost

Peak Performance Rank Mean Performance Rank Difference Avg Rank


TRPO 1 2 1 1.5
SAC 3 1 2 2
A2C 5 2 3 3.5
MPT 6 4 2 5
PPO 4 7 3 5.5
REINFORCE 2 11 9 6.5
DDPG 7 8 1 7.5
TD3 10 5 5 7.5
NAF 7 9 2 8
Buy And Hold 11 6 5 8.5
Random 9 12 3 10.5
Uniform 12 10 2 11

Table 8.2: Table of Rank Comparison at 0.1% Trading Cost

Peak Performance Rank Mean Performance Rank Difference Avg Rank


A2C 1 1 0 1
SAC 4 2 2 3
TRPO 2 6 4 4
PPO 5 5 0 5
MPT 7 4 3 5.5
Buy And Hold 10 3 7 6.5
NAF 6 8 2 7
REINFORCE 3 11 8 7
DDPG 8 7 1 7.5
TD3 11 9 2 10
Random 9 12 3 10.5
Uniform 12 10 2 11

48
Table 8.3: Table of Rank Comparison at 1% Trading Cost

Peak Performance Rank Mean Performance Rank Difference Avg Rank


A2C 1 1 0 1
SAC 5 2 3 3.5
PPO 4 4 0 4
MPT 6 3 3 4.5
TRPO 2 9 7 5.5
Buy And Hold 9 4 5 6.5
REINFORCE 3 11 8 7
NAF 7 8 1 7.5
DDPG 10 6 4 8
TD3 11 6 5 8.5
Random 8 12 4 10
Uniform 12 10 2 11

49
Figure 8.1: Graph of Final Average Rank of All Agents

Figure 8.2: Graph of Cumulative Returns Plot at No Trading Costs

50
Figure 8.3: Graph of Mean of Portfolio Weights For Each Stock at No Trading Costs

Figure 8.4: Graph of Mean of Portfolio Weights For Each Stock at No Trading Costs

51
Figure 8.5: Graph of Cumulative Returns Plot at 0.6% Trading Costs

Figure 8.6: Graph of Mean of Portfolio Weights For Each Stock at 0.6% Trading
Costs

52
Figure 8.7: Graph of Mean of Portfolio Weights For Each Stock at 0.6% Trading
Costs

Figure 8.8: Graph of Cumulative Returns Plot at 1% Trading Costs

53
Figure 8.9: Graph of Mean of Portfolio Weights For Each Stock at 1% Trading Costs

Figure 8.10: Graph of Mean of Portfolio Weights For Each Stock at 1% Trading
Costs

54
8.2 Discussion
8.2.1 RL vs. Baselines
From Tables 8.1 to 8.3, we see that the only two baseline agents compare favourably
with the RL agents. These baseline agents are Buy & Hold and MPT, with the latter
being the stronger one. Trading costs have significant effects on the performance of
the trading agents. Every RL agent outperforms Buy & Hold at no trading costs.
However, when trading costs are introduced, only four RL agents outperform Buy &
Hold. This is still significant achievement and provides evidence that the RL agents
are able to discover good portfolio management strategies.

8.2.2 Value-Based RL vs. Policy-Based RL


In this section, we compare the performance of the NAF agent (a value-based agent)
against the REINFORCE agent (a policy-based agent). Tables 8.1 to 8.3 show that
both NAF and REINFORCE agents are not exceptional performers, usually under-
performing compared to the MPT and the Buy & Hold baseline agents. On mean
performance, the NAF agent outperforms the REINFORCE agent by a small mar-
gin regardless of what trading costs is used. However, at peak performance, the
REINFORCE agent significantly outperforms the NAF agent. This shows how un-
stable the policy generated by the REINFORCE agent is. This is further illustrated
by the fact that the mean and peak performance rank of the NAF agent across all
trading costs vary by as most two positions while those of the REINFORCE agent
vary by as much as nine positions. The results obtained from these two agents are
consistent with the theoretical understanding of how they work. The REINFORCE
agent’s policies have a high variance because of sample inefficiency caused by policy
gradient estimations from rollout. The NAF agent’s policies are much more stable,
but its performance is sub-optimal, never outperforming two of the baseline agents.

8.2.3 On-Policy vs. Off-Policy


In this project, there are four on-policy agents (A2C, PPO, REINFORCE, and
TRPO) and four off-policy agents (DDPG, NAF, SAC, and TD3). In this section,
we will be analyzing the performance of these groups of agents. First, we note
that the two agents (A2C and SAC) that consistently outperform the MPT baseline
belong to both groups. This provides evidence that both on-policy and off-policy RL
agents can perform portfolio management. At the mean performance, SAC slightly
outperforms A2C at no trading costs, but A2C slightly outperforms SAC when any
form of trading costs was introduced. This is also true at peak performance. Since
trading costs are usually involved in the real-world, A2C is better suited to real
world portfolio management.
Comparing the holding strategy of both A2C and SAC from figures 8.2 to 8.10, we
see that the strategy changes with the trading cost. At no trading cost, the SAC
agent put about 80% of its stock into the AXP stock on average. Furthermore, it
spreads its portfolio primarily across three other stocks (CAT, DD, MSFT) over
the entire testing period. However, the A2C agent took a different strategy. It
distributed its portfolio over most of the available stocks, and the spread changes by

55
an average of 5% across all the stocks over the testing period. It should be noted that
A2C had similar cumulative returns as the MPT baseline, which also put 90% of its
holdings into just three stocks - HD, UNH, and V. This confirms a general theory
in portfolio management - different market strategies could yield similar results.
When a trading cost of 0.1% is introduced, both A2C and SAC agents change their
strategy. Rather than put 80% of its holdings into one stock only, the SAC agent
put a similar percentage into four different stocks (JPM, V, GS, CSCO) and kept
its portfolio spread over them. It is interesting to note that these stocks are entirely
different from those it chose at no trading costs, and yet, it outperformed itself on
most of the metrics. Similarly, rather than spread its portfolio into most of the
available stocks, the A2C agent put about half of its portfolio into three stocks
(MRK, GS, KO) this time. Nevertheless, it still kept its portfolio spread across all
stocks but usually chose to trade one of CVX, GS, KO, MCD, MRK, RTX, or V.
When the trading cost was 1%, the strategy landscape changed dramatically. The
SAC agent chose a buy and hold strategy and held an almost uniform proportion
of stocks across all the available stocks. On the other hand, the A2C agent put
most of its stocks (about 90%) into just three stocks - KO, PFE, and RTX. Also, it
kept the portfolio spread across just these three stocks. It is necessary to note that
while both SAC and A2C underperformed compared to MPT at 1% trading costs
on returns-related metrics, the A2C agent’s strategy enabled it to outperform MPT
on risk-related metrics and overall, on average.
The performance of on-policy and off-policy agents seem to be similar at mean
performance, but on-policy agents consistently significantly outperform off-policy
agents at peak performance. Three out of the four on-policy agents (A2C, PPO,
TRPO) are ranked in the top five at different trading costs, consistently outper-
forming the Buy and Hold baseline. In contrast, only one of the four off-policy
agents (SAC) ranks consistently in the top 5. The SAC agent’s performance can
be attributed its maximum entropy learning framework, which allows it to perform
stochastic optimization of policies.
While the performance of SAC shows that off-policy agents can perform as well as
on-policy agents in the task of portfolio management, the evidence of this analysis
suggests that on-policy agents are more suited to the task of portfolio management
in comparison to off-policy agents. The good performance of on-policy agents is
because they are better at evaluating policy and sample efficiency is not a significant
problem in portfolio management. While unlikely, the off-policy agents may get
better with hyperparameter optimization.

56
Chapter 9

Conclusion

9.1 Contributions
This study investigated the performance of RL when applied to portfolio manage-
ment using model-free deep reinforcement learning agents. We trained several RL
agents on real-world stock prices to learn how to perform asset allocation. We com-
pared the performance of these RL agents against some baseline agents. We also
compared the RL agents among themselves to understand which classes of agents
performed better.
From our analysis, RL agents can perform the task of portfolio management since
they significantly outperformed two of the baseline agents (random allocation and
uniform allocation). Four RL agents (A2C, SAC, PPO, and TRPO) outperformed
the best baseline, MPT, overall. This shows the abilities of RL agents to uncover
more profitable trading strategies.
Furthermore, there were no significant performance differences between value-based
and policy-based RL agents. Actor-critic agents performed better than other types
of agents. Also, on-policy agents performed better than off-policy agents because
they are better at policy evaluation and sample efficiency is not a significant problem
in portfolio management.
In summary, this study shows that RL agents can substantially improve asset al-
location since they outperform strong baselines. On-policy, actor-critic RL agents
showed the most promise based on our analysis. The next section discusses some
directions that future works may want to explore to build on this work.

9.2 Future Work


While this work has tried to do a comparative analysis of more RL agents than
what is typically available in the literature, we have not exhausted every possible
RL agent. A possible extension to this work is applying the same methodology to
other potentially useful RL agents and seeing how they perform compared to the
analysis done in this report.
Due to time and compute constraints, we have chosen to stay close to the initially

57
proposed hyperparameters seen in the original papers. Thus, another possible ex-
tension to this work will be to carry out extensive hyperparameter optimization for
all the eight agents studied in this work to see how the performance of these agents
changes.
Furthermore, in this project, we focused only on using feedforward neural networks
as the function approximator for all the agents as proposed by the initial authors.
However, the financial market is a time-series. Using neural networks such as recur-
rent neural networks, convolutional neural networks, transformers, among others,
that can take into account the temporal nature of the market could yield better
results.
Finally, while we have used the Dow Jones market as requested by the client, there
is potential for comparative analysis across several other markets. It would be
interesting to see if the RL agents perform better when the market is small (e.g.,
just the top 5 technology companies) or large (e.g. the S&P 500 market). Similarly,
other markets from other locations around the world (e.g., the DAX market of
Germany, the HK50 market of Hong Kong, and the JSE market of South Africa)
can be studied to see how the insights garnered from the Dow Jones market transfer
to these new markets.

58
Appendix A

Raw Metric Scores for All


Experiments

Tables A.1 to A.9 show all the trading agents’ mean and peak performances at dif-
ferent reward functions and trading costs. . The rank columns show an algorithm’s
position, based on its ranks across all the metrics.

Table A.1: Table of Mean Performance at No Trading Cost & Log Returns Reward

Cumulative Annualized Sharpe Calmar Max Rank


Returns Return DD
A2C 1.61 0.24 0.92 0.69 34% 2
Buy And Hold 1.48 0.2 0.82 0.58 34% 6
DDPG 1.41 0.17 0.76 0.52 33% 8
MPT 1.62 0.25 0.9 0.66 38% 4
NAF 1.42 0.17 0.76 0.5 34% 9
PPO 1.43 0.18 0.79 0.54 33% 7
REINFORCE 1.37 0.15 0.69 0.46 33% 11
Random 1.4 0.17 0.72 0.47 35% 12
SAC 1.77 0.3 0.83 0.68 33% 1
TD3 1.44 0.18 0.81 0.55 33% 5
TRPO 1.59 0.24 0.94 0.69 34% 2
Uniform 1.4 0.17 0.73 0.48 35% 10

59
Table A.2: Table of Mean Performance at No Trading Cost & Sharpe Ratio Reward

Cumulative Annualized Sharpe Calmar Max Rank


Returns Return DD
A2C 1.35 0.15 0.68 0.44 34% 12
Buy And Hold 1.48 0.2 0.82 0.58 34% 2
DDPG 1.41 0.17 0.76 0.52 33% 6
MPT 1.62 0.25 0.9 0.66 38% 4
NAF 1.42 0.17 0.76 0.48 34% 7
PPO 1.4 0.17 0.73 0.48 35% 9
REINFORCE 1.4 0.17 0.74 0.49 34% 8
Random 1.4 0.17 0.72 0.47 35% 11
SAC 1.41 0.17 0.77 0.52 33% 5
TD3 1.43 0.18 0.78 0.52 33% 3
TRPO 1.59 0.24 0.94 0.69 34% 1
Uniform 1.4 0.17 0.73 0.48 35% 9

Table A.3: Table of Peak Performance at No Trading Cost

Cumulative Annualized Sharpe Calmar Max Rank


Returns Return DD
A2C 1.61 0.24 0.92 0.69 34% 5
Buy And Hold 1.48 0.2 0.82 0.58 34% 11
DDPG 1.46 0.28 0.83 0.58 33% 7
MPT 1.62 0.25 0.9 0.66 38% 6
NAF 1.5 0.21 0.87 0.64 32% 7
PPO 1.59 0.24 0.96 0.7 33% 4
REINFORCE 1.64 0.25 1.03 0.8 32% 2
Random 1.54 0.22 0.9 0.63 35% 9
SAC 1.77 0.3 0.83 0.68 33% 3
TD3 1.44 0.26 0.81 0.55 33% 10
TRPO 1.93 0.35 1.34 1.3 27% 1
Uniform 1.4 0.17 0.73 0.48 35% 12

60
Table A.4: Table of Mean Performance at 0.1% Trading Costs & Log Returns Re-
ward

Cumulative Annualized Sharpe Calmar Max Rank


Returns Return DD
A2C 1.62 0.25 0.88 0.68 28% 1
Buy And Hold 1.48 0.2 0.82 0.58 34% 3
DDPG 1.42 0.18 0.78 0.52 34% 7
MPT 1.62 0.25 0.9 0.66 38% 4
NAF 1.41 0.17 0.76 0.5 34% 8
PPO 1.44 0.18 0.78 0.54 34% 5
REINFORCE 1.36 0.15 0.69 0.46 33% 11
Random 1.37 0.16 0.69 0.45 35% 12
SAC 1.82 0.31 0.99 0.9 35% 2
TD3 1.4 0.17 0.74 0.49 34% 9
TRPO 1.42 0.18 0.76 0.53 33% 6
Uniform 1.4 0.17 0.73 0.48 35% 10

Table A.5: Table of Mean Performance at 0.1% Trading Costs & Sharpe Ratio
Reward

Cumulative Annualized Sharpe Calmar Max Rank


Returns Return DD
A2C 1.5 0.2 0.82 0.68 28% 2
Buy And Hold 1.48 0.2 0.82 0.58 34% 3
DDPG 1.43 0.18 0.78 0.52 34% 6
MPT 1.62 0.25 0.9 0.66 38% 4
NAF 1.41 0.17 0.76 0.48 34% 8
PPO 1.44 0.18 0.78 0.54 34% 5
REINFORCE 1.4 0.17 0.74 0.49 34% 9
Random 1.37 0.16 0.69 0.45 35% 12
SAC 1.82 0.31 0.99 0.9 33% 1
TD3 1.4 0.17 0.75 0.5 34% 7
TRPO 1.4 0.17 0.74 0.49 34% 9
Uniform 1.4 0.17 0.73 0.48 35% 11

61
Table A.6: Table of Peak Performance at 0.1% Trading Costs

Cumulative Annualized Sharpe Calmar Max Rank


Returns Return DD
A2C 1.97 0.36 1.2 0.91 29% 1
Buy And Hold 1.48 0.2 0.82 0.58 34% 10
DDPG 1.48 0.2 0.86 0.6 33% 8
MPT 1.62 0.25 0.9 0.66 38% 7
NAF 1.5 0.21 0.87 0.63 32% 6
PPO 1.6 0.24 0.97 0.73 33% 5
REINFORCE 1.64 0.25 1.03 0.8 32% 3
Random 1.49 0.2 0.84 0.59 34% 9
SAC 1.85 0.33 1.02 0.93 35% 4
TD3 1.4 0.17 0.74 0.49 34% 11
TRPO 1.84 0.32 1.14 0.97 33% 2
Uniform 1.4 0.17 0.73 0.48 35% 12

Table A.7: Table of Mean Performance at 1% Trading Costs & Log Returns Reward

Cumulative Annualized Sharpe Calmar Max Rank


Returns Return DD
A2C 1.57 0.23 0.86 0.85 27% 1
Buy And Hold 1.48 0.2 0.82 0.58 34% 4
DDPG 1.4 0.17 0.75 0.51 33% 6
MPT 1.62 0.25 0.9 0.66 38% 3
NAF 1.41 0.17 0.75 0.5 34% 8
PPO 1.43 0.18 0.8 0.57 31% 4
REINFORCE 1.36 0.15 0.69 0.48 32% 11
Random 1.35 0.15 0.68 0.41 36% 12
SAC 1.49 0.2 0.87 0.61 33% 2
TD3 1.41 0.17 0.76 0.5 34% 6
TRPO 1.37 0.15 0.71 0.5 31% 9
Uniform 1.4 0.17 0.73 0.48 35% 10

62
Table A.8: Table of Mean Performance at 1% Trading Costs & Sharpe Ratio Reward

Cumulative Annualized Sharpe Calmar Max Rank


Returns Return DD
A2C 1.58 0.23 0.97 0.85 27% 1
Buy And Hold 1.48 0.2 0.82 0.58 34% 2
DDPG 1.4 0.17 0.75 0.51 33% 5
MPT 1.62 0.25 0.9 0.66 38% 3
NAF 1.4 0.17 0.75 0.46 36% 8
PPO 1.39 0.16 0.72 0.48 34% 10
REINFORCE 1.4 0.17 0.62 0.37 36% 11
Random 1.35 0.15 0.68 0.41 36% 12
SAC 1.43 0.18 0.79 0.54 33% 4
TD3 1.41 0.17 0.76 0.5 34% 6
TRPO 1.39 0.16 0.71 0.5 31% 9
Uniform 1.4 0.17 0.73 0.48 35% 7

Table A.9: Table of Peak Performance at 1% Trading Costs & Sharpe Ratio Reward

Cumulative Annualized Sharpe Calmar Max Rank


Returns Return DD
A2C 1.73 0.28 1.02 1.19 24% 1
Buy And Hold 1.48 0.2 0.82 0.58 34% 9
DDPG 1.43 0.18 0.79 0.55 33% 10
MPT 1.62 0.25 0.9 0.66 38% 6
NAF 1.5 0.2 0.86 0.63 32% 7
PPO 1.6 0.29 0.99 0.73 33% 4
REINFORCE 1.64 0.25 1.03 0.82 31% 3
Random 1.53 0.21 0.88 0.61 35% 8
SAC 1.49 0.29 0.86 0.61 33% 5
TD3 1.41 0.17 0.76 0.5 34% 11
TRPO 1.62 0.25 1.06 0.82 30% 2
Uniform 1.4 0.17 0.73 0.48 35% 12

63
References

Adämmer, P., & Schüssler, R. A. (2020). Forecasting the equity premium: mind
the news! Review of Finance, 24 (6), 1313–1355.
Albanesi, S., & Vamossy, D. F. (2019). Predicting consumer default: A deep learning
approach (Tech. Rep.). None: National Bureau of Economic Research.
Amel-Zadeh, A., Calliess, J.-P., Kaiser, D., & Roberts, S. (2020). Machine learning-
based financial statement analysis. Available at SSRN 3520684 .
Ang, Y. Q., Chia, A., & Saghafian, S. (2020). Using machine learning to demystify
startups funding, post-money valuation, and success. Post-Money Valuation,
and Success (August 27, 2020).
Antunes, F., Ribeiro, B., & Pereira, F. (2017). Probabilistic modeling and visual-
ization for bankruptcy prediction. Applied Soft Computing, 60 , 831–843.
Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? the information
content of internet stock message boards. The Journal of finance, 59 (3), 1259–
1294.
Bao, W., Yue, J., & Rao, Y. (2017). A deep learning framework for financial time
series using stacked autoencoders and long-short term memory. PloS one,
12 (7), e0180944.
Bao, Y., Ke, B., Li, B., Yu, Y. J., & Zhang, J. (2020). Detecting accounting fraud
in publicly traded us firms using a machine learning approach. Journal of
Accounting Research, 58 (1), 199–235.
Bari, O. A., & Agah, A. (2020). Ensembles of text and time-series models for
automatic generation of financial trading signals from social media content.
Journal of Intelligent Systems, 29 (1), 753–772.
Belousov, B., Abdulsamad, H., Klink, P., Parisi, S., & Peters, J. (Eds.). (2021).
Reinforcement Learning Algorithms: Analysis and Applications (Vol. 883).
Cham: Springer International Publishing. Retrieved 2021-10-25, from http://
link.springer.com/10.1007/978-3-030-41188-6 doi: 10.1007/978-3-030
-41188-6
Betancourt, C., & Chen, W.-H. (2021, February). Deep reinforcement learning
for portfolio management of markets with a dynamic number of assets. Expert
Systems with Applications, 164 , 114002. Retrieved 2021-10-25, from https://
linkinghub.elsevier.com/retrieve/pii/S0957417420307776 doi: 10
.1016/j.eswa.2020.114002
Björkegren, D., & Grissen, D. (2020). Behavior revealed in mobile phone usage
predicts credit repayment. The World Bank Economic Review , 34 (3), 618–
634.
Chen, J. (2021, May). Post-Modern portfolio theory (PMPT). https://www
.investopedia.com/terms/p/pmpt.asp. (Accessed: 2021-11-14)

64
Chen, L., Pelger, M., & Zhu, J. (2020). Deep learning in asset pricing. Available at
SSRN 3350138 .
Colombo, E., & Pelagatti, M. (2020). Statistical learning and exchange rate fore-
casting. International Journal of Forecasting, 36 (4), 1260–1289.
Contributors, W. (2021, October). Reinforcement learning. Retrieved 2021-10-
26, from https://en.wikipedia.org/w/index.php?title=Reinforcement
_learning&oldid=1051236695 (Page Version ID: 1051236695)
Croux, C., Jagtiani, J., Korivi, T., & Vulanovic, M. (2020). Important factors deter-
mining fintech loan default: Evidence from a lendingclub consumer platform.
Journal of Economic Behavior & Organization, 173 , 270–296.
Damrongsakmethee, T., & Neagoe, V.-E. (2017). Data mining and machine learning
for financial analysis. Indian Journal of Science and Technology, 10 (39), 1–7.
Deng, Y., Ren, Z., Kong, Y., Bao, F., & Dai, Q. (2016). A hierarchical fused
fuzzy deep neural network for data classification. IEEE Transactions on Fuzzy
Systems, 25 (4), 1006–1012.
E., S., & E., R. (2021). INVESTMENT PORTFOLIO: TRADITIONAL AP-
PROACH. Norwegian Journal of Development of the International Science.
Filos, A. (2019, September). Reinforcement Learning for Portfolio Management.
arXiv:1909.09571 [cs, q-fin, stat] . Retrieved 2021-10-25, from http://arxiv
.org/abs/1909.09571 (arXiv: 1909.09571)
Fujimoto, S., Hoof, H., & Meger, D. (2018). Addressing function approximation
error in actor-critic methods. In International conference on machine learning
(pp. 1587–1596).
Gao, Y., Gao, Z., Hu, Y., Song, S., Jiang, Z., & Su, J. (2021, October). A
Framework of Hierarchical Deep Q-Network for Portfolio Management. In
(pp. 132–140). Retrieved 2021-10-25, from https://www.scitepress.org/
PublicationsDetail.aspx?ID=0fLwyxE3WOE=&t=1
Gomes, T. A., Carvalho, R. N., & Carvalho, R. S. (2017). Identifying anomalies
in parliamentary expenditures of brazilian chamber of deputies with deep au-
toencoders. In 2017 16th ieee international conference on machine learning
and applications (icmla) (pp. 940–943).
Goumagias, N. D., Hristu-Varsakelis, D., & Assael, Y. M. (2018). Using deep q-
learning to understand the tax evasion behavior of risk-averse firms. Expert
Systems with Applications, 101 , 258–270.
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical asset pricing via machine learning.
The Review of Financial Studies, 33 (5), 2223–2273.
Gu, S., Lillicrap, T., Sutskever, I., & Levine, S. (2016). Continuous deep q-learning
with model-based acceleration. In International conference on machine learn-
ing (pp. 2829–2838).
Gulen, H., Jens, C., & Page, T. B. (2020). An application of causal forest in
corporate finance: How does financing affect investment? Available at SSRN
3583685 .
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor.
In International conference on machine learning (pp. 1861–1870).
Harmon, M. E., & Harmon, S. S. (1996). Reinforcement Learning: A Tutorial.
Hayes, A. (2021, November). Portfolio management. https://www.investopedia
.com/terms/p/portfoliomanagement.asp. (Accessed: 2021-11-17)

65
Hieu, L. T. (2020, October). Deep Reinforcement Learning for Stock Portfolio Op-
timization. International Journal of Modeling and Optimization, 10 (5), 139–
144. Retrieved 2021-10-25, from http://arxiv.org/abs/2012.06325 (arXiv:
2012.06325) doi: 10.7763/IJMO.2020.V10.761
Hu, Y., & Lin, S.-J. (2019). Deep Reinforcement Learning for Optimizing Finance
Portfolio Management. 2019 Amity International Conference on Artificial
Intelligence (AICAI). doi: 10.1109/AICAI.2019.8701368
Huang, J., Chai, J., & Cho, S. (2020, June). Deep learning in finance and banking: A
literature review and classification. Frontiers of Business Research in China,
14 (1), 13. Retrieved 2021-10-25, from https://doi.org/10.1186/s11782
-020-00082-6 doi: 10.1186/s11782-020-00082-6
Huang, Y., Huang, K., Wang, Y., Zhang, H., Guan, J., & Zhou, S. (2016). Exploit-
ing twitter moods to boost financial trend prediction based on deep network
models. In International conference on intelligent computing (pp. 449–460).
Huotari, T., Savolainen, J., & Collan, M. (2020, December). Deep Reinforce-
ment Learning Agent for S&amp;P 500 Stock Selection. Axioms, 9 (4),
130. Retrieved 2021-10-25, from https://www.mdpi.com/2075-1680/9/4/
130 (Number: 4 Publisher: Multidisciplinary Digital Publishing Institute)
doi: 10.3390/axioms9040130
Iwasaki, H., & Chen, Y. (2018). Topic sentiment asset pricing with dnn supervised
learning. Available at SSRN 3228485 .
Jiang, Z., & Liang, J. (2017). Cryptocurrency portfolio management with deep
reinforcement learning. In 2017 intelligent systems conference (intellisys) (pp.
905–913).
Jiang, Z., Xu, D., & Liang, J. (2017, July). A Deep Reinforcement Learning Frame-
work for the Financial Portfolio Management Problem. arXiv:1706.10059
[cs, q-fin] . Retrieved 2021-10-25, from http://arxiv.org/abs/1706.10059
(arXiv: 1706.10059)
Joyce, J. M. (2011). Kullback-leibler divergence. In M. Lovric (Ed.), Inter-
national encyclopedia of statistical science (pp. 720–722). Berlin, Heidel-
berg: Springer Berlin Heidelberg. Retrieved from https://doi.org/10.1007/
978-3-642-04898-2_327 doi: 10.1007/978-3-642-04898-2_327
Kakade, S., & Langford, J. (2002, 01). Approximately optimal approximate rein-
forcement learning. In (p. 267-274).
Konda, V., & Gao, V. (2000, January). Actor-critic algorithms. None.
Kvamme, H., Sellereite, N., Aas, K., & Sjursen, S. (2018). Predicting mortgage
default using convolutional neural networks. Expert Systems with Applications,
102 , 207–217.
Lahmiri, S., & Bekiros, S. (2019). Can machine learning approaches predict corpo-
rate bankruptcy? evidence from a qualitative experimental design. Quantita-
tive Finance, 19 (9), 1569–1577.
Levišauskait, K. (2010). Investment Analysis and Portfolio Management. None,
167.
Li, B., & Hoi, S. C. H. (2014, January). Online portfolio selection: A survey. ACM
Computing Surveys, 46 (3), 35:1–35:36. Retrieved 2021-10-25, from https://
doi.org/10.1145/2512962 doi: 10.1145/2512962
Li, H., Shen, Y., & Zhu, Y. (2018). Stock price prediction using attention-based
multi-input lstm. In Asian conference on machine learning (pp. 454–469).

66
Liang, Z., Chen, H., Zhu, J., Jiang, K., & Li, Y. (2018). Adversarial deep reinforce-
ment learning in portfolio management. arXiv preprint arXiv:1808.09940 .
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., . . . Wier-
stra, D. (2019, July). Continuous control with deep reinforcement learning.
arXiv:1509.02971 [cs, stat] . Retrieved 2021-11-07, from http://arxiv.org/
abs/1509.02971 (arXiv: 1509.02971)
Luo, C., Wu, D., & Wu, D. (2017). A deep learning approach for credit scoring
using credit default swaps. Engineering Applications of Artificial Intelligence,
65 , 465–470.
Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7 (1), 77–91.
Retrieved from http://www.jstor.org/stable/2975974
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., &
Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv
preprint arXiv:1312.5602 .
Ozbayoglu, A. M., Gudelek, M. U., & Sezer, O. B. (2020). Deep learning for financial
applications: A survey. Applied Soft Computing, 93 , 106384.
Paula, E. L., Ladeira, M., Carvalho, R. N., & Marzagao, T. (2016). Deep learning
anomaly detection as support fraud investigation in brazilian exports and anti-
money laundering. In 2016 15th ieee international conference on machine
learning and applications (icmla) (pp. 954–960).
Reichenbacher, M., Schuster, P., & Uhrig-Homburg, M. (2020). Expected bond
liquidity. Available at SSRN 3642604 .
Rossi, A. G., & Utkus, S. P. (2020). Who benefits from robo-advising? evidence
from machine learning. Evidence from Machine Learning (March 10, 2020).
Schulman, J., Levine, S., Moritz, P., Jordan, M. I., & Abbeel, P. (2015). Trust
region policy optimization. None.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017, 07).
Proximal policy optimization algorithms. None.
Silver, C. (2021, November). Modern portfolio theory (MPT). https://
www.investopedia.com/terms/m/modernportfoliotheory.asp. (Accessed:
2021-11-14)
Silver, D. (2015). Introduction to Reinforcement Learning with David Silver.
Retrieved 2021-10-25, from https://deepmind.com/learning-resources/
-introduction-reinforcement-learning-david-silver
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., . . .
Hassabis, D. (2018, December). A general reinforcement learning algorithm
that masters chess, shogi, and Go through self-play. Science (New York, N.Y.),
362 (6419), 1140–1144. doi: 10.1126/science.aar6404
Spilak, B. (2018). Deep neural networks for cryptocurrencies price prediction (Un-
published master’s thesis). Humboldt-Universität zu Berlin.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction
(Second ed.). The MIT Press. Retrieved from http://incompleteideas.net/
book/the-book-2nd.html
Taghian, M., Asadi, A., & Safabakhsh, R. (2020). Learning financial asset-
specific trading rules via deep reinforcement learning. arXiv preprint
arXiv:2010.14194 .
Tang, L. (2018, December). An actor-critic-based portfolio investment method
inspired by benefit-risk optimization. Journal of Algorithms & Computa-

67
tional Technology, 12 (4), 351–360. Retrieved 2021-10-28, from http://
journals.sagepub.com/doi/10.1177/1748301818779059 doi: 10.1177/
1748301818779059
Tian, S., Yu, Y., & Guo, H. (2015). Variable selection and corporate bankruptcy
forecasts. Journal of Banking & Finance, 52 , 89–100.
Vamossy, D. F. (2021). Investor emotions and earnings announcements. Journal of
Behavioral and Experimental Finance, 30 , 100474.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8 (3-4), 279–292.
Weng, L. (2018, February). A (Long) Peek into Reinforcement Learning.
Retrieved 2021-11-02, from https://lilianweng.github.io/2018/02/19/
a-long-peek-into-reinforcement-learning.html
Yiu, T. (2020, October). Understanding portfolio optimization. https://
towardsdatascience.com/understanding-portfolio-optimization
-795668cef596. (Accessed: 2021-11-17)

68

You might also like