0% found this document useful (0 votes)
2 views

Deep reinforcement learning for portfolio management of markets with a dynamic number of assets

This paper presents a novel portfolio management method utilizing deep reinforcement learning (RL) for dynamic cryptocurrency markets with varying numbers of assets. The proposed architecture adapts to new assets without requiring additional training and minimizes transaction costs through an integrated algorithm for optimal transactions. Testing on a large cryptocurrency market dataset demonstrated that this method outperformed existing approaches, achieving average daily returns exceeding 24%.

Uploaded by

andrew910309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Deep reinforcement learning for portfolio management of markets with a dynamic number of assets

This paper presents a novel portfolio management method utilizing deep reinforcement learning (RL) for dynamic cryptocurrency markets with varying numbers of assets. The proposed architecture adapts to new assets without requiring additional training and minimizes transaction costs through an integrated algorithm for optimal transactions. Testing on a large cryptocurrency market dataset demonstrated that this method outperformed existing approaches, achieving average daily returns exceeding 24%.

Uploaded by

andrew910309
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Expert Systems With Applications 164 (2021) 114002

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Deep reinforcement learning for portfolio management of markets with a


dynamic number of assets
Carlos Betancourt ∗, Wen-Hui Chen
Graduate Institute of Automation Technology, National Taipei University of Technology, Taipei, Taiwan

ARTICLE INFO ABSTRACT

Keywords: This work proposes a novel portfolio management method using deep reinforcement learning on markets with
Reinforcement learning a dynamic number of assets. This problem is especially important in cryptocurrency markets, which already
Deep learning support the trading of hundreds of assets with new ones being added every month. A novel neural network
Portfolio management
architecture is proposed, which is trained using deep reinforcement learning. Our architecture considers all
Transaction costs
assets in the market, and automatically adapts when new ones are suddenly introduced, making our method
Multiple assets
more general and sample-efficient than previous methods. Further, transaction cost minimization is considered
when formulating the problem. For this purpose, a novel algorithm to compute optimal transactions given a
desired portfolio is integrated into the architecture. The proposed method was tested on a dataset of one of the
largest cryptocurrency markets in the world, outperforming state-of-the-art methods, achieving average daily
returns of over 24%.

1. Introduction of the earnings or losses seen by the agent for taking those actions. The
vector containing the information of the assets held by an agent at any
Generating financial profits by trading cryptocurrencies is chal- moment is called the portfolio, hence this type of process is also known
lenging due to their price erratic changes. Cryptocurrencies are de- as portfolio management.
centralized electronic financial assets that appeared as an alternative Typically RL algorithms have fixed state and action spaces. How-
to fiat currencies (Nakamoto, 2008). However, according to Corbet ever, new assets are often added to cryptocurrency markets (Narayanan
et al. (2014) the prices of cryptocurrencies are affected by government et al., 2016). Hence, to rapidly incorporate those assets into the process,
announcements, policies and actions, in spite of the fact they are adaptable state and action spaces are necessary. Most works on auto-
decentralized assets. Additionally, cryptocurrency prices show higher matic asset trading assume the number of assets is static (Bu & Cho,
volatility than those of traditional assets. For instance, in early 2017, 2018; Jiang & Liang, 2017; Liang et al., 2018; Pendharkar & Cusatis,
the price of Bitcoin, the well-known cryptocurrency, reached its maxi- 2018). Thereby, convolutional layers of neural networks can extract
mum historical peak of approximately 19,000 USD per unit, but during useful information about the prices of that specific set of assets. But,
the subsequent months it plunged to 3000 USD, followed by a strong by doing so, a large portion of data is wasted because only a small
bounce to its current price of approximately 8,000 USD per unit. For
number of assets is used for training algorithms while datasets contain
this price behavior, formulating cryptocurrency trading strategies is a
information collected from dozens and even hundreds of assets. This is
non-trivial task.
an important issue, not only from a sample-efficiency point of view, but
Reinforcement learning (RL) is a suitable framework to process
also because critical earnings may be accomplished by trading assets
complex data and handle difficult decision-making processes such as
that are suddenly incorporated into a market. For instance, in Fig. 1,
asset trading. A trading process can be naturally formulated as an
Dock (coin) reached 2.2 times its original price during the first 20 days
RL process. In this type of process, an agent takes actions over an
in the market, then it fell and settled at about 1.2 times its original
environment based on observations of the states of that environment;
rewards are received by that agent as a consequence of both the states price on the subsequent days. This behavior has been observed often in
visited and the actions taken. In the specific case of asset trading, a assets recently added to markets; however, we are confident it can be
state of the environment is equivalent to the recent history of the assets; predicted and exploited.
actions are the transactions made to get rid of some of the assets held This work proposes an RL method using a recurrent neural network
by the agent and acquire new ones, and the rewards are scalar functions (RNN) to perform portfolio management on markets in which the

∗ Corresponding author.
E-mail addresses: [email protected] (C. Betancourt), [email protected] (W.-H. Chen).

https://doi.org/10.1016/j.eswa.2020.114002
Received 1 April 2020; Received in revised form 10 September 2020; Accepted 11 September 2020
Available online 16 September 2020
0957-4174/© 2020 Elsevier Ltd. All rights reserved.
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002

• Formulation of a trading system without the limitation of hav-


ing a market with a fixed number of assets. Our method is
sample-efficient, its implementation is straightforward, and dur-
ing deployment, it is able to integrate assets into the process that
appear suddenly in the market without the need of extra training.
• Transaction costs are considered and managed in our work. A
novel algorithm to compute the optimal transactions is proposed
and integrated into the system.
• Implementation of the proposed method using the dataset of a
cryptocurrency market. The reliability of our method is tested
under three different trading setups to show its adaptability.

The rest of this paper is organized as follows. Section 2 presents re-


lated works in the field. Section 3 describes the mathematical definition
of the portfolio management problem. Section 4 describes the proposed
Fig. 1. Price evolution of Dock (cryptocurrency) during the first month after being method. Section 5 describes the transaction optimization process. Sec-
released in the Binance website.
tion 6 explains the experiment setups and metrics used to evaluate
Source: binance.com.
them. In Section 7, the results of the experiments are discussed. Finally,
conclusions and directions for future work are presented in Section 8.
number of assets may change over time. Proximal Policy Optimization
2. Related works
(PPO) (Schulman et al., 2017) was adapted for this purpose. PPO is
a popular deep RL algorithm, with an actor–critic architecture, that
Deep learning approaches for portfolio management can be divided
has been shown to perform well on difficult tasks such as video-
into two groups: model-based and model-free methods. Model-based
game playing and dexterous robotics control (OpenAI et al., 2020;
methods, as their name suggests, assume models of the asset behav-
Schulman et al., 2017). PPO has been recently applied to portfolio
ior exist, and deep neural networks (DNNs) are used to approximate
management in markets with a fixed number of assets (Liang et al.,
these models using supervised learning on price datasets. Model-based
2018). However, to adapt to a dynamic number of assets, we propose
methods do not cope with asset trading directly; instead, they require
a particular architecture that processes assets individually and uses the
secondary methods to process the predicted prices, which typically
current portfolio entries for weighting. This results in a network able
use conventional heuristics based on human knowledge. Works that
to effectively process assets that were never even seen during training,
follow this approach include (Heaton et al., 2017; Niaki & Hosein-
without requiring extra training or memory. The proposed method was
zade, 2013). Model-free solutions, on the other hand, compute trading
backtested using data of a cryptocurrency market along state-of-the
actions without explicitly predicting prices. This is done by neural net-
art baselines in three different setups, which correspond to episodes
works that directly map asset features into portfolio vectors. The train-
with lengths of one day, 30 days and 16 weeks with holding periods
ing of these networks is typically formulated using RL, where financial
of 30 min, one day and one week, respectively. The performances of
performance measures such as daily returns, Sharpe ratios, maximum
the methods were evaluated using two standard measures for investing
drawdowns, etc., are optimized to obtain agents that combine both
and trading: total return and Sharpe ratio. Our method outperformed profit-seeking and risk-aversion behaviors. Works that follow this strat-
the baselines in all the tested setups. egy include (Dempster & Leemans, 2006; Moody & Saffell, 2001; Zhang
Keeping the number of transactions as small as possible is an & Maringer, 2013), which used RL with recurrent networks for portfolio
important issue to consider while doing asset trading. Markets obtain management in stock and foreign exchange markets.
revenues from their services in the form of transaction costs. Any agent In recent years, algorithms such as deep Q-learning (Mnih et al.,
that buys or sells assets gives a small percentage of those transactions 2013) allowed researchers to train deeper neural network architectures
to the service provider. Cryptocurrency market transaction fees are using RL. Since then, deep RL approaches became dominant in portfolio
typically lower than 1%, which are among the lowest compared to management research. Jiang and Liang (2017) used a Monte Carlo
fees found in other types of financial asset markets. However, these policy gradient method to train a convolutional neural network (CNN)
seemingly negligible fees become important when transactions are for cryptocurrency trading. They reported high returns, however they
made frequently, for instance in periods of minutes or hours. This also stated to test their method in a real market, it had to be modified
is because the changes in the assets acquired by the agent may not to include real life constraints. Bu and Cho (2018) trained a deep
compensate for the losses suffered by transaction costs. Hence, the long–short-term-memory network using double Q-learning (Van Hasselt
algorithm should aim to keep the number of transactions low. To cope et al., 2016), obtaining positive rewards in a cryptocurrency market
with this issue, in our design, the current portfolio vector is given to the with a decreasing tendency. Pendharkar and Cusatis (2018) compared
network in the output layers, penalizing assets not held by the agent. Q-learning and other value-based RL methods for asset trading in stock
Additionally, a novel algorithm to compute the optimal transactions markets. Liang et al. (2018) proposed an adversarial training method
is given in this work. In a market where transaction costs exist, if that improves the performance of deep RL methods including actor–
an agent wants to obtain a portfolio vector satisfying certain specific critic and policy-gradient based methods. They used a deep residual
proportions, the agent needs to perform transactions, thus giving up network (He et al., 2016) in their designs, and tested them on a
some amount for doing that. Ormos and Urbán (2013) proposed an Chinese stock market reporting positive returns. Jeong and Kim (2019)
iterative method to compute the values of the transactions needed to proposed a transfer learning method to pre-train neural networks when
convert some portfolio into another with minimal cost. However, this data amounts are not sufficiently large. They applied this technique
method assumes transaction costs are the same for all assets, and this is along with Q-learning to trade financial indexes, such as the S&P500
not always the case. We propose a different approach to this problem in and KOSPI, by augmenting an index dataset with data of stocks that
which this assumption is not needed. To do this, the problem of finding have statistical similarities to the index. Aboussalah and Lee (2020)
the optimal transactions given some desired portfolio proportions is proposed a Gaussian process that searches for good neural network
converted into a linear program (LP). The main contributions of this topologies for stock market trading. Wu et al. (2020) compared Q-
work are: learning and policy-gradient methods for stock market trading, finding

2
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002

policy gradient methods give better results. Park et al. (2020) proposed
a two-part network trained with Q-learning. The first part extracts
features using a set of recurrent layers to process each asset separately.
The second part, named the DNN regressor, is a small dense neural
network that processes the extracted features and returns the portfolio.
This work uses a similar strategy because this type of architecture saves
memory resources. However, the second part of our design is different.
The DNN regressor is fixed; hence, if new assets are suddenly added
to the process, the second part has to be modified and retrained. Our
Fig. 2. Trading process diagram.
architecture does not have this limitation. Lei et al. (2020) proposed
a design similar to that of Park et al. (2020), but they used attention
layers (Vaswani et al., 2017) along with recurrent layers for feature
extraction. costs), and it is denoted by 𝑄. A vector containing the proportional
The approach presented in this work falls into the model-free cat- contributions of each asset to the total portfolio value is named the
egory, but is distinct from previous works in some key respects. The portfolio vector, denoted by 𝒑, and it is computed by taking the share
aforementioned approaches assume a fixed number of assets during vector divided by the total portfolio value, i.e. 𝒒∕𝑄. The entries of 𝒑
training and testing. Therefore, adding new assets to those methods are computed using Eq. (1).
requires a change in design, since the input and output layers of those 𝑝𝑖 = 𝑞𝑖 ∕𝑄 (1)
networks are a function of that fixed number. Additionally, if the
number of assets in the market is large, the size of those networks may Only long trading is considered in this formulation. Investors begin
become inconveniently large as well. To make our design independent a trading session with some capital, and all profits are obtained by pur-
of the number of assets, the input layers of the network separately pro- chasing assets that gain value during that session. Further, borrowing
cess each of them and the results are combined using weighting layers is not allowed, and neither is short selling nor margin trading. Hence,
in the network output. These changes make our approach sample- the entries of both 𝒑 and 𝒒 at every moment are non-negative, and the
efficient because the entire dataset is used instead of a segment of some sum of the entries of 𝒑 adds up to 1. Therefore, there must be at least
specific number of assets. Additionally, the network does not need to one 𝑝𝑘 and 𝑞𝑘 strictly positive at every moment.
be adjusted or retrained if new assets are suddenly introduced into the
market. 3.3. Trading session

3. Problem definition In stock markets, trading sessions are specific hours during working
days in which investors are allowed to exchange stocks. However,
This section introduces the concepts of market, investor and trading cryptocurrency markets do not have this limitation, since they are
session, as well as the mathematical definition of portfolio manage- open 24 h. Nonetheless, we use the term trading session for the time in
ment. which the implemented algorithms are allowed to perform transactions
in a simulated environment. In our experiments, trading sessions are
3.1. Market subdivided in periods of equal length, which are named holding periods,
since transactions are only allowed at the end of each of them.
A financial market is an environment where investors trade assets The asset prices changing at every moment due to the interactions
to obtain profit. The prices of assets change over time. Hence, by between investors and market is called market dynamics. Hence, portfo-
predicting these changes, investors can purchase assets which may lio vectors at the beginning and end of each period generally differ; let
increase in value and generate profit from those assets. us denote these vectors by 𝒑[𝑡] and 𝒑′[𝑡] , respectively, for some period
In all markets, there is at least one asset which is considered to 𝑡 during a trading session. The entries of these vectors are related by
keep its value constant over time. Prices of all assets are measured Eq. (2), where 𝑐𝑖[𝑡] and 𝑐𝑖 ′[𝑡] are the prices of each asset at the beginning
with respect to this special asset, and it can be used to purchase any and end of period 𝑡, respectively. Eq. (2) states the change in the value
asset in the market. Hence, it is referred to as cash. In stock and of some asset 𝑖 during period 𝑡 is proportional to the amount held by
fiat currency markets, the most common cash asset is the U.S. Dollar the investor and the ratio between the prices at the beginning and the
(USD). In cryptocurrency markets, on the other hand, the most popular end of that period.
is the Tether1 (USDT). This is because the USDT was designed to 𝑞𝑖[𝑡] 𝑐𝑖 ′[𝑡]
keep its exchange rate constant with respect to the USD (Berentsen & 𝑞𝑖 ′[𝑡] = (2)
𝑐𝑖[𝑡]
Schär, 2019); hence, cryptocurrency investors who want to close their
positions or halt their trading sessions can rapidly exchange their assets At the end of each period, the recent history of the market is ana-
for this risk-free asset. lyzed to propose a portfolio vector that is likely to increase its value in
the subsequent period. The market history consists of prices, volumes,
3.2. Portfolio market capitalizations and other features of the assets recorded in the
latest periods of the trading session. Fig. 2 depicts the steps of a trading
In stock markets, the specific amounts of assets held by an investor session. In this study, the analysis of the market history and proposition
are named shares. Thus, we termed the vector containing the specific of portfolio vectors is carried out by a deep neural network, which is
amounts of assets held by an investor the share vector. The share vector trained using RL. The design of our network and the training process
is denoted by 𝒒 and its entries are denoted as 𝑞𝑖 , where 𝑖 ∈ {0, 1, … , 𝑛 − are described in Section 4.
1} in a market with n assets. The entries are measured in cash units, and In general, the portfolio vector chosen by the network for period 𝑡+1
the index 0 is used for the cash asset. The sum of the entries of 𝒒 is the and the portfolio held by the agent at the end of period 𝑡 differ, because
‘total portfolio value’; it represents the money an investor would make the purpose of the network is to propose assets that have the potential
by selling all the assets at that moment (without counting transaction to increase their value in the near future, which are not necessarily
those assets held by the agent at that moment. Therefore, some assets
need to be exchanged to obtain the portfolio proposed by the network.
1
https://tether.to. The exchange is made in two steps. First, shares of some assets are sold,

3
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002

and the earnings from those transactions are added to the cash. Then,
new assets are purchased using the accumulated cash.
However, all transactions have costs proportional to the traded
amounts, and those costs decrease the total portfolio value. Thus, it is
important to trade efficiently. The problem of finding a set of transac-
tions that yield a desired portfolio, in this case the one proposed by the
network, has multiple solutions in general, and different solutions give
different decrements in portfolio value. Hence, the use of optimization
techniques is necessary to obtain a set of optimal transactions. The ratio
between the portfolio values, before and after the transactions, gives a
measure of the loss due to those transactions; this ratio is shown in
Eq. (3). The best set of transactions is the one leading to the maximum
value of 𝜇. The steps to obtain those optimal transactions are described
in Section 5.
𝑄[𝑡+1]
𝜇[𝑡+1] = (3)
𝑄′[𝑡]
Finally, the performances of investors are evaluated using the earn- Fig. 3. Block diagram of the trading environment.
ings and losses obtained during trading sessions. The main objective of
investors is to obtain net increments in total portfolio value, i.e. profits
from the invested capital. However, profit is not the only way to in the market, 𝑓 is the number of features per asset, and 𝑘 represents
analyze the performance of a set of investments. The oscillations of the the latest steps of the process that the agent is allowed to observe. The
earning and losses seen by an investor during a trading session are used chosen features 𝑓 are: open, close, maximum and minimum prices in
to measure the risk of those investments. Thus, our trading methods are the period, volume, quote-asset volume and the entry of the portfolio
evaluated using both profit-seeking and risk-aversion metrics, which corresponding to that asset, a total of 7 features.
are described in Section 6.2. The state 𝑠[𝑡] is analyzed by the agent to decide on the best possible
In short, due to the market dynamics, it is possible to obtain action 𝑎[𝑡] to take. The action 𝑎[𝑡] is a vector containing the shares of
profits by trading assets. To do this, appropriate portfolios have to be the assets to be sold and acquired for the next period. These values
periodically selected, and optimal transactions have to be computed are computed by the policy. A policy is any map taking states as inputs
and implemented to satisfy the selected portfolios. The details of our and assigning actions as outputs. This map is typically implemented
solutions to these problems are described below. using neural networks. In this work, the policy has two parts: policy
network and transaction optimizer. Our policy network has two gated-
4. Proposed method recurrent-unit layers (GRU) (Cho et al., 2014). The policy network is
denoted by 𝜋𝜃 , where the subindex 𝜃 represents the weights of the
A trading session can be naturally formulated in the RL framework. network that need to be optimized. The output of the policy network is
In an RL process, an agent visits the states of an environment and takes the proposed portfolio vector 𝒑[𝑡] . The action 𝑎[𝑡] is computed using the
actions in each visited state. In return, the environment gives rewards desired portfolio 𝒑[𝑡] , the portfolio in that moment 𝒑′[𝑡−1] , and the fees
to the agent for taking those actions. After the agent executes an action, for trading the assets. This process is described in Section 5.
the environment evolves into a new state due to both the environmental The action 𝑎[𝑡] , which carries the desired amounts to be sold and
dynamics and the action itself. An episode is the set of interactions acquired, is executed in the asset exchange of the environment, resulting
between the agent and environment from its initialization until a final in a new set of assets satisfying the portfolio vector proposed by the
state is visited. The environmental dynamics are the set of rules by network. The following step in the process is a waiting time, in which
which the environment changes over time, which are not necessarily the state 𝑠[𝑡] evolves into 𝑠[𝑡+1] , and consequently 𝒒 [𝑡] evolves into 𝒒 ′[𝑡]
known by the agent, but can be indirectly measured using the data due to the market dynamics. The entries of 𝒒 ′[𝑡] are passed through the
generated in the process, i.e. states, actions and rewards. These data reward function which gives the reward for step 𝑡. Finally, once the
are also used to improve the performances of agents in future episodes. new state 𝑠[𝑡+1] is reached, the reward 𝑟[𝑡] is given to the agent and a
This is typically done by optimizing a policy function which depends on new cycle begins.
the rewards received in previous episodes. This framework is suitable The reward 𝑟[𝑡] is a scalar value representing the performance of the
for asset trading because the concepts of the environment, agents and agent in the current period. The reward is computed using a reward
rewards are analogous to those of markets, investors and profits. A state function, which in a trading environment depends on the earnings
consists of the market history at some specific point in time in which and losses received by the agent in recent periods. In this work, two
the investor is allowed to perform transactions. The action is the set financial measures are used for this purpose: the period return and the
of transactions made by the investor, guided by market history. The differential Sharpe ratio (Moody et al., 1998). Reward functions are
transition between states is a waiting period of time in which prices analogous to overall performance measures, but they are computed for
change due to the market dynamics. The rewards are the actual profits individual periods instead of the entire process. Hence, both reward
or losses obtained by the investor at the end of the period due to the functions and overall performance measures are described together in
transactions made. States, actions and rewards are represented by 𝑠, Section 6.
𝑎 and 𝑟, respectively. The trading process in the framework or RL is The process finishes when a maximum allowed number of states
depicted in Fig. 3. have been visited. When this happens, the rewards are used to compute
the total discounted reward (𝑅), shown in Eq. (4). This is a performance
4.1. Portfolio management as an RL process measure used to train the policy. The variable 𝛾 in the equation is
named the discount factor, and it lies in the open interval (0, 1], 𝑡 is
The iteration 𝑡 of the trading process begins when the agent arrives the index of each step in the process, and 𝑇 is the index of the terminal
at state 𝑠[𝑡] ; this is depicted in the upper part of Fig. 3. State 𝑠[𝑡] is the state. The interpretation of Eq. (4) is agents value earlier rewards more
market history at that specific time, which is represented by a tensor than later ones. From a financial point of view, this means if large
with dimensions 𝑛 × 𝑓 × 𝑘, where 𝑛 is the number of available assets earnings are obtained early, the potential to obtain large final profits

4
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002

increases, because profit depends on both the initial capital and enough
time for the assets to gain value.

𝑇
𝑅= 𝛾 𝑡−1 𝑟[𝑡] (4)
𝑡=1

4.2. Policy optimization

A policy is considered optimal if it maximizes the total expected


discounted reward of the process. The problem of finding optimal
policies is generally hard, but reasonable approximations are obtained
using RL algorithms. Since our architecture uses an RNN, we chose
Proximal Policy Optimization (PPO) (Schulman et al., 2017) to train
our policies, which is an RL algorithm compatible with this type of
architecture. PPO is an actor–critic algorithm, meaning it not only
retrieves a policy 𝜋 (actor), but it also retrieves a value-function 𝑣𝜋 (𝑠)
(critic), which estimates the value of the discounted reward the agent
will receive at the end of the process following policy 𝜋 starting from
any state 𝑠.
On each iteration cycle of PPO, a specific number of copies of
the policy (named workers) are created and assigned to copies of
the environment. Workers interact with their environments during a
specific number of steps, and data of those interactions are stored in
the memory. These data are named rollouts, and are used to improve
the policy at each iteration. Even though the workers are identical,
the data collected by them differs because PPO uses stochastic policies
during training. This means noise is added to the outputs of workers;
consequently, when they land in the same state, they will take slightly
different actions. Thus, adding variations to the generated data, which Fig. 4. Actor–Critic neural network architecture.
leads to training robust policies. Once the rollouts are complete, the
data generated are mixed and divided into mini-batches, which are
used to improve the policy by applying the Adam-optimizer (Kingma
4.3. Market with a dynamic number of assets
& Ba, 2014) to the PPO objective function, which is shown in Eq. (5).
This objective function is equal to the clip objective function minus a
Several changes were made to the original PPO architecture to make
constant coefficient multiplied by the value-function loss.
the process independent of the number of assets in the market. The
𝐿𝑃 𝑃 𝑂 (𝜃) = 𝐿𝑐𝑙𝑖𝑝 (𝜃) − 𝑤𝑣 𝐿𝑣 (𝜃) (5) schematic of our actor–critic architecture is shown in Fig. 4. The archi-
tecture allows both the observation space and action space to change
The key feature of PPO is its clip objective function, shown in dynamically depending on the number of assets. Note, the values of
Eq. (6). In this formula, 𝑢(𝜃) represents the probability ratio between the inputs are normalized before being passed to the recurrent layers.
new and all policies. This ratio is computed dividing the likelihood of The normalization is done for each asset individually, and it is done
the action taken by the current policy at a state 𝑠 by the likelihood separating the features into two groups: price features (open, close,
of the previous policy choosing the same action at the same state; this maximum and minimum) and volume features (volume and quote-asset
is shown in Eq. (7). 𝐴̂ [𝑡] in Eq. (6) is the generalized advantage (GA), volume). In both groups, features are mapped to values in the interval
introduced by Schulman et al. (2015), computed using Eqs. (8) and (9). [0, 1]. This is done by a linear mapping which transforms the minimum
The idea behind the PPO objective function is small differences between and maximum values of each group to zero and one, respectively. The
consecutive policies result in stable training processes. The clip function normalization map is shown in Eq. (11), where 𝐹 represents a group
and hyper-parameter 𝜖 in Eq. (6) ensure that the difference between the of features and 𝑓 an individual feature.
actions of policies before and after an iteration cycle are enclosed in a
𝑓 − 𝑚𝑖𝑛(𝐹 )
small range. 𝑓𝑛𝑜𝑟𝑚𝑒𝑑 = (11)
𝑚𝑎𝑥(𝐹 ) − 𝑚𝑖𝑛(𝐹 )
[ ( )]
𝐿clip (𝜃) = E min 𝑢(𝜃)𝐴̂ [𝑡] , clip (𝑢(𝜃), 1 − 𝜖, 1 + 𝜖) 𝐴̂ [𝑡] (6) The feature tensor of each asset is passed through the network
𝜋𝜃 (𝑎|𝑠) separately and independently. This is equivalent to cloning the network
𝑢(𝜃) = (7)
𝜋𝜃′ (𝑎|𝑠) for each asset. In this way an independent evaluation of the potential to
𝑇∑
−𝑡−1 increase the value of each asset is obtained. Additionally, an artificial
𝐴̂ [𝑡] = (𝛾𝜆)𝑖 𝛿[𝑡+𝑖] (8) feature tensor is used for the cash-asset as well. In this case, the price
𝑖=0 features were set to 1 and the volume features to 0. This is done to give
( ) ( )
𝛿[𝑡] = 𝑟[𝑡] − 𝑣𝜋 𝑠[𝑡] + 𝛾𝑣𝜋 𝑠[𝑡+1] (9) the network a reference point to evaluate the rest of the assets. Hence,
a measure of potential growth is given to each asset as an output of
The value-function loss, shown in Eq. (10), is the squared error the recurrent layers. These outputs are concatenated to the portfolio
between the predictions of the value function and the real discounted vector entries of each asset. These new features are passed through a
returns obtained by the agents in the rollouts. The variable 𝑅[𝑡] in the linear layer and a softmax layer that retrieves the desired output vector.
equation is the discounted reward computed from state 𝑡, and 𝑤𝑣 is The reason for having these final layers is to penalize assets that are not
weight of the value-function loss. The expectations in these formulas held by the agent. This is done to avoid unnecessary transactions, since
are averages over the samples stored in the rollouts. they reduce total portfolio value.
[( ( ) )2 ] Note, the information is passed separately through the network for
𝐿𝑣 (𝜃) = E 𝑣𝜋 𝑠[𝑡] − 𝑅[𝑡] (10)
each asset, and it is only combined before the output layers; this makes

5
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002

the network independent of the number of assets in the market. This Note, the losses due to transaction costs are proportional to both the
architecture differs in several key respects to those designed for a fixed traded shares and the fees for each asset.
number of assets. Our design uses all the available data of the market, ∑
𝑛−1 ∑
𝑛−1 ∑
𝑛−1 ∑
𝑛−1
instead of only using the data of a small subset of assets. This makes 𝑞0 = 𝑞0′ + 𝑦̂𝑗 − 𝑥̂ 𝑗 − 𝑠𝑗 𝑦̂𝑗 − 𝑏𝑗 𝑥̂ 𝑗 (13)
the resulting network more robust at processing assets that were never 𝑗=1 𝑗=1 𝑗=1 𝑗=1
seen by the network. Additionally, our network requires less memory Eqs. (12) and (13) need to be modified to incorporate the vari-
because the input does not depend on the number of assets but only on able 𝜇 defined in Eq. (3), which represents the decrement in total
the number of features. A possible disadvantage of our design is cor- portfolio value that needs to be maximized to obtain a set of optimal
relations between assets may not be found. However, our architecture transactions. Additionally, the entries of the vector 𝒑 also have to be
is able to process higher amounts of data during training; therefore, a incorporated because this is the vector computed by the neural net-
more general space of solutions is explored obtaining higher robustness work. To obtain explicit expressions for 𝜇 and the entries of 𝒑, Eqs. (1)
than other approaches. The method used to train our agents is summa- and (3) were substituted into Eqs. (12) and (13), and the resulting
rized in Algorithm 1. The values of all hyper-parameters are listed in expressions were divided by 𝑄′ . The results of these operations are
Table A.1. shown in Eqs. (14) and (15), where 𝑥𝑖 and 𝑦𝑖 are a new set of variables
defined by 𝑥𝑖 = 𝑥̂ 𝑖 ∕𝑄′ and 𝑦𝑖 = 𝑦̂𝑖 ∕𝑄′ .
Algorithm 1 Trading with dynamic number of assets (DNA)
Inputs:  (# of environment steps) ,  (# of parallel environments), 𝜇𝑝𝑖 = 𝑝′𝑖 − 𝑦𝑖 + 𝑥𝑖 , 𝑖 ∈ {1, 2, … , 𝑛 − 1} (14)
 (# of rollouts per epoch),  (# of mini-batches, 𝒃 and 𝒔 (fee vectors). ∑
𝑛−1 ∑
𝑛−1
Outputs: 𝜋𝜃 ⊳ Policy network 𝜇𝑝0 = 𝑝′0 + (1 − 𝑠𝑗 )𝑦𝑗 − (1 + 𝑏𝑗 )𝑥𝑗 (15)
𝑗=1 𝑗=1
1: Initialize action-state-reward buffer.
2: Reset all environments at random states. To optimize 𝜇, an explicit function for this variable was obtained
3: for epoch 𝑖 = 1, 2, ...,  do ⊳  = ∕( × ) by adding the 𝑛 − 1 expressions represented by Eqs. (14) and (15), the
4: for worker 𝑗 = 1, 2, ...,  do resulting expression is shown in Eq. (16).
5: for rollout 𝑘 = 1, 2, ...,  do

𝑛−1 ∑
𝑛−1
6: 𝒑[𝑡+1] ← 𝜋𝜃 (𝑠[𝑡] ) 𝜇 =1− 𝑠𝑗 𝑦𝑗 − 𝑏𝑗 𝑥 𝑗 (16)
7: 𝑎[𝑡] ← Trans-Opt(𝒑′[𝑡] , 𝒑[𝑡+1] , 𝒃, 𝒔) ⊳ LP (1) 𝑗=1 𝑗=1
8: Save to buffer: 𝑠[𝑡] , 𝑎[𝑡] , 𝑟[𝑡] and 𝑠[𝑡+1] Note, Eqs. (14)–(16) are linear functions of 𝜇, 𝑥𝑖 and 𝑦𝑖 ; hence, using
9: if 𝑡 + 1 = 𝑇 then ⊳ Terminal state these equations the transaction optimization problem can be formu-
10: Reset environment at a random state. lated as an LP. If a feasible solution exists for some LP, then the optimal
11: Shuffle samples in buffer. solution to that LP can always be written as a closed-form expression
12: for Mini-batch 𝑗 = 1, 2, ...,  do using Dantzig’s Simplex Method (Dantzig, 1998), or approximated up to
13: 𝑠[𝑗] , 𝑎[𝑗] , 𝑟[𝑗] , 𝑠[𝑗+1] ← draw samples from buffer. any desired accuracy using other convex optimization techniques (Boyd
14: 𝐴̂ [𝑗] ← compute GA from samples. & Vandenberghe, 2004).2
15: 𝐿𝑃 𝑃 𝑂 ← compute using Eq. (10).
To reduce one decision variable in the problem, Eq. (16) is substi-
16: 𝜋𝜃 ← 𝐴𝑑𝑎𝑚(𝐿𝑃 𝑃 𝑂 , 𝛼, 𝛽1 , 𝛽2 ) ⊳ Optimize 𝜋𝜃
tuted into Eq. (14), resulting in Eq. (17), and in this way 𝜇 only appears
explicitly in the objective function. Note, Eq. (15) is implicit in Eqs. (16)
and (17), so it is omitted in the formulation.
5. Transaction optimization problem
( )

𝑛−1 ∑
𝑛−1
1− 𝑠𝑗 𝑦𝑗 − 𝑏𝑗 𝑥𝑗 𝑝𝑖 = 𝑝′𝑖 − 𝑦𝑖 + 𝑥𝑖 (17)
The outputs of the policy network are the entries of the portfolio
𝑗=1 𝑗=1
vector 𝑝𝑖[𝑡] . In general 𝑝𝑖[𝑡] ≠ 𝑝𝑖 ′[𝑡−1] , consequently some assets have to
be sold and others purchased to satisfy the desired portfolio vector. For Additionally, no more than the available shares held by the investor
simplicity, let us drop the time dependency of the expressions, since can be sold. This condition is explicitly shown in Eq. (18). Similarly, the
this is understood by context, for instance 𝑝𝑖[𝑡] and 𝑝𝑖 ′[𝑡−1] are written amount expended purchasing assets and paying transaction fees has to
as 𝑝𝑖 and 𝑝′𝑖 , respectively. Let us also represent the shares acquired be at most the cash before the transactions plus the amount obtained
and sold per asset at some period by the non-negative variables 𝑥̂ 𝑖 and by selling shares, therefore, Eq. (19) has to be satisfied as well.
𝑦̂𝑖 , respectively. Then, the resulting amounts for non-cash assets due
𝑞𝑖′ ≥ 𝑦̂𝑖 (18)
to the transactions at that period are given by Eq. (12). This formula
tells the shares of any asset after the transactions are the shares before ∑
𝑛−1 ∑
𝑛−1 ∑
𝑛−1 ∑
𝑛−1
𝑞0′ + 𝑦̂𝑗 ≥ 𝑥̂ 𝑗 + 𝑏𝑗 𝑥̂ 𝑗 + 𝑠𝑗 𝑦̂𝑗 (19)
those transactions minus the sold amount plus the acquired amount. 𝑗=1 𝑗=1 𝑗=1 𝑗=1
Intuitively, either 𝑥̂ 𝑖 or 𝑦̂𝑖 should be exactly zero because it does not
Dividing Eqs. (18) and (19) by 𝑄′ and rearranging terms result in
make sense to sell an asset and then buy it again in the same period;
Eqs. (20) and (21) that are equivalent to Eqs. (18) and (19), but contain
however, the optimization algorithm described at the end of this section
𝑥𝑖 and 𝑦𝑖 , which are the decision variables of the LP described below.
takes care of this issue.
𝑝′𝑖 ≥ 𝑦𝑖 (20)
𝑞𝑖 = 𝑞𝑖′ − 𝑦̂𝑖 + 𝑥̂ 𝑖 , 𝑖 ∈ {1, 2, … , 𝑛 − 1} (12)

𝑛−1 ∑
𝑛−1
The shares sold for each asset are added to the cash, and that 𝑝′0 ≥ − (1 − 𝑠𝑗 )𝑦𝑗 + (1 + 𝑏𝑗 )𝑥𝑗 (21)
amount is used to buy new assets. However, both buying and selling 𝑗=1 𝑗=1

assets reduce total portfolio value. These conditions are represented in The LP shown in Eq. (22) was stated using Eq. (16) as the objective
Eq. (13), where 𝑏𝑖 and 𝑠𝑖 are the buying and selling fees for the 𝑖th function, Eq. (17) as the set of equality constraints and Eqs. (20) and
asset, and lie in the open interval [0, 1). Similar to the previous formula, (21) as the set of inequality constraints. This LP is always feasible; thus,
Eq. (13) states the amount of cash after transactions is the amount
before transactions plus the cash obtained by selling the undesired
assets minus the cash used to purchase new assets. In addition, it 2
There are plenty of LP solvers written in C++, Python and other languages
contains the amounts subtracted from the cash due to transaction costs. available on internet at no cost.

6
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002

an optimal solution for it can always be found efficiently using convex


optimization techniques. A proof of this fact is given in Appendix B.
𝐋𝐢𝐧𝐞𝐚𝐫𝐏𝐫𝐨𝐠𝐫𝐚𝐦 ∶

𝑛−1 ∑
𝑛−1
𝐌𝐚𝐱𝐢𝐦𝐢𝐳𝐞 ∶𝜇 = 1 − 𝑠𝑗 𝑦𝑗 − 𝑏𝑗 𝑥 𝑗
𝑗=1 𝑗=1

𝐒𝐮𝐛𝐣𝐞𝐜𝐭𝐭𝐨 ∶
(𝑛−1 )
∑ ∑
𝑛−1
𝟏.𝑦𝑖 − 𝑥𝑖 + 𝑠𝑗 𝑦𝑗 + 𝑏𝑗 𝑥 𝑗 𝑝𝑖 = 𝑝′𝑖 − 𝑝𝑖 (22)
𝑗=1 𝑗=1

𝟐.𝑦𝑖 ≤ 𝑝′𝑖

𝑛−1 ∑
𝑛−1
𝟑. − (1 − 𝑠𝑗 )𝑦𝑗 + (1 + 𝑏𝑗 )𝑥𝑗 ≤ 𝑝′0
𝑗=1 𝑗=1

𝟒.𝑦𝑖 , 𝑥𝑖 ≥ 0, for 𝑖 ∈ {1, 2, … , 𝑛 − 1} Fig. 5. Number of active assets in Binance market in the period: 2017-08-17 to
2019-11-01 (only assets directly exchangeable for USDT were counted).
6. Experiments
Table 1
This section describes the setups of our experiments. These in- Dataset properties.
clude dataset features, metrics for evaluating the algorithms and im- Feature Value
plementation details. We assume the amounts traded by our agents Total days 806
are sufficiently small that the prices of assets are not affected by these Training days 484 (60%)
transactions, and the available shares in the market are large enough Test days 322 (40%)
# of assets 3-85
that transactions are executed immediately.
# of features 9
Fees (except BNB) 0.1%
6.1. Dataset Fees BNB 0.05%
# of entries per sampling period
The dataset of Binance3 was used in all our experiments. Binance
30 minutes 38688
is one of the largest cryptocurrency markets in the world,4 and its six hours 3224
dataset can be accessed at no cost through the website’s API. The data one day 806
used in this work corresponds to the market history from 2017-08-17
to 2019-11-01. Only assets that can be directly exchanged for USDT
were considered, these include Bitcoin, Ethereum, Litecoin and others.
it is computed using Eq. (23). This measure only considers the portfolio
Binance market has grown rapidly since its creation in 2017. Note,
values at the beginning and end of a session. Thus, it gives high scores
during the selected period, the number of active assets that can be
to trading strategies that seek high profits.
exchanged for USDT incremented from three to 85, as shown in Fig. 5.
The dataset consists of asset features recorded in equal-length sampling 𝑄[𝑇 ] − 𝑄′[0]
periods. The dataset has multiple available sampling periods; three of 𝑇𝑅 = (23)
𝑄′[0]
them were chosen for our experiments: 30 min, six hours and one day.
There are nine features available for each asset at each sampled period, On the other hand, the SR was used to evaluate the risks taken by
which include four price features: open, high, low, and close prices; the algorithms during the trading sessions. The SP is computed using
four volume features: standard, quote-asset, taker-buy asset and taker- Eq. (24), where the mean and standard deviation are taken for the
buy-quote-asset volumes; and the number of trades in the sampled period returns (PR) of all steps in the trading session. PRs are computed
period. The first six features were used as inputs for our network. The using Eq. (25). The risk of an investment is the uncertainty the investor
data was divided chronologically into two parts. The first 60% of the has about the future price of an asset that he or she intends to buy
data was used for training and the rest for testing. This type of division or sell, which is hard to evaluate. Instead, the SR measures the risk
was chosen because investors use past price behavior of assets to predict of a set of investments after observing the outcome of them. This is
future tendencies or repetition of patterns. The test dataset corresponds done by computing the statistical measures of the earnings and losses
to 11 months of asset history. Hence, the results can give a general idea at each period, which are represented by PR. Note, PR and TR are
of the expected average performances of our methods at any point of similar expressions; the difference between them is PR is computed
the year. Missing data were completed using linear interpolation. Fees for each individual step and TR is computed for the entire process.
were set as in the Binance exchange website, 0.1% for all assets except The SP and the TR favor algorithms which seek high returns; however,
BNB (coin), which has a tax of 0.05%.5 Table 1 summarizes the main the difference between them is SR penalizes portfolios that experiment
properties of the dataset. large oscillations in gain in the trading sessions.
mean(𝑃 𝑅[𝑡] )
6.2. Metrics 𝑡
𝑆𝑅 = (24)
std(𝑃 𝑅[𝑡] )
𝑡
Two metrics were used in this study for evaluating the profit-seeking 𝑄[𝑡] − 𝑄′[𝑡−1]
and risk-aversion behaviors of the implemented algorithms: the total 𝑃 𝑅[𝑡] = (25)
𝑄′[𝑡−1]
return (TR) and the Sharpe ratio (SR) (Sharpe, 1994). The total return is
the total profit or loss obtained by the investor in a trading session, and
6.3. Reward functions

3
www.binance.com. The value of 𝑟[𝑡] is a measure of performance of an agent at period
4
According to Coin Market Cap (www.coinmarketcap.com). 𝑡. Any real scalar function that maps good performances to high values
5
This discount changes after the first 12 months of use of the platform. can be used to compute rewards. In trading environments, the most

7
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002

Table 2 Table 4
Trading session lengths and holding periods. Score summary for the trading sessions with length: one day and holding periods:
Session length Holding period # of periods 30 min.
Algorithm Total Return Sharpe Ratio
1 day 30 min 48
14 days 6 h 56 TD(𝜆) (Pendharkar & Cusatis, 2018) 0.014 ± 0.021 0.064 ± 0.107
30 days 1 day 30 CNN (Jiang & Liang, 2017) 0.088 ± 0.086 0.274 ± 0.127
DQN (Bu & Cho, 2018) −0.019 ± 0.024 −0.180 ± 0.176
DNA-R (ours) 0.041 ± 0.042 0.166 ± 0.153
Table 3
DNA-S (ours) 𝟎.𝟐𝟒𝟒 ± 𝟎.𝟏𝟓𝟒 𝟎.𝟒𝟔𝟖 ± 𝟎.𝟐𝟎𝟎
Summary of the implemented algorithms.
Algorithm # of # of Description
assets features Table 5
TD(𝜆) (Pendharkar & Cusatis, 2018) 2 2 Value function Score summary for the trading sessions with length: 14 days and holding periods: 6 h.
CNN (Jiang & Liang, 2017) 12 14 Policy gradient Algorithm Total Return Sharpe Ratio
DQN (Bu & Cho, 2018) 8 8 Double-Q learning
TD(𝜆) (Pendharkar & Cusatis, 2018) 0.224 ± 0.183 0.317 ± 0.125
DNA-R (ours) 3 - 85 6 PPO, period return CNN (Jiang & Liang, 2017) 0.680 ± 0.257 0.411 ± 0.088
DNA-S (ours) 3 - 85 6 PPO, Diff. Sharpe ratio DQN (Bu & Cho, 2018) 1.108 ± 0.503 𝟎.𝟔𝟏𝟖 ± 𝟎.𝟎𝟖𝟖
DNA-R (ours) 𝟐.𝟐𝟖𝟑 ± 𝟏.𝟑𝟕𝟏 0.533 ± 0.095
DNA-S (ours) 0.468 ± 0.314 0.336 ± 0.150

natural choice for this purpose is the PR (Eq. (25)), which is the profit
or loss in each period. However, this function does not consider risks. Table 6
Score summary for the trading sessions with length: 30 days and holding periods: one
The issue of finding a risk function that can be computed at each
day.
step, and can be used as reward function in RL processes, was first
Algorithm Total Return Sharpe Ratio
studied by Moody et al. (1998). They derived an approximation for the
TD(𝜆) (Pendharkar & Cusatis, 2018) 0.329 ± 0.265 0.404 ± 0.126
contributions to the SP at each step, termed the differential Sharpe ratio
CNN (Jiang & Liang, 2017) 1.012 ± 0.495 0.527 ± 0.143
(DSR), which is computed using Eq. (26). They reported agents trained DQN (Bu & Cho, 2018) 1.151 ± 0.521 0.661 ± 0.158
with the DSR performed better than those trained with measures that
DNA-R (ours) 𝟕.𝟏𝟏𝟔 ± 𝟒.𝟓𝟔𝟔 𝟎.𝟕𝟓𝟖 ± 𝟎.𝟏𝟖𝟔
do not consider risks. Both PR and DSR were used to train our agents. DNA-S (ours) 2.798 ± 2.065 0.609 ± 0.201
We named our agents DNA for the dynamic number of assets. The agent
trained with the PR is named DNA-R, and the one trained with the DSR
is named DNA-S.
an average TR of 0.041. The other two approaches: DQN and TD(𝜆)
𝐵[𝑡−1] 𝛥𝐴[𝑡] − 0.5𝐴[𝑡−1] 𝛥𝐵[𝑡]
𝐷𝑆𝑅[𝑡] = ( )3∕2 obtained the lowest scores, with average TRs close to zero. The average
𝐵[𝑡−1] − 𝐴2[𝑡−1] SRs obtained by all the algorithms correlate to their TRs, for instance
( ) (26) DNA-S has the best average SR among the competing algorithms, which
𝐴[𝑡] = 𝐴[𝑡−1] + 𝜂 𝑟[𝑡] − 𝐴[𝑡−1]
( ) is 0.468. Hence, in this setup, DNA-S not only sought high profits, but
𝐵[𝑡] = 𝐵[𝑡−1] + 𝜂 𝑟2[𝑡] − 𝐵[𝑡−1] also managed investment risks better than its competing approaches.
The results of the second experiment (trading sessions of 14 h
6.4. Experiment setups and holding periods of 6 h) show important contrasts with respect to
the previous experiment. DNA-R obtained the best average TR in this
Three different trading session lengths were chosen for our backtest setup, as shown in Table 5, but it did not obtain the best average SR.
experiments: 24 h, 14 days and 30 days, each of them with holding The average TR of DNA-R in this setup was 2.283, but the standard
periods: 30 min, six hours and one day, respectively. These setups were deviation was 1.371. This indicates the performances of this algorithm
chosen to test the robustness of our methods to changes in holding over the course of a year could have large variations. The SR obtained
periods and trading session lengths. The setups are summarized in by this algorithm supports this observation. The best average SR in
Table 2. this setup belong to DQN (0.618). Hence, DQN had higher stability in
During training, episodes were drawn randomly across the train the increments in portfolio value in the trading sessions. However, this
dataset. For testing, the algorithms were run through all the episodes of approach only obtained half as much average profit (1.08) as that of
the test dataset; the performances shown in this work are the average DNA-R. CNN and DNA-S have the next best scores, which is surprising,
TR and SR obtained by each algorithm during testing. since these two approaches obtained the best results in the previous
experiment. Again, TD(𝜆) had the lowest performance. Nonetheless, its
6.5. Baselines results are better in this setup than in the previous one.
In the third setup (trading sessions of 30 days and holding periods
Our agents were compared to methods presented in (Bu & Cho, of one day), our agents obtained significantly higher scores than those
2018; Jiang & Liang, 2017; Pendharkar & Cusatis, 2018); these base- obtained by the baselines. These results are shown in Table 6. DNA-R
lines were described in the Section 2. The main features of our agents and DNA-S obtained the highest average TRs: 7.116 and 2.798, respec-
and baselines are summarized in Table 3. All baselines trade a fixed tively. On the other hand, the results of the baselines are actually very
number of assets, which are the cash asset and those assets with the similar to those of the previous experiment, even though the trading
highest capitalizations in the market. session was twice as long. The SR scores correlate to those of the TR,
except for DQN, which obtained a higher score than DNA-S.
7. Results In general, our methods adapted better than the baselines to changes
in holding periods and trading session lengths. Our experiments show
In the first experiment, which corresponds to trading sessions of DNA-R is the most robust approach for both profit-seeking and risk-
1 day and holding periods of 30 min, DNA-S obtained the highest aversion. The TRs and SRs obtained by this agent are good throughout
results; these are shown in Table 4. DNA-S obtained an average TR of all setups, even though DNA-S had the best performance in the setup
0.224, which is more than double the score of the closest competitor with the shortest holding period. We also found being able to analyze
CNN, which obtained 0.088. DNA-R obtained the third best results with a large pool of assets is beneficial for the agent. Tables 4–6 show the

8
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002

Table A.1
List of hyperparameters.
Hyperparameter Value
# of environments () 10
# of mini-batches () 10
# of environment Steps () 100000
# of Rollouts per epoch () 10
# of optimization steps 10000
PPO decay rate (𝛾) 0.99
PPO value-loss coefficient (𝑤𝑣 ) 0.99
PPO clip parameter (𝜖) 0.2
PPO noise distribution  (𝜇, 𝜎 2 ) 0, 0.01
GA parameter (𝜆) 0.95
# of recurrent layers 2
Recurrent layers length 10
Adam-Optimizer (𝛼, 𝛽1 , 𝛽2 ) 7e−4, 0.9, 0.999
DSR parameter (𝜂) 0.01

to mitigate the volatility of assets. Additionally, since it was assumed


the traded amounts have to be sufficiently small that prices of assets
are not affected, we recommend trading only small amounts. Further
research is necessary to assess the impact of the size of the investments
in automatic trading with RL.

8. Conclusions and future work

We introduced a method that performs portfolio management on


markets with transaction costs in which the number of assets is dy-
namic. Even though our method is able to integrate new assets into
the process during deployment, it does not require extra training or
memory and its implementation is straightforward. Our method was
tested on a cryptocurrency market outperforming state-of-the-art meth-
ods under three distinct setups. Additionally, a novel algorithm to
compute transactions with minimal costs, formulated as an LP, was
given in this work.
Future works include the implementation of the proposed method
into traditional markets such as stocks and fiat currency markets.
Additionally, due to its straightforward implementation, our method
is compatible with advanced trading strategies such as limit, trailing
and stop-loss; hence, future directions can also include the integration
of the mentioned mechanisms to the method.

CRediT authorship contribution statement

Carlos Betancourt: Conceptualization, Methodology, Software, Val-


idation, Formal analysis, Investigation, Data curation, Writing - origi-
nal draft, Writing - review & editing, Visualization. Wen-Hui Chen:
Resources, Writing - review & editing, Supervision.

Fig. 6. Examples of tests on the three setups: (a) Trading session with length: one day Declaration of competing interest
and holding periods: 30 min (January, 2019). (b) Trading session with length: 14 days
and holding periods: 6 h (September, 2019). (c) Trading session with length: 30 days
and holding periods: one day (June and July, 2019). The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
correlation between the number of assets processed by the competing
approaches and their overall performances, for instance, the agent Appendix A. Hyperparameters
that performed worst in all setups was TD(𝜆), which processed only
two assets. Fig. 6 depicts the evolution of the portfolio values of all See Table A.1.
competing approaches in the three setups.
Based on these results, we believe our method is a step forward in
the automatic trading of cryptocurrencies. In this work, we included
transaction costs to make our simulations as close to real markets as Appendix B. Supplementary data
possible. Trading assets is risky, especially cryptocurrencies which are
extremely volatile. Our recommendation for institutional investors is to Supplementary material related to this article can be found online
use these methods along with loss limiting functions, such as stop-loss, at https://doi.org/10.1016/j.eswa.2020.114002.

9
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002

References Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., &
Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv
Aboussalah, A. M., & Lee, C.-G. (2020). Continuous control with stacked deep dynamic preprint arXiv:1312.5602.
recurrent reinforcement learning for portfolio optimization. Expert Systems with Moody, J., & Saffell, M. (2001). Learning to trade via direct reinforcement. IEEE
Applications, 140, Article 112891. transactions on neural Networks, 12, 875–889.
Berentsen, A., & Schär, F. (2019). Stablecoins: The quest for a low-volatility cryp- Moody, J., Wu, L., Liao, Y., & Saffell, M. (1998). Performance functions and rein-
tocurrency. In The economics of fintech and digital currencies (pp. 65–71). CEPR forcement learning for trading systems and portfolios. Journal of Forecasting, 17,
Press. 441–470.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge university press. Nakamoto, S. (2008). Bitcoin: A peer-to-peer electronic cash system.
Bu, S.-J., & Cho, S.-B. (2018). Learning optimal q-function using deep boltzmann Narayanan, A., Bonneau, J., Felten, E., Miller, A., & Goldfeder, S. (2016). Bitcoin and
machine for reliable trading of cryptocurrency. In International conference on cryptocurrency technologies: A comprehensive introduction. Princeton University Press.
intelligent data engineering and automated learning (pp. 468–480). Springer. Niaki, S. T. A., & Hoseinzade, S. (2013). Forecasting s&p 500 index using artifi-
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., cial neural networks and design of experiments. Journal of Industrial Engineering
& Bengio, Y. (2014). Learning phrase representations using rnn encoder–decoder International, 9(1).
for statistical machine translation. In Conference on empirical methods in natural OpenAI, Andrychowicz, M., Baker, B., Chociej, M., Józefowicz, R., McGrew, B.,
language processing (EMNLP 2014). Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S.,
Corbet, S., McHugh, G., & Meegan, A. (2014). The influence of central bank monetary Tobin, J., Welinder, P., Weng, L., & Zaremba, W. (2020). Learning dexterous
policy announcements on cryptocurrency return volatility. In Investment management in-hand manipulation. International Journal of Robotics Research, 39, 3–20.
and financial innovations 14, Iss. 4 (pp. 60–72). Business Perspectives, Publishing Ormos, M., & Urbán, A. (2013). Performance analysis of log-optimal portfolio strategies
Company. with transaction costs. Quantitative Finance, 13, 1587–1597.
Dantzig, G. B. (1998). Linear programming and extensions. Princeton university press. Park, H., Sim, M. K., & Choi, D. G. (2020). An intelligent financial portfolio trading
Dempster, M. A., & Leemans, V. (2006). An automated fx trading system using adaptive strategy using deep q-learning. Expert Systems with Applications, Article 113573.
reinforcement learning. Expert Systems with Applications, 30, 543–552. Pendharkar, P. C., & Cusatis, P. (2018). Trading financial indices with reinforcement
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image learning agents. Expert Systems with Applications, 103, 1–13.
recognition. In Proceedings of the IEEE conference on computer vision and pattern Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional
recognition (pp. 770–778). IEEE. continuous control using generalized advantage estimation. arXiv preprint arXiv:
Heaton, J., Polson, N., & Witte, J. H. (2017). Deep learning for finance: Deep portfolios. 1506.02438.
Applied Stochastic Models in Business and Industry, 33, 3–12. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal
Jeong, G., & Kim, H. Y. (2019). Improving financial trading decisions using deep q- policy optimization algorithms. arXiv preprint arXiv:1707.06347.
learning: Predicting the number of shares, action strategies, and transfer learning. Sharpe, W. F. (1994). The sharpe ratio. Journal of portfolio management, 21, 49–58.
Expert Systems with Applications, 117, 125–138. Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double
Jiang, Z., & Liang, J. (2017). Cryptocurrency portfolio management with deep rein- q-learning. In Thirtieth AAAI conference on artificial intelligence.
forcement learning. In 2017 Intelligent systems conference (IntelliSys) (pp. 905–913). Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł.,
IEEE. & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, processing systems (pp. 5998–6008).
abs/1412.6980. Wu, X., Chen, H., Wang, J., Troiano, L., Loia, V., & Fujita, H. (2020). Adaptive stock
Lei, K., Zhang, B., Li, Y., Yang, M., & Shen, Y. (2020). Time-driven feature-aware jointly trading strategies with deep reinforcement learning methods. Information Sciences.
deep reinforcement learning for financial signal representation and algorithmic Zhang, J., & Maringer, D. (2013) Indicator selection for daily equity trading with
trading. Expert Systems with Applications, 140, Article 112872. recurrent reinforcement learning. In Proceedings of the 15th annual conference
Liang, Z., Chen, H., Zhu, J., Jiang, K., & Li, Y. (2018). Adversarial deep reinforcement companion on Genetic and evolutionary computation (pp. 1757–1758).
learning in portfolio management. arXiv preprint arXiv:1808.09940.

10

You might also like