Deep reinforcement learning for portfolio management of markets with a dynamic number of assets
Deep reinforcement learning for portfolio management of markets with a dynamic number of assets
Keywords: This work proposes a novel portfolio management method using deep reinforcement learning on markets with
Reinforcement learning a dynamic number of assets. This problem is especially important in cryptocurrency markets, which already
Deep learning support the trading of hundreds of assets with new ones being added every month. A novel neural network
Portfolio management
architecture is proposed, which is trained using deep reinforcement learning. Our architecture considers all
Transaction costs
assets in the market, and automatically adapts when new ones are suddenly introduced, making our method
Multiple assets
more general and sample-efficient than previous methods. Further, transaction cost minimization is considered
when formulating the problem. For this purpose, a novel algorithm to compute optimal transactions given a
desired portfolio is integrated into the architecture. The proposed method was tested on a dataset of one of the
largest cryptocurrency markets in the world, outperforming state-of-the-art methods, achieving average daily
returns of over 24%.
1. Introduction of the earnings or losses seen by the agent for taking those actions. The
vector containing the information of the assets held by an agent at any
Generating financial profits by trading cryptocurrencies is chal- moment is called the portfolio, hence this type of process is also known
lenging due to their price erratic changes. Cryptocurrencies are de- as portfolio management.
centralized electronic financial assets that appeared as an alternative Typically RL algorithms have fixed state and action spaces. How-
to fiat currencies (Nakamoto, 2008). However, according to Corbet ever, new assets are often added to cryptocurrency markets (Narayanan
et al. (2014) the prices of cryptocurrencies are affected by government et al., 2016). Hence, to rapidly incorporate those assets into the process,
announcements, policies and actions, in spite of the fact they are adaptable state and action spaces are necessary. Most works on auto-
decentralized assets. Additionally, cryptocurrency prices show higher matic asset trading assume the number of assets is static (Bu & Cho,
volatility than those of traditional assets. For instance, in early 2017, 2018; Jiang & Liang, 2017; Liang et al., 2018; Pendharkar & Cusatis,
the price of Bitcoin, the well-known cryptocurrency, reached its maxi- 2018). Thereby, convolutional layers of neural networks can extract
mum historical peak of approximately 19,000 USD per unit, but during useful information about the prices of that specific set of assets. But,
the subsequent months it plunged to 3000 USD, followed by a strong by doing so, a large portion of data is wasted because only a small
bounce to its current price of approximately 8,000 USD per unit. For
number of assets is used for training algorithms while datasets contain
this price behavior, formulating cryptocurrency trading strategies is a
information collected from dozens and even hundreds of assets. This is
non-trivial task.
an important issue, not only from a sample-efficiency point of view, but
Reinforcement learning (RL) is a suitable framework to process
also because critical earnings may be accomplished by trading assets
complex data and handle difficult decision-making processes such as
that are suddenly incorporated into a market. For instance, in Fig. 1,
asset trading. A trading process can be naturally formulated as an
Dock (coin) reached 2.2 times its original price during the first 20 days
RL process. In this type of process, an agent takes actions over an
in the market, then it fell and settled at about 1.2 times its original
environment based on observations of the states of that environment;
rewards are received by that agent as a consequence of both the states price on the subsequent days. This behavior has been observed often in
visited and the actions taken. In the specific case of asset trading, a assets recently added to markets; however, we are confident it can be
state of the environment is equivalent to the recent history of the assets; predicted and exploited.
actions are the transactions made to get rid of some of the assets held This work proposes an RL method using a recurrent neural network
by the agent and acquire new ones, and the rewards are scalar functions (RNN) to perform portfolio management on markets in which the
∗ Corresponding author.
E-mail addresses: [email protected] (C. Betancourt), [email protected] (W.-H. Chen).
https://doi.org/10.1016/j.eswa.2020.114002
Received 1 April 2020; Received in revised form 10 September 2020; Accepted 11 September 2020
Available online 16 September 2020
0957-4174/© 2020 Elsevier Ltd. All rights reserved.
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002
2
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002
policy gradient methods give better results. Park et al. (2020) proposed
a two-part network trained with Q-learning. The first part extracts
features using a set of recurrent layers to process each asset separately.
The second part, named the DNN regressor, is a small dense neural
network that processes the extracted features and returns the portfolio.
This work uses a similar strategy because this type of architecture saves
memory resources. However, the second part of our design is different.
The DNN regressor is fixed; hence, if new assets are suddenly added
to the process, the second part has to be modified and retrained. Our
Fig. 2. Trading process diagram.
architecture does not have this limitation. Lei et al. (2020) proposed
a design similar to that of Park et al. (2020), but they used attention
layers (Vaswani et al., 2017) along with recurrent layers for feature
extraction. costs), and it is denoted by 𝑄. A vector containing the proportional
The approach presented in this work falls into the model-free cat- contributions of each asset to the total portfolio value is named the
egory, but is distinct from previous works in some key respects. The portfolio vector, denoted by 𝒑, and it is computed by taking the share
aforementioned approaches assume a fixed number of assets during vector divided by the total portfolio value, i.e. 𝒒∕𝑄. The entries of 𝒑
training and testing. Therefore, adding new assets to those methods are computed using Eq. (1).
requires a change in design, since the input and output layers of those 𝑝𝑖 = 𝑞𝑖 ∕𝑄 (1)
networks are a function of that fixed number. Additionally, if the
number of assets in the market is large, the size of those networks may Only long trading is considered in this formulation. Investors begin
become inconveniently large as well. To make our design independent a trading session with some capital, and all profits are obtained by pur-
of the number of assets, the input layers of the network separately pro- chasing assets that gain value during that session. Further, borrowing
cess each of them and the results are combined using weighting layers is not allowed, and neither is short selling nor margin trading. Hence,
in the network output. These changes make our approach sample- the entries of both 𝒑 and 𝒒 at every moment are non-negative, and the
efficient because the entire dataset is used instead of a segment of some sum of the entries of 𝒑 adds up to 1. Therefore, there must be at least
specific number of assets. Additionally, the network does not need to one 𝑝𝑘 and 𝑞𝑘 strictly positive at every moment.
be adjusted or retrained if new assets are suddenly introduced into the
market. 3.3. Trading session
3. Problem definition In stock markets, trading sessions are specific hours during working
days in which investors are allowed to exchange stocks. However,
This section introduces the concepts of market, investor and trading cryptocurrency markets do not have this limitation, since they are
session, as well as the mathematical definition of portfolio manage- open 24 h. Nonetheless, we use the term trading session for the time in
ment. which the implemented algorithms are allowed to perform transactions
in a simulated environment. In our experiments, trading sessions are
3.1. Market subdivided in periods of equal length, which are named holding periods,
since transactions are only allowed at the end of each of them.
A financial market is an environment where investors trade assets The asset prices changing at every moment due to the interactions
to obtain profit. The prices of assets change over time. Hence, by between investors and market is called market dynamics. Hence, portfo-
predicting these changes, investors can purchase assets which may lio vectors at the beginning and end of each period generally differ; let
increase in value and generate profit from those assets. us denote these vectors by 𝒑[𝑡] and 𝒑′[𝑡] , respectively, for some period
In all markets, there is at least one asset which is considered to 𝑡 during a trading session. The entries of these vectors are related by
keep its value constant over time. Prices of all assets are measured Eq. (2), where 𝑐𝑖[𝑡] and 𝑐𝑖 ′[𝑡] are the prices of each asset at the beginning
with respect to this special asset, and it can be used to purchase any and end of period 𝑡, respectively. Eq. (2) states the change in the value
asset in the market. Hence, it is referred to as cash. In stock and of some asset 𝑖 during period 𝑡 is proportional to the amount held by
fiat currency markets, the most common cash asset is the U.S. Dollar the investor and the ratio between the prices at the beginning and the
(USD). In cryptocurrency markets, on the other hand, the most popular end of that period.
is the Tether1 (USDT). This is because the USDT was designed to 𝑞𝑖[𝑡] 𝑐𝑖 ′[𝑡]
keep its exchange rate constant with respect to the USD (Berentsen & 𝑞𝑖 ′[𝑡] = (2)
𝑐𝑖[𝑡]
Schär, 2019); hence, cryptocurrency investors who want to close their
positions or halt their trading sessions can rapidly exchange their assets At the end of each period, the recent history of the market is ana-
for this risk-free asset. lyzed to propose a portfolio vector that is likely to increase its value in
the subsequent period. The market history consists of prices, volumes,
3.2. Portfolio market capitalizations and other features of the assets recorded in the
latest periods of the trading session. Fig. 2 depicts the steps of a trading
In stock markets, the specific amounts of assets held by an investor session. In this study, the analysis of the market history and proposition
are named shares. Thus, we termed the vector containing the specific of portfolio vectors is carried out by a deep neural network, which is
amounts of assets held by an investor the share vector. The share vector trained using RL. The design of our network and the training process
is denoted by 𝒒 and its entries are denoted as 𝑞𝑖 , where 𝑖 ∈ {0, 1, … , 𝑛 − are described in Section 4.
1} in a market with n assets. The entries are measured in cash units, and In general, the portfolio vector chosen by the network for period 𝑡+1
the index 0 is used for the cash asset. The sum of the entries of 𝒒 is the and the portfolio held by the agent at the end of period 𝑡 differ, because
‘total portfolio value’; it represents the money an investor would make the purpose of the network is to propose assets that have the potential
by selling all the assets at that moment (without counting transaction to increase their value in the near future, which are not necessarily
those assets held by the agent at that moment. Therefore, some assets
need to be exchanged to obtain the portfolio proposed by the network.
1
https://tether.to. The exchange is made in two steps. First, shares of some assets are sold,
3
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002
and the earnings from those transactions are added to the cash. Then,
new assets are purchased using the accumulated cash.
However, all transactions have costs proportional to the traded
amounts, and those costs decrease the total portfolio value. Thus, it is
important to trade efficiently. The problem of finding a set of transac-
tions that yield a desired portfolio, in this case the one proposed by the
network, has multiple solutions in general, and different solutions give
different decrements in portfolio value. Hence, the use of optimization
techniques is necessary to obtain a set of optimal transactions. The ratio
between the portfolio values, before and after the transactions, gives a
measure of the loss due to those transactions; this ratio is shown in
Eq. (3). The best set of transactions is the one leading to the maximum
value of 𝜇. The steps to obtain those optimal transactions are described
in Section 5.
𝑄[𝑡+1]
𝜇[𝑡+1] = (3)
𝑄′[𝑡]
Finally, the performances of investors are evaluated using the earn- Fig. 3. Block diagram of the trading environment.
ings and losses obtained during trading sessions. The main objective of
investors is to obtain net increments in total portfolio value, i.e. profits
from the invested capital. However, profit is not the only way to in the market, 𝑓 is the number of features per asset, and 𝑘 represents
analyze the performance of a set of investments. The oscillations of the the latest steps of the process that the agent is allowed to observe. The
earning and losses seen by an investor during a trading session are used chosen features 𝑓 are: open, close, maximum and minimum prices in
to measure the risk of those investments. Thus, our trading methods are the period, volume, quote-asset volume and the entry of the portfolio
evaluated using both profit-seeking and risk-aversion metrics, which corresponding to that asset, a total of 7 features.
are described in Section 6.2. The state 𝑠[𝑡] is analyzed by the agent to decide on the best possible
In short, due to the market dynamics, it is possible to obtain action 𝑎[𝑡] to take. The action 𝑎[𝑡] is a vector containing the shares of
profits by trading assets. To do this, appropriate portfolios have to be the assets to be sold and acquired for the next period. These values
periodically selected, and optimal transactions have to be computed are computed by the policy. A policy is any map taking states as inputs
and implemented to satisfy the selected portfolios. The details of our and assigning actions as outputs. This map is typically implemented
solutions to these problems are described below. using neural networks. In this work, the policy has two parts: policy
network and transaction optimizer. Our policy network has two gated-
4. Proposed method recurrent-unit layers (GRU) (Cho et al., 2014). The policy network is
denoted by 𝜋𝜃 , where the subindex 𝜃 represents the weights of the
A trading session can be naturally formulated in the RL framework. network that need to be optimized. The output of the policy network is
In an RL process, an agent visits the states of an environment and takes the proposed portfolio vector 𝒑[𝑡] . The action 𝑎[𝑡] is computed using the
actions in each visited state. In return, the environment gives rewards desired portfolio 𝒑[𝑡] , the portfolio in that moment 𝒑′[𝑡−1] , and the fees
to the agent for taking those actions. After the agent executes an action, for trading the assets. This process is described in Section 5.
the environment evolves into a new state due to both the environmental The action 𝑎[𝑡] , which carries the desired amounts to be sold and
dynamics and the action itself. An episode is the set of interactions acquired, is executed in the asset exchange of the environment, resulting
between the agent and environment from its initialization until a final in a new set of assets satisfying the portfolio vector proposed by the
state is visited. The environmental dynamics are the set of rules by network. The following step in the process is a waiting time, in which
which the environment changes over time, which are not necessarily the state 𝑠[𝑡] evolves into 𝑠[𝑡+1] , and consequently 𝒒 [𝑡] evolves into 𝒒 ′[𝑡]
known by the agent, but can be indirectly measured using the data due to the market dynamics. The entries of 𝒒 ′[𝑡] are passed through the
generated in the process, i.e. states, actions and rewards. These data reward function which gives the reward for step 𝑡. Finally, once the
are also used to improve the performances of agents in future episodes. new state 𝑠[𝑡+1] is reached, the reward 𝑟[𝑡] is given to the agent and a
This is typically done by optimizing a policy function which depends on new cycle begins.
the rewards received in previous episodes. This framework is suitable The reward 𝑟[𝑡] is a scalar value representing the performance of the
for asset trading because the concepts of the environment, agents and agent in the current period. The reward is computed using a reward
rewards are analogous to those of markets, investors and profits. A state function, which in a trading environment depends on the earnings
consists of the market history at some specific point in time in which and losses received by the agent in recent periods. In this work, two
the investor is allowed to perform transactions. The action is the set financial measures are used for this purpose: the period return and the
of transactions made by the investor, guided by market history. The differential Sharpe ratio (Moody et al., 1998). Reward functions are
transition between states is a waiting period of time in which prices analogous to overall performance measures, but they are computed for
change due to the market dynamics. The rewards are the actual profits individual periods instead of the entire process. Hence, both reward
or losses obtained by the investor at the end of the period due to the functions and overall performance measures are described together in
transactions made. States, actions and rewards are represented by 𝑠, Section 6.
𝑎 and 𝑟, respectively. The trading process in the framework or RL is The process finishes when a maximum allowed number of states
depicted in Fig. 3. have been visited. When this happens, the rewards are used to compute
the total discounted reward (𝑅), shown in Eq. (4). This is a performance
4.1. Portfolio management as an RL process measure used to train the policy. The variable 𝛾 in the equation is
named the discount factor, and it lies in the open interval (0, 1], 𝑡 is
The iteration 𝑡 of the trading process begins when the agent arrives the index of each step in the process, and 𝑇 is the index of the terminal
at state 𝑠[𝑡] ; this is depicted in the upper part of Fig. 3. State 𝑠[𝑡] is the state. The interpretation of Eq. (4) is agents value earlier rewards more
market history at that specific time, which is represented by a tensor than later ones. From a financial point of view, this means if large
with dimensions 𝑛 × 𝑓 × 𝑘, where 𝑛 is the number of available assets earnings are obtained early, the potential to obtain large final profits
4
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002
increases, because profit depends on both the initial capital and enough
time for the assets to gain value.
∑
𝑇
𝑅= 𝛾 𝑡−1 𝑟[𝑡] (4)
𝑡=1
5
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002
the network independent of the number of assets in the market. This Note, the losses due to transaction costs are proportional to both the
architecture differs in several key respects to those designed for a fixed traded shares and the fees for each asset.
number of assets. Our design uses all the available data of the market, ∑
𝑛−1 ∑
𝑛−1 ∑
𝑛−1 ∑
𝑛−1
instead of only using the data of a small subset of assets. This makes 𝑞0 = 𝑞0′ + 𝑦̂𝑗 − 𝑥̂ 𝑗 − 𝑠𝑗 𝑦̂𝑗 − 𝑏𝑗 𝑥̂ 𝑗 (13)
the resulting network more robust at processing assets that were never 𝑗=1 𝑗=1 𝑗=1 𝑗=1
seen by the network. Additionally, our network requires less memory Eqs. (12) and (13) need to be modified to incorporate the vari-
because the input does not depend on the number of assets but only on able 𝜇 defined in Eq. (3), which represents the decrement in total
the number of features. A possible disadvantage of our design is cor- portfolio value that needs to be maximized to obtain a set of optimal
relations between assets may not be found. However, our architecture transactions. Additionally, the entries of the vector 𝒑 also have to be
is able to process higher amounts of data during training; therefore, a incorporated because this is the vector computed by the neural net-
more general space of solutions is explored obtaining higher robustness work. To obtain explicit expressions for 𝜇 and the entries of 𝒑, Eqs. (1)
than other approaches. The method used to train our agents is summa- and (3) were substituted into Eqs. (12) and (13), and the resulting
rized in Algorithm 1. The values of all hyper-parameters are listed in expressions were divided by 𝑄′ . The results of these operations are
Table A.1. shown in Eqs. (14) and (15), where 𝑥𝑖 and 𝑦𝑖 are a new set of variables
defined by 𝑥𝑖 = 𝑥̂ 𝑖 ∕𝑄′ and 𝑦𝑖 = 𝑦̂𝑖 ∕𝑄′ .
Algorithm 1 Trading with dynamic number of assets (DNA)
Inputs: (# of environment steps) , (# of parallel environments), 𝜇𝑝𝑖 = 𝑝′𝑖 − 𝑦𝑖 + 𝑥𝑖 , 𝑖 ∈ {1, 2, … , 𝑛 − 1} (14)
(# of rollouts per epoch), (# of mini-batches, 𝒃 and 𝒔 (fee vectors). ∑
𝑛−1 ∑
𝑛−1
Outputs: 𝜋𝜃 ⊳ Policy network 𝜇𝑝0 = 𝑝′0 + (1 − 𝑠𝑗 )𝑦𝑗 − (1 + 𝑏𝑗 )𝑥𝑗 (15)
𝑗=1 𝑗=1
1: Initialize action-state-reward buffer.
2: Reset all environments at random states. To optimize 𝜇, an explicit function for this variable was obtained
3: for epoch 𝑖 = 1, 2, ..., do ⊳ = ∕( × ) by adding the 𝑛 − 1 expressions represented by Eqs. (14) and (15), the
4: for worker 𝑗 = 1, 2, ..., do resulting expression is shown in Eq. (16).
5: for rollout 𝑘 = 1, 2, ..., do
∑
𝑛−1 ∑
𝑛−1
6: 𝒑[𝑡+1] ← 𝜋𝜃 (𝑠[𝑡] ) 𝜇 =1− 𝑠𝑗 𝑦𝑗 − 𝑏𝑗 𝑥 𝑗 (16)
7: 𝑎[𝑡] ← Trans-Opt(𝒑′[𝑡] , 𝒑[𝑡+1] , 𝒃, 𝒔) ⊳ LP (1) 𝑗=1 𝑗=1
8: Save to buffer: 𝑠[𝑡] , 𝑎[𝑡] , 𝑟[𝑡] and 𝑠[𝑡+1] Note, Eqs. (14)–(16) are linear functions of 𝜇, 𝑥𝑖 and 𝑦𝑖 ; hence, using
9: if 𝑡 + 1 = 𝑇 then ⊳ Terminal state these equations the transaction optimization problem can be formu-
10: Reset environment at a random state. lated as an LP. If a feasible solution exists for some LP, then the optimal
11: Shuffle samples in buffer. solution to that LP can always be written as a closed-form expression
12: for Mini-batch 𝑗 = 1, 2, ..., do using Dantzig’s Simplex Method (Dantzig, 1998), or approximated up to
13: 𝑠[𝑗] , 𝑎[𝑗] , 𝑟[𝑗] , 𝑠[𝑗+1] ← draw samples from buffer. any desired accuracy using other convex optimization techniques (Boyd
14: 𝐴̂ [𝑗] ← compute GA from samples. & Vandenberghe, 2004).2
15: 𝐿𝑃 𝑃 𝑂 ← compute using Eq. (10).
To reduce one decision variable in the problem, Eq. (16) is substi-
16: 𝜋𝜃 ← 𝐴𝑑𝑎𝑚(𝐿𝑃 𝑃 𝑂 , 𝛼, 𝛽1 , 𝛽2 ) ⊳ Optimize 𝜋𝜃
tuted into Eq. (14), resulting in Eq. (17), and in this way 𝜇 only appears
explicitly in the objective function. Note, Eq. (15) is implicit in Eqs. (16)
and (17), so it is omitted in the formulation.
5. Transaction optimization problem
( )
∑
𝑛−1 ∑
𝑛−1
1− 𝑠𝑗 𝑦𝑗 − 𝑏𝑗 𝑥𝑗 𝑝𝑖 = 𝑝′𝑖 − 𝑦𝑖 + 𝑥𝑖 (17)
The outputs of the policy network are the entries of the portfolio
𝑗=1 𝑗=1
vector 𝑝𝑖[𝑡] . In general 𝑝𝑖[𝑡] ≠ 𝑝𝑖 ′[𝑡−1] , consequently some assets have to
be sold and others purchased to satisfy the desired portfolio vector. For Additionally, no more than the available shares held by the investor
simplicity, let us drop the time dependency of the expressions, since can be sold. This condition is explicitly shown in Eq. (18). Similarly, the
this is understood by context, for instance 𝑝𝑖[𝑡] and 𝑝𝑖 ′[𝑡−1] are written amount expended purchasing assets and paying transaction fees has to
as 𝑝𝑖 and 𝑝′𝑖 , respectively. Let us also represent the shares acquired be at most the cash before the transactions plus the amount obtained
and sold per asset at some period by the non-negative variables 𝑥̂ 𝑖 and by selling shares, therefore, Eq. (19) has to be satisfied as well.
𝑦̂𝑖 , respectively. Then, the resulting amounts for non-cash assets due
𝑞𝑖′ ≥ 𝑦̂𝑖 (18)
to the transactions at that period are given by Eq. (12). This formula
tells the shares of any asset after the transactions are the shares before ∑
𝑛−1 ∑
𝑛−1 ∑
𝑛−1 ∑
𝑛−1
𝑞0′ + 𝑦̂𝑗 ≥ 𝑥̂ 𝑗 + 𝑏𝑗 𝑥̂ 𝑗 + 𝑠𝑗 𝑦̂𝑗 (19)
those transactions minus the sold amount plus the acquired amount. 𝑗=1 𝑗=1 𝑗=1 𝑗=1
Intuitively, either 𝑥̂ 𝑖 or 𝑦̂𝑖 should be exactly zero because it does not
Dividing Eqs. (18) and (19) by 𝑄′ and rearranging terms result in
make sense to sell an asset and then buy it again in the same period;
Eqs. (20) and (21) that are equivalent to Eqs. (18) and (19), but contain
however, the optimization algorithm described at the end of this section
𝑥𝑖 and 𝑦𝑖 , which are the decision variables of the LP described below.
takes care of this issue.
𝑝′𝑖 ≥ 𝑦𝑖 (20)
𝑞𝑖 = 𝑞𝑖′ − 𝑦̂𝑖 + 𝑥̂ 𝑖 , 𝑖 ∈ {1, 2, … , 𝑛 − 1} (12)
∑
𝑛−1 ∑
𝑛−1
The shares sold for each asset are added to the cash, and that 𝑝′0 ≥ − (1 − 𝑠𝑗 )𝑦𝑗 + (1 + 𝑏𝑗 )𝑥𝑗 (21)
amount is used to buy new assets. However, both buying and selling 𝑗=1 𝑗=1
assets reduce total portfolio value. These conditions are represented in The LP shown in Eq. (22) was stated using Eq. (16) as the objective
Eq. (13), where 𝑏𝑖 and 𝑠𝑖 are the buying and selling fees for the 𝑖th function, Eq. (17) as the set of equality constraints and Eqs. (20) and
asset, and lie in the open interval [0, 1). Similar to the previous formula, (21) as the set of inequality constraints. This LP is always feasible; thus,
Eq. (13) states the amount of cash after transactions is the amount
before transactions plus the cash obtained by selling the undesired
assets minus the cash used to purchase new assets. In addition, it 2
There are plenty of LP solvers written in C++, Python and other languages
contains the amounts subtracted from the cash due to transaction costs. available on internet at no cost.
6
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002
𝐒𝐮𝐛𝐣𝐞𝐜𝐭𝐭𝐨 ∶
(𝑛−1 )
∑ ∑
𝑛−1
𝟏.𝑦𝑖 − 𝑥𝑖 + 𝑠𝑗 𝑦𝑗 + 𝑏𝑗 𝑥 𝑗 𝑝𝑖 = 𝑝′𝑖 − 𝑝𝑖 (22)
𝑗=1 𝑗=1
𝟐.𝑦𝑖 ≤ 𝑝′𝑖
∑
𝑛−1 ∑
𝑛−1
𝟑. − (1 − 𝑠𝑗 )𝑦𝑗 + (1 + 𝑏𝑗 )𝑥𝑗 ≤ 𝑝′0
𝑗=1 𝑗=1
𝟒.𝑦𝑖 , 𝑥𝑖 ≥ 0, for 𝑖 ∈ {1, 2, … , 𝑛 − 1} Fig. 5. Number of active assets in Binance market in the period: 2017-08-17 to
2019-11-01 (only assets directly exchangeable for USDT were counted).
6. Experiments
Table 1
This section describes the setups of our experiments. These in- Dataset properties.
clude dataset features, metrics for evaluating the algorithms and im- Feature Value
plementation details. We assume the amounts traded by our agents Total days 806
are sufficiently small that the prices of assets are not affected by these Training days 484 (60%)
transactions, and the available shares in the market are large enough Test days 322 (40%)
# of assets 3-85
that transactions are executed immediately.
# of features 9
Fees (except BNB) 0.1%
6.1. Dataset Fees BNB 0.05%
# of entries per sampling period
The dataset of Binance3 was used in all our experiments. Binance
30 minutes 38688
is one of the largest cryptocurrency markets in the world,4 and its six hours 3224
dataset can be accessed at no cost through the website’s API. The data one day 806
used in this work corresponds to the market history from 2017-08-17
to 2019-11-01. Only assets that can be directly exchanged for USDT
were considered, these include Bitcoin, Ethereum, Litecoin and others.
it is computed using Eq. (23). This measure only considers the portfolio
Binance market has grown rapidly since its creation in 2017. Note,
values at the beginning and end of a session. Thus, it gives high scores
during the selected period, the number of active assets that can be
to trading strategies that seek high profits.
exchanged for USDT incremented from three to 85, as shown in Fig. 5.
The dataset consists of asset features recorded in equal-length sampling 𝑄[𝑇 ] − 𝑄′[0]
periods. The dataset has multiple available sampling periods; three of 𝑇𝑅 = (23)
𝑄′[0]
them were chosen for our experiments: 30 min, six hours and one day.
There are nine features available for each asset at each sampled period, On the other hand, the SR was used to evaluate the risks taken by
which include four price features: open, high, low, and close prices; the algorithms during the trading sessions. The SP is computed using
four volume features: standard, quote-asset, taker-buy asset and taker- Eq. (24), where the mean and standard deviation are taken for the
buy-quote-asset volumes; and the number of trades in the sampled period returns (PR) of all steps in the trading session. PRs are computed
period. The first six features were used as inputs for our network. The using Eq. (25). The risk of an investment is the uncertainty the investor
data was divided chronologically into two parts. The first 60% of the has about the future price of an asset that he or she intends to buy
data was used for training and the rest for testing. This type of division or sell, which is hard to evaluate. Instead, the SR measures the risk
was chosen because investors use past price behavior of assets to predict of a set of investments after observing the outcome of them. This is
future tendencies or repetition of patterns. The test dataset corresponds done by computing the statistical measures of the earnings and losses
to 11 months of asset history. Hence, the results can give a general idea at each period, which are represented by PR. Note, PR and TR are
of the expected average performances of our methods at any point of similar expressions; the difference between them is PR is computed
the year. Missing data were completed using linear interpolation. Fees for each individual step and TR is computed for the entire process.
were set as in the Binance exchange website, 0.1% for all assets except The SP and the TR favor algorithms which seek high returns; however,
BNB (coin), which has a tax of 0.05%.5 Table 1 summarizes the main the difference between them is SR penalizes portfolios that experiment
properties of the dataset. large oscillations in gain in the trading sessions.
mean(𝑃 𝑅[𝑡] )
6.2. Metrics 𝑡
𝑆𝑅 = (24)
std(𝑃 𝑅[𝑡] )
𝑡
Two metrics were used in this study for evaluating the profit-seeking 𝑄[𝑡] − 𝑄′[𝑡−1]
and risk-aversion behaviors of the implemented algorithms: the total 𝑃 𝑅[𝑡] = (25)
𝑄′[𝑡−1]
return (TR) and the Sharpe ratio (SR) (Sharpe, 1994). The total return is
the total profit or loss obtained by the investor in a trading session, and
6.3. Reward functions
3
www.binance.com. The value of 𝑟[𝑡] is a measure of performance of an agent at period
4
According to Coin Market Cap (www.coinmarketcap.com). 𝑡. Any real scalar function that maps good performances to high values
5
This discount changes after the first 12 months of use of the platform. can be used to compute rewards. In trading environments, the most
7
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002
Table 2 Table 4
Trading session lengths and holding periods. Score summary for the trading sessions with length: one day and holding periods:
Session length Holding period # of periods 30 min.
Algorithm Total Return Sharpe Ratio
1 day 30 min 48
14 days 6 h 56 TD(𝜆) (Pendharkar & Cusatis, 2018) 0.014 ± 0.021 0.064 ± 0.107
30 days 1 day 30 CNN (Jiang & Liang, 2017) 0.088 ± 0.086 0.274 ± 0.127
DQN (Bu & Cho, 2018) −0.019 ± 0.024 −0.180 ± 0.176
DNA-R (ours) 0.041 ± 0.042 0.166 ± 0.153
Table 3
DNA-S (ours) 𝟎.𝟐𝟒𝟒 ± 𝟎.𝟏𝟓𝟒 𝟎.𝟒𝟔𝟖 ± 𝟎.𝟐𝟎𝟎
Summary of the implemented algorithms.
Algorithm # of # of Description
assets features Table 5
TD(𝜆) (Pendharkar & Cusatis, 2018) 2 2 Value function Score summary for the trading sessions with length: 14 days and holding periods: 6 h.
CNN (Jiang & Liang, 2017) 12 14 Policy gradient Algorithm Total Return Sharpe Ratio
DQN (Bu & Cho, 2018) 8 8 Double-Q learning
TD(𝜆) (Pendharkar & Cusatis, 2018) 0.224 ± 0.183 0.317 ± 0.125
DNA-R (ours) 3 - 85 6 PPO, period return CNN (Jiang & Liang, 2017) 0.680 ± 0.257 0.411 ± 0.088
DNA-S (ours) 3 - 85 6 PPO, Diff. Sharpe ratio DQN (Bu & Cho, 2018) 1.108 ± 0.503 𝟎.𝟔𝟏𝟖 ± 𝟎.𝟎𝟖𝟖
DNA-R (ours) 𝟐.𝟐𝟖𝟑 ± 𝟏.𝟑𝟕𝟏 0.533 ± 0.095
DNA-S (ours) 0.468 ± 0.314 0.336 ± 0.150
natural choice for this purpose is the PR (Eq. (25)), which is the profit
or loss in each period. However, this function does not consider risks. Table 6
Score summary for the trading sessions with length: 30 days and holding periods: one
The issue of finding a risk function that can be computed at each
day.
step, and can be used as reward function in RL processes, was first
Algorithm Total Return Sharpe Ratio
studied by Moody et al. (1998). They derived an approximation for the
TD(𝜆) (Pendharkar & Cusatis, 2018) 0.329 ± 0.265 0.404 ± 0.126
contributions to the SP at each step, termed the differential Sharpe ratio
CNN (Jiang & Liang, 2017) 1.012 ± 0.495 0.527 ± 0.143
(DSR), which is computed using Eq. (26). They reported agents trained DQN (Bu & Cho, 2018) 1.151 ± 0.521 0.661 ± 0.158
with the DSR performed better than those trained with measures that
DNA-R (ours) 𝟕.𝟏𝟏𝟔 ± 𝟒.𝟓𝟔𝟔 𝟎.𝟕𝟓𝟖 ± 𝟎.𝟏𝟖𝟔
do not consider risks. Both PR and DSR were used to train our agents. DNA-S (ours) 2.798 ± 2.065 0.609 ± 0.201
We named our agents DNA for the dynamic number of assets. The agent
trained with the PR is named DNA-R, and the one trained with the DSR
is named DNA-S.
an average TR of 0.041. The other two approaches: DQN and TD(𝜆)
𝐵[𝑡−1] 𝛥𝐴[𝑡] − 0.5𝐴[𝑡−1] 𝛥𝐵[𝑡]
𝐷𝑆𝑅[𝑡] = ( )3∕2 obtained the lowest scores, with average TRs close to zero. The average
𝐵[𝑡−1] − 𝐴2[𝑡−1] SRs obtained by all the algorithms correlate to their TRs, for instance
( ) (26) DNA-S has the best average SR among the competing algorithms, which
𝐴[𝑡] = 𝐴[𝑡−1] + 𝜂 𝑟[𝑡] − 𝐴[𝑡−1]
( ) is 0.468. Hence, in this setup, DNA-S not only sought high profits, but
𝐵[𝑡] = 𝐵[𝑡−1] + 𝜂 𝑟2[𝑡] − 𝐵[𝑡−1] also managed investment risks better than its competing approaches.
The results of the second experiment (trading sessions of 14 h
6.4. Experiment setups and holding periods of 6 h) show important contrasts with respect to
the previous experiment. DNA-R obtained the best average TR in this
Three different trading session lengths were chosen for our backtest setup, as shown in Table 5, but it did not obtain the best average SR.
experiments: 24 h, 14 days and 30 days, each of them with holding The average TR of DNA-R in this setup was 2.283, but the standard
periods: 30 min, six hours and one day, respectively. These setups were deviation was 1.371. This indicates the performances of this algorithm
chosen to test the robustness of our methods to changes in holding over the course of a year could have large variations. The SR obtained
periods and trading session lengths. The setups are summarized in by this algorithm supports this observation. The best average SR in
Table 2. this setup belong to DQN (0.618). Hence, DQN had higher stability in
During training, episodes were drawn randomly across the train the increments in portfolio value in the trading sessions. However, this
dataset. For testing, the algorithms were run through all the episodes of approach only obtained half as much average profit (1.08) as that of
the test dataset; the performances shown in this work are the average DNA-R. CNN and DNA-S have the next best scores, which is surprising,
TR and SR obtained by each algorithm during testing. since these two approaches obtained the best results in the previous
experiment. Again, TD(𝜆) had the lowest performance. Nonetheless, its
6.5. Baselines results are better in this setup than in the previous one.
In the third setup (trading sessions of 30 days and holding periods
Our agents were compared to methods presented in (Bu & Cho, of one day), our agents obtained significantly higher scores than those
2018; Jiang & Liang, 2017; Pendharkar & Cusatis, 2018); these base- obtained by the baselines. These results are shown in Table 6. DNA-R
lines were described in the Section 2. The main features of our agents and DNA-S obtained the highest average TRs: 7.116 and 2.798, respec-
and baselines are summarized in Table 3. All baselines trade a fixed tively. On the other hand, the results of the baselines are actually very
number of assets, which are the cash asset and those assets with the similar to those of the previous experiment, even though the trading
highest capitalizations in the market. session was twice as long. The SR scores correlate to those of the TR,
except for DQN, which obtained a higher score than DNA-S.
7. Results In general, our methods adapted better than the baselines to changes
in holding periods and trading session lengths. Our experiments show
In the first experiment, which corresponds to trading sessions of DNA-R is the most robust approach for both profit-seeking and risk-
1 day and holding periods of 30 min, DNA-S obtained the highest aversion. The TRs and SRs obtained by this agent are good throughout
results; these are shown in Table 4. DNA-S obtained an average TR of all setups, even though DNA-S had the best performance in the setup
0.224, which is more than double the score of the closest competitor with the shortest holding period. We also found being able to analyze
CNN, which obtained 0.088. DNA-R obtained the third best results with a large pool of assets is beneficial for the agent. Tables 4–6 show the
8
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002
Table A.1
List of hyperparameters.
Hyperparameter Value
# of environments () 10
# of mini-batches () 10
# of environment Steps () 100000
# of Rollouts per epoch () 10
# of optimization steps 10000
PPO decay rate (𝛾) 0.99
PPO value-loss coefficient (𝑤𝑣 ) 0.99
PPO clip parameter (𝜖) 0.2
PPO noise distribution (𝜇, 𝜎 2 ) 0, 0.01
GA parameter (𝜆) 0.95
# of recurrent layers 2
Recurrent layers length 10
Adam-Optimizer (𝛼, 𝛽1 , 𝛽2 ) 7e−4, 0.9, 0.999
DSR parameter (𝜂) 0.01
Fig. 6. Examples of tests on the three setups: (a) Trading session with length: one day Declaration of competing interest
and holding periods: 30 min (January, 2019). (b) Trading session with length: 14 days
and holding periods: 6 h (September, 2019). (c) Trading session with length: 30 days
and holding periods: one day (June and July, 2019). The authors declare that they have no known competing finan-
cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
correlation between the number of assets processed by the competing
approaches and their overall performances, for instance, the agent Appendix A. Hyperparameters
that performed worst in all setups was TD(𝜆), which processed only
two assets. Fig. 6 depicts the evolution of the portfolio values of all See Table A.1.
competing approaches in the three setups.
Based on these results, we believe our method is a step forward in
the automatic trading of cryptocurrencies. In this work, we included
transaction costs to make our simulations as close to real markets as Appendix B. Supplementary data
possible. Trading assets is risky, especially cryptocurrencies which are
extremely volatile. Our recommendation for institutional investors is to Supplementary material related to this article can be found online
use these methods along with loss limiting functions, such as stop-loss, at https://doi.org/10.1016/j.eswa.2020.114002.
9
C. Betancourt and W.-H. Chen Expert Systems With Applications 164 (2021) 114002
References Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., &
Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv
Aboussalah, A. M., & Lee, C.-G. (2020). Continuous control with stacked deep dynamic preprint arXiv:1312.5602.
recurrent reinforcement learning for portfolio optimization. Expert Systems with Moody, J., & Saffell, M. (2001). Learning to trade via direct reinforcement. IEEE
Applications, 140, Article 112891. transactions on neural Networks, 12, 875–889.
Berentsen, A., & Schär, F. (2019). Stablecoins: The quest for a low-volatility cryp- Moody, J., Wu, L., Liao, Y., & Saffell, M. (1998). Performance functions and rein-
tocurrency. In The economics of fintech and digital currencies (pp. 65–71). CEPR forcement learning for trading systems and portfolios. Journal of Forecasting, 17,
Press. 441–470.
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge university press. Nakamoto, S. (2008). Bitcoin: A peer-to-peer electronic cash system.
Bu, S.-J., & Cho, S.-B. (2018). Learning optimal q-function using deep boltzmann Narayanan, A., Bonneau, J., Felten, E., Miller, A., & Goldfeder, S. (2016). Bitcoin and
machine for reliable trading of cryptocurrency. In International conference on cryptocurrency technologies: A comprehensive introduction. Princeton University Press.
intelligent data engineering and automated learning (pp. 468–480). Springer. Niaki, S. T. A., & Hoseinzade, S. (2013). Forecasting s&p 500 index using artifi-
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., cial neural networks and design of experiments. Journal of Industrial Engineering
& Bengio, Y. (2014). Learning phrase representations using rnn encoder–decoder International, 9(1).
for statistical machine translation. In Conference on empirical methods in natural OpenAI, Andrychowicz, M., Baker, B., Chociej, M., Józefowicz, R., McGrew, B.,
language processing (EMNLP 2014). Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S.,
Corbet, S., McHugh, G., & Meegan, A. (2014). The influence of central bank monetary Tobin, J., Welinder, P., Weng, L., & Zaremba, W. (2020). Learning dexterous
policy announcements on cryptocurrency return volatility. In Investment management in-hand manipulation. International Journal of Robotics Research, 39, 3–20.
and financial innovations 14, Iss. 4 (pp. 60–72). Business Perspectives, Publishing Ormos, M., & Urbán, A. (2013). Performance analysis of log-optimal portfolio strategies
Company. with transaction costs. Quantitative Finance, 13, 1587–1597.
Dantzig, G. B. (1998). Linear programming and extensions. Princeton university press. Park, H., Sim, M. K., & Choi, D. G. (2020). An intelligent financial portfolio trading
Dempster, M. A., & Leemans, V. (2006). An automated fx trading system using adaptive strategy using deep q-learning. Expert Systems with Applications, Article 113573.
reinforcement learning. Expert Systems with Applications, 30, 543–552. Pendharkar, P. C., & Cusatis, P. (2018). Trading financial indices with reinforcement
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image learning agents. Expert Systems with Applications, 103, 1–13.
recognition. In Proceedings of the IEEE conference on computer vision and pattern Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional
recognition (pp. 770–778). IEEE. continuous control using generalized advantage estimation. arXiv preprint arXiv:
Heaton, J., Polson, N., & Witte, J. H. (2017). Deep learning for finance: Deep portfolios. 1506.02438.
Applied Stochastic Models in Business and Industry, 33, 3–12. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal
Jeong, G., & Kim, H. Y. (2019). Improving financial trading decisions using deep q- policy optimization algorithms. arXiv preprint arXiv:1707.06347.
learning: Predicting the number of shares, action strategies, and transfer learning. Sharpe, W. F. (1994). The sharpe ratio. Journal of portfolio management, 21, 49–58.
Expert Systems with Applications, 117, 125–138. Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep reinforcement learning with double
Jiang, Z., & Liang, J. (2017). Cryptocurrency portfolio management with deep rein- q-learning. In Thirtieth AAAI conference on artificial intelligence.
forcement learning. In 2017 Intelligent systems conference (IntelliSys) (pp. 905–913). Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł.,
IEEE. & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. CoRR, processing systems (pp. 5998–6008).
abs/1412.6980. Wu, X., Chen, H., Wang, J., Troiano, L., Loia, V., & Fujita, H. (2020). Adaptive stock
Lei, K., Zhang, B., Li, Y., Yang, M., & Shen, Y. (2020). Time-driven feature-aware jointly trading strategies with deep reinforcement learning methods. Information Sciences.
deep reinforcement learning for financial signal representation and algorithmic Zhang, J., & Maringer, D. (2013) Indicator selection for daily equity trading with
trading. Expert Systems with Applications, 140, Article 112872. recurrent reinforcement learning. In Proceedings of the 15th annual conference
Liang, Z., Chen, H., Zhu, J., Jiang, K., & Li, Y. (2018). Adversarial deep reinforcement companion on Genetic and evolutionary computation (pp. 1757–1758).
learning in portfolio management. arXiv preprint arXiv:1808.09940.
10