0% found this document useful (0 votes)
105 views

A Reinforcement Learning Extension To The Almgren-Chriss Framework For Optimal Trade Execution

This document discusses using reinforcement learning to improve existing analytical solutions for optimal trade execution. Specifically, it explores enhancing the Almgren-Chriss model for optimal trade execution by allowing the trading trajectory to be dynamic based on prevailing market conditions, with the goal of minimizing trading costs. The authors consider using reinforcement learning to map market states like spread and volume to optimal trading actions over discrete time periods. Their approach aims to improve on the Almgren-Chriss solution by up to 10.3% on average based on testing on South African equity market data.

Uploaded by

Rayan Rayan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views

A Reinforcement Learning Extension To The Almgren-Chriss Framework For Optimal Trade Execution

This document discusses using reinforcement learning to improve existing analytical solutions for optimal trade execution. Specifically, it explores enhancing the Almgren-Chriss model for optimal trade execution by allowing the trading trajectory to be dynamic based on prevailing market conditions, with the goal of minimizing trading costs. The authors consider using reinforcement learning to map market states like spread and volume to optimal trading actions over discrete time periods. Their approach aims to improve on the Almgren-Chriss solution by up to 10.3% on average based on testing on South African equity market data.

Uploaded by

Rayan Rayan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

A reinforcement learning extension to the

Almgren-Chriss framework for optimal trade


execution
Dieter Hendricks Diane Wilcox
School of Computational and School of Computational and
Applied Mathematics Applied Mathematics
arXiv:1403.2229v1 [q-fin.TR] 10 Mar 2014

University of the Witwatersrand University of the Witwatersrand


Johannesburg, South Africa Johannesburg, South Africa
Email: [email protected] Email: [email protected]

Abstract—Reinforcement learning is explored as a candidate cost). This behaviour of institutional investors was empirically
machine learning technique to enhance existing analytical so- demonstrated in [9], where they observed that typical trades of
lutions for optimal trade execution with elements from the large investment management firms are almost always broken
market microstructure. Given a volume-to-trade, fixed time
horizon and discrete trading periods, the aim is to adapt a up into smaller trades and executed over the course of a day
given volume trajectory such that it is dynamic with respect or several days.
to favourable/unfavourable conditions during realtime execution, Several authors have studied the problem of optimal liq-
thereby improving overall cost of trading. We consider the uidation, with a strong bias towards stochastic dynamic pro-
standard Almgren-Chriss model with linear price impact as a gramming solutions. See [7], [17], [26], [1] as examples. In
candidate base model. This model is popular amongst sell-side
institutions as a basis for arrival price benchmark execution this paper, we consider the application of a machine learning
algorithms. By training a learning agent to modify a volume technique to the problem of optimal liquidation. Specifically
trajectory based on the market’s prevailing spread and volume we consider a case where the popular Almgren-Chriss closed-
dynamics, we are able to improve post-trade implementation form solution for a trading trajectory (see [1]) can be enhanced
shortfall by up to 10.3% on average compared to the base model, by exploiting microstructure attributes over the trading horizon
based on a sample of stocks and trade sizes in the South African
equity market. using a reinforcement learning technique.
Reinforcement learning in this context is essentially a cali-
I. I NTRODUCTION brated policy mapping states to optimal actions. Each state is
A critical problem faced by participants in investment a vector of observable attributes which describe the current
markets is the so-called optimal liquidation problem, viz. how configuration of the system. It proposes a simple, model-
best to trade a given block of shares at minimal cost. Here, free mechanism for agents to learn how to act optimally in
cost can be interpreted as in Perold’s implementation shortfall a controlled Markovian domain, where the quality of action
([21]), i.e. adverse deviations of actual transaction prices chosen is successively improved for a given state [10]. For
from an arrival price baseline when the investment decision the optimal liquidation problem, the algorithm examines the
is made. Alternatively, cost can be measured as a deviation salient features of the current order book and current state of
from the market volume-weighted trading price (VWAP) over execution in order to decide which action (e.g. child order
the trading period, effectively comparing the specific trader’s price or volume) to select to service the ultimate goal of
performance to that of the average market trader. In each case, minimising cost.
the primary problem faced by the trader/execution algorithm The first documented large-scale empirical application of
is the compromise between price impact and opportunity cost reinforcement learning algorithms to the problem of optimised
when executing an order. trade execution in modern financial markets was conducted
Price impact here refers to adverse price moves due to a by [20]. They set up their problem as a minimisation of
large trade size absorbing liquidity supply at available levels implementation shortfall for a buying/selling program over a
in the order book (temporary price impact). As market partic- fixed time horizon with discrete time periods. For actions,
ipants begin to detect the total volume being traded, they may the agent could choose a price to repost a limit order for
also adjust their bids/offers downward/upward to anticipate the remaining shares in each discrete period. State attributes
order matching (permanent price impact) [16]. To avoid price included elapsed time, remaining inventory, current spread,
impact, traders may split a large order into smaller child orders immediate cost and signed volume. In their results, they
over a longer period. However, there may be exogenous market found that their reinforcement learning algorithm improved the
forces which result in execution at adverse prices (opportunity execution efficiency by 50% or more over traditional submit-
and-leave or market order policies. lower execution cost variance for the same or lower level of
Instead of a pure reinforcement learning solution to the expected execution cost.
problem, as in [20], we propose a hybrid approach which The exposition of their solution is as follows: They assume
enhances a given analytical solution with attributes from that the security price evolves according to a discrete arith-
the market microstructure. Using the Almgren-Chriss (AC) metic random walk:
model as a base, for a finite liquidation horizon with discrete nk
trading periods, the algorithm determines the proportion of Sk = Sk−1 + στ 1/2 ξk − τ g( ),
τ
the AC-suggested trajectory to trade based on prevailing
where:
volume/spread attributes. One would expect, for example, that
allowing the trajectory to be more aggressive when volumes Sk = price at time k,
are relatively high and spreads are tight may reduce the σ = volatility of the security,
ultimate cost of the trade. In our implementation, a static
τ = length of discete time interval,
volume trajectory is preserved for the duration of the trade,
however the proportion traded is dynamic with respect to ξk = draws from independent random variables,
market dynamics. As in [20], a market order is executed at nk = volume traded at time k and
the end of the trade horizon for the remaining volume, to g(.) = permanent price impact.
ensure complete liquidation. An important consideration in our
analysis is the specification of the problem as a finite-horizon Here, permanent price impact refers to changes in the
Markov Decision Process (MDP) and the consequences for equilibrium price as a direct function of our trading, which
optimal policy convergence of the reinforcement learning al- persists for at least the remainder of the liquidation horizon.
gorithm. In [20], they use an approximation in their framework Temporary price impact refers to adverse deviations as a result
to address this issue by incorporating elapsed time as a state of absorbing available liquidity supply, but where the impact
attribute, however they do not explicitly discuss convergence. dissipates by the next trading period due to the resilience of
We will use the findings of [14] in our model specification the order book. Almgren and Chriss introduce a temporary
and demonstrate near-optimal policy convergence of the finite- price impact function h(v) to their model, where h(v) causes
horizon MDP problem. a temporary adverse move in the share price as a function of
The model described above is compared with the our trading rate v [1]. Given this addition, the actual security
base Almgren-Chriss model to determine whether it in- transaction price at time k is given by:
creases/decreases the cost of execution for different types nk
of trades consistently and significantly. This study will help S˜k = Sk−1 − h( ).
τ
determine whether reinforcement learning is a viable technique
Assuming a sell program, we can then define the total
which can be used to extend existing closed-form solutions to
trading revenue as:
exploit the nuances of the microstructure where the algorithms
are applied. N N N
nk nk
nk S˜k = XS0 +
X X X
This paper proceeds as follows: Section 2 introduces the (στ 1/2 ξk −τ g( ))xk − nk h( ),
τ τ
standard Almgren-Chriss model. Section 3 describes the spe- k=1 k=1 k=1
cific hybrid reinforcement learning technique proposed, along k
P N
P
with a discussion regarding convergence to optimum action where xk = X − nj = nj for k = 0, 1, ..., N .
j=1 j=k+1
values. Section 4 discusses the data used and results, compar-
ing the 2 models for multiple trade types. Section 5 concludes PThe ˜total cost of trading is thus given by x = XS0 −
nk Sk , i.e. the difference between the target revenue value
and proposes some extensions for further research. and the total actual revenue from the execution. This definition
II. T HE A LMGREN -C HRISS MODEL refers to Perold’s implementation shortfall measure (see [21]),
and serves as the primary transaction cost metric which
Bertsimas and Lo are pioneers in the area of optimal liqui-
is minimised in order to maximise trading revenue. Since
dation, treating the problem as a stochastic dynamic program-
implementation shortfall is a random variable, Almgren and
ming problem [7]. They employed a dynamic optimisation
Chriss compute the following:
procedure which finds an explicit closed-form best execution
strategy, minimising trading costs over a fixed period of time N N
X nk X nk
for large transactions. Almgren and Chriss extended the work E(x) := τ xk g( )+ nk h( )
τ τ
of [7] to allow for risk aversion in their framework [1]. They k=1 k=1

argue that incorporating the uncertainty of execution of an and


N
optimal solution is consistent with a trader’s utility function. 2
X
In particular, they employ a price process which permits linear V ar(x) := σ τ xk 2 .
k=1
permanent and temporary price impact functions to construct
an efficient frontier of optimal execution. They define a trading The distribution of implementation shortfall is Gaussian if the
strategy as being efficient if there is no strategy which has ξk are Gaussian.
Given the overall goal of minimising execution costs and It should be noted that the AC solution yields a suggested
the variance of execution costs, they specify their objective volume trajectory over the liquidation horizon, however there
function as: is no discussion in [1] as to the prescribed order type to execute
min{E(x) + λV ar(x)}, the trade list. We have assumed that the trade list can be
x
executed as a series of market orders. Given that this implies
where: we are always crossing the spread, one needs to consider that
x = implementation shortfall, traversing an order book with thin volumes and widely-spaced
prices could have a significant transaction cost impact. We thus
λ = level of risk aversion.
consider a reinforcement learning technique which learns when
The intuition of this objective function can be thought and how much to cross the spread, based on the current order
of as follows: Consider a stock which exhibits high price book dynamics.
volatility and thus a high risk of price movement away from The general solution outlined above assumes linear price im-
the reference price. A risk averse trader would prefer to trade pact functions, however the model was extended by Almgren
a large portion of the volume immediately, causing a (known) in [2] to account for non-linear price impact. This extended
price impact, rather than risk trading in small increments model can be considered as an alternative base model in future
at successively adverse prices. Alternatively, if the price is research.
expected to be stable over the liquidation horizon, the trader
would rather split the trade into smaller sizes to avoid price III. A REINFORCEMENT LEARNING APPROACH
impact. This trade-off between speed of execution and risk
of price movement is what governs the shape of the resulting The majority of reinforcement learning research is based on
trade trajectory in the AC framework. a formalism of Markov Decision Processes (MDPs) [4]. In this
A detailed derivation of the general solution can be found context, reinforcement learning is a technique used to numeri-
in [1]. Here, we state the general solution: cally solve for a calibrated policy mapping states to optimal or
near-optimal actions. It is a framework within which a learning
sinh(κ(T − tj )) agent repeatedly observes the state of its environment, and
xj = X for j = 0, ..., N.
sinh(κT ) then performs a chosen action to service some ultimate goal.
The associated trade list is: Performance of the action has an immediate numeric reward
or penalty and changes the state of the environment [10].
2 sinh( 12 κτ )
nj = cosh(κ(T − tj− 12 ))X for j = 0, ..., N, The problem of solving for an optimal policy mapping states
sinh(κT ) to actions is well-known in stochastic control theory, with
where: a significant contribution by Bellman [5]. Bellman showed
1
 2
τ 2
 that the computational burden of an MDP can be significantly
−1
κ = cosh κ̃ + 1 , reduced using what is now known as dynamic programming.
τ 2
It was however recognised that two significant drawbacks exist
λσ 2
κ̃2 = , for classical dynamic programming: Firstly, it assumes that a
η(1 − ρτ
2η ) complete, known model of the environment exists, which is
η = temporary price impact parameter, often not realistically obtainable. Secondly, problems rapidly
ρ = permanent price impact parameter, become computationally intractable as the number of state
variables increases, and hence, the size of the state space for
τ = length of discrete time period.
which the value function must be computed increases. This
This implies that for a program of selling an initially long problem is referred to as the curse of dimensionality [25].
position, the solution decreases monotonically from its initial Reinforcement learning offers two advantages over classi-
value to zero at a rate determined by the parameter κ. If trading cal dynamic programming: Firstly, agents learn online and
intervals are short, κ2 is essentially the ratio of the product continually adapt while performing the given task. Secondly,
of volatility and risk-intolerance to the temporary transaction the methods can employ function approximation algorithms
cost parameter. We note here that a larger value of κ implies to represent their knowledge. This allows them to generalize
a more rapid trading program, again conceptually confirming across the state space so that the learning time scales much
the propositions of [17] that an intolerance for execution risk better [12]. Reinforcement learning algorithms do not require
leads to a larger concentration of quantity traded early in the knowledge about the exact model governing an MDP and
trading program. Another consequence of this analysis is that thus can be applied to MDP’s where exact methods become
different sized baskets of the same securities will be liquidated infeasible.
in the same manner, barring scale differences and provided Although a number of implementations of reinforcement
the risk aversion parameter λ is held constant. This may be learning exist, we will focus on Q-learning. This is a model-
counter-intuitive, since one would expect larger baskets to be free technique first introduced by [27], which can be used to
effectively less liquid, and thus follow a less rapid trading find the optimal, or near-optimal, action-selection policy for a
program to minimise price impact costs. given MDP.
During Q-learning, an agent’s learning takes place during A. Implications of finite-horizon MDP
sequential episodes. Consider a discrete finite world where The above exposition presents an algorithm which guar-
at each step n, an agent is able to register the current state antees optimal policy convergence of a stationary infinite-
xn ∈ X and can choose from a finite set of actions an ∈ A. horizon MDP. The stationarity assumption, and hence validity
The agent then receives a probabilistic reward rn , whose of the above result, needs to be questioned when considering a
mean value Rxn (an ) depends only on the current state and finite-horizon MDP, since states, actions and policies are time-
action. According to [27], the state of the world changes dependent [22]. In particular, we are considering a discrete
probabilistically to yn according to: period, finite trading horizon, which guarantees execution
Prob(yn = y|xn , an ) = Pxn y (an ). of a given volume of shares. At each decision step in the
trading horizon, it is possible to have different state spaces,
The agent is then tasked to learn the optimal policy mapping actions, transition probabilities and reward values. Hence the
states to actions, i.e. one which maximises total discounted above model needs revision. Garcia and Ndiaye consider
expected reward. Under some policy mapping π and discount this problem and provide a model specification which suits
rate γ (0 < γ < 1), the value of state x is given by: this purpose [14]. They propose a slight modification to the
X Bellman optimality equations shown above:
V π (x) = Rx (π(x)) + γ Pxy (π(x))V π (y).
X
y n π∗
Vn∗ (x) = max{Rx (an ) + γ Pxy (an+1 )Vn+1 (y)}
an
According to [6] and [23], the theory of dynamic programming y
says there is at least one optimal stationary policy π ∗ such that for all x ∈ Sn , y ∈ Sn+1 , an ∈ Sn , n ∈ {0, 1, ..., N } and
∗ X ∗ VN∗ +1 (x) = 0. This optimality equation has a single solution
V ∗ (x) = V π (x) = max{Rx (a) + γ Pxy (a)V π (y)}.
a
y
V ∗ = {V1∗ , V2∗ , ..., VN∗ } that can be obtained using dynamic
programming techniques. The equivalent discounted expected
We also define Qπ (x, a) as the expected discounted reward reward specification thus becomes:
from choosing action a in state x, and then following policy X
π thereafter, i.e. Qπn (x, an ) = Rx (an ) + γ n
Pxy π
(π(x))Vn+1 (y).
X y
Qπ (x, a) = Rx (a) + γ Pxy (π(x))V π (y).
y
They propose a novel transformation of an N -step non-
stationary MDP into an infinite-horizon process ([14]). This
The task of the Q-learning agent is to determine V ∗ , π ∗ is achieved by adding an artifical final reward-free absorbing

and Qπ where Pxy (a) is unknown, using a combination of state xabs , such that all actions aN ∈ AN lead to xabs with
exploration and exploitation techniques over the given domain. probability 1. Hence the revised Q-learning update equation
It can be shown that V ∗ (x) = maxa Q∗ (x, a) and that an becomes:
optimal policy can be formed such that π ∗ (x) = a∗ . It
thus follows that if the agent can find the optimal Q-values, Qn+1 (xn , an ) = Qn (xn , an ) + αn (xn , an )Un ,
the optimal action can be inferred for a given state x. It is
where
shown in [10] that an agent can learn Q-values via experiential 
learning, which takes place during sequential episodes. In the rn + γ maxb Qn (yn , b) − Qn (xn , an )
 if xn ∈ Si , i < N
nth episode, the agent: Un = rn − Qn (xn , an ) if xn ∈ SN
• observes its current state xn ,

0 otherwise.

• selects and performs an action an ,
• observes the subsequent state yn as a result of performing If xn ∈/ SN , yn = xn+1 , otherwise choose randomly in S1 .
action an , If xn+1 ∈ Sj , select an+1 ∈ Aj . The learning rule for SN is
• receives an immediate reward rn and thus equivalent to setting VN∗ +1 (xabs ) = Q∗N +1 (xabs , aj ) = 0
• uses a learning factor αn , which decreases gradually over ∀ aj ∈ AN +1 .
time. Garcia and Ndiaye further show that the above specification
Q is updated as follows: (in the case where γ = 1) will converge to the optimal policy
with probability one, provided that each state-action pair is
Q(xn , an ) = Q(xn , an ) + visited infinitely often, n αn (x, a) = ∞ and n α2n (x, a) <
P P

αn (rn + γ max Q(xn+1 , b) − Q(xn , an )). ∞ [14].


b
Provided each state-action pair is visited infinitely often, B. Implementation for optimal liquidation
[10] show that Q converges to Q∗ for any exploration policy. Given the above description, we are able to discuss our
Singh et al. provide guidance as to specific exploration policies specific choices for state attributes, actions and rewards in
for asymptotic convergence to optimal actions and asymptotic the context of the optimal liquidation problem. We need to
exploitation under the Q-learning algorithm, which we incor- consider a specification which adequately accounts for our
porate in our analysis [24]. state of execution and the current state of the limit order
book, representing the opportunity set for our ultimate goal 2) Actions: Based on the Almgren-Chriss (AC) model
of executing a volume of shares over a fixed trading horizon. specified above, we calculate the AC volume trajectory
1) States: We acknowledge that the complexity of the (AC1 , AC2 , ..., ACT ) for a given volume-to-trade (V ), fixed
financial system cannot be distilled into a finite set of states time horizon (T ) and discrete trading periods (t = 1, 2, ..., T ).
and is not likely to evolve according to a Markov process. ACt represents the proportion of V to trade in period t, such
However, we conjecture that the essential features of the PT
that ACt = V . For the purposes of this study, we assume
system can be sufficiently captured with some simplifying t=1
assumptions such that meaningful insights can still be inferred. that each child order is executed as a market order based on
For simplicity, we have chosen a look-up table representation the prevailing limit order book structure. We would like our
of Q. Function approximation variants may be explored in learning agent to modify the AC volume trajectory based on
future research for more complex system configurations. As prevailing volume and spread characteristics in the market. As
described above, each state xn ∈ X represents a vector of such, the possible actions for our agent include:
observable attributes which describe the configuration of the • βj = Proportion of ACt to trade,
system at time n. As in [20], we use Elapsed Time t and • βLB = Lower bound of volume proportion to trade,
Remaining Inventory i as private attributes which capture our • βUB = Upper bound of volume proportion to trade,
state of execution over a finite liquidation horizon T . Since • Action: ajt = βj ACt , where βLB ≤ βj ≤ βUB
our goal is to modify a given volume trajectory based on and βj = βj−1 + βincr .
favourable market conditions, we include spread and volume The aim here is to train the learning agent to trade a higher
as candidate market attributes. The intuition here is that the (lower) proportion of the overall volume when conditions are
agent will learn to increase (decrease) trading activity when favourable (unfavourable), whilst still broadly preserving the
spreads are narrow (wide) and volumes are high (low). This volume trajectory suggested by the AC model. To ensure
would ensure that a more significant proportion of the total that the total volume-to-trade is executed over the given time
volume-to-trade would be secured at a favourable price and, horizon, we execute any residual volume at the end of the
similarly, less at an unfavourable price, ultimately reducing the trading period with a market order.
post-trade implementation shortfall. Given the look-up table 3) Rewards: Each of the actions described above results
implementation, we have simplified each of the state attributes in a volume to execute with a market order, based on the
as follows: prevailing structure of the limit order book. The size of the
• T = Trading Horizon , V = Total Volume-to-Trade,
• H = Hour of day when trading will begin,
child order volume will determine how deep we will need to
• I = Number of remaining inventory states, traverse the order book. For example, suppose we have a BUY
• B = Number of spread states, order with a volume-to-trade of 20000, split into child orders
• W = Number of volume states,
• spn = %ile Spread of the nth tuple,
of 10000 in period t and 10000 in period t+ 1. If the structure
• vpn = %ile Bid/Ask Volume of the nth tuple, of the limit order book at time t is as follows:
• Elapsed Time: tn = 1, 2, 3, ..., T , • Level-1 Ask Price = 100.00; Level-1 Ask Volume = 3000
•  in = 1, 2, 3, ..., I , 1
Remaining Inventory: • Level-2 Ask Price = 100.50; Level-2 Ask Volume = 4000

 1, if 0 < spn ≤ B • Level-3 Ask Price = 102.30; Level-3 Ask Volume = 5000
2, if 1 < sp ≤ 2

• Level-4 Ask Price = 103.00; Level-4 Ask Volume = 6000
B n B
• Spread State: sn = • Level-5 Ask Price = 105.50; Level-5 Ask Volume = 2000
...

 (B−1) the volume-weighted execution price will be:
B, if R < spn1 ≤ 1,

 1, if 0 < vpn ≤ U (3000 × 100) + (4000 × 100.5) + (3000 × 102.3)
= 100.9.

2, if 1 < vp ≤ 2

W n W 10000
• Volume State: vn =
...

 (W −1) Trading more (less) given this limit order book structure will
W, if W < vpn ≤ W.

result in a higher (lower) volume-weighted execution price. If
Thus, for the nth episode, the state attributes can be sum-
the following trading period t + 1 has the following structure:
marised as the following tuple:
• Level-1 Ask Price = 99.80; Level-1 Ask Volume = 6000
zn =< tn , in , sn , vn > . • Level-2 Ask Price = 99.90; Level-2 Ask Volume = 2000
• Level-3 Ask Price = 101.30; Level-3 Ask Volume = 7000
For spn and vpn , we first construct a historical distribution of • Level-4 Ask Price = 107.00; Level-4 Ask Volume = 3000
• Level-5 Ask Price = 108.50; Level-5 Ask Volume = 1000
spreads and volumes based on the training set. It has been
the volume-weighted execution price for the second child order
empirically observed that major equity markets exhibit U- will be:
shaped trading intensity throughout the day, i.e. more activity
(6000 × 99.8) + (2000 × 99.9) + (2000 × 101.3)
in mornings and closer to closing auction. A further discussion = 100.12.
10000
of these insights can be found in [3] and [8]. In fact, [13]
empirically demonstrates that South African stocks exhibit If the reference price of the stock at t = 0 is 99.5, then the
implementation shortfall from this trade is:
similar characteristics. We thus consider simlulations where
training volume/spread tuples are H-hour dependent, such that ((20000 × 99.5) − (10000 × 100.9 + 10000 × 100.12)
the optimal policy is further refined with respect to trading 20000 × 99.5
time (H). = −0.0101 = −101bps.
Since the conditions of the limit order book were more An important assumption in this model specification is that our
favourable for BUY orders in period t + 1, if we had modified trading activity does not affect the market attributes. Although
the child orders to, say 8000 in period t and 12000 in period temporary price impact is incorporated into execution prices
t + 1, the resulting implementation shortfall would be:
via depth participation of the market order in the prevailing
((20000 × 99.5) − (8000 × 100.54 + 12000 × 100.32) limit order book, we assume the limit order book is resilient
20000 × 99.5 with respect to our trading activity. Market resiliency can be
= −0.0091 = −91bps. thought of as the number of quote updates before the market’s
spread reverts to its competitive level. Degryse et al. showed
In this example, increasing the child order volume when Ask that a pure limit order book market (Euronext Paris) is fairly
Prices are lower and Level-1 Volumes are higher decreases the resilient with respect to most order sizes, taking on average
overal cost of the trade. It is for this reason that implementation 50 quote updates for the spread to normalise following the
shortfall is a natural candidate for the rewards matrix in most aggressive orders [11]. Since we are using 5-minute
our reinforcement learning system. Each action implies a trading intervals and small trade sizes, we will assume that
child order volume, which has an associated volume-weighted any permanent price impact effects dissipate by the next
execution price. The agent will learn the consequences of trading period. A preliminary analysis of South African stocks
each action over the trading horizon, with the ultimate goal of revealed that there were on average over 1000 quote updates
minimising the total trade’s implementation shortfall. during the 5-minute trading intervals and the pre-trade order
4) Algorithm and Methodology: Given the above specifica- book equilibrium is restored within 2 minutes for large trades.
tion, we followed the following steps to generate our results: The validity of this assumption however will be tested in future
• Specify a stock (S), volume-to-trade (V ), time horizon research, as well as other model specifications explored which
(T ), and trading datetime (from which the trading hour incorporate permanent effects in the system configuration.
H is inferred),
IV. DATA AND RESULTS
• Partition the dataset into independent training sets and
testing sets to generate results (the training set always A. Data used
pre-dates the testing set), For this study, we collected 12 months of market depth
• Calibrate the parameters for the Almgren-Chriss (AC) tick data (Jan-2012 to Dec-2012) from the Thomson Reuters
volume trajectory (σ, η) using the historical training set; Tick History (TRTH) database, representing a universe of 166
set ρ = 0, since we assume order book is resilient to stocks that make up the South African local benchmark index
trading activity (see below), (ALSI) as at 31-Dec-2012. This includes 5 levels of order book
• Generate the AC volume trajectory (AC1 , ..., ACT ), depth (bid/ask prices and volumes) at each tick. The raw data
• Train the Q-matrix based on the state-action tuples gen- was imported into a MongoDB database and aggregated into
erated by the training set, 5-minute intervals showing average level prices and volumes,
• Execute the AC volume trajectory at the specified trading which was used as the basis for the analysis.
datetime (H) on each day in the testing set, recording the
B. Stocks, parameters and assumptions
implementation shortfall,
• Use the trained Q-matrix to modify the AC trajectory as To test the robustness of the proposed model in the South
we execute V at the specified trading datetime, recording African (SA) equity market we tested a variety of stock types,
the implementation shortfall and trade sizes and model parameters. Due to space constraints,
• Determine whether the reinforcement learning (RL) we will only show a representative set of results here that
model improved/worsened realised implementation short- illustrate the insights gained from the analysis. The following
fall. summarises the stocks, parameters and assumptions used for
In order to train the Q-matrix to learn the optimal policy the results that follow:
• Stocks
mapping, we need to traverse the training data set (T × I × A) – SBK (Large Cap, Financials)
times, where A is the total number of possible actions. The – AGL (Large Cap, Resources)
following pseudo-code illustrates the algorithm used to train – SAB (Large Cap, Industrials)
the Q-matrix: • Model Parameters
– βLB : 0, βU B : 2, βincr : 0.25
Optimal_strategy<V,T,I,A> – λ: 0.01, τ : 5-min, α0 : 1, γ: 1
For (Episode 1 to N) {
– V : 100 000, 1000 000
Record reference price at t=0
For t = T to 1 { – T : 4 (20-min), 8 (40-min), 12 (60-min)
For i = 1 to I – H: 9, 10, 11, 12, 13, 14, 15, 16
Calculate episode’s STATE attributes <s,v> – I, B, W : 5, 10
For a = 1 to A { – Buy/Sell: BUY
Set x = <t,i,s,v> • Assumptions
Determine the action volume a
Calculate IS from trade, R(x,a) – Max volume participation rate in order book: 20%
Simulate transition x to y – Market is resilient to our trading activity
Look up max_p Q(y,p)
Update Q(x,a) = Q(x,a) + alpha*U }}}
Note, we set γ = 1 since [14] states that this is a necessary
Select the lowest-IS action max_p Q(y,p) for optimal policy condition to ensure convergence to the optimal policy with
probability one for a finite-horizon MDP. We also choose an Parameters Trading Time(hour) Average
V T I,B,W 9 10 11 12 13 14 15 16
arbitrary value for λ, although sensitivities to these parameters 100000 4 5 23.9 -1.4 4.7 13.4 1.8 3.3 1.8 35.1 10.3
will be explored in future work. AC parameters are calibrated 100000 8 5 25.3 4.3 8.3 2.3 1.4 9.9 -0.6 -1.9 6.1
and Q-matrix trained over a 6-month training set from 1-Jan- 100000 12 5 32.7 -25.2 7.2 -2.7 -1.5 4.6 4.5 -3.3 2.1
1000000 4 5 23.3 -1.3 4.8 9.3 1.9 3.5 1.8 35.0 9.8
2012 to 30-Jun-2012. The resultant AC and RL trading trajec- 1000000 8 5 28.8 5.6 8.2 1.9 1.4 9.9 -0.3 -2.6 6.6
tories are then executed on each day at the specified trading 1000000 12 5 33.1 -25.0 7.2 -4.0 -0.8 4.8 4.8 1.2 2.7
time H in the testing set from 1-Jul-2012 to 20-Dec-2012. The 100000 4 10 22.9 1.3 3.0 9.7 2.7 5.8 3.5 -26.1 2.8
100000 8 10 26.0 4.3 6.7 -0.2 3.5 8.6 1.6 -3.1 5.9
implementation shortfall for both models is calculated and the 100000 12 10 27.8 -21.9 7.5 -4.1 0.6 1.8 6.2 -9.5 1.1
difference recorded. This allows us to construct a distribution 1000000 4 10 22.6 1.4 3.1 9.3 2.5 6.0 3.6 -26.1 2.8
of implementation shortfall for each of the AC and RL models, 1000000 8 10 26.3 5.0 7.2 -0.5 3.3 7.0 2.3 -1.8 6.1
1000000 12 10 27.9 -24.3 8.3 -6.9 0.5 1.8 7.5 -3.3 1.4
and for all trading hours H = 9, 10, ..., 16.
TABLE I: Average % improvement in median implementation shortfall for various
C. Results parameter values, using AC and RL models. Training H-dependent.

Table 1 shows the average % improvement in median


implementation shortfall for the complete set of stocks and
parameter values. These results suggest that the model is more
effective for shorter trading horizons (T = 4), with an average
improvement of up to 10.3% over the base AC model. This
result may be biased due to the assumption of order book
resilience. Indeed, the efficacy of the trained Q-matrix may
be less reliable for stocks which exhibit slow order book
resilience, since permanent price effects would affect the state
space transitions. In future work, we plan to relax this order
book resilience assumption and incoporate permanent effects
into state transitions.
Figure 1 illustrates the improvement in median post-trade
implementation shortfall when executing the volume trajecto-
ries generated by each of the models, for each of the candidate
stocks at the given trading times. In general, the RL model
is able to improve (lower) ex-post implementation shortfall, Fig. 1: Difference between median implementation shortfall generated using RL and AC
models, with given parameters (I,B,W = 5). Training H-dependent.
however the improvement seems more significant for early
morning/late afternoon trading hours. This could be due to suggesting that a shorter training period may yield similar
the increased trading activity at these times, resulting in more results. We do however note that improving the % of correct
state-action visits in the training set to refine the associated actions by increasing the granularity of the state space does not
Q-matrix values. We also notice more dispersed performance necessarily translate into better model performance. This can
between 10:00 and 11:00. This time period coincides with be seen by Table 1, where the results where I, B, W = 10
the UK market open, where global events may drive local do not show any significant improvement over those with
trading activity and skew results, particularly since certain SA I, B, W = 5. This suggests that the market dynamics may
stocks are dual-listed on the London Stock Exchange (LSE). not be fully represented by volume and spread state attributes,
The improvement in implementation shortfall ranges from 15 and alternative state attributes should be explored in future
bps (85.3%) for trading 1000 000 of SBK between 16:00 and work to improve ex-post model efficacy.
17:00, to -7 bps (-83.4%) for trading 100 000 SAB between Table 2 shows the average standard deviation of the re-
16:00 and 17:00. Overall, the RL model is able to improve sultant implementation shortfall when using each of the AC
implementation shortfall by 4.8%. and RL models. Since we have not explicitly accounted for
Figure 2 shows the % of correct actions implied by the Q- variance of execution in the RL reward function, we see that
matrix, as it evolves through the training process after each the resultant trading trajectories generate a higher standard
tuple visit. Here, a correct action is defined as a reduction deviation compared to the base AC model. Thus, although the
(addition) in the volume-to-trade based on the max Q-value RL model provides a performance improvement over the AC
action, in the case where spreads are above (below) the model, this is achieved with a higher degree of execution risk,
50%ile and volumes are below (above) the 50%ile level. This which may not be acceptable for the trader. We do note that the
coincides with the intuitive behaviour we would like the RL RL model exhibits comparable risk for T = 4, thus validating
agent to learn. These results suggest that finer state granularity the use of the RL model to reliably improve IS over short
(I, B, W = 10) improves the overall accuracy of the learn- trade horizons. A future refinement on the RL model should
ing agent, as demonstrated by the higher % correct actions incorporate variance of execution, such that it is consistent
achieved. All model configurations seem to converge to some with the AC objective function. In this way, a true comparison
stationary accuracy level after approximately 1000 tuple visits, of the techniques can be done, and one can conclude as to
ACKNOWLEDGMENT
The authors thank Dr Nicholas Westray for his contribu-
tion in the initiation of this work, as well as the insightful
comments from the anonymous reviewers. This work is based
on the research supported in part by the National Research
Foundation of South Africa (Grant Number CPRR 70643)
R EFERENCES
[1] R. Almgren, N. Chriss. Optimal execution of portfolio transactions,
Journal of Risk, 3, pp. 5-40, 2000.
[2] R. Almgren. Optimal execution with nonlinear impact functions and
trading-enhanced risk, Applied Mathematical Finance, 10(1), pp. 1-18,
Fig. 2: % correct actions implied by Q-matrix after each training set tuple. Training 2003.
H-dependent.
[3] A. Admati, P Pfleiderer. A theory of intraday patterns: volume and price
variability, Review of Financial Studies, 1(1), pp. 3-40, 1988.
Parameters Standard Deviation(%) % improvement [4] A. Barto, S. Mahadevan. Recent advances in hierarchical reinforcement
V T I,B,W AC RL in IS learning, Discrete Event Dynamic Systems, 13(4), pp. 341-379, 2003.
100000 4 5 0.13 0.17 10.3 [5] R. Bellman. The theory of dynamic programming, Bulletin of the Amer-
100000 8 5 0.14 0.23 6.1 ican Mathematical Society, 1954.
100000 12 5 0.14 0.26 2.1 [6] R. Bellman, S. Dreyfus. Applied dynamic programming, Princeton Uni-
1000000 4 5 0.13 0.17 9.8 versity Press, Princeton, New Jersey, 1962.
1000000 8 5 0.14 0.23 6.6 [7] D. Bertsimas, A. Lo. Optimal control of execution costs, Journal of
1000000 12 5 0.14 0.26 2.7 Financial Markets, 1(1), pp. 1-50, 1998.
100000 4 10 0.13 0.17 2.8 [8] W. Brock, A Kleidon. Periodic market closure and trading volume: a
100000 8 10 0.14 0.22 5.9 model of intraday bids and asks, Journal of Economic Dynamics and
100000 12 10 0.14 0.26 1.1 Control, 16(3), pp. 451-489, 1992.
1000000 4 10 0.13 0.17 2.8 [9] L. Chan, J Lakonishok. The behavior of stock prices around institutional
1000000 8 10 0.14 0.22 6.1 trades, Journal of Finance, 50(4), pp.1147-1174, 1995.
1000000 12 10 0.14 0.26 1.4 [10] P. Dayan, C. Watkins. Reinforcement learning, Encyclopedia of Cogni-
Average 0.14 0.22 4.8 tive Science, 2001.
[11] H. Degryse, F. deJong, M. Ravenswaaij, G. Wuyts, Aggressive orders
TABLE II: Standard deviation(%) of implementation shortfall when using AC vs RL and the resiliency of a limit order market, Review of Finance, 9(2), pp.
models.
201-242, 2003.
[12] T. Dietterich. Hierarchical reinforcement learning with the MAXQ value
whether the RL model indeed outperforms the AC model at a function decomposition, Abstraction, Reformulation and Approximation,
statistically significant level. pp. 26-44, 2000.
[13] B. Du Preez. JSE Market Microstructure, MSc Dissertation, University
of the Witwatersrand, School of Computational and Applied Mathematics,
V. C ONCLUSION 2013.
[14] F. Garcia, S. Ndiaye. A learning rate analysis of reinforcement learning
algorithms in finite-horizon, Proceedings of the 15th International Con-
In this paper, we introduced reinforcement learning as a ference on Machine Learning, 1998.
candidate machine learning technique to enhance a given [15] A. Gosavi. Reinforcement learning: a tutorial survey and recent ad-
vances, INFORMS Journal on Computing, 21(2), pp. 178-192, 2009.
optimal liquidation volume trajectory. Nevmyvaka, Feng and [16] R. Holthausen, R. Leftwich, D. Mayers. Large-block transactions, the
Kearns showed that reinforcement learning delivers promising speed of response and temporary and permanent stock-price effects,
results where the learning agent is trained to choose the Journal of Financial Economics, 26(1), pp. 71-95, 1990.
[17] G. Huberman, W. Stanzl. Optimal liquidity trading, Yale School of
optimal limit order price at which to place the remaining Management, Working Paper, 2001.
inventory, at discrete periods over a fixed liquidation horizon [18] L. Kaelbling, M. Littman, A. Moore. Reinforcement learning: a survey,
[20]. Here, we show that reinforcement learning can also be Journal of Artificial Intelligence Research, 4, pp. 237-285, 1996.
[19] J. McCulloch. Relative volume as a doubly stochastic binomial point
used successfully to modify a given volume trajectory based process, Quantitative Finance, 7(1), pp. 55-62, 2007.
on market attributes, executed via a sequence of market orders [20] Y. Nevmyvaka, Y. Feng., M. Kearns. Reinforcement learning for optimal
based on the prevailing limit order book. trade execution, Proceedings of the 23rd international conference on
machine learning, pp. 673-680, 2006.
Specifically, we showed that a simple look-up table Q- [21] A. Perold, The implementation shortfall: paper vs reality, Journal of
learning technique can be used to train a learning agent to Portfolio Management, 14(3), pp. 4-9, 1988.
[22] M. Puterman, Markov Decision Processes, John Wiley and Sons, New
modify a static Almgren-Chriss volume trajectory based on York, 1994.
prevailing spread and volume dynamics, assuming order book [23] S. Ross. Introduction to stochastic dynamic programming, Academic
resiliency. Using a sample of stocks and trade sizes in the Press, New York, 1983.
[24] S. Singh, T Jaakola, M. Littman, C. Szepesvari. Convergence results
South African equity market, we were able to reliably improve for single-step on-policy reinforcement learning algorithms, Machine
post-trade implementation shortfall by up to 10.3% on average Learning, 38(3), pp. 287-308, 2000.
for short trade horizons, demonstrating promising potential [25] R. Sutton, A. Barto. Reinforcement learning, Cambridge, MA: MIT
Press, 1998.
applications of this technique. Further investigations include [26] D. Vayanos. Strategic trading in a dynamic noisy market, Journal of
incorporating variance of execution in the RL reward function, Finance, 56(1), pp. 131-171, 2001.
relaxing the order book resiliency assumption and alternative [27] C. Watkins. Learning from delayed rewards, PhD Thesis, Cambridge
University, 1989.
state attributes to govern market dynamics.

You might also like