A Reinforcement Learning Extension To The Almgren-Chriss Framework For Optimal Trade Execution
A Reinforcement Learning Extension To The Almgren-Chriss Framework For Optimal Trade Execution
Abstract—Reinforcement learning is explored as a candidate cost). This behaviour of institutional investors was empirically
machine learning technique to enhance existing analytical so- demonstrated in [9], where they observed that typical trades of
lutions for optimal trade execution with elements from the large investment management firms are almost always broken
market microstructure. Given a volume-to-trade, fixed time
horizon and discrete trading periods, the aim is to adapt a up into smaller trades and executed over the course of a day
given volume trajectory such that it is dynamic with respect or several days.
to favourable/unfavourable conditions during realtime execution, Several authors have studied the problem of optimal liq-
thereby improving overall cost of trading. We consider the uidation, with a strong bias towards stochastic dynamic pro-
standard Almgren-Chriss model with linear price impact as a gramming solutions. See [7], [17], [26], [1] as examples. In
candidate base model. This model is popular amongst sell-side
institutions as a basis for arrival price benchmark execution this paper, we consider the application of a machine learning
algorithms. By training a learning agent to modify a volume technique to the problem of optimal liquidation. Specifically
trajectory based on the market’s prevailing spread and volume we consider a case where the popular Almgren-Chriss closed-
dynamics, we are able to improve post-trade implementation form solution for a trading trajectory (see [1]) can be enhanced
shortfall by up to 10.3% on average compared to the base model, by exploiting microstructure attributes over the trading horizon
based on a sample of stocks and trade sizes in the South African
equity market. using a reinforcement learning technique.
Reinforcement learning in this context is essentially a cali-
I. I NTRODUCTION brated policy mapping states to optimal actions. Each state is
A critical problem faced by participants in investment a vector of observable attributes which describe the current
markets is the so-called optimal liquidation problem, viz. how configuration of the system. It proposes a simple, model-
best to trade a given block of shares at minimal cost. Here, free mechanism for agents to learn how to act optimally in
cost can be interpreted as in Perold’s implementation shortfall a controlled Markovian domain, where the quality of action
([21]), i.e. adverse deviations of actual transaction prices chosen is successively improved for a given state [10]. For
from an arrival price baseline when the investment decision the optimal liquidation problem, the algorithm examines the
is made. Alternatively, cost can be measured as a deviation salient features of the current order book and current state of
from the market volume-weighted trading price (VWAP) over execution in order to decide which action (e.g. child order
the trading period, effectively comparing the specific trader’s price or volume) to select to service the ultimate goal of
performance to that of the average market trader. In each case, minimising cost.
the primary problem faced by the trader/execution algorithm The first documented large-scale empirical application of
is the compromise between price impact and opportunity cost reinforcement learning algorithms to the problem of optimised
when executing an order. trade execution in modern financial markets was conducted
Price impact here refers to adverse price moves due to a by [20]. They set up their problem as a minimisation of
large trade size absorbing liquidity supply at available levels implementation shortfall for a buying/selling program over a
in the order book (temporary price impact). As market partic- fixed time horizon with discrete time periods. For actions,
ipants begin to detect the total volume being traded, they may the agent could choose a price to repost a limit order for
also adjust their bids/offers downward/upward to anticipate the remaining shares in each discrete period. State attributes
order matching (permanent price impact) [16]. To avoid price included elapsed time, remaining inventory, current spread,
impact, traders may split a large order into smaller child orders immediate cost and signed volume. In their results, they
over a longer period. However, there may be exogenous market found that their reinforcement learning algorithm improved the
forces which result in execution at adverse prices (opportunity execution efficiency by 50% or more over traditional submit-
and-leave or market order policies. lower execution cost variance for the same or lower level of
Instead of a pure reinforcement learning solution to the expected execution cost.
problem, as in [20], we propose a hybrid approach which The exposition of their solution is as follows: They assume
enhances a given analytical solution with attributes from that the security price evolves according to a discrete arith-
the market microstructure. Using the Almgren-Chriss (AC) metic random walk:
model as a base, for a finite liquidation horizon with discrete nk
trading periods, the algorithm determines the proportion of Sk = Sk−1 + στ 1/2 ξk − τ g( ),
τ
the AC-suggested trajectory to trade based on prevailing
where:
volume/spread attributes. One would expect, for example, that
allowing the trajectory to be more aggressive when volumes Sk = price at time k,
are relatively high and spreads are tight may reduce the σ = volatility of the security,
ultimate cost of the trade. In our implementation, a static
τ = length of discete time interval,
volume trajectory is preserved for the duration of the trade,
however the proportion traded is dynamic with respect to ξk = draws from independent random variables,
market dynamics. As in [20], a market order is executed at nk = volume traded at time k and
the end of the trade horizon for the remaining volume, to g(.) = permanent price impact.
ensure complete liquidation. An important consideration in our
analysis is the specification of the problem as a finite-horizon Here, permanent price impact refers to changes in the
Markov Decision Process (MDP) and the consequences for equilibrium price as a direct function of our trading, which
optimal policy convergence of the reinforcement learning al- persists for at least the remainder of the liquidation horizon.
gorithm. In [20], they use an approximation in their framework Temporary price impact refers to adverse deviations as a result
to address this issue by incorporating elapsed time as a state of absorbing available liquidity supply, but where the impact
attribute, however they do not explicitly discuss convergence. dissipates by the next trading period due to the resilience of
We will use the findings of [14] in our model specification the order book. Almgren and Chriss introduce a temporary
and demonstrate near-optimal policy convergence of the finite- price impact function h(v) to their model, where h(v) causes
horizon MDP problem. a temporary adverse move in the share price as a function of
The model described above is compared with the our trading rate v [1]. Given this addition, the actual security
base Almgren-Chriss model to determine whether it in- transaction price at time k is given by:
creases/decreases the cost of execution for different types nk
of trades consistently and significantly. This study will help S˜k = Sk−1 − h( ).
τ
determine whether reinforcement learning is a viable technique
Assuming a sell program, we can then define the total
which can be used to extend existing closed-form solutions to
trading revenue as:
exploit the nuances of the microstructure where the algorithms
are applied. N N N
nk nk
nk S˜k = XS0 +
X X X
This paper proceeds as follows: Section 2 introduces the (στ 1/2 ξk −τ g( ))xk − nk h( ),
τ τ
standard Almgren-Chriss model. Section 3 describes the spe- k=1 k=1 k=1
cific hybrid reinforcement learning technique proposed, along k
P N
P
with a discussion regarding convergence to optimum action where xk = X − nj = nj for k = 0, 1, ..., N .
j=1 j=k+1
values. Section 4 discusses the data used and results, compar-
ing the 2 models for multiple trade types. Section 5 concludes PThe ˜total cost of trading is thus given by x = XS0 −
nk Sk , i.e. the difference between the target revenue value
and proposes some extensions for further research. and the total actual revenue from the execution. This definition
II. T HE A LMGREN -C HRISS MODEL refers to Perold’s implementation shortfall measure (see [21]),
and serves as the primary transaction cost metric which
Bertsimas and Lo are pioneers in the area of optimal liqui-
is minimised in order to maximise trading revenue. Since
dation, treating the problem as a stochastic dynamic program-
implementation shortfall is a random variable, Almgren and
ming problem [7]. They employed a dynamic optimisation
Chriss compute the following:
procedure which finds an explicit closed-form best execution
strategy, minimising trading costs over a fixed period of time N N
X nk X nk
for large transactions. Almgren and Chriss extended the work E(x) := τ xk g( )+ nk h( )
τ τ
of [7] to allow for risk aversion in their framework [1]. They k=1 k=1