Reinforcement Learning For Trade Execution
Reinforcement Learning For Trade Execution
The research will form the basis for solving our execution problems using an RL
framework.
INTRODUCTION
We are interested in developing trade execution algorithms that minimize losses and
limit risk across a variety of markets (e.g., cash, futures), market regimes (efficient,
challenging), benchmarks (arrival price, volume-weighted average price), and risk
measures (variance, expected shortfall). Traditionally, execution algorithms are often
developed using some combination of optimal solutions to stylized versions of the
problem expressed using market models and heuristic algorithms. Here, we explore an
alternate strategy: training execution policies using deep reinforcement learning.
For execution, we are interested in buying or selling some lots in an asset under some
constraints and with some final utility. The utility could vary, but it revolves around
having a lower cost to a specified benchmark with a degree of variance. Finishing the
order in the underlying symbol in a given duration will be the most critical constraint,
but there could be other constraints on account of market conditions. These problems
are, in reality, sequential games where we seek to compete in markets with many actors.
However, we will approximate them as Markov decision processes (MDPs) in ways that
preserve the essential properties necessary to learn execution strategies that will transfer
well to markets directly.
Deep reinforcement learning methods are a class of solution methods for large Markov
decision processes like those with which we can approximate trade execution. Training
neural network policies that map order and market states to actions, these methods can
learn from observations with complex structures, such as a limit order book.
Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
While there are many different reinforcement learning methods, and we find success
using Proximal Policy Optimization in the experiments we describe below, the
experiments we describe here could be solved in various ways. Proximal Policy
Optimization is a highly successful RL algorithm, which was introduced in [3]. At a high
level, reinforcement learning methods take a similar approach to humans when
optimizing execution strategies. The learning algorithm explores different variants of
how it can act and exploits actions that lead to good performance. However, the nature
of the exploration-exploitation behavior is different. Deep reinforcement learning
algorithms often use simple randomization for exploration and exploit advantageous
behavior by slowly increasing the likelihood of actions in states which historically results
in improved outcomes. While humans can decide what new policies to explore and adopt
informed by market understanding, reinforcement learning methods can access much
more data and explore much more rapidly than humans, allowing for outperformance in
many tasks. For the interested reader, we refer to [4] for a detailed online on
reinforcement learning.
We will restrict this set of problems in several ways to simplify their solution. However,
we demonstrate how some of the most restrictive simplifications can be lifted. Deep
reinforcement learning can potentially learn solutions to very general versions of the
trade execution problem. In the first part of this article, we describe in detail an
experiment where the trading agent is only allowed to send limit and market orders at
the touch. In the second part, we describe an experiment where the agent can control the
optimal depth of its limit order placements.
Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
RL FOR EXECUTION finding an optimal solution is not possible. However, we can train the RL algorithm
PAGE 3 directly on the historical market simulator, which leads to an approximately optimal
policy. To evaluate the performance of the RL solution, we will run the RL solution and
the model-based solution on a hold-out test set.
In this simple model, we do not assume market impact. However, we prevent the agent
from trading too quickly by forcing it to follow the TWAP schedule, which is defined by
𝑁−𝑛
TWAP
𝑄𝑛 = 𝑄0 , 𝑛 ∈ {0, … , 𝑁}. (5)
𝑁
Our objective is to maximise the final cash while staying close to the the TWAP schedule.
Therefore, we want to maximise
𝑁
TWAP 2 ⎤
𝔼 ⎡𝑋𝑁 − 𝜆 ∑ (𝑄𝑛 − 𝑄𝑛 ) , 𝜆 > 0, (6)
⎣ 𝑛=0 ⎦
where 𝜆 is a risk preference, controlling the degree of deviation from the target schedule.
Assuming that the spread is constant, finding an optimal solution of (6) through dynamic
programming is easy. The optimal solution has a simple form with an ahead and behind
boundary. We do not give the details on how to derive the solution here. Instead, we
illustrate the behavior of the solution in Figure 1. In this example, the agent trades ten
lots of gold over five minutes in ten-second intervals. The market parameters are
calibrated based on historical data. The grey dashed line represents the TWAP inventory
during the execution period. If the agent’s inventory is far ahead of the target inventory
(below the green line) the agent does not send any orders. If the inventory is roughly in
line with the target schedule (above the green and below the red line) the agent sends
only limit orders. Finally, if the agent’s inventory is far behind schedule (above the red
line), the agent sends limit and market orders.
Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
FIGURE 2 Baseline vs. Reinforcement Learning Policies
Performance
comparison of
model-based
dynamic
programming
solution and
reinforcement
learning policies
trained on only
order state and
order and market
state. Mean profit vs.
the benchmark
improves very
slightly, while
schedule deviation
decreases
dramatically.
Therefore, a higher number of buy order 𝐵𝑛 leads to a higher fill probability of our limit
orders. While this fill mechanism is still simplistic, it captures the relationship between
fill probabilities and liquidity and provides a good trade-off between realism and
practicality.
At each time 𝑛, the agent still earns the profit (3) and wants to optimize the overall
objective (6). The difference to the stochastic model in the previous section is that we
model the fill probabilities 𝐹𝑛 , the ask price 𝑆𝑛 , and the spread 𝛼𝑛 as described in the
last paragraph. Otherwise, the optimization problem stays the same.
Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
FIGURE 3 Trade Execution for RL Policy
The green filled
triangles represent
filled limit orders.
The empty green
triangles represent
limit orders that
were placed but not
filled. The red
triangles represent
market orders.
We then compare the RL solution against the model-based solution by running both
solutions on the test set in the historical simulator. To find the model-based solution, we
fit the price volatility in (1) and the fill probability in (2) based on the training data set. In
̄
particular, we estimate the fill probability by 𝜌 = 1 − (1 − 𝑝)𝐵 , where 𝐵̄ is the average
number of buy trades arriving at the best ask in the training period. This calibration
makes the model close in structure to the historical market data simulator, the main
difference being that the fill probability and price volatility are constant in the former.
In Figure 2, we show the performance of the RL solution relative to the model solution
for a fixed risk aversion 𝜆 = 0.01. We fit two RL policies using the same reinforcement
learning method. This first one uses only the current time and inventory as state
variables. The second one uses time, inventory, and additional market states as state
variables. To represent market sates, we choose the current quote size, spread,
imbalance, and some lags of the total number of orders and volatility. Compared to the
benchmark solution, the RL policies improve the mean cash shortfall against the TWAP
marginally (upper panel in Figure 2). However, the schedule deviation reduces
dramatically from the model to the RL solutions (lower panel in Figure 2). Furthermore,
Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
including additional market state reduces the schedule deviation even further. This
behavior is in line with our expectations. Improving the cash shortfall against TWAP
would stem from short-term price predictions, which are generally hard to do, and
requires more sophisticated feature engineering. Staying closer to a target schedule is
related to liquidity forecasting, which is easier than price forecasting. The RL algorithm
effectively learns dynamic ahead and behind boundaries depending on market conditions,
whereas the model-based agent uses static boundaries as illustrated in Figure 1.
Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
RL FOR EXECUTION historical data. The reinforcement learning method once again learned a policy to
PAGE 7 control schedule deviation, this time by placing deeper in the book ahead of schedule
and closer to the top of the book or even inside the spread when behind. Figure 4 shows
the resulting placement distribution across price levels and the distribution of schedule
deviations throughout the order. During higher volatility periods, the learned policy
places limit orders deeper in the book, as these deeper orders are more likely to be filled.
CONCLUSION
The above-mentioned and other ongoing experiments have convinced us that combining
deep reinforcement learning methods and historical simulation is a feasible strategy for
learning complex adaptive execution strategies. While the examples covered here are not
meant to be implemented in production, they and future experiments will serve to guide
our research on optimal execution strategies. Future iterations may more directly impact
our production trading systems, as we continue to advance the complexity of execution
strategies we are able to solve efficiently with these methods.
References
[1] Alvaro Cartea and Sebastian Jaimungal. “Optimal execution with limit and market
orders”. In: Quantitative Finance 15.8 (2015), pp. 1279–1291.
[2] Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. “Reinforcement learning for
optimized trade execution”. In: Proceedings of the 23rd international conference on
Machine learning. 2006, pp. 673–680.
[3] John Schulman et al. “Proximal policy optimization algorithms”. In: arXiv preprint
arXiv:1707.06347 (2017).
[4] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT
press, 2018.
Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]