0% found this document useful (0 votes)
67 views

Reinforcement Learning For Trade Execution

This document discusses using reinforcement learning to develop optimal trade execution strategies. It describes simulations conducted to train trading agents, including a stochastic simulator and a historical market simulator. In the first experiment, the agent can only place limit or market orders at the touch. It learns the optimal decision boundaries for when to use each type of order. The goal is to maximize profits while staying close to a time-weighted average price benchmark.

Uploaded by

if05041736
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Reinforcement Learning For Trade Execution

This document discusses using reinforcement learning to develop optimal trade execution strategies. It describes simulations conducted to train trading agents, including a stochastic simulator and a historical market simulator. In the first experiment, the agent can only place limit or market orders at the touch. It learns the optimal decision boundaries for when to use each type of order. The goal is to maximize profits while staying close to a time-weighted average price benchmark.

Uploaded by

if05041736
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

REINFORCEMENT LEARNING

FOR TRADE EXECUTION:


EMPIRICAL EVIDENCE
BASED ON SIMULATIONS
ISAAC SCHEINFELD
MORITZ WEISS
MARCH 31, 2023 EXECUTIVE SUMMARY
We provide proof of concept solutions to solve trade execution problems using
deep reinforcement learning (RL) and classical methods.

We develop stand-alone simulations to train execution policies we expect to work


well in practice.

The research will form the basis for solving our execution problems using an RL
framework.

INTRODUCTION
We are interested in developing trade execution algorithms that minimize losses and
limit risk across a variety of markets (e.g., cash, futures), market regimes (efficient,
challenging), benchmarks (arrival price, volume-weighted average price), and risk
measures (variance, expected shortfall). Traditionally, execution algorithms are often
developed using some combination of optimal solutions to stylized versions of the
problem expressed using market models and heuristic algorithms. Here, we explore an
alternate strategy: training execution policies using deep reinforcement learning.

For execution, we are interested in buying or selling some lots in an asset under some
constraints and with some final utility. The utility could vary, but it revolves around
having a lower cost to a specified benchmark with a degree of variance. Finishing the
order in the underlying symbol in a given duration will be the most critical constraint,
but there could be other constraints on account of market conditions. These problems
are, in reality, sequential games where we seek to compete in markets with many actors.
However, we will approximate them as Markov decision processes (MDPs) in ways that
preserve the essential properties necessary to learn execution strategies that will transfer
well to markets directly.

Deep reinforcement learning methods are a class of solution methods for large Markov
decision processes like those with which we can approximate trade execution. Training
neural network policies that map order and market states to actions, these methods can
learn from observations with complex structures, such as a limit order book.

Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
While there are many different reinforcement learning methods, and we find success
using Proximal Policy Optimization in the experiments we describe below, the
experiments we describe here could be solved in various ways. Proximal Policy
Optimization is a highly successful RL algorithm, which was introduced in [3]. At a high
level, reinforcement learning methods take a similar approach to humans when
optimizing execution strategies. The learning algorithm explores different variants of
how it can act and exploits actions that lead to good performance. However, the nature
of the exploration-exploitation behavior is different. Deep reinforcement learning
algorithms often use simple randomization for exploration and exploit advantageous
behavior by slowly increasing the likelihood of actions in states which historically results
in improved outcomes. While humans can decide what new policies to explore and adopt
informed by market understanding, reinforcement learning methods can access much
more data and explore much more rapidly than humans, allowing for outperformance in
many tasks. For the interested reader, we refer to [4] for a detailed online on
reinforcement learning.

We will restrict this set of problems in several ways to simplify their solution. However,
we demonstrate how some of the most restrictive simplifications can be lifted. Deep
reinforcement learning can potentially learn solutions to very general versions of the
trade execution problem. In the first part of this article, we describe in detail an
experiment where the trading agent is only allowed to send limit and market orders at
the touch. In the second part, we describe an experiment where the agent can control the
optimal depth of its limit order placements.

FIGURE 1 Decision Boundaries for Optimal Solution


Agent starts sending
limit orders if
inventory is above
the green line.
Agent starts sending
market orders if the
inventory is above
the red line.

POSTING OPTIMALLY AT THE TOUCH


In this section, we introduce two market simulators. The first is a stochastic simulator,
while the second relies on replaying historical market data. The stochastic simulator is a
discrete time version of the market model introduced [1]. The main advantage of the
stochastic simulator is that it allows finding an optimal solution through dynamic
programming, with the drawback that the assumptions on the simulation’s behavior are
simplistic. On the other hand, the historical market simulator is more realistic, but

Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
RL FOR EXECUTION finding an optimal solution is not possible. However, we can train the RL algorithm
PAGE 3 directly on the historical market simulator, which leads to an approximately optimal
policy. To evaluate the performance of the RL solution, we will run the RL solution and
the model-based solution on a hold-out test set.

STOCHASTIC MARKET SIMULATOR


We are interested in selling an initial inventory of 𝑄0 ∈ ℕ lots over 𝑁 discrete constant
time steps {𝑛Δ𝑡, 𝑛 ∈ {0, 1, … 𝑁}}. We assume that the ask price follows the dynamics
𝑆𝑛+1 = 𝑆𝑛 + 𝜎𝜉𝑛 , (1)
for a volatility 𝜎 > 0 and standard normal random variables 𝜉𝑛 . At each non-terminal
time step 𝑛 ∈ {0, 1, … , 𝑁 − 1} we can choose whether to place a single lot limit order
𝑙𝑛 ∈ {0, 1} and whether to place a single lot market order 𝑚𝑛 ∈ {0, 1}. The limit order
fill probability is encoded through a Bernoulli random variable
𝐹𝑛 ∼ Bernoulli(𝜌), (2)
where 𝜌 is the probability of a limit order fill. The profit at each time 𝑛 < 𝑁 is then given
by
𝑃𝑛 = 𝑙𝑛 𝐹𝑛 𝑆𝑛 + 𝑚𝑛 (𝑆𝑛 − 𝛼𝑛 ), (3)
where 𝛼𝑛 > 0 denotes the spread. Thus, the limit order earns the spread with uncertainty
on the fill, while the market order loses the spread but is sure to be filled. Furthermore,
any leftover inventory 𝑄𝑁 is traded as market order under a sweep cost encoded by
𝑓(𝑄𝑁 ). The cash at the terminal time 𝑁 is then given by
𝑁−1
𝑋𝑁 = ∑ 𝑃𝑛 + 𝑄𝑁 (𝑆𝑁 − 𝛼𝑁 − 𝑓(𝑄𝑁 )). (4)
𝑛=0

In this simple model, we do not assume market impact. However, we prevent the agent
from trading too quickly by forcing it to follow the TWAP schedule, which is defined by
𝑁−𝑛
TWAP
𝑄𝑛 = 𝑄0 , 𝑛 ∈ {0, … , 𝑁}. (5)
𝑁
Our objective is to maximise the final cash while staying close to the the TWAP schedule.
Therefore, we want to maximise
𝑁
TWAP 2 ⎤
𝔼 ⎡𝑋𝑁 − 𝜆 ∑ (𝑄𝑛 − 𝑄𝑛 ) , 𝜆 > 0, (6)
⎣ 𝑛=0 ⎦
where 𝜆 is a risk preference, controlling the degree of deviation from the target schedule.

Assuming that the spread is constant, finding an optimal solution of (6) through dynamic
programming is easy. The optimal solution has a simple form with an ahead and behind
boundary. We do not give the details on how to derive the solution here. Instead, we
illustrate the behavior of the solution in Figure 1. In this example, the agent trades ten
lots of gold over five minutes in ten-second intervals. The market parameters are
calibrated based on historical data. The grey dashed line represents the TWAP inventory
during the execution period. If the agent’s inventory is far ahead of the target inventory
(below the green line) the agent does not send any orders. If the inventory is roughly in
line with the target schedule (above the green and below the red line) the agent sends
only limit orders. Finally, if the agent’s inventory is far behind schedule (above the red
line), the agent sends limit and market orders.

Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
FIGURE 2 Baseline vs. Reinforcement Learning Policies
Performance
comparison of
model-based
dynamic
programming
solution and
reinforcement
learning policies
trained on only
order state and
order and market
state. Mean profit vs.
the benchmark
improves very
slightly, while
schedule deviation
decreases
dramatically.

HISTORICAL MARKET SIMULATOR


It is more complex to pick a mechanism for limit order fills, which were modeled by (2) in
the previous section. Limit orders are filled on first-in first-out principle. The fill
probability of a limit order increases as its queue position advances due to cancellations
or fills further up the queue. We do not model queue position explicitly. Instead, we
assume that the probability of a fill conditional on a market order arriving at the ask
price is given by a fixed probability 𝑝 ∈ [0, 1]. For example, 𝑝 = 1 corresponds to the
case where the limit order is always at the front of the queue, and every arriving market
order leads to a fill. In our examples, we choose 𝑝 conservatively between three and five
percent. We do not calculate the fill probability for every market order arriving but rather
aggregate orders within the time intervals. To be more precise, let 𝐵𝑛 be the number of
buy orders arriving in the interval [𝑛Δ𝑡, (𝑛 + 1)Δ𝑡), then the limit order fill probability is
encoded by
𝐹𝑛 ∼ Bernoulli(𝜌𝑛 ), 𝜌𝑛 = 1 − (1 − 𝑝)𝐵𝑛 . (7)

Therefore, a higher number of buy order 𝐵𝑛 leads to a higher fill probability of our limit
orders. While this fill mechanism is still simplistic, it captures the relationship between
fill probabilities and liquidity and provides a good trade-off between realism and
practicality.

At each time 𝑛, the agent still earns the profit (3) and wants to optimize the overall
objective (6). The difference to the stochastic model in the previous section is that we
model the fill probabilities 𝐹𝑛 , the ask price 𝑆𝑛 , and the spread 𝛼𝑛 as described in the
last paragraph. Otherwise, the optimization problem stays the same.

Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
FIGURE 3 Trade Execution for RL Policy
The green filled
triangles represent
filled limit orders.
The empty green
triangles represent
limit orders that
were placed but not
filled. The red
triangles represent
market orders.

REINFORCEMENT LEARNING VS MODEL SOLUTION


In this section, we compare the RL solution against the model solution in the historical
market simulator described before. We trade ten lots of Gold over five minutes in
ten-second intervals. We train the RL algorithm on all trading days from March 2022 to
June 2022 and test it on all trading days in July 2022. Furthermore, we restrict trading
only to the most liquid trading hours. In total, we have 200.000 training episodes and
50.000 test episodes. The historical simulator works as described in the previous section.
We choose 𝑝 = 3% in (7) for the limit order fill probability conditional on a market order
arriving.

We then compare the RL solution against the model-based solution by running both
solutions on the test set in the historical simulator. To find the model-based solution, we
fit the price volatility in (1) and the fill probability in (2) based on the training data set. In
̄
particular, we estimate the fill probability by 𝜌 = 1 − (1 − 𝑝)𝐵 , where 𝐵̄ is the average
number of buy trades arriving at the best ask in the training period. This calibration
makes the model close in structure to the historical market data simulator, the main
difference being that the fill probability and price volatility are constant in the former.

In Figure 2, we show the performance of the RL solution relative to the model solution
for a fixed risk aversion 𝜆 = 0.01. We fit two RL policies using the same reinforcement
learning method. This first one uses only the current time and inventory as state
variables. The second one uses time, inventory, and additional market states as state
variables. To represent market sates, we choose the current quote size, spread,
imbalance, and some lags of the total number of orders and volatility. Compared to the
benchmark solution, the RL policies improve the mean cash shortfall against the TWAP
marginally (upper panel in Figure 2). However, the schedule deviation reduces
dramatically from the model to the RL solutions (lower panel in Figure 2). Furthermore,

Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
including additional market state reduces the schedule deviation even further. This
behavior is in line with our expectations. Improving the cash shortfall against TWAP
would stem from short-term price predictions, which are generally hard to do, and
requires more sophisticated feature engineering. Staying closer to a target schedule is
related to liquidity forecasting, which is easier than price forecasting. The RL algorithm
effectively learns dynamic ahead and behind boundaries depending on market conditions,
whereas the model-based agent uses static boundaries as illustrated in Figure 1.

We illustrate the typical behavior of a learned strategy by showing a single trade


execution example in Figure 3. The upper panel in this figure shows bid and ask prices
and the trades of the agent. An empty green triangle corresponds to a limit order that
was placed by the agent but not filled. On the other hand, a filled green triangle
corresponds to a filled limit order. The red triangles denote market orders. The lower
panel describes the agents’ inventory. At the beginning of the trading episode, the agent
places limit orders. In the middle of the execution period, the agent stops placing limit
orders because it is ahead of the trading schedule. Toward the end of the episode, the
agent starts placing limit orders again. However, it does not get enough fills and starts
sending market orders to get back on schedule and avoid the sweep cost in the end.

FIGURE 4 Limit Order Placement Policy Targeting Constant Execution Rate


Limit order depth
distribution and
schedule deviation
distribution for a
limit order
placement policy
targeting a constant
execution rate

OPTIMAL LIMIT ORDER DEPTH


In the previous sections, we restricted the agent to only post at the touch. We also
conducted an experiment in which the agent could choose the optimal depth of its limit
order placement. The authors in [2] study a similar trade execution problem, and our
results align with theirs. In contrast to the previous section, we do not have an optimal
model-based solution for this experiment. Instead, we train the agent directly on a
historical market simulator and analyze its behavior. To ensure that the simulated
results transfer reasonably to the real market, we constrained the algorithm to place up
to a small fraction of the existing quote size at a single price level. In addition, we
assumed orders remained at the back of their price queues when simulating fills from

Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]
RL FOR EXECUTION historical data. The reinforcement learning method once again learned a policy to
PAGE 7 control schedule deviation, this time by placing deeper in the book ahead of schedule
and closer to the top of the book or even inside the spread when behind. Figure 4 shows
the resulting placement distribution across price levels and the distribution of schedule
deviations throughout the order. During higher volatility periods, the learned policy
places limit orders deeper in the book, as these deeper orders are more likely to be filled.

CONCLUSION
The above-mentioned and other ongoing experiments have convinced us that combining
deep reinforcement learning methods and historical simulation is a feasible strategy for
learning complex adaptive execution strategies. While the examples covered here are not
meant to be implemented in production, they and future experiments will serve to guide
our research on optimal execution strategies. Future iterations may more directly impact
our production trading systems, as we continue to advance the complexity of execution
strategies we are able to solve efficiently with these methods.

References

[1] Alvaro Cartea and Sebastian Jaimungal. “Optimal execution with limit and market
orders”. In: Quantitative Finance 15.8 (2015), pp. 1279–1291.
[2] Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. “Reinforcement learning for
optimized trade execution”. In: Proceedings of the 23rd international conference on
Machine learning. 2006, pp. 673–680.
[3] John Schulman et al. “Proximal policy optimization algorithms”. In: arXiv preprint
arXiv:1707.06347 (2017).
[4] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT
press, 2018.

Americas: +1 646 293 1888 Europe: +44 (0)20 3714 5831 Asia: +61 (2) 8074 3154 [email protected]

You might also like