Deeplob: Deep Convolutional Neural Networks For Limit Order Books
Deeplob: Deep Convolutional Neural Networks For Limit Order Books
XX, XXX 1
Abstract—We develop a large-scale deep learning model to everyday, it is natural to employ more modern data-driven
predict price movements from limit order book (LOB) data machine learning techniques to extract such features.
of cash equities. The architecture utilises convolutional filters In addition, limit order data, like any other financial time-
to capture the spatial structure of the limit order books as
arXiv:1808.03668v6 [q-fin.CP] 23 Jan 2020
well as LSTM modules to capture longer time dependencies. series data is notoriously non-stationary and dominated by
The proposed network outperforms all existing state-of-the-art stochastics. In particular, orders at deeper levels of the LOB
algorithms on the benchmark LOB dataset [1]. In a more are often placed and cancelled in anticipation of future price
realistic setting, we test our model by using one year market moves and are thus even more prone to noise. Other problems,
quotes from the London Stock Exchange and the model delivers such as auction and dark pools [6], also add additional difficul-
a remarkably stable out-of-sample prediction accuracy for a
variety of instruments. Importantly, our model translates well to ties, bringing ever more unobservability into the environment.
instruments which were not part of the training set, indicating The interested reader is referred to [7] in which a number of
the model’s ability to extract universal features. In order to these issues are reviewed.
better understand these features and to go beyond a “black In this paper we design a novel deep neural network
box” model, we perform a sensitivity analysis to understand the architecture that incorporates both convolutional layers as well
rationale behind the model predictions and reveal the components
of LOBs that are most relevant. The ability to extract robust as Long Short-Term Memory (LSTM) units to predict future
features which translate well to other instruments is an important stock price movements in large-scale high-frequency LOB
property of our model which has many other applications. data. One advantage of our model over previous research [8]
is that it has the ability to adapt for many stocks by extracting
representative features from highly noisy data.
I. I NTRODUCTION In order to avoid the limitations of handcrafted features, we
use a so-called Inception Module [9] to wrap convolutional and
I N today’s competitive financial world more than half of the
markets use electronic Limit Order Books (LOBs) [2] to
record trades [3]. Unlike traditional quote-driven marketplaces,
pooling layers together. The Inception Module helps to infer
local interactions over different time horizons. The resulting
where traders can only buy or sell an asset at one of the prices feature maps are then passed into LSTM units which can
made publicly by market makers, traders now can directly capture dynamic temporal behaviour. We test our model on
view all resting limit orders1 in the limit order book of an a publicly available LOB dataset, known as FI-2010 [1], and
exchange. Because limit orders are arranged into different our method remarkably outperforms all existing state-of-the-
levels based on their submitted prices, the evolution in time of art algorithms. However, the FI-2010 dataset is only made up
a LOB represents a multi-dimensional problem with elements of 10 consecutive days of down-sampled pre-normalised data
representing the numerous prices and order volumes/sizes at from a less liquid market. While it is a valuable benchmark set,
multiple levels of the LOB on both the buy and sell sides. it is arguable not sufficient to fully verify the robustness of an
A LOB is a complex dynamic environment with high di- algorithm. To ensure the generalisation ability of our model,
mensionality, inducing modelling complications that make tra- we further test it by using one year order book data for 5
ditional methods difficult to cope with. Mathematical finance is stocks from the London Stock Exchange (LSE). To minimise
often dominated by models of evolving price sequences. This the problem of overfitting to backtest data, we carefully opti-
leads to a range of Markov-like models with stochastic driving mise any hyper-parameter on a separate validation set before
terms, such as the vector autoregressive model (VAR) [4] or moving to the out-of-sample test set. Our model delivers robust
the autoregressive integrated moving average model (ARIMA) out-of-sample prediction accuracy across stocks over a test
[5]. These models, to avoid excessive parameter spaces, often period of three months.
rely on handcrafted features of the data. However, given As well as presenting results on out-of-sample data (in a
the billions of electronic market quotes that are generated timing sense) from stocks used to form the training set, we
also test our model on out-of-sample (in both timing and
The authors are with the Oxford-Man Institute of Quantitative Finance, data stream sense) stocks that are not part of the training set.
Department of Engineering Science, University of Oxford (e-mail: zi- Interestingly, we still obtain good results over the whole testing
[email protected]).
Github: https://github.com/zcakhaa period. We believe this observation shows not only that the
1 Limit orders are orders that do not match immediately upon submission proposed model is able to extract robust features from order
and are also called passive orders. This is opposed to orders that match books, but also indicates the existence of universal features
immediately, so-called aggressive orders, such as a market order. A LOB
is simply a record of all resting/outstanding limit orders at a given point in in the order book that modulate stock demand and price. The
time. ability to transfer the model to new instruments opens up a
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 2
number of possibilities that we consider for future work. Linear Discriminant Analysis (LDA) in the work of [24]. How-
To show the practicability of our model we use it in a simple ever these extraction methods are static pre-processing steps,
trading simulation. We focus on sufficiently liquid stocks which are not optimised to maximise the overall objective
so that slippage and market impact are small. Indeed, these of the model that observes them. In the work of [25, 24],
stocks are generally harder to predict than less liquid ones. the Bag-of-Features model (BoF) is expressed as a neural
Since our trading simulation is mainly meant as a method of layer and the model is trained end-to-end using the back-
comparison between models we assume trading takes place at propagation algorithm, leading to notably better results on the
mid-price2 and compare gross profits before fees. The former FI-2010 dataset [1]. These works suggest the importance of a
assumption is equivalent to assuming that one side of the data driven approach to extract representative features from a
trade may be entered into passively and the latter assumes large amout of data. In our work, we advocate the end-to-end
that different models trade similar volumes and would thus be training and show that the deep neural network by itself not
subject to similar fees. Our focus here is using a simulation as only leads to even better results but also transfers well to new
a measure of the relative value of the model predictions in a instruments (not part of the training set) - indicating the ability
trading setting. Under these simplifications, our model delivers of networks to extract “universal” features from the raw data.
significantly positive returns with a relatively small risk. Arguably, one of the key contributions of modern deep
Although our network achieves good performance, a com- learning is the addition of feature extraction and representation
plex “black box” system, such as a deep neural network, as part of the learned model. The Convolutional Neural Net-
has limited use for financial applications without some un- work (CNN) [30] is a prime example, in which information
derstanding of the rationale behind the model predictions. extraction, in the form of filter banks, is automatically tuned to
Here we exploit the model-agnostic LIME method [10] to the utility function that the entire network aims to optimise.
highlight highly relevant components in the order book to gain CNNs have been successfully applied to various application
a better understanding between our predictions and model in- domains, for example, object tracking [31], object-detection
puts. Reassuringly, these conform to sensible (though arguably [32] and segmentation [33]. However, there have been but
unusual) patterns of activity in both price and volume within a few published works that adopt CNNs to analyse finan-
the order book. cial microstructure data [34, 35, 26] and the existing CNN
Outline: The remainder of the paper is as follows. architectures are rather unsophisticated and lack of thorough
Section II introduces background and related work. Section investigation. Just like when moving from “AlexNet” [36] to
III describes limit order data and the various stages of data “VGGNet” [37], we show that a careful design of network
preparation. We present our network architecture in Section IV archiecture can lead to better results compared with all existing
and give justifications behind each component of the model. In methods.
Section V we compare our work with a large group of popular The Long Short-Term Memory (LSTM) [38] was originally
methods. Section VI summarises our findings and considers proposed to solve the vanishing gradients problem [39] of
extensions and future work. recurrent neural networks, and has been largely used in ap-
plications such as language modelling [40] and sequence to
II. BACKGROUND AND R ELATED W ORK sequence learning [41]. Unlike CNNs which are less widely
Research on the predictability of stock markets has a long applied in financial markets, the LSTM has been popular in
history in the financial literature e.g., [11, 12]. Although opin- recent years, [42, 28, 43, 44, 45, 46, 47, 20] all utilising
ions differ regarding the efficiency of markets, many widely LSTMs to analyse financial data. In particular, [20] uses
accepted studies show that financial markets are to some extent limit order data from 1000 stocks to test a four layer LSTM
predictable [13, 14, 15, 16]. Two major classes of work which model. Their results show a stable out-of-sample prediction
attempt to forecast financial time-series are, broadly speaking, accuracy across time, indicating the potential benefits of deep
statistical parametric models and data-driven machine learn- learning methods. To the best of our knowledge, there is no
ing approaches [17]. Traditional statistical methods generally work that combines CNNs with LSTMs to predict stock price
assume that the time-series under study are generated from movements and this is the first extensive study to apply a
a parametric process [18]. There is, however, agreement that nested CNN-LSTM model to raw market data. In particular,
stock returns behave in more complex ways, typically highly the usage of the Inception Model in this context is novel and is
nonlinearly [19, 20]. Machine learning techniques are able to essential in inferring the optimal “decay rates” of the extracted
capture such arbitrary nonlinear relationships with little, or no, features.
prior knowledge regarding the input data [21].
Recently, there has been a surge of interest to predict III. DATA , N ORMALISATION AND L ABELLING
limit order book data by using machine learning algorithms A. Limit Order Books
[1, 22, 23, 24, 25, 26, 27, 20, 28, 29]. Among many machine
learning techniques, pre-processing or feature extraction is of- We first introduce some basic definitions of limit order
ten performed as financial time-series data is highly stochastic. books (LOBs). For classical references on market microstruc-
Generic feature extraction approches have been implemented, ture the reader is referred to [48, 49] and for a short review
such as the Principal Component Analysis (PCA) and the on LOBs in particular we refer to [7]. Here we follow the
conventions of [7]. A LOB has two types of orders: bid orders
2 The average of the best buy and best sell prices in the market at the time. and ask orders. A bid (ask) order is an order to buy (sell) an
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 3
After normalising the limit order data, we use the mid-price 26.30 pt
26.25
(1) (1) 26.20
pa (t) + pb (t) 26.15
pt = , (1) 26.10
2
0 200 400 600 800 1000
to create labels that represent the direction of price changes. 26.30 pt
Although no order can transact exactly at the mid-price, 26.25
26.20
it expresses a general market value for an asset and it is 26.15
frequently quoted when we want a single number to represent 26.10
an asset price. 0 200 400 600 800 1000
Because financial data is highly stochastic, if we simply Figure 2. An example of two smoothed labelling methods based on a same
compare pt and pt+k to decide the price movement, the threshold (α) and same prediction horizon (k). Green shading represents a +1
resulting label set will be noisy. In the works of [1] and [26], signal and red a -1. Top: [1]’s method and Bottom: [26]’s method.
two smoothing labelling methods are introduced. We briefly
recall the two methods here. First, let m− denote the mean
IV. M ODEL A RCHITECTURE
of the previous k mid-prices and m+ denote the mean of the
next k mid-prices: A. Overview
We here detail our network architecture, which comprises
k
1X three main building blocks: standard convolutional layers, an
m− (t) = pt−i
k i=0 Inception Module and a LSTM layer, as shown in Figure 3.
k
(2) The main idea of using CNNs and Inception Modules is to
1X automate the process of feature extraction as it is often difficult
m+ (t) = pt+i
k i=1 in financial applications since financial data is notoriously
noisy with a low signal-to-noise ratio. Technical indicators
where pt is the mid-price defined in Equation (1) and k is the such as MACD and the Relative Strength Index are included as
prediction horizon. Both methods use the percentage change inputs and preprocessing mechanisms such as principal com-
(lt ) of the mid-price to decide directions. We can now define ponent analysis (PCA) [51] are often used to transform raw
inputs. However, none of these processes is trivial, they make
m+ (t) − pt
lt = (3) tacit assumptions and further, it is questionable if financial
pt data can be well-described with parametric models with fixed
parameters. In our work, we only require the history of LOB
m+ (t) − m− (t)
lt = (4) prices and sizes as inputs to our algorithm. Weights are learned
m− (t) during inference and features, learned from a large training set,
Both are methods to define the direction of price movement are data-adaptive, removing the above constraints. A LSTM
at time t, where the former, Equation 3, was used in [1] and layer is then used to capture additional time dependencies
the latter, Equation 4, in [26]. among the resulting features. We note that very short time-
The labels are then decided based on a threshold (α) for dependencies are already captured in the convolutional layer
the percentage change (lt ). If lt > α or lt < −α, we define which takes “space-time images” of the LOB as inputs.
it as up (+1) or down (−1). For anything else, we consider
it as stationary (0). Figure 2 provides a graphical illustration B. Details of Each Component
of two labelling methods on the same threshold (α) and the a) Convolutional Layer: Recent development of elec-
same prediction horizon (k). All the labels classified as down tronic trading algorithms often submit and cancel vast numbers
(−1) are shown as red areas and up (+1) as green areas. The of limit orders over short periods of time as part of their
uncoloured (white) regions correspond to stationary (0) labels. trading strategies [52]. These actions often take place deep
The FI-2010 dataset [1] adopts the method in Equation 3 in a LOB and it is seen [7] that more than 90% of orders end
and we directly used their labels for fair comparison to other in cancellation rather than matching, therefore practitioners
methods. However, the produced labels are less consistent as consider levels further away from best bid and ask levels to
shown on the top of Figure 2 because this method fits closer be less useful in any LOB. In addition, the work of [53]
to real prices as smoothing is only applied to future prices. suggests that the best ask and best bid (L1-Ask and L1-Bid)
This is essentially detrimental for designing trading algorithms contribute most to the price discovery and the contribution
as signals are not consistent here leading to many redundant of all other levels is considerably less, estimated at as little
trading actions thus incurring larger transaction costs. as 20%. As a result, it would be otiose to feed all level
Further, the FI-2010 dataset was collected in 2010 and information to a neural network as levels deep in a LOB are
the instruments were less liquid compared to now. We ex- less useful and can potentially even be misleading. Naturally,
perimented with this approach in [1] on our data from the we can smooth these signals by summarising the information
London Stock Exchange and found the resulting labels are contained in deeper levels. We note that convolution filters
rather stochastic, therefore we adopt the method in Equation 4 used in any CNN architecture are discrete convolutions, or
for our LSE dataset to produce more consistent signals. finite impulse response (FIR) filters, from the viewpoint of
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 5
V. E XPERIMENTAL R ESULTS
Figure 4. The Inception Module used in the model. For example, 3 × 1@32
represents a convolutional layer with 32 filters of size (3 × 1). A. Experiments Settings
We apply the same architecture to all our experiments in
this section and the proposed model is denoted as DeepLOB.
solution to extract invariant features.
We learn the parameters by minimising the categorical cross-
3x1@32 entropy loss. The Adaptive Moment Estimation algorithm,
b) Inception Module: We note that all filters of a stan- ADAM [65], is utilised and we set the parameter “epsilon” to
dard convolutional layer have fixed size. If, for example, we 1 and the learning rate to 0.01. The learning is stopped when
employ filters of size (4 × 1),5x1@32we capture local interactions
1x1@32 validation accuracy does not improve for 20 more epochs. This
amongst data over four time steps. However, we can capture is about 100 epochs for the FI-2010 dataset and 40 epochs for
dynamic behaviours over multiple timescales by using Incep- the LSE dataset.
tion Modules to wrap several convolutions
Concat together. We find
We train with mini-batches of size 32. We choose a small
that this offers a performance improvement to the resultant
mini-batch size due to the findings in [66] in which they sug-
model.
gest that large-batch methods tend to converge to narrow deep
The idea of the Inception Module can be also considered as minima of the training functions, but small-batch methods
using different moving averages Input
in technical analysis. Practi- consistently converge to shallow broad minima. All models
tioners often use moving averages with different decay weights are built using Keras [67] based on the TensorFlow backend
to observe time-series momentum [63]. If a large decay weight 100*40 Tesla P100
[68], and we train them using a single NVIDIA
is adopted, we get a smoother time-series that well represents GPU.
the long-term1x1@16 1x1@16
trend, but we could 1x1@16 that are
miss small variations
important in high-frequency data. In practice, it is a daunting
task to set the right decay weights. Instead, we can use B. Experiments on the FI-2010 Dataset
3x40@16
Inception Modules and the 10x40@16
weights are then 20x40@16
learned during There are two experimental setups using the FI-2010
back-propagation. dataset. Following the convention of [24], we denote them
In our case, we split the input into a small set of lower- as Setup 1 and Setup 2. Setup 1 splits the dataset into 9 folds
dimensional representations Concat
by using 1 × 1 convolutions, based on a day basis (a standard anchored forward split). In
transform the representations by a set of filters, here 3 × 1 the i-th fold, we train our model on the first i days and test it
and 5 × 1, and then merge the outputs. A max-pooling layer on the (i + 1)-th day where i = 1, · · · , 9. The second setting,
is used inside the Inception Module, with stride 1 and zero Setup 2, originates from the works [26, 28, 27, 25] in which
padding. “Inception@32” represents one module and indicates deep network architectures were evaluated. As deep learning
1x1@16
all convolutional layers have1x1@16 1x1@16
32 filters in this module, and techniques often require a large amount of data to calibrate
the approach is depicted schematically in Figure 4. The 1 × 1 weights, the first 7 days are used as the train data and the last
convolutions form the Network-in-Network approach proposed 3 days are used as the test data in this setup. We evaluate our
3x1@16
in [64]. Instead of applying a10x1@16 20x1@16
simple convolution to our data, model in both setups here.
the Network-in-Network method uses a small neural network Table I shows the results of our model compared to other
to capture non-linear properties of our data. We find this methods in Setup 1. Performance is measured by calculating
method to be effective and itConcat gives us an improvement on the mean accuracy, recall, precision, and F1 score over all
prediction accuracy. folds. As the FI-2010 dataset is not well balanced, [1] suggests
c) LSTM Module and Output: In general, a fully con- to focus on F1 score performance as fair comparisons. We have
nected layer is used to classify the input data. However, all compared our model to all existing experimental results in-
Maxpool
inputs to the fully connected layer are assumed independent of cluding Ridge Regression (RR) [1], Single-Layer-Feedforward
each other unless multiple fully connected layers are used. Due Network (SLFN) [1], Linear Discriminant50*1 Analysis (LDA)
to the usage of Inception Module in our work, we have a large [22], Multilinear Discriminant Analysis (MDA) [22], Mul-
number of features at end. Just using one fully connected layer tilinear Time-series Regression (MTR) [22], Weighted Mul-
with 64 units1x1@16
would result in1x1@16 1x1@16
more than 630,000 parameters tilinear Time-series Regression (WMTR) [22], Multilinear
Table I Table II
S ETUP 1: E XPERIMENT R ESULTS FOR THE FI-2010 DATASET S ETUP 2: E XPERIMENT R ESULTS FOR THE FI-2010 DATASET
Class-specific Discriminant Analysis (MCSDA) [23], Bag-of- Models Forward (ms) Number of parameters
Feature (BoF) [24], Neural Bag-of-Feature (N-BoF) [24], and
BoF [24] 0.972 86k
Attention-augmented-Bilinear-Network with one hidden layer N-BoF [24] 0.524 12k
(B(TABL)) and two hidden layers (C(TABL)) [25]. More CNN-I [26] 0.025 768k
methods such as PCA and Autoencoder (AE) are actually LSTM [28] 0.061 -
C(TABL) [25] 0.229 -
tested in their works but, for simplicity, we only report their DeepLOB 0.253 60k
best results and our model achieves better performance.
However, the Setup 1 is not ideal for training deep learning
models as we mentioned that deep network often requires
C. Experiments on the London Stock Exchange (LSE)
a large amount of data to calibrate weights. This anchored
forward setup leads to only one or two days’ training data for As we suggested, the FI-2010 dataset is not sufficient to
the first few folds and we observe worse performance in the verify a prediction model - it is far too short, downsampled
first few days. As training data grows, we observe remarkably and taken from a less liquid market. To perform a meaningful
better results as shown in Table II which shows the results evaluation that can hold up to modern applications, we further
of our network compared to other methods in Setup 2. In test our method on stocks from the LSE of one year length
particular, the important difference between our model and with a testing period of three months. As mentioned in Section
CNN-I [26] and CNN-II [27] is due to network architecture III, we train our model on five stocks: Lloyds Bank (LLOY),
and we can see huge improvements on performance here. In Barclays (BARC), Tesco (TSCO), BT and Vodafone (VOD).
Table III, we compare the parameter sizes of DeepLOB with Recent work of [20] suggests that deep learning techniques
CNN-I [26]. Although our model has many more layers, there can extract universal features for limit order data. To test this
are far fewer parameters in our network due to the usage of universality, we directly apply our model to five more stocks
LSTM layers instead of fully connected layers. that were not part of the training data set (transfer learning).
We also report the computation time (forward pass) in We select HSBC, Glencore (GLEN), Centrica (CNA), BP and
milliseconds (ms) for available algorithms in Table III. Due ITV for transfer learning because they are also among the most
to the development of GPUs, training deep networks is now liquid stocks in the LSE. The testing period is the same three
feasible and it is swift to make predictions, making it possible months as before, and the classes are roughly balanced.
for high frequency trading. We will discuss this more in the Table IV presents the results of our model for all stocks on
next section. different prediction horizons. To better investigate the results,
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 8
Table IV 0.80
E XPERIMENT R ESULTS FOR THE LSE DATASET 0.75
0.70
Accuracy
Prediction Horizon Accuracy % Precision % Recall % F1 %
Results on LLOY, BARC, TSCO, BT and VOD
0.65
0.60
k=20 70.17 70.17 70.17 70.15 k=20
k=50 63.93 63.43 63.93 63.49 0.55 k=50
k=100 61.52 60.73 61.52 60.65 0.50 k=100
Results on Transfer Learning (GLEN, HSBC, CNA, BP, ITV) LLOY BARC TSCO BT VOD
Stock
k=20 68.62 68.64 68.63 68.48
k=50 63.44 62.81 63.45 62.84 0.80 k=20
k=50
k=100 61.46 60.68 61.46 60.77
0.75 k=100
Accuracy
0.70
9667343 2532164 907266 9546221 3069960 889661 9996056 2650105 673195 0.65
0.60
Up nary own
D
2910692 10879399 2570201 3603711 7524565 4373401 4999538 6688453 4162581 0.55
GLEN HSBC CNA BP ITV
Stock
tio
Down Stationary Up Down Stationary Up Down Stationary Up Figure 6. Boxplots of daily accuracy for the different prediction horizons.
Top: results on LLOY, BARC, TSCO, BT and VOD; Bottom: results on
14188991 3414391 1189903 14673322 4028731 975993 14718401 4289685 940541 transfer learning (GLEN, HSBC, CNA, BP, ITV).
Up nary own
D
0.03 k=20
k=50
0.02 k=100
Profit
0.01
0.00
0.01
LLOY BARC TSCO BT VOD GLEN HSBC ITV BP CNA
50 k=20
40 k=50
k=100
30
t-score
20
10
0
10 LLOY BARC TSCO BT VOD GLEN HSBC ITV BP CNA
Stock
Figure 7. Boxplots for normalised daily profits and t-statistics for different stocks and prediction horizons (k). Profits are in GBX (= GBP/100).
stocks and prediction horizons. We use a t-test to check if better than other network architectures such as CNN-I [26].
the profits are statistically greater than 0. The t-statistics is LIME uses an interpretable model to approximate the predic-
essentially the same as Sharpe ratios but a more consistent tion of a complex model on a given input. It locally perturbs
evaluation metric for high frequency trading. Figure 8 shows the input and observes variations in the model’s predictions,
the cumulative profits across the testing period. We can ob- thus providing some measure of information regarding input
serve consistent profits and significant t-values over the testing importance and sensitivity.
period for all stocks. Although we obtain worse accuracy for Figure 9 presents an example that shows how DeepLOB
longer prediction horizons, the cumulative profits are actually and CNN-I [26] react to a given input. In the figure we
higher as a more robust signal is generated. show the top 10 areas of pros (in green) and cons (in red)
for the predicted class (yellow being the boundary). Not
E. Sensitivity Analysis coloured areas represent the components of inputs that are
less influential on the predicted results or “unimportant”. We
Trust and risk are fundamental in any financial application.
note that most components of the input are inactive for CNN-
If we take actions based on predictions, it is always important
I [26]. We believe that this is due to two max-pooling layers
to understand the reasons behind those predictions. Neural
used in that architecture. Because [26] used large-size filters
networks are often considered as “black boxes” which lack
in the first convolutional layer, any representation deep in the
interpretability. However, if we understand the relationship
network actually represents information gleaned from a large
between the inputs’ components (e.g. words in text, patches in
portion of inputs. Our experiments applying LIME to many
an image) and the model’s prediction, we can compare those
examples indicate this observation is a common feature.
relationships with our domain knowledge to decide if we can
accept or reject a prediction.
The work of [10] proposes a method, which they call LIME, VI. C ONCLUSION
to obtain such explanations. In our case, we use LIME to reveal In this paper, we introduce the first hybrid deep neural net-
components of LOBs that are most important for predictions work to predict stock price movements using high frequency
and to understand why the proposed model DeepLOB works limit order data. Unlike traditional hand-crafted models, where
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 10
R EFERENCES
[1] A. Ntakaris, M. Magris, J. Kanniainen, M. Gabbouj,
and A. Iosifidis, “Benchmark dataset for mid-price fore-
casting of limit order book data with machine learning
Figure 9. LIME plots. x-axis represents time stamps and y-axis represents methods,” Journal of Forecasting, vol. 37, no. 8, pp. 852–
levels of the LOB, as labelled in the top image. Top: Original image.
Middle: Importance regions for CNN-I [26]. Bottom: Importance regions
866, 2018.
for DeepLOB model. Regions supportive for prediction are shown in green, [2] C. A. Parlour and D. J. Seppi, “Limit order markets:
and regions against in red. The boundary is shown in yellow. A survey,” Handbook of financial intermediation and
banking, vol. 5, pp. 63–95, 2008.
[3] I. Rosu et al., “Liquidity and information in order driven
features are carefully designed, we utilise a CNN and an markets,” Tech. Rep., 2010.
Inception Module to automate feature extraction and use [4] E. Zivot and J. Wang, “Vector autoregressive models for
LSTM units to capture time dependencies. multivariate time series,” Modeling Financial Time Series
The proposed method is evaluated against several baseline with S-PLUS
, R pp. 385–429, 2006.
methods on the FI-2010 benchmark dataset and the results [5] A. A. Ariyo, A. O. Adewumi, and C. K. Ayo, “Stock
show that our model performs better than other techniques in price prediction using the ARIMA model,” in Computer
predicting short term price movements. We further test the Modelling and Simulation (UKSim), 2014 UKSim-AMSS
robustness of our model by using one year of limit order 16th International Conference on. IEEE, 2014, pp. 106–
data from the LSE with a testing period of three months. An 112.
interesting observation from our work is that the proposed [6] C. Carrie, “The new electronic trading regime of dark
model generalises well to instruments that did not form part books, mashups and algorithmic trading,” Trading, vol.
of the training data. This suggests the existence of universal 2006, no. 1, pp. 14–20, 2006.
features that are informative for price formation and our model [7] M. D. Gould, M. A. Porter, S. Williams, M. McDonald,
appears to capture these features, learning from a large data D. J. Fenn, and S. D. Howison, “Limit order books,”
set including several instruments. A simple trading simulation Quantitative Finance, vol. 13, no. 11, pp. 1709–1742,
is used to further test our model and we obtain good profits 2013.
that are statistically significant. [8] W.-C. Chiang, D. Enke, T. Wu, and R. Wang, “An
To go beyond the often-criticised “black box” nature of adaptive stock index trading decision support system,”
deep learning models, we use LIME, a method for sensitivity Expert Systems with Applications, vol. 59, pp. 195–207,
analysis, to indicate the components of inputs that contribute to 2016.
predictions. A good understanding of the relationship between [9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
the input’s components and the model’s prediction can help D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabi-
us decide if we can accept a prediction. In particular, we see novich, “Going deeper with convolutions,” in Proceed-
how the information of prices and sizes on different levels and ings of the IEEE conference on computer vision and
horizons contribute to the prediction which is in accordance pattern recognition, 2015, pp. 1–9.
with our econometric understanding. [10] M. T. Ribeiro, S. Singh, and C. Guestrin, “Why should I
In a recent extension of this work we have modified the trust you?: Explaining the predictions of any classifier,”
DeepLOB model to use Bayesian neural networks [69]. This in Proceedings of the 22nd ACM SIGKDD international
allows to provide uncertainty measures on the network’s conference on knowledge discovery and data mining.
outputs which for example can be used to upsize positions ACM, 2016, pp. 1135–1144.
as demonstrated in [69]. [11] A. Ang and G. Bekaert, “Stock return predictability: Is it
In subsequent continuations of this work we would like there?” The Review of Financial Studies, vol. 20, no. 3,
to investigate more detailed trading strategies, using Rein- pp. 651–707, 2006.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 11
[12] P. Bacchetta, E. Mertens, and E. Van Wincoop, “Pre- [26] A. Tsantekidis, N. Passalis, A. Tefas, J. Kanniainen,
dictability in financial markets: What do survey expec- M. Gabbouj, and A. Iosifidis, “Forecasting stock prices
tations tell us?” Journal of International Money and from the limit order book using convolutional neural
Finance, vol. 28, no. 3, pp. 406–426, 2009. networks,” in Business Informatics (CBI), 2017 IEEE
[13] T. Bollerslev, J. Marrone, L. Xu, and H. Zhou, “Stock 19th Conference on, vol. 1. IEEE, 2017, pp. 7–12.
return predictability and variance risk premia: Statistical [27] ——, “Using Deep Learning for price prediction by
inference and international evidence,” Journal of Finan- exploiting stationary limit order book features,” arXiv
cial and Quantitative Analysis, vol. 49, no. 3, pp. 633– preprint arXiv:1810.09965, 2018.
661, 2014. [28] ——, “Using deep learning to detect price change in-
[14] M. A. Ferreira and P. Santa-Clara, “Forecasting stock dications in financial markets,” in Signal Processing
market returns: The sum of the parts is more than the Conference (EUSIPCO), 2017 25th European. IEEE,
whole,” Journal of Financial Economics, vol. 100, no. 3, 2017, pp. 2511–2515.
pp. 514–537, 2011. [29] M. Dixon, D. Klabjan, and J. H. Bang, “Classification-
[15] B. Mandelbrot and R. L. Hudson, The Misbehavior of based financial markets prediction using deep neural
Markets: A fractal view of financial turbulence. Basic networks,” Algorithmic Finance, vol. 6, no. 3-4, pp. 67–
books, 2007. 77, 2017.
[16] B. B. Mandelbrot, “How Fractals Can Explain What’s [30] Y. LeCun, Y. Bengio et al., “Convolutional networks for
Wrong with Wall Street,” Scientific American, vol. 15, images, speech, and time series,” The handbook of brain
no. 9, p. 2008, 2008. theory and neural networks, vol. 3361, no. 10, p. 1995,
[17] J. Agrawal, V. Chourasia, and A. Mittra, “State-of-the- 1995.
art in stock prediction techniques,” International Journal [31] N. Wang and D.-Y. Yeung, “Learning a deep compact
of Advanced Research in Electrical, Electronics and image representation for visual tracking,” in Advances in
Instrumentation Engineering, vol. 2, no. 4, pp. 1360– neural information processing systems, 2013, pp. 809–
1366, 2013. 817.
[18] R. C. Cavalcante, R. C. Brasileiro, V. L. Souza, J. P. [32] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich
Nobrega, and A. L. Oliveira, “Computational intelligence feature hierarchies for accurate object detection and
and financial markets: A survey and future directions,” semantic segmentation,” in Proceedings of the IEEE
Expert Systems with Applications, vol. 55, pp. 194–211, conference on computer vision and pattern recognition,
2016. 2014, pp. 580–587.
[19] Q. Cao, K. B. Leggio, and M. J. Schniederjans, “A com- [33] J. Long, E. Shelhamer, and T. Darrell, “Fully convolu-
parison between Fama and French’s model and artificial tional networks for semantic segmentation,” in Proceed-
neural networks in predicting the Chinese stock market,” ings of the IEEE Conference on Computer Vision and
Computers Operations Research, vol. 32, no. 10, pp. Pattern Recognition, 2015, pp. 3431–3440.
2499–2512, 2005. [34] J.-F. Chen, W.-L. Chen, C.-P. Huang, S.-H. Huang, and
[20] J. Sirignano and R. Cont, “Universal features of price A.-P. Chen, “Financial time-series data analysis using
formation in financial markets: perspectives from deep deep convolutional neural networks,” in Cloud Com-
learning,” arXiv preprint arXiv:1803.06917, 2018. puting and Big Data (CCBD), 2016 7th International
[21] G. S. Atsalakis and K. P. Valavanis, “Surveying stock Conference on. IEEE, 2016, pp. 87–92.
market forecasting techniques–Part II: Soft computing [35] J. Doering, M. Fairbank, and S. Markose, “Convolu-
methods,” Expert Systems with Applications, vol. 36, tional neural networks applied to high-frequency market
no. 3, pp. 5932–5941, 2009. microstructure forecasting,” in Computer Science and
[22] D. T. Tran, M. Magris, J. Kanniainen, M. Gabbouj, Electronic Engineering (CEEC), 2017. IEEE, 2017, pp.
and A. Iosifidis, “Tensor representation in high-frequency 31–36.
financial data for price change prediction,” in Computa- [36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
tional Intelligence (SSCI), 2017 IEEE Symposium Series classification with deep convolutional neural networks,”
on. IEEE, 2017, pp. 1–7. in Advances in neural information processing systems,
[23] D. T. Tran, M. Gabbouj, and A. Iosifidis, “Multilinear 2012, pp. 1097–1105.
class-specific discriminant analysis,” Pattern Recognition [37] K. Simonyan and A. Zisserman, “Very Deep Convolu-
Letters, vol. 100, pp. 131–136, 2017. tional Networks for Large-Scale Image Recognition,” in
[24] N. Passalis, A. Tefas, J. Kanniainen, M. Gabbouj, and International Conference on Learning Representations,
A. Iosifidis, “Temporal bag-of-features learning for pre- 2015.
dicting mid price movements using high frequency limit [38] S. Hochreiter and J. Schmidhuber, “Long short-term
order book data,” IEEE Transactions on Emerging Topics memory,” Neural computation, vol. 9, no. 8, pp. 1735–
in Computational Intelligence, 2018. 1780, 1997.
[25] D. T. Tran, A. Iosifidis, J. Kanniainen, and M. Gabbouj, [39] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-
“Temporal attention-augmented bilinear network for fi- term dependencies with gradient descent is difficult,”
nancial time-series data analysis,” IEEE transactions on IEEE transactions on neural networks, vol. 5, no. 2, pp.
neural networks and learning systems, 2018. 157–166, 1994.
JOURNAL OF LATEX CLASS FILES, VOL. XX, NO. XX, XXX 12
[40] M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural learning for optimized trade execution,” in Proceedings
networks for language modeling,” in Thirteenth Annual of the 23rd international conference on Machine learn-
Conference of the International Speech Communication ing. ACM, 2006, pp. 673–680.
Association, 2012. [57] M. Avellaneda, J. Reed, and S. Stoikov, “Forecasting
[41] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to prices from Level-I quotes in the presence of hidden
sequence learning with neural networks,” in Advances in liquidity,” Algorithmic Finance, vol. 1, no. 1, pp. 35–43,
neural information processing systems, 2014, pp. 3104– 2011.
3112. [58] Y. Burlakov, M. Kamal, and M. Salvadore, “Optimal
[42] W. Bao, J. Yue, and Y. Rao, “A deep learning framework limit order execution in a simple model for market
for financial time series using stacked autoencoders and microstructure dynamics,” 2012.
long-short term memory,” PloS one, vol. 12, no. 7, p. [59] L. Harris, “Maker-taker pricing effects on market quo-
e0180944, 2017. tations,” USC Marshall School of Business Work-
[43] S. Selvin, R. Vinayakumar, E. Gopalakrishnan, V. K. ing Paper. Avalable at http://bschool. huji. ac. il/.
Menon, and K. Soman, “Stock price prediction using upload/hujibusiness/Maker-taker. pdf, 2013.
LSTM, RNN and CNN-sliding window model,” in Ad- [60] A. Lipton, U. Pesavento, and M. G. Sotiropoulos, “Trade
vances in Computing, Communications and Informatics arrival dynamics and quote imbalance in a limit order
(ICACCI), 2017 International Conference on. IEEE, book,” arXiv preprint arXiv:1312.0514, 2013.
2017, pp. 1643–1647. [61] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier
[44] T. Fischer and C. Krauss, “Deep learning with long short- nonlinearities improve neural network acoustic models,”
term memory networks for financial market predictions,” in Proc. icml, vol. 30, no. 1, 2013, p. 3.
European Journal of Operational Research, vol. 270, [62] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learn-
no. 2, pp. 654–669, 2018. ing. MIT Press, 2016,
[45] L. Di Persio and O. Honchar, “Artificial neural networks http://www.deeplearningbook.org.
architectures for stock price prediction: Comparisons and [63] T. J. Moskowitz, Y. H. Ooi, and L. H. Pedersen, “Time
applications,” International Journal of Circuits, Systems series momentum,” Journal of financial economics, vol.
and Signal Processing, vol. 10, pp. 403–413, 2016. 104, no. 2, pp. 228–250, 2012.
[46] M. Dixon, “Sequence classification of the limit order [64] M. Lin, Q. Chen, and S. Yan, “Network in network,” in
book using recurrent neural networks,” Journal of com- International Conference on Learning Representations,
putational science, vol. 24, pp. 277–286, 2018. 2014.
[47] D. M. Nelson, A. C. Pereira, and R. A. de Oliveira, [65] D. Kingma and J. Ba, “Adam: A method for stochastic
“Stock market’s price movement prediction with LSTM optimization,” Proceedings of the International Confer-
neural networks,” in Neural Networks (IJCNN), 2017 ence on Learning Representations 2015, 2015.
International Joint Conference on. IEEE, 2017, pp. [66] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy,
1419–1426. and P. T. P. Tang, “On large-batch training for deep
[48] L. Harris, Trading and exchanges: Market microstructure learning: Generalization gap and sharp minima,” in Inter-
for practitioners. Oxford University Press, USA, 2003. national Conference on Learning Representations, 2017.
[49] M. O’Hara, Market microstructure theory. Blackwell [67] F. Chollet et al., “Keras,” https://keras.io, 2015.
Publishers Cambridge, MA, 1995, vol. 108. [68] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,
[50] A. N. Kercheval and Y. Zhang, “Modelling high- C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,
frequency limit order book dynamics with Support Vector S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
Machines,” Quantitative Finance, vol. 15, no. 8, pp. M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,
1315–1329, 2015. J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
[51] A. Abraham, B. Nath, and P. K. Mahanti, “Hybrid intelli- C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,
gent systems for stock market analysis,” in International K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
Conference on Computational Science. Springer, 2001, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg,
pp. 337–345. M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-
[52] T. Hendershott, C. M. Jones, and A. J. Menkveld, “Does scale machine learning on heterogeneous systems,”
algorithmic trading improve liquidity?” The Journal of 2015, software available from tensorflow.org. [Online].
Finance, vol. 66, no. 1, pp. 1–33, 2011. Available: https://www.tensorflow.org/
[53] C. Cao, O. Hansch, and X. Wang, “The information [69] Z. Zhang, S. Zohren, and S. Roberts, “BDLOB: Bayesian
content of an open limit-order book,” Journal of futures Deep Convolutional Neural Networks for Limit Order
markets, vol. 29, no. 1, pp. 16–41, 2009. Books,” in Third workshop on Bayesian Deep Learning
[54] S. J. Orfanidis, Introduction to signal processing. (NeurIPS 2018), 2018.
Prentice-Hall, Inc., 1995.
[55] J. Gatheral and R. C. Oomen, “Zero-intelligence realized
variance estimation,” Finance and Stochastics, vol. 14,
no. 2, pp. 249–283, 2010.
[56] Y. Nevmyvaka, Y. Feng, and M. Kearns, “Reinforcement