0% found this document useful (0 votes)
26 views

6 - Enhancing a Pairs Trading Strategy With ML

This document explores enhancing a Pairs Trading strategy through Machine Learning, addressing challenges in finding profitable pairs and reducing prolonged decline periods. It proposes a new trading model utilizing Unsupervised Learning techniques, specifically the OPTICS algorithm, which outperforms traditional methods in identifying rewarding pairs. The results demonstrate significant improvements in portfolio performance and a reduction in decline periods, indicating the potential of integrating advanced data analysis techniques in trading strategies.

Uploaded by

Marcelo Paes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

6 - Enhancing a Pairs Trading Strategy With ML

This document explores enhancing a Pairs Trading strategy through Machine Learning, addressing challenges in finding profitable pairs and reducing prolonged decline periods. It proposes a new trading model utilizing Unsupervised Learning techniques, specifically the OPTICS algorithm, which outperforms traditional methods in identifying rewarding pairs. The results demonstrate significant improvements in portfolio performance and a reduction in decline periods, indicating the potential of integrating advanced data analysis techniques in trading strategies.

Uploaded by

Marcelo Paes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Enhancing a Pairs Trading strategy

with the application of Machine Learning


Simão Moraes Sarmento Nuno Horta
[email protected] [email protected]

Abstract—Pairs Trading is one of the most valuable market- first problem motivating this research work. In section IV
neutral strategies used by hedge funds. It is particularly interest- we propose a new trading model in response to the second
ing as it overcomes the arduous process of valuing securities by problem on the origin of this work. Next, in section V
focusing on relative pricing. By buying a relatively undervalued
security and selling a relatively overvalued one, a profit can we design the simulation environment to test the proposed
be made upon the pair’s price convergence. However, with the approaches, for which the results are presented in section VI.
growing availability of data, it became increasingly harder to find
rewarding pairs. In this work we address two problems: (i) how II. BACKGROUND AND R ELATED W ORK
to find profitable pairs while constraining the search space and Each stage composing a Pairs Trading strategy is described
(ii) how to avoid long decline periods due to prolonged divergent in detail along with the most relevant related work.
pairs. To manage these difficulties, the application of promising
Machine Learning techniques is investigated in detail. We propose A. Pairs Selection
the integration of an Unsupervised Learning algorithm, OPTICS,
to handle problem (i). The results obtained demonstrate the The pairs selection stage encompasses (i) finding the ap-
suggested technique can outperform the common pairs’ search propriate candidate pairs and (ii) selecting the most promising
methods, achieving an average portfolio Sharpe ratio of 3.79, in ones.
comparison to 3.58 and 2.59 obtained by standard approaches. Starting with (i), the investor should select the securities
For problem (ii), we introduce a forecasting-based trading model,
capable of reducing the periods of portfolio decline by 75%. Yet, of interest (e.g stocks, ETFs, etc) and search for possible
this comes at the expense of decreasing overall profitability. The combinations. In the literature, two methodologies are typi-
proposed strategy is tested using an ARMA model, an LSTM and cally suggested for this stage: performing an exhaustive search
an LSTM Encoder-Decoder. This work’s results are simulated for all possible combinations among the selected securities,
during varying periods between January 2009 and December or grouping them by sector, and constrain the combinations
2018, using 5-minutes price data from a group of 208 commodity-
linked ETFs, and accounting for transaction costs. to pairs formed by securities within the same sector. While
Index Terms—Pairs Trading, Market Neutral, Machine Learn- the former may find more unusual interesting pairs, the lat-
ing, Deep Learning, Unsupervised Learning ter reduces the likelihood of finding spurious relations. For
example, [1, 2] impose no restriction on the universe from
I. I NTRODUCTION which to select the pairs. Contrarily, some research work, as
Pairs Trading is a popular trading strategy widely used by [3–5] arranges the securities on category groups and select
hedge funds and investment banks. It is capable of obtaining pairs within the same group.
profits irrespective of the market direction. Concerning (ii), the investor must define what criteria should
This is accomplished with a two-step procedure. First, a be used to select a pair. The most common approaches are the
pair of assets whose prices have historically moved together is distance, correlation, and cointegration approaches.
detected. Then, assuming the equilibrium relationship should The distance approach, suggested in [3], selects pairs which
persist in the future, the spread between the prices of the minimize the historic sum of squared distances between the
two assets is monitored and in case it deviates from its two assets’ price series. This method is widely used but
historical mean the investor shorts the overvalued asset and according to [6] it is analytically sub optimal. If pi,t is a
buys the undervalued one. Both positions are closed upon price realization of the normalized price process Pi = (Pi,t )t∈T
convergence. of an asset i, the average sum of squared distances ssdPi ,Pj
However, with the growing availability of data, it is be- in the formation period1 of a pair formed by assets i and j is
coming increasingly harder to find robust pairs. In this work, given by
T
we address two problems in specific: (i) how to find profitable 1X 2
ssdPi ,Pj = (pi,t − pj,t ) . (1)
pairs while constraining the search space and (ii) how to avoid T t=1
long decline periods due to prolonged divergent pairs.
Thus, an optimal pair would be one that minimizes Eq.(1).
The remainder of this document is organized as follows: in
However, this implies a zero spread pair is considered optimal
section II we introduce the main concepts of Pairs Trading
while describing the associated research work. In section III 1 The formation period corresponds to the period in which securities are
we suggest a new pairs selection framework to address the being analyzed to form potential pairs.
which logically may not be as it would not provide trading based on Neural Network Generalized Autoregressive Condi-
chances. tional Heteroskedasticity (GARCH) models for modeling the
The application of Pearson correlation as a selection metric mispricing-correction mechanism between relative prices com-
is analyzed in [7]. The authors examine its application on posing a pair. Huck [13], Huck [14] uses RNNs to generate a
return series with the same data sample used in [3] and find one-week ahead forecast, from which the predicted returns are
that correlation shows better performance, with a reported calculated. Lastly, Krauss et al. [1] analyze the effectiveness
monthly average of 1.70% raw returns, almost twice as high of deep neural networks, gradient-boosted-trees and random
as the one obtained using the distance approach. Nevertheless, forests in the context of statistical arbitrage using S&P 500
this criteria is not foolproof as two return level correlated stocks. Apart from this, Machine Learning techniques still
securities might not share an equilibrium relationship, and remain fairly unexplored in this field and the results obtained
divergence reversions cannot be explained theoretically. indicate this is a promising direction for future research.
At last, the cointegration approach entails selecting pairs
for which the two constituents are found to be cointegrated. If III. P ROPOSED PAIRS S ELECTION F RAMEWORK
two securities, Yt and Xt are found to be cointegrated, then
At this research stage we aim to explore how one investor
by definition, the series constructed as
may find promising pairs without exposing himself to the
St = Yt − βXt , (2) adversities of the common pairs searching techniques. On
the one hand, if the investor limits its search to securities
where β is the cointegration factor, must be stationary. Defin-
within the same sector he is less likely to find pairs not yet
ing the spread series in this way is particularly convenient
being traded in large volumes, leaving a small margin for
since under these conditions the spread is expected to be mean-
profit. But on the other hand, if the investor does not impose
reverting, meaning that every spread divergence is expected to
any limitation on the search space, he might have to explore
be followed by convergence. Hence, this approach finds econo-
excessive combinations and possibly find spurious relations.
metrically more sound equilibrium relationships. The most
We intend to reach an equilibrium with the application of
cited work in this field is [8], that proposes a set of heuristics
an Unsupervised Learning algorithm, on the expectation that
for cointegration based strategies. Furthermore, [9] performs a
it will infer meaningful clusters of assets from which to select
comparison study between the cointegration approach and the
the pairs.
distance approach and finds that the cointegration approach
significantly outperforms the distance method.
A. Dimensionality reduction
B. Trading Models The first step towards this direction consists in finding
The most common trading model follows from [3], and can a compact representation for each asset, starting from its
be described as indicated below: price series. The application of Principal Component Analysis
i Calculate the pair’s spread (St = Yt − Xt ) mean, µs , and (PCA) is proposed. PCA is a statistical procedure that uses an
standard deviation, σs , during the formation period. orthogonal transformation to convert a set of observations of
ii Define the model thresholds: the threshold that triggers possibly correlated variables into a set of linearly uncorrelated
a long position, αL , the threshold that triggers a short variables, the principal components. Each component can be
position, αS , and the exit threshold, αexit , that defines seen as representing a risk factor. We suggest the application
the level at which a position should be exited. of PCA in the normalized return series, defined as
iii Monitor the evolution of the spread, St , and control if
Pi,t − Pi,t−1
any threshold is crossed. Ri,t = , (3)
Pi,t−1
iv In case αL is crossed, go long the spread by buying Y and
selling X. If αS is triggered, short the spread by selling where Pi,t is the price series of a asset i. Using the price series
Y and buying X. Exit position when αexit is crossed. could result in the detection of spurious correlations due to
The simplicity of this model is particularly appealing, mo- underlying time trends. The number of principal components
tivating its frequent application in the field. Nonetheless, the used defines the number of features for each asset represen-
entry points defined may not be optimal since no information tation. Considering that an Unsupervised Learning algorithm
concerning the spread subsequent direction is incorporated will be applied to these data, the number of features should not
in the trading decision. Some efforts have emerged trying be large. High data dimensionality presents a dual problem.
to propose more robust models. Techniques from different The first being that in the presence of more attributes, the
fields, such as stochastic control theory, statistical process likelihood of finding irrelevant features increases. Additionally,
modelling and Machine Learning have been studied. In par- there is the problem of the curse of dimensionality, caused
ticular, the results obtained by Machine Learning approaches by the exponential increase in volume associated with adding
have proved very promising. Dunis et al. [10, 11] explore extra dimensions to the space. According to [15], this effect
the application of Artificial Neural Networks to forecast the starts to be severe for dimensions greater than 15. Taking this
spread change for two famous spreads. Thomaidis et al. into consideration, the number of PCA dimensions is upper
[12] propose an experimental statistical arbitrage system bounded at this value and is chosen empirically.
B. Unsupervised Learning clustering
Having constructed a compact representation for each asset,
a clustering technique may be applied. To decide which algo-
rithm is more appropriate, some problem-specific requisites
are first defined:
– No need to specify the number of clusters in advance.
– No need to group all securities.
– Strict assignment that accounts for outliers.
– No assumptions regarding the clusters’ shape. Fig. 1. Clusters with varying density. Adapted from: [17]

The assignment should be strict, otherwise it would increase


the number of combinations when looking for pairs, which is C. Pairs selection criteria
conflicting with the initial goal. Also, by making the number of
Having generated the clusters of assets, it is still necessary
clusters data-driven, we introduce as little bias as possible. In
to define a set of rules for selecting the pairs to trade. It is
addition, outliers should not be incorporated in the clusters,
critical that the pairs’ equilibrium persists. To enforce this, we
and therefore grouping all assets should not be enforced.
propose the unification of methods applied in separate research
Finally, due to the nonexistence of prior information that
work. According to the proposed criteria, a pair is selected if it
indicates the clusters should be regularly shaped, the selected
complies with the four conditions described next. First, a pair
algorithm should not adopt this assumption.
is only deemed eligible for trading if the two securities forming
Taking into consideration the previously described require-
the pair are cointegrated. To test this condition, we propose
ments, a density-based clustering algorithm seems an appro-
the application of the Engle-Granger test due to its simplicity.
priate choice. It forms clusters with arbitrary shapes and thus
To protect from the test reliance on the dependent variable,
no gaussianity assumptions need to be adopted. It is naturally
we propose that the test is run for both possible selections of
robust to outliers as it does not group every point in the data
the dependent variable, and that the combination generating
set. Furthermore, it requires no specification of the number of
the lowest t-statistic is selected. Secondly, to provide more
clusters.
confidence in the mean-reversion character of the spread, an
The DBSCAN (Density-Based Spatial Clustering of Appli-
extra validation step is suggested. We resort to the concept
cations with Noise) algorithm is the most influential in this
of Hurst exponent, H, which quantifies the relative tendency
category. Briefly, DBSCAN detects clusters of points based
of a time series either to regress strongly to the mean or to
on their density. To accomplish that, two parameters need to
follow a trend [18]. If H belongs to the range 0–0.5 it indicates
be defined: ε, which specifies how close points should be to
that a time series is mean-reverting. Hence, we require that a
each other to be considered “neighbors”, and minP ts, the
pairs’ spread Hurst exponent is less than 0.5. In third place, we
minimum number of points to form a cluster. From these two
intend to discard stationary pairs with unsuitable timings. A
parameters, in conjugation with some concepts that we omit
mean-reverting spread by itself does not necessarily generate
here2 , clusters of neighboring points are formed. Points falling
profits. There must be coherence between the mean-reversion
in regions with less than minP ts within a circle of radius
time and the trading period. The half-life of mean-reversion
ε are classified as outliers, hence not affecting the results. In
is an indicator of how long it takes for a time series to mean-
spite of the advantages stated so far, DBSCAN still carries one
revert [19]. Therefore, we propose filtering out pairs for which
drawback. The algorithm is appropriate under the assumption
the half-life takes extreme values: less than one day or more
that clusters are evenly dense. However, if regions in space
than one year. Lastly, we suggest enforcing that every spread
have different densities, a fixed ε may be well adapted to one
crosses its mean at least twelve times per year, enforcing one
given cluster density but it might be unrealistic for another,
trade per month on average.
as depicted in Figure 1. It is evident that cluster A, B, and C
could eventually be found using the same ε, but A1 and A2 D. Framework diagram
would not be distinguished. The three building blocks of the proposed framework have
The OPTICS algorithm proposed in [17] addresses this been described. Figure 2 illustrates how they connect. As
problem. OPTICS is based on DBSCAN, with the introduction we may observe, the initial state should comprise the price
of some important concepts that enable a varying ε implemen- series for all the possible pairs’ constituents. We assume this
tation. In this enhanced setting, the investor is only required to information is available to the investor. Then, by reducing
specify the parameter minP ts, as the algorithm is capable of the data dimensionality, each security may be described not
detecting the most appropriate ε0 for each cluster3 . Therefore, just by its price series but also by the compact representation
we propose using OPTICS not just to account for varying emerging from the application of PCA in the return series
cluster densities but also to facilitate the investor’s task. (State 1). Using this simplified representation, the OPTICS
2 Interested algorithm is capable of organizing the securities into clusters
readers may refer to [16].
3 This description is very simplified. We suggest the interested readers refer (State 2). Finally, we may search for pair combinations within
to [17]. the clusters and select those that verify the rules imposed.
From f (x), the set of negative percentage changes, f − (x), and
positive percentage changes, f + (x), are considered separately.
Since the proposed model targets abrupt changes but also
requires that they occur frequently enough, looking for the
extreme quantiles seems an adequate solution. Therefore, we
recommend using the top decile and quintile from f + (x)
as candidates for defining αL and the bottom ones, from
f − (x), for defining αS . The quintile-based and decile-based
thresholds are both tested in the validation set and the most
optimistic combination is adopted. Formally,

Fig. 2. Pairs selection diagram.
αS , αL = argmax Rval (q),
h q
q∈ Qf - (x) (0.20), Qf + (x) (0.80) (6)
IV. P ROPOSED T RADING M ODEL  i
Qf - (x) (0.10), Qf + (x) (0.90)
We proceed to address the second problem this work intends
to explore: handling the long decline periods due to prolonged
where Rval is the return obtained in the validation period.
divergent pairs.
To summarize, the model construction follows the diagram
A. Trading Model illustrated in Figure 3. For each pair, the investor starts
A potential alternative to continuously monitor the spread by training the forecasting algorithms to predict the spread.
and track deviations consists of modeling the spread directly. Furthermore, the decile-based and quintile-based thresholds
This way, a prediction can be made regarding how the spread are collected to integrate the trading model. Having fitted
will vary in the future and a position is only entered if the the forecasting algorithms and obtained the two combinations
predicted conditions are favourable. By taking advantage of for the thresholds (State 1), the model is applied on the
a time-series forecasting algorithm to predict the spread at validation set. From the validation performance, the best
the next time instant, we may calculate the expected spread threshold combination is selected (State 2). At this point the
percentage change at time t + 1 as model is ready to be applied on unseen data.

St+1 − St
∆t+1 = × 100, (4)
St
where S and S ∗ correspond to the real and the predicted
spread, respectively. When the absolute value of the predicted
change is larger than a predefined threshold, a position may
be entered, on the expectation that the spread will suffer an
abrupt movement from which the investor can benefit from.
Assuming the investor is not holding a position yet, the next
position, Pt+1 , may be described according to

if ∆t+1 ≥ αL , Go long
 Fig. 3. Proposed model construction diagram.
Pt+1 : if ∆t+1 ≤ αS , Go short . (5)


otherwise, Wait An application example is illustrated in Figure 4. For
the sake of illustration, the forecasting has perfect accuracy,
Once a position is entered, it is maintained while the predicted
meaning the positions can be set in optimal conditions.
spread direction does not change. When it switches, the posi-
tion is exited. This strategy defines the basis of the proposed
trading model. It is still to be described how the thresholds
(αL , αS ) should be calculated. A possible approach consists
of framing an optimization problem, and try to find the profit-
maximizing values. However, this approach is rejected due to
its risk of data-snooping and unnecessary added complexity.
We propose a simpler, non-iterative, data-driven approach.
We start by obtaining f (x), the spread percentage change
distribution during the formation period, given that the spread
percentage change at time t is defined as
Fig. 4. Example of the proposed forecasting-based strategy.
St+1 − St
xt = × 100.
St
B. Applied forecasting algorithms
Forecasting algorithms commonly applied in the litera-
ture can be divided into two major classes: parametric and
non-parametric models. The former assumes that the un-
derlying process can be described using a small number
of parameters. The latter makes no structural assumptions
about the underlying structure of the process. We pro-
pose the application of a benchmark parametric approach,
the autoregressive–moving-average (ARMA), and two non-
parametric models, the Long Short-Term Memory (LSTM) and
the LSTM Encoder-Decoder. This will allow inferring to what Fig. 5. LSTM Encoder-Decoder.
extent the strategy profitability depends on the complexity
of the time-series forecasting algorithm. The justification for
choices adopted are described next. the nodes in each hidden layer. Early-stopping and dropout
Although financial time series are very complex in nature are applied as regularization techniques.
[20], the ones under analysis are by construction stationary, as
they correspond to the linear combination of cointegrated price V. R ESEARCH DESIGN
series. Thus it is fair to ask if an ARMA model may succeed
at forecasting this series. This model describes a stationary The research design contemplates two stages, corresponding
stochastic process as the composition of two polynomials, the to each problem being addressed.
autoregression AR(p) and the moving average MA(q), as
p q
X X A. Research Stage 1 - Pairs selection
Xt = c + εt + ϕi Xt−i + θi εt−i , (7)
i=1 i=1 First, we intend to examine how the three different pairs’
where p and q represent the order of the polynomials. search techniques (unrestricted, grouping by category and un-
Nevertheless, there is an underlying motivation for applying supervised learning) compare to each other. For this purpose,
more complex models, such as Artificial Neural Networks the three methodologies are implemented. The proposed pairs
(ANN). First, ANNs have been an object of attention in many selection rules are also constructed and applied on top of each
different fields, which makes its application in this context an search technique. As for the trading setup, since the focus
interesting case study. Furthermore, ANN-based models have lies on comparing the search techniques relative to each other,
shown very promising results in predicting financial time series we do not concern about meticulously optimizing the trading
data in general [21]. From the vast amount of existing ANN conditions. Therefore, we apply the standard threshold-based
configurations, the LSTM architecture is deemed appropriate model proposed in [3], with the parameters specified in Table
due to its capabilities of learning non-linear representations of I. The spread’s standard deviation, σs , and mean, µs , are
the data while memorizing long sequences. LSTMs assume calculated with w.r.t to the entire formation period.
the existence of a sequential dependency among inputs, and
previous states might affect the decision of the neural network TABLE I
T HRESHOLD - BASED MODEL PARAMETERS .
at a different point in time.
Furthermore, from a trading perspective, it might be particu-
Parameters Values
larly beneficial to collect information regarding the prediction
not just of the next instant of a time-series but also of later time Long Treshold µs − 2σs
steps. An LSTM Encoder-Decoder architecture is naturally Short Threshold µs + 2σs
Exit Threshold µs
fitted to such scenarios. This architecture is comprised by two
LSTMs, one for encoding the input sequence into a fixed-
length vector, the encoder, and a second for decoding the To test the performance of the selected pairs, we implement
fixed-length vector and outputting the predicted sequence, the three different test portfolios resembling probable trading
decoder, as illustrated in Figure 5. scenarios. Portfolio 1 considers all the pairs identified in the
In this multi-step forecasting scenario, the trading rules are formation period. Portfolio 2 takes advantage of the feedback
adapted by simply calculating the prediction change N times collected from running the strategy in the validation set by
in advance. Likewise, the thresholds αL and αS should be selecting only the pairs that had a positive return. Lastly,
calculated with respect to the distribution of the percentage Portfolio 3 corresponds to the situation in which the investor
change between x(t + N ) and x(t). is limited to invest in a fixed number of k pairs. In such case,
Given the limited computation resources, the neural network we suggest selecting the top-k pairs according to the return
models’ tuning is constrained to a set of most relevant vari- obtained in the validation set. We consider k = 10, as it stands
ables: data sequence length, the number of hidden layers and in between the choices of [3], which uses k = 5 and k = 20.
B. Research Stage 2 - Trading Model decide to proceed on the basis that a longer period may
At this stage, we aim to compare the robustness pro- identify more robust pairs. Configuration (ii) is used for
vided by the standard threshold-based model with the pro- simulating the forecasting-based trading model, thus providing
posed forecasting-based model, simulated using an ARMA, more formation data to fit the forecasting algorithms. In this
an LSTM and an LSTM Encoder-Decoder with an output case, the first 8 years are used for training, as indicated in
length of two. We propose to first evaluate the forecasting Figure 6.
performance of each algorithm. As benchmark, a naive base- For the first research stage, we propose using three differ-
line is considered, which simply outputs Yt+1 = Yt . Then, to ent periods to have more statistical evidence on the results
evaluate the trading strategy itself, we suggest using the pairs obtained. In the second research stage, this is not conceivable
search technique which proved more appealing according to due to the computational burden of training the forecasting
the results obtained in the previous research stage. As for the algorithms. Hence, we consider one period using configuration
test portfolio, we consider using Portfolio 2. (i), and a second period using configuration (ii), to evaluate
how the standard model could have performed in the same test
C. Dataset period.
Trading ETFs is considered adequate since they are easy to
trade, as they trade like stocks, and because their dynamics
are expected to change much slower than that of a single
stock. Adding to that, research in the field [7, 22] obtained
more robust mean-reverting time series by using a linear
combination of stocks to form each component of the spread.
We presume using ETFs may be a proxy to accomplish that
more efficiently.
This work fixates a subset of ETFs which track single com-
modities, commodity-linked indexes or companies focused on
exploring a commodity. This reduces the number of possible
Fig. 6. Trading periods.
pairs, making the strategy computationally more efficient and
leaving space for careful analysis of the selected pairs. As preprocessing steps, we start by removing all ETFs
A total of 208 commodity-linked ETFs are available for with missing values throughout the period being considered.
trading in January 2019, for which five categories may be Then, we remove ETFs that do not verify the minimum
identified based on the ETFs composition (Agriculture, Broad liquidity requisites to ensure the considered transaction costs
Market, Energy, Industrial Metals and Precious Metals). This associated with bid-ask spread are realistic5 . The minimum
information is collected from [23]. liquidity requisites follow the criterion adopted in [3, 25],
We considered price series data with 5-min frequency. The which discards all ETFs not traded during at least one day.
motivation for using intraday data is three-fold. First, with finer
granularity the entry and exit points can be defined with more D. Trading simulation
precision, providing opportunities for higher profit margins. Concerning the portfolio construction, we impose that all
Secondly, we may detect intraday mean-reversion patterns that pairs are equally weighted in the portfolio, so that the returns
could not be found otherwise. At last, it provides more data can be obtained dividing the performance by the number of
samples, allowing to train complex forecasting models with pairs, with no need to concern about relative weights.
less risk of overfitting. An aspect that follows concerns the capital allocation within
The periods considered for simulating each research stage each pair. We consider that the capital resulting from the short
are illustrated in Figure 6. There are essentially two possible position is immediately applied in the long position. This type
configurations: (i) the 3-year-long formation periods, and (ii) of leverage is adopted by most hedge funds. On this basis, we
the 9-year-long formation period. In both cases, the second-to- construct a framework which ensures that every trade is set
last year is used for validating the performance, before running with just $1. This is accomplished by imposing that
the strategy on the test set. We define a 1-year-long trading
period, based on the findings of [4], that claim the profitability max(asset1 , asset2 ) = $1,
can be increased if the initial 6-month trading period proposed where asset1 and asset2 represent the capital invested in each
in [3] is extended to 1 year. pair’s constituent, as illustrated in Figure 7. Although the gross
Configuration (i) is adopted when using the threshold-based exposure is higher, a $1 dollar initial investment is always
trading model (described in Table I). A formation period of sufficient. As trading progresses, we consider that all the
three years seems appropriate. Although this period is slightly capital earned by a pair in the trading period is reinvested
longer than what is commonly found in the literature4 , we in the next trade.
4 [3, 4, 24] use a 1-year-long formation period. [5] makes use of a 3-month 5 Trading illiquid ETFs would result in higher bid-ask spreads which could
formation period. dramatically impact the profit margins.
F. Implementation environment
All the code developed in this work is built from scratch
using Python. Some libraries are particularly useful. First,
sci-kit learn proves helpful in the implementation of PCA
and the OPTICS algorithm. Second, statsmodels provides an
already implemented version of the ADF test, useful for
testing cointegration. Last, we make use of Keras to build
the proposed neural networks. The code is publicly available
in [28].
Concerning the running environment, most of the simulation
Fig. 7. Market position definition. code is run on a local CPU, except for the training of
the LSTM models. They involve a huge amount of matrix
multiplications which result in long processing times. These
All the results presented in this work account for transaction operations are massively sped up by taking advantage of the
costs. The transaction costs considered are based on estimates parallelization capabilities of a GPU.
from [25], in which the authors perform an in-depth study
on the impact of transaction costs in Pairs Trading. The VI. R ESULTS
costs comprise three components: commissions (8 bps), market The results obtained at each research stage are presented
impact (20 bps) and short-selling constraints (1% per annum). next.
Besides, commission and market impact costs are adapted to
account for both assets in the pair. A. Analysis of the pairs selection framework
As a trading system can not act instantaneously, there might We start by presenting some relevant statistics in Table II
be a small deviation in the entry price inherent to the delay of concerning the number of pairs found for the three different
entering a position. To account for this factor and make sure pairs search techniques being compared at this stage.
the strategy is viable in practice, we assume a conservative
one period (5-min) delay for entering a position. TABLE II
This work does not comprehend an implementation of a S ELECTED PAIRS USING DIFFERENT SEARCH METHODS .
stop-loss system, under any circumstances. This means a
position is only exited if the pair converges or the trading
period ends.

E. Evaluation metrics
Regarding the trading evaluation, we propose analyzing the
strategy Return on Investment (ROI), Sharpe Ratio (SR) and
the portfolio Maximum Drawdown (MDD).
The ROI is calculated as the net profit divided by the initial
capital, which we enforced to be $1.
The portfolio SR is calculated as
Rport − Rf As expected, when no restrictions are imposed in the search
SR year = × annualization factor, (8) space, a larger set of ETFs emerges and consequently more
σport
pairs are selected. Contrarily, when grouping ETFs in five
where Rport represents the daily portfolio returns and Rf the partitions (according to the categories described in section
risk-free rate6 . The portfolio volatility, σport , is calculated as V-C) there is a reduction in the number of possible pair
v combinations. This is not more evident due to the underlying
uN N
uX X unbalance across the categories considered. Because energy
σport = t ωi cov(i, j)ωj , (9) linked ETFs represent close to half of all ETFs, the combi-
i=1 j=1 nations within this sector are still vast. Lastly, the number of
possible pair combinations when using OPTICS is remarkably
where wi is the relative weight of asset i in the portfolio.
lower. Although the number of clusters is higher than when
The annualization factor is set according to the methodology
grouping by category, their smaller size results in fewer
proposed by Lo [27] (Table 2 in [27]), to prevent imprecise
combinations. We proceed to analyze in more detail the results
approximations.
obtained with this algorithm.
6 The average the 3-Month treasury bill rate, taken from [26], during the
The results concerning the OPTICS application are obtained
corresponding test period and converted to a daily basis for consistency with using five principal components to describe the data. We
the formula. empirically verified that up to the 15-dimensions boundary
(motivated in section III-A) the results are not significantly
affected. We adopt 5 dimensions since we find more adequate
to settle the ETFs’ representation in a lower dimension pro-
vided that there is no evidence favoring higher dimensions.
To validate the clusters formed and get an insight into their
composition we examine the results obtained in the period
of Jan 2014 to Dec 20177 . To represent the clusters in a 2-
D setting, the data must be reduced from 5 dimensions. We
consider the application of t-SNE [29] for this purpose. Figure
8 illustrates the clusters formed. The ETFs not clustered are
represented by the smaller circles, which were not labeled to (a) Normalized prices in Cluster 1.
facilitate the visualization.

(b) Normalized prices in Cluster 2.

Fig. 9. Price series composition of some clusters.

We confirm the generated clusters display a tendency to


group subsets of ETFs from the same category while not
impeding clusters containing ETFs from different ones.
With respect to the trading performance, Table III unveils
the test results obtained with each clustering type using the
three different portfolios introduced in section V-A. To aggre-
gate the information in a more concise way, the average over
Fig. 8. Application of t-SNE to the clusters generated by OPTICS.
all years and portfolios is described on the rightmost column.
Note also that the three evaluation metrics described in section
In order to evaluate the integrity of the clusters, we propose V-E are accentuated, to differentiate from the remaining, less
analyzing the composing price series. Therefore, we select two critical, descriptive statistics. We can confirm the profitability
clusters and represent the logarithm of the price series8 of in all the environments tested, which corroborates the idea that
each ETF. Figure 9(a) illustrates a cluster in which the ETFs the pairs selection rules are robust.
identified do not just belong to the same category but are Comparing the different clustering techniques, if an investor
also part of the same segment, the Equity US: MLPs. This is focused on obtaining the highest ROI, regardless of the in-
evinces that the OPTICS approach is capable of detecting curred risk, performing no clustering is particularly appealing.
a specific segment just from the time series data. Figure
9(b) demonstrates the OPTICS clustering capabilities extend But when risk is taken into the equation, the OPTICS based
beyond selecting ETFs within the same segment, as we may strategy proves more auspicious. It is capable of generating the
observe ETFs from distinct categories, such as Agriculture highest average portfolio Sharpe ratio of 3.79, in comparison
(CGW, FIW, PHO, and PIO), Industrial Metals (LIT and with 3.58 obtained when performing no clustering or 2.59
REMX) and Energy (YMLI). There is a visible relation among when grouping by category. Also, it shows more consistency
the identified price series, even though they do not all belong w.r.t the portion of profitable pairs in the portfolio, with an
to the same category. average of 86% profitable pairs, against 80% when grouping
by category and 79% when performing no clustering at all.
At last, it achieves more steady portfolio drawdowns, with the
7 This period is chosen arbitrarily because an extensive analysis covering
lowest average MDD. It is capable of mantaining the MDD
all periods does not fit this report.
8 The price series illustrated result from subtracting the mean of the original values within an acceptable range even when the other two
price series, to facilitate the visualization. techniques display considerable deviations, as in 2017.
TABLE III A total of 31 forecasting model architectures are imple-
T RADING PERFORMANCE FOR EACH PAIRS SEARCH TECHNIQUE . mented in this work to find the one with the most trading
potential, meaning 155 models are trained (31 architectures
× 5 spreads). We experiment increasingly complex configu-
rations until signs of overfitting are evident. The forecasting
performance obtained for the best configurations is described
in Table IV.

TABLE IV
F ORECASTING RESULTS COMPARISON .

We may verify that all the implemented models are capable


of outperforming the naive implementation during the valida-
tion period. Curiously, we note that the LSTM-based models
do not manage to surpass the ARMA model, at least w.r.t
B. Evaluation of the forecasting-based model
to the chosen metrics. Also, the results obtained in the test
We start by selecting pairs using the OPTICS clustering, set indicate signs of overfitting besides the efforts taken in
due to its demonstrated ability. On these conditions, we find that regard, as the LSTM-based models are no longer superior
5 pairs during the formation period of Jan 2009 to Dec to the naive performance. The incapability of the LSTM-
2017 and 19 pairs during Jan 2015 to Dec 2017 (periods based models to outperform the simpler approaches is in
defined in Figure 6). Not surprisingly, the number of pairs accordance with the findings of [30], which asserts that time-
found for the former period is greatly reduced as the active series problems found in the literature are often conceptually
cointegrated ETFs throughout this interval are more scarce. simpler than many tasks already solved by LSTMs and more
But since training the Deep Learning forecasting models is often than not they all relevant information about the next
computationally very expensive, having fewer pairs is actually event is conveyed by a few recent events. We suspect this is the
convenient. The corresponding five spreads are illustrated in case in this work. At last, we analyze the performance obtained
Figure 10. The spreads look indeed stationary. There is an by the integration of the previous algorithms in the proposed
evident difference in their volatility, which further supports trading model scheme. Based on the validation records, the
the importance of enforcing data-driven trading thresholds. quintile-based thresholds are used with ARMA and the decile-
based with the LSTMs. The test results in these conditions are
illustrated in Table V.

TABLE V
T RADING RESULTS COMPARISON USING A 8- YEAR - LONG FORMATION
PERIOD .

Fig. 10. Pairs identified in Jan 2009-Dec 2017.

Each spread in Figure 10 is fitted by the forecasting al- The results indicate that if robustness is evaluated by the
gorithms. The forecasting score is obtained by averaging the number of days the portfolio value does not decline (accentu-
mean-square error (MSE) over the five spreads. ated in Table V), then the proposed trading model does provide
an improvement. The forecasting-based models display a total [3] E. Gatev, W. N. Goetzmann, and K. G. Rouwenhorst, “Pairs trading:
of 2 (LSTM), 11 (ARMA) and 22 (LSTM Encoder-Decoder) Performance of a relative-value arbitrage rule,” The Review of Financial
Studies, vol. 19, no. 3, pp. 797–827, 2006.
days of portfolio decline, in comparison with 87 days obtained [4] B. Do and R. Faff, “Does simple pairs trading still work?” Financial
when using the standard model. This finding suggests the Analysts Journal, vol. 66, no. 4, pp. 83–95, 2010. [Online]. Available:
forecasting-based model is capable of defining more precise https://doi.org/10.2469/faj.v66.n4.1
entry points, and hence reduce the number of unprofitable [5] C. L. Dunis, G. Giorgioni, J. Laws, and J. Rudy, “Statistical arbitrage
and high-frequency data with an application to eurostoxx 50 equities,”
days. However, that comes at the expense of a reduction in Liverpool Business School, Working paper, 2010.
both portfolio SR and ROI, questioning the benefits provided [6] C. Krauss, “Statistical arbitrage pairs trading strategies: Review and
by the proposed model after all. We suspect the long required outlook,” Journal of Economic Surveys, vol. 31, no. 2, pp. 513–545,
2017.
formation period is also responsible for this profitability de- [7] H. Chen, S. Chen, Z. Chen, and F. Li, “Empirical investigation of an
cline. Therefore we proceed to analyze the standard trading equity pairs trading strategy,” Management Science, 2017.
model in the 3-year-long period. [8] G. Vidyamurthy, Pairs Trading: quantitative methods and analysis.
John Wiley & Sons, 2004, vol. 217.
[9] N. Huck and K. Afawubo, “Pairs trading and selection methods: is
TABLE VI cointegration superior?” Applied Economics, vol. 47, no. 6, pp. 599–
T RADING RESULTS FOR STANDARD TRADING MODEL USING A 613, 2015.
3- YEAR - LONG FORMATION PERIOD . [10] C. L. Dunis, J. Laws, and B. Evans, “Modelling and trading the
gasoline crack spread: A non-linear story,” Derivatives Use, Trading
& Regulation, vol. 12, no. 1-2, pp. 126–145, 2006.
[11] C. L. Dunis, J. Laws, P. W. Middleton, and A. Karathanasopoulos,
“Trading and hedging the corn/ethanol crush spread using time-varying
leverage and nonlinear models,” The European Journal of Finance,
vol. 21, no. 4, pp. 352–375, 2015.
[12] N. S. Thomaidis, N. Kondakis, and G. Dounias, “An intelligent statistical
arbitrage trading system,” in SETN, 2006.
[13] N. Huck, “Pairs selection and outranking: An application to the s&p
100 index,” European Journal of Operational Research, vol. 196, no. 2,
pp. 819–825, 2009.
[14] N. Huck, “Pairs trading and outranking: The multi-step-ahead forecast-
By comparison, the performance in the 10-year-long period ing case,” European Journal of Operational Research, vol. 207, no. 3,
seems greatly affected by the long required duration, sug- pp. 1702–1716, 2010.
gesting the less satisfactory returns emerge not simply from [15] P. Berkhin, “A survey of clustering data mining techniques,” in Grouping
the trading model itself, but also due to the underlying time multidimensional data. Springer, 2006, pp. 25–71.
[16] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm
settings. Following this line of reasoning, if the forecasting- for discovering clusters in large spatial databases with noise.”
based models’ performance increases in the same proportion [17] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics:
as the standard trading model when reducing the formation ordering points to identify the clustering structure,” in ACM Sigmod
record, vol. 28, no. 2. ACM, 1999, pp. 49–60.
period, the results obtained could be much more satisfactory.
[18] T. Kleinow, “Testing continuous time models in financial markets,” 2002.
[19] E. Chan, Algorithmic trading: winning strategies and their rationale.
VII. C ONCLUSIONS John Wiley & Sons, 2013, vol. 625.
We explored how Pairs Trading could be enhanced with the [20] Y.-W. Si and J. Yin, “Obst-based segmentation approach to financial
time series,” Engineering Applications of Artificial Intelligence, vol. 26,
integration of Machine Learning. First, we proposed a new no. 10, pp. 2581–2596, 2013.
approach to search for pairs based on the application of the [21] R. C. Cavalcante, R. C. Brasileiro, V. L. Souza, J. P. Nobrega, and A. L.
OPTICS algorithm followed by a robust pairs selection crite- Oliveira, “Computational intelligence and financial markets: A survey
and future directions,” Expert Systems with Applications, vol. 55, pp.
ria. The strategy achieved better risk-adjusted returns when 194–211, 2016.
using this method. Secondly, we introduced a forecasting- [22] M. Perlin, “M of a kind: A multivariate approach at pairs trading,” 2007.
based model aiming to reduce decline periods associated with [23] “Find the Right ETF - Tools, Ratings, News,” https://www.etf.com/,
untimely market positions and prolonged divergent pairs. We accessed: 2019-06-30.
[24] H. Rad, R. K. Y. Low, and R. Faff, “The profitability of pairs trading
demonstrated the proposed model is capable of reducing the strategies: distance, cointegration and copula methods,” Quantitative
average decline period in more than 75% although that comes Finance, vol. 16, no. 10, pp. 1541–1558, 2016. [Online]. Available:
at the expense of declining profitability. In addition, this work https://doi.org/10.1080/14697688.2016.1164337
also contributes with empirical evidence of the suitability of [25] B. Do and R. Faff, “Are pairs trading profits robust to trading costs?”
Journal of Financial Research, vol. 35, no. 2, pp. 261–287, 2012.
ETFs traded in a 5-minutes setting in the context of Pairs [26] “3-Month Treasury Bill: Secondary Market Rate,”
Trading. https://fred.stlouisfed.org/series/TB3MS, accessed: 2019-07-11.
[27] A. W. Lo, “The statistics of sharpe ratios,” Financial analysts journal,
R EFERENCES vol. 58, no. 4, pp. 36–52, 2002.
[1] C. Krauss, X. A. Do, and N. Huck, “Deep neural networks, gradient- [28] S. Moraes Sarmento, “Github repository: Pairs trading,”
https://github.com/simaomsarmento/PairsTrading, 2019.
boosted trees, random forests: Statistical arbitrage on the s&p 500,”
European Journal of Operational Research, vol. 259, no. 2, pp. 689– [29] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal
702, 2017. of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
[2] J. Caldeira and G. V. Moura, “Selection of a portfolio of pairs based [30] F. A. Gers, D. Eck, and J. Schmidhuber, “Applying lstm to time series
on cointegration: A statistical arbitrage strategy,” Available at SSRN predictable through time-window approaches,” in Neural Nets WIRN
2196391, 2013. Vietri-01. Springer, 2002, pp. 193–200.

You might also like