A Backtesting Protocol in The Era of Machine Learning
A Backtesting Protocol in The Era of Machine Learning
Rob Arnott
Research Affiliates, Newport Beach, CA 92660, USA
Campbell R. Harvey
Fuqua School of Business, Duke University, Durham, NC 27708, USA
National Bureau of Economic Research, Cambridge, MA 02912, USA
Harry Markowitz
Harry Markowitz Company, San Diego, CA 92109, USA
ABSTRACT
Machine learning offers a set of powerful tools that holds considerable promise for investment
management. As with most quantitative applications in finance, the danger of misapplying
these techniques can lead to disappointment. One crucial limitation involves data availability.
Many of machine learning’s early successes originated in the physical and biological sciences, in
which truly vast amounts of data are available. Machine learning applications often require far
more data than are available in finance, which is of particular concern in longer-horizon
investing. Hence, choosing the right applications before applying the tools is important. In
addition, capital markets reflect the actions of people, which may be influenced by others’ actions
and by the findings of past research. In many ways, the challenges that affect machine learning
are merely a continuation of the long-standing issues researchers have always faced in
quantitative finance. While investors need to be cautious—indeed, more cautious than in past
applications of quantitative methods—these new tools offer many potential applications in
finance. In this article, the authors develop a research protocol that pertains both to the
application of machine learning techniques and to quantitative finance in general.
Machine learning and other statistical tools, which have been impractical to use in
the past, hold considerable promise for the development of successful trading
strategies, especially in higher frequency trading. They might also hold great promise
in other applications such as risk management. Nevertheless, we need to be careful
in applying these tools. Indeed, we argue that given the limited nature of the standard
data that we use in finance, many of the challenges we face in the era of machine
learning are very similar to the issues we have long faced in quantitative finance in
general. We want to avoid backtest overfitting of investment strategies. And we want
a robust environment to maximize the discovery of new (true) strategies.
We believe the time is right to take a step back and to re-examine how we do our
research. Many have warned about the dangers of data mining in the past (e.g.,
Leamer, 1978; Lo and MacKinlay, 1990; and Markowitz and Xu, 1994), but the
problem is even more acute today. The playing field has leveled in computing
2
1 Harvey and Liu (2014) present a similar exhibit with purely simulated (fake) strategies.
This strategy might seem too good to be true. And it is. This data-mined strategy
forms portfolios based on letters in a company’s ticker symbol. For example, A(1)-
B(1) goes long all stocks with “A” as the first letter of their ticker symbol and short
all stocks with “B” as the first letter, equally weighting in both portfolios. The strategy
in Exhibit 1 considers all combinations of the first three letters of the ticker symbol,
denoted as S(3)-U(3). With 26 letters in the alphabet, and with two pairings on three
possible letters in the ticker symbol, thousands of combinations are possible. In
2Online tools, such as those available at http://datagrid.lbl.gov/backtest/index.php, generate fake strategies that
are as impressive as the one illustrated in Exhibit 1.
3Economists have an advantage over physicists in that societies are human constructs. Economists research what
humans have created, and as humans, we know how we created it. Physicists are not so lucky.
4 In investing, two of these three outcomes have a twist to the winner’s curse: private gain and social loss. The
asset manager pockets the fees until the flaw of the strategy becomes evident, while the investor bears the losses
until the great reveal that it was a bad strategy all along.
5 See McLean and Pontiff (2016). Arnott, Beck, and Kalesnik (2016) examine eight of the most popular factors and
show an average return of 5.8% a year in the span before the factors’ publication and a return of only 2.4% after
publication. This loss of nearly 60% of the alpha on a long−short portfolio before any fees or trading costs is far
more slippage than most observers realize.
10
11
8See Asness and Frazzini (2013). Hou, Xue, and Zhang (2017) show that most anomaly excess returns
disappear once microcaps are excluded.
12
9 Monte Carlo simulations are part of the toolkit, perhaps less used today than in the past. Of course,
simulations will produce results entirely consonant with the assumptions that drive the simulations.
10 Monthly macroeconomic data generally became available in 1959.
13
Exhibit 2 condenses the foregoing discussion into a seven-point protocol for research
in quantitative finance.
14
15
1. Research Motivation
a) Does the model have a solid economic foundation?
b) Did the economic foundation or hypothesis exist before the research was conducted?
4. Cross-Validation
a) Are the researchers aware that true out-of-sample tests are only possible in live trading?
b) Are steps in place to eliminate the risk of out-of-sample “iterations” (i.e., an in-sample model
that is later modified to fit out-of-sample data)?
c) Is the out-of-sample analysis representative of live trading? For example, are trading costs and
data revisions taken into account?
5. Model Dynamics
a) Is the model resilient to structural change and have the researchers taken steps to minimize
the overfitting of the model dynamics?
b) Does the analysis take into account the risk/likelihood of overcrowding in live trading?
c) Do researchers take steps to minimize the tweaking of a live model?
6. Complexity
a) Does the model avoid the curse of dimensionality?
b) Have the researchers taken steps to produce the simplest practicable model specification?
c) Is an attempt made to interpret the predictions of the machine learning model rather than
using it as a black box?
7. Research Culture
a) Does the research culture reward quality of the science rather than finding the winning
strategy?
b) Do the researchers and management understand that most tests will fail?
c) Are expectations clear (that researchers should seek the truth not just something that works)
when research is delegated?
16
López de Prado, Marcos. 2018. “The 10 Reasons Most Machine Learning Funds Fail.” Journal of
Portfolio Management, vol. 44, no. 6 (Special Issue Dedicated to Stephen A. Ross):120−133.
Markowitz, Harry, and Ban Lin Xu. 1994. “Data Mining Corrections.” Journal of Portfolio
Management, vol. 21, no. 1 (Fall): 60−69.
McLean, R. David, and Jeffrey Pontiff. 2016. “Does Academic Research Destroy Stock Return
Predictability?” Journal of Finance, vol. 71, no. 1 (February):5–32.
17
18