Chapter Six - Price Action Lab Blog
Chapter Six - Price Action Lab Blog
Chapter Six
Fooled by Technical Analysis: The perils of charting, backtesting, and data-mining – Chapter Six
Quantitative trading
is a broad term that refers to the use of mathematical models to identify an
edge and profit from the market. At the institutional level,
quantitative trading includes algos for
statistical arbitrage, the design and pricing of derivatives, liquidity-seeking algorithms, portfolio
allocation, risk analysis, and high-frequency trading. At the individual trader or fund
management level, quantitative trading usually
refers to the use of an edge that has been
identified through the analysis of historical price data. The edge may be in the form of a single
indicator, a pattern that repeats in the data, or a more complex set of rules that combine several
indicators and patterns of price, volume, and even fundamental information about a security or a
group of securities. In this book, we deal with the latter definition of quantitative trading. There is
a large group of traders, called Level III technical traders, that have suffered significant losses
due to a wrong application of quantitative analysis. We will discuss some of the major causes of
the failure of these traders and then provide guidelines
for a transition to Level IV, which is the
group of expert quants.
Quant
traders emerged in the early 1980s after personal computers became available and
affordable. They coded their systems in some high-level programming language, such as Basic,
and then tested their performance on historical data. The author of this book used FORTRAN 77
and Pascal during that time to program and backtest price patterns after identifying them visually
on charts. Many tests during those earlier times provided solid evidence about the existence of
statistically significant edges for trading the markets. The work of a quant during that time was
tedious and time-consuming because every idea had to be coded almost from scratch. By the late
1980s, the first backtesting software packages became available for sale to retail traders. One of
the first such programs was System Writer Plus by Omega Research, Inc., a
company that in the
late 1990s changed its name to Tradestation Technologies. System Writer Plus offered a high-
level trading language, called EasyLanguage, that one could use to code trading ideas and test
them on a large database of historical data that came with the program. This was a significant
development that allowed traders with basic programming skills to become quants. It was also
the beginning of a new era that marked the birth of a new group of traders that were fooled by
technical analysis because they did not understand the pitfalls and limitations of backtesting.
Backtesting was not the main problem but it was rather the lack of knowledge of its proper use
that contributed to the losses of many of its users.
During
the years that followed after the advent of backtesting programs, the market for such
software exploded and more chartists started transitioning to quant trading to determine whether
their methods possessed a quantifiable edge. Unfortunately, in the case of chart patterns, it was
soon realized that it was difficult to describe them objectively in a computer language, as well as
implement associated trading methods. Due to that, many chartists abandoned quant trading and
returned to their visual chart analysis. Those who preferred indicators
and mathematical models
continued using backtesting programs in trying to confirm or discover an edge. Gradually, two
separate large groups developed with a small one in between the first large group involved traders
who strictly adhered to charting, i.e., to the visual identification of trading opportunities, either
manually or with a help of a data scanner, and the second large group involved quant traders who
would only employ trading systems that were tested on historical data and also allowed
automation of signal generation. In between, a small group used both methods, switching back
and forth without any success. However, the new quant method had even more problems than the
chartist method and at the same time, the transition to an expert quant was as difficult as was the
transition to an expert chartist. Thus, backtesting
software offered no free lunch as anyone with
knowledge of how the markets operate and also of statistical analysis could have guessed from
the start. The result was the second generation of traders fooled by technical analysis but due to
backtesting this time. To understand why and how this happened, the details of the two main
methods used for discovering a trading edge via backtesting will be discussed next after a
brief
introduction to the subject.
What is backtesting?
Backtesting has also been called the past historical testing or historical simulation. The term
simulation
refers to the process of driving some mathematical model with input and
observing its
Figure
6-1 shows how the process of backtesting works in principle at a high level. The
mathematical model of the trading system is driven by historical data input and the output is the
sequence of market entry (Ei) and corresponding exit (Xi)
levels. Then, the output is used in
conjunction with the input to calculate a set of performance parameters that are in turn analyzed
to decide whether the model qualifies as a viable trading system (See Ref .
3 for more details).
Some
naive quants may think that backtesting is a process of finding a system that will be
profitable in the future but that is a gross misconception. All one can achieve with backtesting is
to determine historical profitability. If one assumes that systems that did not perform well in the
past have a high probability to fail in the future in actual trading, then backtesting can facilitate
file:///mnt/ST_320/Download/Chapter Six – Price Action Lab Blog_files/default_005.htm 3/68
10/1/22, 4:52 PM Chapter Six – Price Action Lab Blog
As
it turns out then, if the performance from backtesting is acceptable, then the system has a
probability p to stay profitable and (1-p) to fail. Expert quants have devised methods of selecting
systems with a high probability p to staying profitable for an extended period, some of
which will
be discussed later in this chapter.
Table 6-11 provides a partial list of some of the most important parameters calculated during
backtesting.
Pitfalls of backtesting
There
are three major pitfalls of backtesting that could lead to misleading results even when a
trading system has been coded correctly and the simulation of its performance on historical data
is realistic and does not involve basic errors such as “forward-looking”, such as buying or selling
at the high or low of a bar, or other common programming mistakes. These are:
Backtester limitations
The
first two pitfalls are discussed in more detail in Reference 3. Backtester limitations, or errors
in the software, can affect the reliability of the results in many different ways, some of which are
hard to detect. For example, when backtesting programs were first released for sale to the public,
some of them would not open a new position on a bar where an exit occurred. Due to this
limitation, or a bug depending on perspective, backtesting of mid-frequency positions or swing
trading systems often produced unrealistic results. The majority of users of the two most popular
backtesting programs used in the 1990s never became aware of these limitations as they assumed
that the results
were accurate because they thought these programs were tested thoroughly before
being released to the public. Only when intraday trading became popular these errors were fixed
because they became obvious.
Historical
data integrity is an important issue. Backtesting results can be affected by data errors,
adjustments, or differences in price series provided by different vendors. Errors due to misprints
are often cleaned
by data vendors but the issue of adjusted vs. unadjusted data has not been
settled yet. For example, stock data are adjusted for both splits and dividends and futures data are
adjusted for rollovers. In all cases,
backtesting results using the adjusted and unadjusted series
may vary depending on
entry and exit logic. In the case of futures, the best way of performing
backtests is via the use of front contracts and handling the rollovers in the software. However,
this way of backtesting is cumbersome because it involves handling multiple price series and, as
a result, continuous, backward-adjusted data in the form of a single price
series is used instead.
The result is that certain types of exits produce unrealistic backtesting results, for example,
percent profit targets and stop-loss levels. In the case of equities, exit levels in points often
produce unrealistic backtest results when adjusted data for
dividends is used. In many cases any
deviation from what would have been obtained had actual data been used is not very large to
cause a
robust and profitable system to turn unprofitable but results do not correspond accurately
to what would have been obtained had the system been used to trade. In the case of forex, there
are no issues with data adjustments but results may vary significantly when using historical data
from different vendors because of the lack of a centralized exchange. It is suggested that forex
trading systems are backtested using data from different vendors for establishing their robustness
because the variations in performance can be significant.
The
lack of an effect on price action is the most serious pitfall of backtesting and especially
affects intraday and high-frequency trading systems. This pitfall arises from the fact that when a
trading system is
backtested on historical data it is not an actual participant in the market and it
does not impact order book liquidity in any way. A simulated system cannot cause reactions from
other market participants as it happens in reality. The adverse effects of this pitfall are often taken
into account by including additional trading friction in the form of slippage for each trade
executed during the backtest but that may not
be sufficient in very fast timeframes because of the
impact on performance ultimately depends on the nature of the trading system. In slower
timeframes, such as when backtesting with hourly or daily bars in
long holding periods, the effect
due to the lack of actual trading participation is minimal in liquid markets. Backtesting scalping
systems
using order book historical data or high-frequency systems using tick data leads to
unrealistic and at times misleading results because of the
lack of an effect on prices action.
file:///mnt/ST_320/Download/Chapter Six – Price Action Lab Blog_files/default_005.htm 6/68
10/1/22, 4:52 PM Chapter Six – Price Action Lab Blog
Although high-frequency trading system backtesting is popular nowadays, most of those that do
it are being fooled by randomness unless they have a clear and tested mathematical model of how
their system affects price action. Thus, another model is required on top of a trading system
model in some cases
and the number of assumptions and parameters increase together with the
danger of curve-fitting performance to historical data. Trading systems
tested on daily timeframes
and with data from liquid markets usually are not affected by their lack of an effect on price
action as long as the profit per trade is many times the security tick value and any stop-loss levels
are set far enough from the entry price.
Backtesting
is an integral part of system development at every level of quant trading and of the
two methods of discovering models that are discussed in more detail below.
A
trading edge was defined in Chapter Two as a set of rules, also referred to as a trading system,
that maintains a positive expectation during a sufficiently long period and delivers profits at an
acceptable risk level that usually exceeds what could be obtained by following a benchmark
index. There are two ways of discovering an edge: (a) conceiving a strategy and then backtesting
it and (b) discovering a strategy by data-mining historical data. These are also referred to as
trading system analysis and trading system synthesis, respectively [Ref. 7].
A
single strategy, or hypothesis, is a set of rules that can be tested on
historical data and then
applied to actual trading. For example, a simple strategy may involve entering a long position if
the 5-day moving
average crosses below the 25-day moving average and then exiting the long
position and reversing to short when the opposite occurs. This strategy can be expressed in
mathematical form, as shown in Pseudo Code 6-1, A trading strategy is also referred to as a
trading model. A
complete system that can be used in actual trading based on some model
usually involves additional rules that determine position size and also incorporates rules for risk
control but the objective is to first determine whether the concept is sound and can deliver an
edge. Note that the terms trading strategy, trading model, and trading system are used
interchangeably throughout this book.
If MA(5) crosses above MA(25) then Buy at the close //buy signal
If MA(5) crosses below MA(25) then Short at the close //short signal
Figure
6-2 shows the main steps that are followed for a single strategy analysis. The strategy is
first backtested on a portion of the available
data, called the in-sample, which usually covers 75%
of the available history of the price series. This step may also involve optimization of any
parameters of the model but this is not necessary as some trading models do not have any
parameters that can be optimized (for example, Close of today greater than the high of 10 days.)
A
selection is made if the model generates acceptable performance results
in the in-sample,
otherwise, the model is discarded. If selected, then the model is tested on the out-of-sample with
the parameter values that were determined during the first step. If the performance in the out-of-
sample is satisfactory, then it is accepted, otherwise, it is discarded. There is no consensus about
what constitutes acceptable out-of-sample performance. Some have argued that if the out-of-
sample performance based on a metric such as the compound annual return is at least half of that
obtained in the in-sample, then the model is acceptable. Others look for out-of-sample
performance that is at least as good as that from the in-sample. There are other more complex
criteria one could use, such as a metric that combines the return, drawdown, and other parameters
calculated during the backtest, including
the Sharpe ratio and the win rate to name a few.
However, in most cases, the choice of criteria to judge out-of-sample performance may not
matter
because this process is problematic to start with.
If
the performance of the model in the out-of-sample is not acceptable, then it is either modified,
or discarded, or a new strategy is considered. This is the point where this method of discovering
trading systems becomes problematic because data-snooping bias is introduced, i.e., the
developer knows already how some particular model performed in
the out-of-sample, and then
based on that information she attempts to improve her chances of coming up with a better model.
In addition to that, the process involves model selection after the in-sample backtest,
as shown in
Figure 6-2. The combination of curve-fitting, if any is involved, with selection and data-snooping
leads to what is commonly referred to as data-mining bias: any accepted results (the final
selection as shown in Figure 6-2) ignore the fact that previous results were rejected.
Data
mining is a dangerous practice because as the number of trials to find an edge increases, the
probability of accepting a model that possesses no market timing ability but performs well in the
out-of-sample just by chance tends to 1 and after a while, it becomes the certain event. Thus,
the
result of following the procedure in Figure 6-2 in an iterative fashion, i.e., by testing and rejecting
models based on out-of-sample performance, increases data-mining bias and the probability of
the generation of spurious correlations. There are ways of mitigating these negative effects that
are used by expert quants and discussed later in this chapter, but naive quants, or Level III
technical traders according
to this book, do not employ them because they either underestimate
the negative impact of data-mining bias or are simply unaware of it. Level III quants that use
backtesting frequently are likely to get fooled by randomness and spurious correlations because
data-mining bias increases with every new backtest and gradually rises to high levels. Even in the
case of a single independent backtest, there is the possibility that the
strategy that was analyzed
was at some point discovered via data mining. This is the case with most strategies that are
included in trading publications. Using such strategies as an input to the process in Figure 6-2
will amplify the problems and most likely lead nowhere.
Naive
backtesting has resulted in a massive wealth transfer from Level III technical traders to
market professionals and other well-informed or highly skilled market participants, such as Level
II chartists and Level
IV quants. Yet, this practice is rarely challenged in technical analysis books
instead it is highly promoted since no alternatives are present. There is a huge industry developed
around backtesting that hesitates to educate its customers of its dangers from the fear of losing
them other than posting a few disclaimers in the fine print. However, it is an empirical fact that
discovering an edge via backtesting and by a process such as the one shown in Figure 6-2, leads
to a dead end unless data-mining bias is minimized. Someone with little trading experience but
with a fresh idea about a strategy that can offer
an edge has better chances than someone with a
lot of experience with the markets that have spent several years looking for an edge via
backtesting because the data-mining bias in the former case is small whereas in the latter is huge.
file:///mnt/ST_320/Download/Chapter Six – Price Action Lab Blog_files/default_005.htm 9/68
10/1/22, 4:52 PM Chapter Six – Price Action Lab Blog
Figure 6-3 shows a schematic of how the process of discovering a market edge can be automated.
The
automated process of discovering a trading edge is exploratory and based on some criteria
for generating strategies combined with automation settings. The criteria may involve things such
as the rules of arithmetic and Boolean operators, basic rules of technical analysis indicators, risk
and money management methods, performance objectives in
the form of metrics, and ways of
defining an objective function to optimize. The automation settings may involve things such as
the maximum
number of rules in a strategy, the maximum number of free parameters in
a model,
and the number of iterations to perform to combine rules. The strategy generation may be
accomplished by random permutation of rules or via some more sophisticated algorithm based on
neural networks or principles from evolutionary biology, including genetic programming
algorithms.
After
a strategy is generated by the procedure described in Figure 6-3, it is
backtested on an in-
sample of data and if any parameters are involved they are optimized in the process, based on
defined objectives, something that often introduces curve-fitting. If the results of the backtest are
not satisfactory in the sense of the performance objectives
defined, then a new strategy is
generated after modifying the previous one, which is a process that introduces selection bias.
When convergence
occurs in the sense of the particular objective function used, the best
strategy
is tested in an out-of-sample and if it passes the performance
tests it is added to a list of best
performing models, a step that introduces additional selection bias. If the information that a
particular strategy is not performing satisfactorily in the out-of-sample is used to modify the
strategy generation process, or worse to restart it, then data-snooping bias is introduced in
addition to selection bias. The combination of any curve-fitting, selection, and data-snooping is
what leads to what is commonly referred to as data-mining bias:
any accepted results (the final
selection as shown in Figure 6-3) ignore the fact that many other results were rejected and the
probability of discovering spurious models that perform well on the out-of-sample is high.
Note
that the process in Figure 6-3 has an advantage over the process in Figure 6-2 where the
developer tests new strategies sequentially until one is found that performs well in the out-of-
sample. In analysis. data-snooping bias is necessarily introduced because the human developer
is
aware of the out-of-sample performance results along the way to a final system, while in
synthesis data-snooping will not be introduced unless the developer elects to modify or restart
strategy generation based on out-of-sample results. Although this appears to be an advantage
of
automated data-mining methods for developing trading systems, it turns out that it is not a major
one because such a method may generate millions or even billions – and in certain cases
depending on the number
of rules used trillions – of strategies to backtest and the data-mining
bias due to selection is so high that the absence of data-snooping bias
has a minor positive effect.
A human developer is only able to test a limited number of distinct strategies in a lifetime but an
automated data-mining process can test a very large number of strategies in a matter of a few
hours. There are ways of reducing the data-mining bias and increasing the probability of a final
system not being a fluke but that is what expert quants try to achieve by following certain
procedures and guidelines that are discussed later in this chapter. The majority of Level III quants
that use data-mining software to discover a
trading edge are getting fooled by randomness
because the probability of finding spurious correlation is for all practical purposes equal to 1.
Many
quants have been fooled by efforts to present statistical hypothesis testing as a solution to
the problem of discovering a significant edge. Simple hypothesis testing is not useful when the
edge is found via multiple comparisons because of data-mining bias (Ref. 2.) Hypothesis testing
is not useful also in the case of a single independent test and thus not useful at all, as already
mentioned in Chapter Two. Proposed methods for hypothesis testing presented in the literature
involve the generation of a sampling distribution of a statistic, usually, the mean return, using
procedures such as bootstrapping or Monte Carlo simulation
(Ref. 10.) In the case of multiple
comparisons, proposed methods for hypothesis testing involve adjusting p-values by a suitable
factor. The p-value is the probability of obtaining profitability at least as extreme as the observed
profitability of a trading system given that the
null hypothesis that the trading system is not
profitable is true. For example, if the p-value is equal to 0.01, it means that there is one chance in
100 of obtaining the observed profitability assuming that the null hypothesis of zero profitability
is true.
Hypothesis
testing is not useful because the actual null hypothesis is not that the system has zero
profitability or that it is random and non-intelligent but that any of those are true given the
specific results obtained. If market conditions do not change, then the results do not change and a
statistically significant edge will persist. However, if market conditions change, then the
performance results will differ and as a consequence, the sampling distribution will have different
mean and variance. If the sampling distribution parameters change substantially due to changing
market conditions, then the edge may no longer be significant. For example, in the presence of
adverse market conditions, such as the conditions developed through the financial crisis of 2008
for example, an edge may lose its significance and this is actually what happened to many algos
that were profitable before that specific event. Therefore, the use of hypothesis testing has
implicit the notion of large samples to guarantee stability but this is
not often the case when it is
applied. In addition, even in the case that a large sample is available, naive quants employing
hypothesis testing do not often consider the impact of data mining. When market conditions
change substantially from those in the in-sample and out-of-sample, the system may fail
regardless of a low p-value, even when adjusted for multiple comparisons.
Hypothesis
testing can be useful in rejecting unprofitable ideas at the risk of also rejecting some
potentially good ideas but it cannot be used for rejecting the null hypothesis of zero profitability
because any inferences made are conditioned on the obtained results, which may not be
representative of the population. For example, a trend-following system that performs well in the
in-sample and out-of-sample that both consist of long trends in prices with minor corrections will
file:///mnt/ST_320/Download/Chapter Six – Price Action Lab Blog_files/default_005.htm 12/68
10/1/22, 4:52 PM Chapter Six – Price Action Lab Blog
not perform
well in forward trading during a long period of sideways action. This system may
show high significance and low p-value in the out-of-sample but then generate large losses in the
forward sample where there is no trend. Market wizards of the 1980s knew these realities of the
markets but the mathematical jargon that was introduced after academia got involved with trading
has caused some level of confusion with the introduction of terminology like p-values and t-
statistics, multiple comparisons, bootstrapping, false discovery rate, etc., when in reality the key
is that all these tests fail if there is a significant change in
market conditions. Trading system
development cannot be elevated to the
status of an academic subject because it is also an art. If
trading system development was only a mathematical problem, then the math giants
of the 19th
century would have already solved it and there would be nothing new to say about it.
Getting
fooled by hypothesis testing during trading system development is very common
amongst Level III naive quants that embrace statistics more than common sense but lack in-depth
understanding of the factors that render them inapplicable. Fortunately, several practical tests can
be performed to maximize the probability that a trading system will stay profitable, as we shall
see later in this chapter. But anyone relying solely on statistics to get the job done will be greatly
disappointed.
Data
mining of random data can result in trading systems that are profitable
in the out-of-sample
but are spurious correlations and do not possess any predictive ability. For example, consider the
equity curve in Figure
6-4 that was generated by tossing a coin with a payout equal to +1 for
heads and -1 for tails. The in-sample test covers about the first 630 trades and the rest of the
trades are from the out-of-sample test that was used for validation. The lesson here is that if many
equity curves are generated by mining the data, the probability of finding one that passes the
validation test is equal to 1. Therefore, out-of-sample tests
are for the most part useless for
validating results that arise from strategy data-mining but may have some value in the case of a
single strategy that is unbiased. However, most naive quants use out-of-sample validation tests
that, in addition to the fact that they deprive the in-sample of important market conditions, they
are useless as far as a confirmatory test because of the exploratory nature of the process.
Even
in the case that analysis (back-testing a single strategy) or synthesis
(data-mining) involves
validation with data from several other markets,
when the number of iterations, or multiple
comparisons, gets extremely large, the probability of finding a spurious system that will validate
on all markets approaches 1. The same is true with the method that has been promoted in recent
years as a way of determining the best parameters to use in a system and goes by the names walk-
forward analysis or walk-forward optimization. This method involves
dividing the available data
into multiple in-sample and out-of-sample periods, optimizing in the former and walking forward
in the latter. Thus, this method involves not one in-sample backtest followed by an out-of-sample
validation test but several in-sample backtests each followed by out-of-sample tests. The claim is
that this is how traders use systems in real life, by re-optimizing their parameters and then using
the new parameter set forward. In the context of the two methods for discovering edges already
discussed, the analysis and synthesis, walk-forward optimization involves an added number of
iterations at the backtesting level and it does little, or even nothing, in reducing the data-mining
bias introduced by multiple comparisons. This validation test may have some value in the case of
testing a single strategy just once, on data used only once, but as long as a rejection of a strategy
occurs and a new or modified one is tested, walk-forward optimization becomes part of a process
that is plagued by data-mining bias regardless
of how the data is divided into multiple in-samples
and subsequent out-of-samples. As is the case with simple backtesting, walk-forward
optimization can be employed for supporting the hypothesis that a system
that does not perform
well in the past will also not perform well in actual trading but it cannot be used to support the
alternative hypothesis, i.e., that the system will perform well in the future.
More
importantly, walk-forward optimization is primarily used for determining the best
parameters to use in the next out-of-sample period (forward sample) and this assumes that all
samples are representative and drawn from the same population. However, determining the best
parameters to use is not the fundamental problem of trading system development. It is an
important issue but a secondary problem and it is desirable if possible not to have any parameters
at all in the model to avoid over-fitting and its contribution to data-mining bias. A more difficult
problem arises from the fact that the historical data used may
generate trade samples that are not
representative of the population but then during actual trading conditions arise that were not
present during the analysis. Therefore, although walk-forward analysis may be a good exercise
for Level III quants and a nice feature of trading programs for marketing purposes, it adds little to
solving the fundamental problem of trading systems developed via the use of backtesting because
if this process is used repetitively with the same historical data, then spurious correlations may
arise due to data-mining
bias.
Monte
Carlo simulations are used extensively by technical traders for determining the robustness
of systems developed via analysis or synthesis and even as part of the validation stage. These
simulations are also used to determine capitalization requirements. Monte Carlo simulation
usually involves varying the optimized system parameters, the
exits parameters and even adding
noise to the data and obtaining distributions of some key statistics that then provide confidence
intervals of the performance of the system and especially of its maximum
equity drawdown,
which is a key parameter in calculating capitalization
requirements. However, it turns out that the
robustness of over-fitted systems, especially those produced by neural networks and genetic
algorithms, is high ad related to parameter variations. Therefore, Monte
Carlo tests of optimized
and over-fitted systems tend to be optimistic and this is another reason that they have not helped
Level III quants in
discovering edges. In addition, the use of Monte Carlo simulation suffers from
the same problem as hypothesis testing. This is because the
calculated confidence intervals are
based on limited samples of trades that may not be representative of the population. One other
problem of Monte Carlo tests is that when they become part of an iterative process of discovering
trading systems, either analysis or synthesis, data-mining bias is still present and a spurious
system can be found when the number of strategies tested becomes large. Especially in the case
of testing billions or even trillions of systems to obtain a list of best performers, as in Figure 6-3,
Monte Carlo analysis turns out to be ineffective.
then
k new vectors may be generated by reshuffling it. The new k+1 vectors of trade results and
corresponding equity curves can be used to access the potential of the system, its risk, drawdown,
etc. The rationale behind this method of analysis is that T[0] represents just a particular
order of
trades and naturally the order could be different. Reshuffling
of results is employed to arrive at
different possibilities. The main assumption here is that all outcomes in T[0] are independent in
the same
way that a coin toss is independent of another coin toss. If the independence condition is
violated, then this particular method of assessing risk does not apply and may produce misleading
results.
Monte Carlo simulations may generate misleading results when the following conditions are true:
4. The systems take trades that depend on the outcome of other trades
The
independence assumption is directly violated in cases 2, 4, 5, and 6 above and indirectly but
in a significant way in case 3.
All
validation schemes and robustness tests lose their effectiveness when the data-mining bias is
large, something that occurs when reusing historical data to test many strategies. In the next
section, we will discuss ways of minimizing the data-mining bias so that the chances of finding
robust systems with an edge increase. We will achieve that by minimizing each component of
data-mining bias separately.
An
expert quant, also called a Level IV technical trader in Chapter One of
this book, understands
the limitations of backtesting and how its repetitive use with historical data increases data-mining
bias and the probability of discovering spurious correlations. Expert quants accept backtesting as
an indispensable tool for developing trading systems but use various rules, guidelines, and
procedures for minimizing its pitfalls while the adverse effects of data-mining bias are kept to a
minimum. This task is often complicated and time-consuming but it is what allows for separating
statistical flukes from actual trading edges and, as a result, the specifics are often kept proprietary
and constitute a higher level edge or a trading meta-edge. This is in a nutshell the large gap that
separates expert quants from quants that use backtesting programs in naive ways.
In
this section, some specific rules, procedures, and guidelines are offered that could allow a
trading system developer to transition from Level III to Level IV. Based on 30 years of trading
system development experience, the author of this book believes that although these may not
constitute a formal or complete method for arriving at the final goal, which is the discovery of a
trading edge, they nevertheless offer a starting point. It would be ludicrous to claim that a formal
method exists that is unique and applies in all cases and no such claim is made
in this book.
Instead, the reader should question the validity of methods that are presented as a panacea to this
problem. Such methods are often the result of a lack of experience with actual trading and system
development and at times reflect naive attempts to apply simple concepts from statistics where
they have limited effectiveness.
There are two key components to the expert quant methodology that are presented in more detail
in this section:
Note
that some may wrongly assume that data-mining bias can be eliminated or
even quantified.
This bias is inherent in the process of developing trading systems via the use of backtesting in a
repetitive fashion, manual or automatic, and it is a qualitative aspect with several quantitative
effects. Millions of technical traders around the world have used data mining and many still do.
Magazine articles, blog posts, and academic papers have attempted to improve a trading system
someone else has published in the past. Any subsequent attempt to improve data-mined systems
increases the bias due to the reuse of data they were
developed on. Furthermore, when a trader
conceives an idea to test, it is highly possible that it was already data-mined in the past by
someone
else and then used or even published in some form. Therefore, coming up
with original
ideas to test or having a data-mining process that can look for unique ideas is half the battle.
Recall from Figures 6-2 and 6-3 that three main processes contribute to data-mining bias:
Model selection
Data snooping
Below we consider ways of minimizing the contribution of each process that contributes to data-
mining bias.
In
mathematics, curve fitting is the process of finding a curve that fits best a collection of data
points in the sense that some objective function subject to constraints is maximized (or
minimized.) For example, least-squares is a curve-fitting method that minimizes
the sum of
squared residuals. A residual is a difference between a fitted and an actual value. Thus, the
objective function to minimize when using this method is the sum of the squared residuals.
Optimality can only be defined relative to an objective and curve-fitting is essentially the result
of optimization.
There
has been some confusion in the trading literature about the notion of curve fitting as it
applies to trading systems development. This is possible because of an effort to generalize. A
trading system is a process that generates a sequence of entry and exit signals. Usually, the
algorithm, or model, of a trading system includes a set of variables, referred to as the system
parameters. One of the objectives of trading system development is to select the values of these
parameters so that the system performs best under actual market conditions. It is common
practice to set these parameters by back-testing the system on historical data so that some
objective function, usually referred to as a metric, is maximized (or minimized.) For example,
one can set the parameters so that the net profit is maximized, the maximum drawdown is
minimized, or even some linear or non-linear function of these two performance parameters or
any
other performance parameters is maximized or minimized.
What
is achieved essentially when system parameters are adjusted via backtesting is that the
timing of the signals is varied so that they are
fitted to historical data in such a way as to optimize
an objective function. This is not curve-fitting in the traditional sense because one
is not merely
trying to find a curve that best fits the historical data
but instead determines the best sequence of
entry signals that in conjunction with the exit signals maximize some objective. This process is
much more involved and complicated than simple curve-fitting because it includes the selection
of entry and exit signals that maximize an objective function. It is a highly non-linear
optimization problem rather than just a curve-fitting problem. As already mentioned, curve-
fitting usually involves optimization but the latter is a process with a much broader scope and
includes many more possibilities.
As
an example, let us consider a simple moving average crossover system that generates long
entry signals when SMA(n1) > SMA (n2) and short entry signals when SMA(n1) < SMA(n2),
where n1 and n2 are the periods
and with n2 > n1, as shown on Pseudocode 6-2. This is a stop
and reverse system, i.e., when an opposite signal is generated the previous position is closed and
reversed. This system cannot be used in practice unless the values of n1 and n2 are selected. One
could select the values
via optimization of performance, after backtesting the system on historical
data.
If MA(n1) crosses above MA(n2) then Buy N shares at the close //buy signal
If MA(n1) crosses below MA(n1) then Short N shares at the close //short signal
Pseudo code 6-2. A moving average crossover system with two free parameters
Let
us suppose that on January 1, 2013, the trading system developer thinks
that {n1=10, n2=35}
are good values for the moving average crossover system for trading SPY. He then backtests the
system with these initial values on SPY data for 2013 (Note that the data sample is small but this
is an example to illustrate the concept.) The resulting equity curve is
shown in Figure 6-5.
Figure
6-5. Backtest results for the system in Pseudocode 6-2 with n1 = 10 and
n2=35 during 2013.
Created with Amibroker – advanced charting and technical analysis software: www.amibroker.com
The
net profit of the system with a $0.01 commission per share is shown on the detailed trade
report in Figure 6.6 and it is equal to $442.77. The initial capital is $100K. Four trades were
generated with the last one being an open long and the only profitable one. Note that trade is
generated only when the 10-day moving average crosses above or below the
35-day moving
average and thus there is no trade for the first half of the year in this example, as shown in Figure
6-5 with a flat equity curve. CAR is 0.44% and the maximum drawdown is -13.78%.
Figure
6-6. Trades-by-trade backtests results for the system in Pseudocode 6-2
with n1 = 10 and
n2=35. Created with Amibroker – advanced charting and technical analysis software:
www.amibroker.com
Next,
the system developer decides to find the optimum values of n1 and n2 that maximize the
net profit. He uses a genetic optimization algorithm that comes with his backtesting program to
find out that the values n1=5
and n2=15 generate better results, as shown in Figures 6-7 and 6-8.
Figure 6-7. Backtest results for the system in Pseudo code 6-2 with n1 = 5 and n2=15 for 2013
The
total net profit of the system with the optimum parameters and with a $0.01 commission per
share traded is shown in the detailed trade report in Figure 6.8 and it has increased to $2,267. In
this case, 18 trades were generated, with the last one being an open long. Note that trade is
generated only when the 5-day moving average crosses above or below the
15-day moving
average and thus there is no trade for the first two months, as shown in Figure 6-7 with a flat
equity curve. CAR is 3.27% and the maximum drawdown decreased to -5.47%.
Figure
6-8. Trade-by-trade backtest results for the system in Pseudocode 6-2 with n1 = 5 and
n2=15. Created with Amibroker – advanced charting and technical analysis software:
www.amibroker.com
It
may be seen from the above simple example that optimization (and curve-fitting) amounts
essentially to determine the best parameters that
allow the selection of a system that fulfills
certain objectives over the backtest period. This process introduces bias because the selected
system and associated parameters may not reflect an edge but only survival due to chance, i.e.
when market conditions change the system will be unable to perform as expected and it will fail.
This is the case
with this particular example, as backtesting the system with data for 2012 results
in a failure with CAR -17.60%, as shown in Figure 6-9. Thus, if the trader is lucky and the same
conditions as in 2013 continue during 2014, the system will generate profit but this system lacks
any ability of adapting to changing market conditions. Note that had the system been optimized
using data from 2012, the best parameters would be n1=10 and n2=40 with 6 trades generated,
100% win rate, and CAR
of 12.17%. But then, these parameters would have generated a negative
performance during 2013 with CAR equal to -0.27%. This shows that the results of sequential
optimization methods, such as walk forward optimization, depend on the path a system follows
and are not inherently
more robust.
Figure 6-9. Backtest results for the system in Pseudo code 6-1 with n1 = 5 and n2=15 in 2012
From
the simple moving average crossover example with limited samples for illustration
purposes only, the following conclusions may be derived about the impact of optimization and
curve-fitting:
The
result of the above two effects is the introduction of selection bias in the trading system
development process. To minimize the bias, we next
provide a classification of trading systems
based on how they are affected by optimization and curve-fitting.
Type
I trading systems: When the parameters of Type I systems are adjusted, the timing of both
the entry and the exit signals is affected. An example of a Type I trading system is a simple
moving average crossover system, such as the one in Pseudocode 6-1. In this case, optimization
and curve-fitting result in different sequences of entry and exit signals, and selecting one that
performs best increases data-mining bias. These systems have the highest probability of failure in
forward trading because selection ignores variations that have failed or performed poorly.
Type
II trading systems: When the parameters of Type II systems are adjusted, only the timing of
their entry signals is affected. In this case, optimization and curve-fitting result in collections of
entry and exit signals that differ only in their entry part. A final selection usually introduces less
bias as compared to Type I systems. Example: Enter long if SMA(n1) > SMA(n2) and Exit long
at P1 or P2 where P1 and P2 are fixed price levels. Note that Type II systems are rarely used
in
practice.
Type
III trading systems: When the parameters of Type III systems are adjusted, only the timing
of the exit signals is affected. In this case,
optimization and curve-fitting result in sequences of
entry and exit signals that differ only in their exit part. Example: Enter long if Close of today >
Close of 2 days ago and Exit long at entry price + X
points or at entry price – Y points, where X
and Y are the parameter to
optimize (the profit target and stop-loss.)
In
general, systems that include lagging indicators of price are Type I. Type II systems are not
common, especially in trending markets such as equities. Type III systems include a broad class
of systems based on price patterns of open, high, low, and close. Working exclusively with Type-
III systems minimizes the effect of curve-fitting and optimization in the sense that the trade
sample does not vary and selection bias is minimized because only the profit target and stop-loss
values are optimized. The impact of the optimization on the exit part can be minimized if during
the trading system development phase sufficiently small values are used that avoid the fitting of
the exits to the data and only allow establishing if a system has a market timing ability. The
actual
magnitudes of the exit values depend on the market traded and they should be different for
intraday and position trading systems. For example, for position and swing systems on daily data,
profit-target and
stop-loss values in the range 0.5% – 3% may be used in some equities and ETFs.
Then, after a system is developed that can successfully time the market, different exits can be
used to increase the profit potential, such as trailing stops, for example.
Use
of exits that fit market data, such as trailing stops should be avoided
during the trading
system development phase because the objective should be to discover an edge that can time
price action properly, not to curve-fit some algo to past data. This is a common error of Level III
quants and especially of those who use off-the-shelf programs that discover trading systems
through the use of neural networks, genetic algorithms, or permutation methods. This is also an
error made by those who believe that market entries are not important as long as one can ride
The
use of genetic programming engine by a Level III quant along with a bunch of indicators and
some exit rules, including trailing stops, average true range, moving average crossovers, etc., is
the certain path
to failure. The data-mining bias is so large that the best systems obtained have a
probability close to 1 of being flukes. This probability
can be reduced if indicators that have no
parameters are used or only one class of similar indicators is considered, such as moving
averages, in a system with a maximum of two free parameters. Even in the case when
indicators
with no parameters are used, for example, price patterns, the use of exits that can fit the system
signals to the data should be avoided. Figure 6-10 shows the equity curve of a system for trading
SPY developed on data from 01/03/2003 to 12/29/2006 using a program that was
based on
genetic programming. The code of the best performing system is
shown as Pseudocode 6-3. The
system generates an entry at the open of the next day when the low of six days ago is larger than
the open of 18 days ago. A 10% trailing stop and a stop-loss of 2% based on the entry prices are
used for exits. The equity performance of the above system in
the mentioned period is shown in
Figure 6-10.
If the low of 6 days ago is greater than the open of 18 days ago then
Figure 6-10. Equity curve of the system in Pseudocode 6-3 in the period 01/02/2003 to 12/29/2006
The
net return of the system in Pseudo Code 6-3 is 50%, for starting capital of $100K and $0.01
commission per share traded. The CAR is 12.32% and the maximum drawdown is -10.33%. The
system is fitted to the data because the market is on an uptrend and the trailing stop facilitates
exactly that while the stop-loss is only hit five times at the beginning of 2003 but after February
of the same year, there is only
one long position that stays open until the end of the backtesting
period. However, when the system is tested forward on data from 01/02/2008 to 12/31/2008 it
fails because market conditions change. Maximum drawdown increases to -38%, the net return is
– 30.5% and CAR is
-16.6%. An expert quant could have suspected the deterioration in the
performance of a system just by looking at its code because it appears random. However, Level
III quants often get fooled by promises of quick trading system discovery schemes without
understanding the impact data-mining bias has on the results. In this case, the combination of
curve-fitting through the use of a trailing stop and selection of the best-performing model
produced a fluke that could not cope with changing
market conditions. Many use similar or even
worse systems that are discovered through relentless data-fishing and data-snooping.
Figure 6-11. Equity curve of the system in Pseudocode 6-3 in the period 01/03/2007 to 12/31/2008
The following measures could result in lower data-mining bias due to optimization and curve-
fitting:
If possible use only Type III systems or systems with a maximum of two parameters
Backtest the system after introducing a sufficiently small profit target and stop-loss
The
use of Type III systems helps in avoiding optimization of the entry part and offers
consistency as far as the sequence of entries but there is still the exit part that may have
parameters that could be curve-fitted to the data. The use of exit rules that curve-fit to historical
data should be avoided during development because the entry rules lose their significance and the
resulting system may not possess any intelligence in timing market price action. Different exits
may be added to a system but only after it is determined that it has an edge. Removing the exit
logic and testing a system with a small profit target and stop-loss may in some cases indicate
whether the entry part does its
job in determining proper market timing. Finally, introducing
suitable randomness in the entry logic allows ranking the system and testing the null hypothesis
that it has no intelligence on the data sample that is available.
Optimization
of the free parameters of the moving average crossover system in Pseudocode 6-2
was performed on SPY data from inception to 07/23/2014. Based on the objective of maximum
net profit, the resulting best values for the moving average periods were n1=25 and n2=295. The
equity performance with the optimum values is shown in Figure 6-12 and the trade-by-trade
report in Figure 6-13. The starting equity was $100K, commissions was $0.01 per share and the
available equity was fully invested.
Figure 6-12. Backtest results for the system in Pseudo code 6-2 with n1 = 25 and n2=295
Figure
6-13. Trade-by-trade backtests results for the system in Pseudocode 6-2
with n1 = 25 and
n2=295. Created with Amibroker – advanced charting and
technical analysis software:
www.amibroker.com
The
optimized system has CAR equal to 10.30%, 50% win rate, and a maximum drawdown of
about -28%, as shown in Figures 6-13. Next, a small profit target and stop-loss of 2% are applied
to each entry signal to determine
whether the entry part has some market timing ability or the
signals are essentially random and the real work is done by the exit signals. The resulting equity
performance is shown in Figure 6-14.
Figure 6-14. Backtest results for the system in Pseudo code 6-2 with n1 = 25 and n2=295 and 2%
profit target and stop-loss
Figure
6-15. Trade-by-trade backtests results for the system in Pseudocode 6-2
with n1 = 25 and
n2=295 with a 2% profit target and stop-loss. Created with Amibroker – advanced charting and
technical analysis software: www.amibroker.com
A
trade-by-trade report of the results in Figure 6-14 shows the same entries but different exits due
to the applied profit target and stop-loss. Performance is now negative with a CAR of -0.19%,
win rate of 43.75%, and a smaller drawdown of -9.77% due to the small stop-loss, as shown in
Figure 6-15. These results offer an indication that the optimized system relies on its exits to
perform well while its entries do not offer any market timing ability. While relying on the exits is
not a disadvantage in principle and is desirable, sole reliance
on them opens the system to
potential troubles when market conditions change, as already mentioned and demonstrated with
an example in Figures
6-10 and 6-11, where a trailing stop was used. Essentially, in that case, the
optimized system entered the market randomly hoping that the reversal exit signal will be far
enough from entry and profit will be generated. This is no different than the type of random
trading of Level
I chartists where entries are mostly random. A trading system with a random
entry behaves similarly to a naive chartist who hopes that the exit price will be higher than the
entry price although trading is in reality random. This is equivalent to “buy (or short) and hope”.
A
randomization study involves making the system entry logic random while
maintaining the
same exits. Note that this is just one possible form of
randomization. The pseudo-code for this is
given in Pseudo Code 6-4. It
involves tossing a coin and opening a long position after closing any
open short position when the outcome is heads and the fast average is above the slow average or
opening a short position after closing any long open position when the outcome is tails and the
fast-moving average
is below the slow. The exits remain the same, i.e., long positions are exited
if the fast-moving average crosses below the slow and the opposite holds in the case of short
positions.
If MA(25) > MA(295) AND Heads then Buy N shares at the close //buy signal
If MA(25) < MA(295) AND Tails then Short N shares at the close //short signal
Pseudo code 6-4. Code for randomization study of a moving average crossover system
The
system in Pseudo code 6-4 was backtested 10,000 times to obtain a frequency distribution of
the resulting CAR values, as shown in Figure 6-16. The distribution is close to normal with a
mean of 10.11% and a standard deviation of 0.221%. The kurtosis is low and slightly negative
and the skewness is also small and negative. As it turns out after ranking the original system,
19.22% of the random entry systems have a higher than 10.30%, which is the value of the CAR
of the original system. Significance at the 5% level would require a CAR of 10.46% or 16
basis
points more than what was obtained during the original backtest.
Figure 6-16. Randomization study results for the system in Pseudocode 6-2 with n1 = 25 and
n2=295
Although
the difference in values for significance appears small in this example that is used to
illustrate the method, it translates to roughly $27,500 of additional profit in the backtest period.
For
example, a market condition similar to those encountered in XLF from 01/03/2000 to
12/31/2002 would result in a negative performance with excessive drawdown. In Figure 6-17, the
backtest results of the moving average crossover system in Pseudocode 6-2 are shown for n1 = 25
and n2=295, when applied to XLF data. In this case, the entry part of the system is not intelligent
enough to avoid the whipsaw market that begins
in 2001 and drawdown accumulates due to
consecutive losers.
Figure 6-17. Backtest results for the system in Pseudo code 6-2 with n1 = 25 and n2=295 in XLF
The
conclusion from this example is that the original moving average crossover system was
curve-fitted and the tests that were performed allowed us to determine that this system lacks
intelligence, as was expected. Note that this simple example was just used to illustrate a
methodology for determining whether a system was curve-fitted during the
design phase.
In
recent years several variations of a trading system that is based on the RSI indicator appeared
in books, magazines, and blog posts. The basic rule of the system involves going long when the
RSI of two days, denoted as RSI(2), drops below a certain value, supposedly indicating an
oversold condition, and exiting when the same indicator enters the overbought territory. In
Chapter Four, it was shown that the RSI is essentially a strength indicator, as its name implies,
and not an overbought/oversold indicator because this assessment relies on the gambler’s fallacy.
Note that this seemingly “simple” system, as it has been described, already involves four
parameters, the RSI period for the
long entry, n1, the period for the long exit, n2, the overbought
level OB, and the oversold level OS, as shown in Pseudocode 6-5.
According
to the rules of trading system development for reducing the impact of curve-fitting on
the data-mining bias that were presented in this section, the four parameters are already too many.
Selecting values for those parameters immediately introduces data-mining bias. Optimizing this
system further increases the data-mining bias due to curve-fitting.
But let us suppose that the
recommended parameter values are as follows: n1 = 2, n2 = 2, OS = 10 and OB = 90. These
values could have been the result of some optimization in the past based on some unknown
objective function. A backtest of this system on unadjusted SPY data from 01/29/1993 to
07/23/2014, with starting capital of $100K, fully invested equity at each position, and 0.01 cents
per share commission, yielded the equity curve shown in Figure 6-18. The equity performance of
this system nearly matches that of buy and hold with a CAR of 7.16%, the maximum drawdown
is equal to -38.84%, the win rate is 77.91%, and the profit factor is equal to 1.88 and the Sharpe
ratio is 1.05. A total
of 172 trades were generated and the net return is 342%. These performance
parameters may impress a naive quant but an expert quant will try to debunk this system because
(a) it involves too many parameters, (b) its payoff ratio (average win to the average loss ratio)
is
only 0.53, which is a low number, and (c) it is a system that works in markets with a strong
upward bias in mean-reversion mode, as it will be shown in more detail below.
Figure 6-18. Backtest results of the system in Pseudo code 6-5 with n1 = 2, n2 = 2, OS = 10 and OB
= 90 in SPY
As
the first step in our effort to debunk this system, let us remove the exit logic and introduce a
small profit target and stop-loss of 2%. The resulting equity performance is shown in Figure 6-19.
Figure 6-19. Backtest results of the system in Pseudocode 6-5 with n1 = 2, OS = 10 and a 2% exits
in SPY
It
may be seen from Figure 6-19 that after the exit logic was removed, the
performance of the
system deteriorates significantly and the win rate drops to 54.63. Still, the performance is positive
with a CAR of 2.37% due to the longer-term upward bias of SPY. The small profit factor of 1.15
indicates that the results may not be better than random.
We
next perform a random simulation in the same SPY data with a biased coin that generates
heads (long signals) 75% of the time. When tails show up the positions are exited. Thus, long
positions stay in the market for some time comparable to the average bars in trades from the
backtest in Figure 6-19 which is equal to about 15 days. The frequency distribution of 20,000
such simulations with the same initial capital, fully invested equity at each position, and the same
commission rate is shown in Figure 6-10. The test return of 342% for the RSI(2) system ranks
higher than only 75.69% of the random systems and the resulting p-value is 0.2430. Therefore,
there is no evidence, based on the results
of this simulation, against the null hypothesis that this
system does not possess any intelligence. Note that according to the results of the simulation
shown in Figure 6-20, a net return of at least 385.82% is required to declare the RSI(2) system
significant at the 5% level.
file:///mnt/ST_320/Download/Chapter Six – Price Action Lab Blog_files/default_005.htm 35/68
10/1/22, 4:52 PM Chapter Six – Price Action Lab Blog
Figure 6-20. Randomization study results of a coin toss system for trading SPY with p{heads} =
0.75
In
addition to the four parameters that open the door to over-fitting, this simple version of the
RSI(2) system is naive because it tries to capture short-term trends as the strength of the indicator
increases from low levels. However, in a strong up trending market such as the SPY, 100% of all
random position systems profit, as shown in Figure 6-20, with a minimum return of 86.69% and a
maximum return of 521.93%. Also, about 18% of the random systems manage to make more than
the buy and hold return. Therefore, a trader tossing a biased coin, entering long positions when
heads show up, and exiting them when tails show up would have about a little less than 1 in 5
chances of making more than 352% and beating the buy and hold. Furthermore, it is evident to an
expert Level IV quant that a system such as the RSI(2) could fail during
downtrends, as shown in
Figure 6-21 when it is applied to GLD data.
Figure 6-21. Backtest results of the system in Pseudo code 6-5 with n1 = 2, n2 = 2, OS = 10 and OB
= 90 in GLD
It
may be seen from Figure 6-21 that the RSI(2) system does a good job of profiting during the
uptrend in GLD but when a downtrend starts after 2013 its performance deteriorates. This should
be expected because during downtrends rebound rallies fizzle fast. However, instead of tossing
this system away, naive quants have tried to improve it by adding all sorts of filters and thus more
parameters that facilitate more curve-fitting, leading to more data-mining bias. One of those
variations that were proposed included the addition of an extra condition in the entry logic for
checking for an uptrend in prices and thus avoiding downtrends, in the form of C > MA(200) and
the exit part in the form C < MA(5) for trying to manage quick reversals. However, as it may be
seen from Figure 6-22 after repeating the backtest
of Figure 6-21 with these two rules that add
two more parameters to the
system for a total of six parameters, the performance in GLD does
not improve significantly in the period since the start of 2012.
Figure
6-22. Backtest results of the system in Pseudo code 6-5 with n1 = 2, n2
= 2, OS = 10 and OB
= 90 and with extra filter added in GLD
Although
the RSI(2) system with the added filter manages to avoid a bad trade during the
downtrend in 2013, it nevertheless suffers losses due to the sideways price action during 2014.
This should be expected because the filters work when the downtrend is smooth but when the
averages turn flat, they often do not. Thus, the filters that were added in hindsight helped little
when market conditions changed and there was prolonged whipsaw. Also, if the same filters are
used without optimizing the RSI period, the CAR is reduced to 4.86% from 7.16% in the case of
SPY. In general, when many parameters are already involved and filters are added, the result is a
random system with a high probability of failure.
If in addition the parameters are optimized, the
probability of a random result is close to 1 due to high data-mining bias.
The
two rules presented in this section, i.e., the consideration of systems
with up to two
parameters maximum and that of avoiding during development the use of exits rules that curve-fit
equity performance to data, can help in reducing data-mining bias due to optimization and curve-
fitting. Then, the two tests proposed, i.e., replacing the exit logic with a small profit target and
stop-loss followed by a backtest and doing a randomization study to rank performance, can assist
an expert quant in evaluating the potential of trading systems developed either manually or via
the use of automated data-mining.
Note
that when the null hypothesis cannot be supported using the two methods
suggested, this
does not imply that the alternative is supported because the data sample used in the tests may not
be representative of the population. The aim of a trading system developer should be to try to find
support for the null hypothesis that the trading system does not
have an edge because true edges
are hard to find and the probability that a trading system is random is always high. While most
authors focus
on techniques that try to confirm an edge, this book aims at educating traders that it
is easy to be fooled by technical analysis into thinking
to an edge has been discovered. A system
can be so intelligently optimized that it passes all tests and appears to possess an edge, i.e.,
it may
show timing ability when a small profit target and a stop loss are applied and it may also rank at a
high significance level during randomization studies. Therefore, it is important to always
remember that tests are used to reject systems. If the rejection is not possible,
that does not mean
that a system possesses an edge. Understanding
this is equivalent to winning half the battle in
becoming an expert quant. An expert quant should always look hard for reasons to reject a
system and strive towards achieving that goal.
Next,
we discuss how the bias due to system selection can be minimized and we
introduce
additional tests for supporting the null hypothesis of no profitability.
Selection of systems
In
the previous section, we discussed in some detail and with a few examples how optimization
and curve-fitting increase data-mining bias. Specific rules were suggested for minimizing the bias
and testing the null hypothesis that a trading system possesses no edge. In this section, we will
discuss the impact of selection on the data-mining bias. Trading system selection, via a manual or
an automated trading system development process, introduces data-mining bias because selecting
a model based on some performance criteria, even when they are
calculated in an out-of-sample,
ignores many other models that did not perform as well.
Figures
6-2 and 6-3 indicate that selection is involved in two stages of the process, after
backtesting and optimizing with the in-sample data and after backtesting with the out-of-sample
data. If the trading system development process in Figure 6-2 is used repeatedly to test new
strategies, then it becomes equivalent to the automated one in Figure 6-3. One difference between
the two processes is how the hypotheses to test are generated; in the manual process they are
conceived by a human via a complex cognitive process and in the automated process they are
file:///mnt/ST_320/Download/Chapter Six – Price Action Lab Blog_files/default_005.htm 39/68
10/1/22, 4:52 PM Chapter Six – Price Action Lab Blog
generated by an algorithm that in the simplest case is some process based on permutations and in
the most complex it is based on an evolutionary algorithm. Essentially, both processes try to
achieve
the same thing and discover an edge. The automated process creates a list of best systems
that fulfill some performance criteria while the manual process generates a similar list but over a
longer period.
Another
difference between the automated and the manual process of discovering trading systems
is that the former can test millions, billions, or even trillions of strategies in a short period but
note that in both cases the act of selecting a final system to trade increases the data-mining bias
because a large number of systems are excluded. The key point to understand is that although the
selection is not necessarily bad in general and it is even unavoidable in many cases, in the case of
trading
systems whose performance is evaluated based on historical data, any selected result may
have survived due to luck alone and may not possess an edge. This is the reason that selection
introduces bias.
After
selecting a system one must test it to determine if it possesses an edge. The final decision
cannot be based only on out-of-sample backtesting because, as already explained, the results of
the tests can be due to luck when multiple comparisons are made. In other words, if repeated
data-fishing is performed, a system may be found that passed the out-of-sample validation test by
luck. In the case of automated trading system development via the use of neural networks,
evolutionary algorithms, or permutations, it is easier to see why after testing a huge number of
alternative hypotheses one or even several can be found that passes the out-of-sample validation
test. That may not be evident to those who try to discover systems manually but it is a similar
process and in this case, only the speed by which data-mining bias is increased is different. Due
to data-snooping bias that is often present in manual development, i.e., a new strategy is
conceived after the results of backtesting of other strategies in the out-of-sample are known, any
benefits from the lower speed of data-mining bias increase are kept to a minimum.
After
the above introduction, it may seem reasonable to eliminate out-of-sample testing and use
the available data history for developing trading systems, manually or automatically. This allows
for more market conditions to be encountered during backtesting and may help in eliminating
more random systems. However, selection must take place based on some criteria, and the final
system, or systems, must be validated on some other data sample. This new data sample could be
supplied by comparable markets. For example, if the objective is to find
a system for trading
QQQ, after the final selection of strategies could
be tested on a sufficiently large number of
comparable securities that is decided in advance and before development begins but not
afterward. The group of securities should involve price series that reflect a wide variety of market
conditions and at the same time, they are not highly correlated. In the case of QQQ, the
comparable securities group could include, for example, SPY, IWM, DIA, TLT, GLD, DBC,
XLE, XLF, and EEM. However, this is where most naive quants make a mistake: they consider
the combined portfolio performance to decide whether the selected system
has an edge. The
results could be positively skewed due to outperformance in only a few securities. This is also a
frequent error in the analysis of longer-term trend-following systems tested on a portfolio of
securities or futures contracts. A better criterion is the number of failures. For example, if the
system fails in more than 10% of
the securities or contracts in the portfolio, then that could mean
that
market conditions existed in the past during which the system offered no edge. Since the
same conditions can repeat in any market, the system will fail in the future with a high
probability.
Using all of the available histories to develop systems (no out-of-sample testing)
Setting a performance threshold and selecting all systems that rank above it
The
first example below shows how the test using comparable securities can expose a system that
was fitted intelligently and, as a result, passed the curve-fitting test. The second example involves
a machine-designed system and illustrates how the use of a performance threshold in addition to
a test on comparable securities can contribute towards the minimization of bias due to selection.
Variations
of a trading system based on the z-score became popular about 10 years ago. The z-
score measures the number of standard deviations the price is
away from the mean price. A
contrarian system can use the z-score to generate buy and sell signals in anticipation of reversion
when the price diverges from its short-term mean. A variation of this system and performance
results for SPY was published in November of 2010 on the website of a popular trading platform
(Ref. 8). According to the technical analyst that performed the study, the system was not curve-
fitted and passed tests with data from comparable securities. Although no out-of-sample testing
was used, as is also suggested in this
book, it was not entirely clear whether the comparable
securities were selected after the results were known. However, this is not the main issue as the
objective of this example is to show how difficult it is at
times to reject a system that was
intelligently fitted to start with.
The
code of the z-score system is shown in Pseudocode 6-6. The average price is first calculated
via the use of an exponential moving average with period n1. The standard deviation is calculated
using the same period n1. Then, the z-score is calculated using the average price and the standard
deviation. The next step involves the calculation of the average z-score, using an exponential
moving average with period n1. Finally, the n2-day momentum of the average z-score is
calculated. A buy
signal is generated at the next open if the momentum is negative and a sell
short signal is generated at the next open if it is positive. This is a long/short system that is
always in the market, with short positions exiting long positions and vis versa. Position size is
determined based on fixed initial capital.
z-score=(price-avgp)/s
The
z-score system in Pseudo code 6-6 has 3 parameters: the period for calculating the average
price and standard deviation, the period for calculating the average z-score, and the momentum
period. The three parameters were reduced to two by equating the first two, n1 and n2. The
variation of this system presented in (Ref. 8) has n1 set to 10 and n2 set to 5.
Figure 6-23. Equity performance of the z-score system from 02/02/1993 to 11/30/2010
Figure
6-23 shows the equity performance of the z-score system, as determined by a backtest
with SPY data from the same period as in the reference mentioned, i.e., from 02/02/1993 to
11/30/2010. No commission was included. The system generated 646 trades with the following
values for the performance parameters:
CAR: 6.38%
The
net return is higher than the buy and hold return of 169.66% during the
test period. The
equity curve is relatively smooth but the low Sharpe and payoff ratios immediately raise some
questions about the robustness of this system. As argued previously, the primary goal of a trading
system developer should be to try to debunk a system and prove that it has no edge. Since at the
time this system was tested the maximum data history in SPY was fully utilized, we will not test
it on out-of-sample data but we will instead use other tests that could lead to rejection. The first
test involves applying a small profit target and stop-loss of 1.5%. The resulting equity curve after
backtesting the system with this change is shown in Figure 6-24 and shows performance
degradation but still a positive net return over the testing period. Specifically, the net return is
about 22%, CAR is 1.09% and the win rate is 52.40%.
Figure
6-24. Equity performance of the z-score system in SPY from 02/02/1993 to 11/30/2010 with a
1.5% profit target and stop-loss.
After
the first test and the decrease of the win rate to 52.40%, we may suspect that this system is
fitted to the data but there is no conclusive evidence yet as performance is still positive. The next
step involves ranking the performance of this system after simulating a large
number of random
systems that use the same rule for position size determination. The logic that generates the
random systems is based on tossing a fair coin at the close of each day. If heads show up, any
open
short position is exited and a long position is established, which in turn is closed and
reversed to a short position when tails show up.
Figure
6-25. Randomization study results for the system in Pseudocode 6-5 with
n1 = 10 and n2=5.
Generated by Price Action Lab software. www.priceactionlab.com
The
results of the randomization study are shown in Figure 6-25. It may be seen that the
minimum return for significance over the test period from 01/29/1993 to 11/30/2010 is 134.87%
and the z-score system with a return
of 201.38% ranks better than 99.24% of the random system.
According to these results, there is no evidence against the null hypothesis that the
z-score system
is not intelligent or random. However, one requirement for the rejection of the null hypothesis in
trading systems development is that the data sample used is representative of the population. As
we will see below in this case this requirement is not met and it is one of
the reasons this system
does not offer an edge.
We
have been unable so far to reject the z-score system based on simple tests and we must go one
step further and perform tests on comparable securities. The reason these tests are left last is that
they take more time to set up and analyze. Figure 6-26 shows the equity performance resulting
from a backtest of the z-score system on GLD data from 11/18/2004 (inception) to 11/30/2010.
The results show a consistently losing system with a net return of about -48%, CAR equal to
-10.24%, and
a win rate of 56.70%.
Figure 6-26. Equity performance of the z-score system in GLD from 11/18/2004 to 11/30/2010
Similar
consistently losing performance is obtained when the z-score system is tested on DBC
data from 02/05/2006 (inception) to 11/30/2010, as shown in Figure 6-27. The net return is
-56.58%, CAR is -15.90% and the win rate is 52.67%.
Figure 6-27. Equity performance of the z-score system in DBC from 02/06/2004 to 11/30/2010
Therefore,
we have succeeded in rejecting the z-score system after applying tests on comparable
securities. It appears that the system was intelligently optimized in the past or it might have been
generated by some data-mining program that used as its input descriptive statistics. The system
was tested on data from other securities but only those with acceptable results were considered,
something that increases the data-mining bias because the results were skewed by a few securities
that performed well. The deeper reasons that the system has no edge are not important at this
point and system developers should not waste any time trying to understand why systems fail
because that can alienate them from the main objective, which is the rejection itself. The z-score
system does not perform well during smooth or steep trends. This was also mentioned in (Ref. 8)
to the credit of the system developer. However, in our opinion, the mistake was the selection of a
portfolio where this system performed well. Developers should always look for conditions where
a system does not perform well. Such conditions occurred in SPY after 2013 and the trading
system underperformed the buy
and hold return by a wide margin, as shown in Figure 6-28. Buy
and hold
for SPY (dividends not included) in the period 01/03/2013 to 10/17/2014
was about
29% but the z-score system gained only 9.57%. In that period,
the minimum significant return
based on a randomization study was 24.3%, a little less than the buy and hold return and a 9.57%
return was
better than only 75% of the random system, a result that provides support to the null
hypothesis that the z-score system does not possess any intelligence.
Figure 6-28. Equity performance of the z-score system in SPY from 01/02/2013 to 10/17/2014
In
this example, we considered a trading system that was backtested manually although it may
represent a strategy that was not conceived by a
human developer initially but was the outcome
of an automated data-mining process. When an automated process is used, selection bias can be
reduced by setting a performance threshold based on some metric, such as the profit factor or the
compound annual return, or even a combination of metrics, and then selecting all results that rank
above the threshold and treating them as a single system. The next example shows how this
additional rule fits with the other three rules mentioned
earlier.
This
example involves a trading system that was machined designed by a software program
called Deep Learning Price Action Lab (DLPAL). DLPAL finds strategies based on the open,
high, low, and close of a price series. These strategies are of Type III because they have no
parameters
in their entry logic. After the strategies are discovered only the parameters of the exit
file:///mnt/ST_320/Download/Chapter Six – Price Action Lab Blog_files/default_005.htm 48/68
10/1/22, 4:52 PM Chapter Six – Price Action Lab Blog
Data
for SPY from inception on 01/29/1193 to 12/31/2012 was used for the machine design and
the program was instructed to find strategies that are based only on the close of daily bars.
DLPAL can find strategies that are formed by a minimum of 2 to a maximum of 9 bars. The
metrics chosen for the selection threshold were the win rate, the Profit factor, the number of
trades, and the number of consecutive losers of each strategy discovered. Table 6-2 summarizes
the parameters of the machine design process.
Security SPY
Timeframe Daily
01/29/1993-
Data range
12/31/2012
Stop-loss 2.5%
DLPAL
found 109 distinct strategies that fulfilled the threshold settings, 89
long and 20 short,
and the output is shown in Figure 6-29, sorted for the highest win rate.
Figure
6-29. Results of a data-mining process for SPY price patterns with parameters in Table 6-
2. Generated by Deep Learning Price Action Lab software. www.priceactionlab.com
Each
line in the results of Figure 6-29 corresponds to a strategy that
satisfies the performance
parameters in Table 6-2. Index and Index Date
are used internally by the program to classify
strategies. Trade on is the entry point, in this case, the Open of the next bar. P is the success rate
of the strategy, PF is the profit factor, Trades is the number of historical trades, CL is the
maximum number of consecutive losers, Type is LONG for long and SHORT for short strategies,
Target is the profit target, Stop is the stop-loss and C indicates whether % or points for the exits,
in this case, it is %. Last Date and First Date are the last and first dates in the historical data file.
To
reduce the bias, we select all strategies in the results based on the defined threshold and we
use them to develop a system that takes long and short positions in SPY with a 2.5% profit target
and stop-loss. Reverse signals cause an exit of any open position and no multiple positions are
allowed. Starting capital is set to $100,000 and positions
are sized based on a constant dollar
amount equal to the starting capital divided by the entry price. Note that when the system equity
is fully invested, the risk per trade of this system is 2.5% and equal to the stop-loss of 2.5%.
Table 6- 3 shows the compound annual return (CAR)
of the final system for each of 10 ETFs
with data from inception to 12/31/2012.
SPY 8.86
QQQ 3.98
file:///mnt/ST_320/Download/Chapter Six – Price Action Lab Blog_files/default_005.htm 50/68
10/1/22, 4:52 PM Chapter Six – Price Action Lab Blog
IWM 3.53
DIA 4.22
EEM 5.99
GLD 4.97
TLT 4.25
XLF 4.09
XLE 5.55
DBC 0.64
It
may be seen from Table 6-3 that the SPY system based on the 109 strategies discovered by a
data-mining program was also profitable in the other nine major ETFs. The mean of the CAR
values resulting from the
tests on the nine comparable securities is 4.6% and the standard
deviation is 1.53%, indicating small variability. Excluding the low but still positive result for
DBC, the mean is 5.04% and the standard deviation 1.63%. These results allow us to conclude
that the system passes the comparable securities test.
In
this example, we first limited the number of strategies by considering only price patterns and
we also restricted the system to be of Type III.
Then we used all of the available data (at the time
the system was developed.) We also selected all strategies above a predefined threshold
and then
performed a test on comparable securities to validate the results. The system passed all tests but
that does not provide a guarantee that it will stay profitable in the future because market
conditions that have not been encountered in any of the tests may develop. However, the
probability of failure was minimized.
A
forward test was performed next in the period from 01/02/2013 to 06/30/2014, during which
the SPY system that was data-mined with data from 01/29/1993 to 12/31/2012 generated a net
return of 25.66% with the same position sizing method. The buy and hold return of SPY during
the same period was 34%. As shown in Figure 6-30, based on a randomization study the return of
the system was better than 97.53% of that of random systems, resulting in a p-value of 0.0248
and as a result, there is no evidence against the null hypothesis that the system is random.
However,
as mentioned already several times, finding no evidence against the null hypothesis
does not imply that an edge will be maintained in the future as market conditions may change
from those encountered in all tests.
Figure
6-30. Randomization study results for a test return of 25.66% in SPY. Generated by Price
Action Lab software. www.priceactionlab.com
In
this section, we discussed ways of reducing selection bias during both manual and automated
trading system development. In the next section, we
deal with data-snooping and its impact on
the data-mining bias.
In
trading system development data snooping bias arises when after an out-of-sample test the
process of system discovery is repeated, as indicated in Figures 6-2 and 6-3. When this is done,
the out-of-sample becomes part of the in-sample used to develop the system. Data-mining bias is
increased due to data snooping and the chances of discovering a random trading system that
performs well in the out-of-sample increase. Trading system developers that use backtesting
repeatedly to test many ideas with out-of-sample data have small chances of discovering an edge
as time goes by because data mining becomes very large. The following two rules may reduce the
impact of data snooping in the case of manual development:
Eliminate out-of-sample testing and validate with data from comparable securities
Most
naive quants continuously backtest ideas thinking that this increases their chances of
finding an edge when in fact the chances decrease as time goes by and at some point, they
discover a fluke that passes all out-of-sample and validation tests but then fails in actual trading.
Data
snooping bias can be eliminated when using automated data mining to discover trading
systems if any out-of-sample tests are performed only once and are not used to restart the
process. However, most users of data-mining software are likely to either restart the process after
changing the metrics they want to be maximized/minimized or even set an option that may be
offered by the software developer to automatically use out-of-sample results to cause a restart.
Needless to say that when data snooping is employed in an automated process, such as the one
shown in Figure 6-3, the chances of finding an edge before finding a fluke are zero for all
practical purposes. There is no need for formal mathematical proof; just consider the fact that
there are millions or even billions of flukes and only a small number of edges. If one backtests
hundreds of millions of trading strategies, then the chances that a fluke will be over-fitted to the
data are high, especially when such strategies employ several indicators with optimized
parameters. Therefore, using all available history as the in-sample and eliminating out-of-sample
tests eliminates data-snooping bias. Instead of out-of-sample validation, backtests on comparable
securities may be used
to also limit the data-mining bias due to selection.
The
notion that the extensive use of backtesting lowers the chances of finding a profitable edge
may sound counter-intuitive to some but it is what expert quants already know that separates
them from naive quants doing random tests and constantly discovering spurious correlations.
Backtesting is a science but it is also an art. Successful traders who use backtesting are like artists
that create a work of art from a canvas
(data) and some media (rules). Trying more often does not
make them a better artists unless they have a good idea. Trying to paint every day can only result
in worthless work. One must wait for the right moment to
use the canvas with the right media.
Only then the chances of success are higher.
At
the highest level, data-mining bias results from testing multiple hypotheses on historical data.
As the number of tested hypotheses increases, so are the chances of accepting a random rule as
genuine. Data-mining bias has a hugely adverse effect on the quality of the data-mining process.
The myths this book exposes are the result of naive
efforts to deal with a qualitative feature of a
process from a quantitative perspective.
Data-mining
bias cannot be effectively measured because it refers to the quality of
a process and
not to some specific parameter. Any definitions of data-mining bias that attribute to it a specific
measure are the result of unjustified attempts to quantify a non-quantifiable notion. For example,
some quants attempt to measure this bias by generating random data and then applying data
mining to them. They hope that by repeating this process many times and by getting a
distribution of some metric based on the random results, they will be able to correct the original
results from the actual data and compensate for the data-mining bias. However, such methods
essentially rank the performance of an algo with respect to the performance of a large number of
algos mined on random data and nothing more than that. There is no justification for the claim
that an algo that ranks high will perform better in actual trading. For
example, if the original algo
found by data mining was over-fitted on historical data, then it still has low chances of
performing well if market conditions change regardless of whether or not it ranked high when
compared to algos obtained from random data.
Out-of-sample
tests are part of an overall algo development process that is plagued by data-
mining bias. Therefore, they cannot deal with it. If many hypotheses are tested on historical data,
then some of them will pass out-of-sample tests by chance alone. Therefore, out-of-sample
testing does not suffice to guarantee that an algo is not random. In addition, it is false to call out-
of-sample testing a “cross-validation” method. This type of testing is only a validation method
that cannot deal with the data-mining bias in the presence of multiple hypothesis testing. Note
that “cross-validation” methods are in general hard or impossible to use in trading algo manual
development or machine learning due to the
nature of the process and data used.
When
machine learning software vendors are confronted by customers who have used out-of-
sample testing without success, they recommend forward testing. However, forward testing is just
another form of out-of-sample testing and essentially its continuation. If one tests many
hypotheses, then there is a high probability of finding a random one that passes all
tests in both
the out-of-sample and forward sample. As the number of hypotheses becomes large, this
probability tends to 1, i.e., finding a random algo that passes all tests is a certain event.
Myth No. 4: Monte Carlo Analysis can deal with data-mining bias
Some
algo developers use Monte Carlo analysis to determine the effect of parameter variations
on performance. If the algo is robust enough when subjected to such variations, then it is
perceived as genuine. However, Monte Carlo analysis becomes part of the data-mining process as
soon as it is applied. Even if Monte Carlo analysis is performed on unseen data,
when a large
number of hypotheses are tested, there is a high probability of finding one that passes all out-of-
sample, forward sample, and Monte Carlo analysis tests. The probability becomes 1 as time
passes and the same data are reused. Any machine learning process when used repeatedly and
extensively will generate algos that are random
but pass all tests, even Monte Carlo analysis. In
addition, note that this type of analysis has many flaws and applies only to a small subset of
possible strategies, as already discussed earlier in this chapter.
Myth No. 5: If you do not use data-mining, then there is no data-mining bias
The
only difference between thinking of a hypothesis and having a computer generate it is the
processing speed. Data-mining bias is always present as long as the hypothesis is tested on
historical data. Unless one can think of a truly unique hypothesis that no one else has, then
chances are it was already data-mined by a computer. Researchers started applying machine
learning to the markets in the mid-1980s and since then
they have mined data relentlessly. One of
the most naive things one could do nowadays is to try to combine some indicators with exit rules
using some machine learning algorithm hoping to find gold. Although some
results may work for
some time, inherently they are random because when
one adjusts for the multiple comparisons
the results are not significant.
In
trading, strategy development validation is required because it is easy
to over-fit strategies on
historical data during development and get fooled by randomness. Although the best validation
method is the actual performance, it is an expensive method and most developers would like to
assess potential before employing strategies. In most cases, the developer attempts to modify a
failed strategy after a failed validation
test or conceive a new one. This leads to a dangerous
practice of reuse
of data until a strategy that passes some validation tests is developed. There are
some misconceptions about multiple comparisons, data-mining bias, and p-hacking, mostly due
to the way these have been presented by academics with good mathematical skills but lack
trading experience.
Although
most trading strategies developed via unsound methods usually fail at some point in
real trading, some even immediately, high statistical significance, a subject favored by academics,
file:///mnt/ST_320/Download/Chapter Six – Price Action Lab Blog_files/default_005.htm 55/68
10/1/22, 4:52 PM Chapter Six – Price Action Lab Blog
is not equivalent to a good strategy. In other words, strategies with high statistical significance
may also fail and they often do and this is caused by major
changes in market conditions, also
known as regime changes. Therefore, statistical significance is not a panacea, as purported in
some academic
publications, and developers should be aware of this and the fact that market
regime changes also play an important role.
1. Out-of-sample tests
This
is the most popular and also abused validation method. Briefly, out-of-sample tests require
setting aside a portion of the data to be used in testing the strategy after it is developed and
obtaining an unbiased estimate of future performance. However, out-of-sample tests
In
other words, out-of-sample tests are useful in the case of unique hypotheses only. The use of
out-of-sample tests for strategies developed
via data mining shows a lack of understanding of the
process. In this case, the test can be used to reject strategies but not to accept any. In this sense,
the test is still useful but trading strategy developers know that good performance in out-of-
samples for strategies developed via multiple comparisons is in most cases a random result.
A
few methods have been proposed for correcting out-of-sample significance for the presence of
multiple comparisons bias but in almost
all real cases the result is a non-significant strategy.
However, as we
show in Ref. 1 with two examples that correspond to two major market regimes,
highly significant strategies even after corrections for bias are applied can also fail due to
changing markets. Therefore, out-of-sample tests are unbiased estimates of future performance
only if
future returns are distributed in an identical way as the past returns.
In other words, non-
stationarity may invalidate any results of out-of-sample testing.
Out-of-sample
tests apply only to unique hypotheses and assume stationarity. In this case, they
are useful but if these conditions are not met, they can be quite misleading.
Due
to a high Type-II error (false rejection) of other validation methods, practitioners have long
resorted to robustness tests. Robustness tests fall under the more general subject of Stochastic
Modeling.
Robustness
tests usually involve small variations in parameters and/or in the entry and exit logic
of a trading strategy. The objective is to determine whether the resulting distribution of some
performance metric,
usually the mean return or maximum drawdown, has low variance. If the
variance is high then it is assumed that the strategy is not robust. Different metrics can be used
along with different methods for evaluating robustness. Briefly, robustness tests, in reality,
determine how well a strategy is over-fitted and in most cases indicate the opposite of what is
desired, i.e., high robustness may be an indication of an excessive fit to historical data. For this
reason, it is appropriate to apply robustness tests on an out-of-sample but as a result, the power of
the test is low. More importantly, any robustness tests and stochastic modeling, in general, are
subject to data-snooping bias if used repeatedly with the same data. The probability that a random
strategy passes all tests even in the out-of-sample increases as the number of hypotheses
increases. However, for unique hypotheses, the practical tests can reveal certain properties that
are useful but vary on a case-to-case basis.
Robustness
tests and stochastic modeling in general can assess over-fitting conditions but Type-I
error (false discoveries) is high especially in the case of multiple comparisons even when applied
to an out-of-sample.
3. Portfolio backtests
The
main idea of this method is to avoid out-of-sample tests and the use of
the whole price
history in developing trading strategies. By using a sufficiently high number of price series in a
portfolio, the power of a test for significance increases. A simple statistical hypothesis test may
be used in this case based on the T-stat as follows:
The
null hypothesis is that the strategy returns are drawn from a distribution with a zero mean. If
the null hypothesis is rejected, then there is a low probability of obtaining the strategy
performance given that the null hypothesis is true.
Portfolio
backtests, as well as tests on comparable securities suffer from at least two problems:
the first problem is that these tests are too strict
and Type-II error (missed discoveries) is high.
The other and maybe more serious problem is that strategies usually fall when market conditions
change despite their high significance. Statistical hypothesis testing has limited application in
trading strategy development despite offering ground for publishing academic papers. In general,
the logic of the strategy and the process used to derive it are
more important than any statistical
test. Statistics cannot find gold where there is none to be found. Note that it is easy to cheat
portfolio
backtests by selecting the securities to use post-hoc. This is often done by some trading
strategy developers, in most cases due to ignorance. The securities to use in the portfolio must be
selected ex-ante. Then, if the test fails, the strategy must be rejected. Any effort to improve the
strategy and repeat the tests will introduce data-snooping bias and p-hacking.
Portfolio
tests and tests on comparable securities are useful under certain conditions and given
that they are not abused with p-hacking as a goal.
Monte
Carlo simulation is part of stochastic modeling but we list it here separately because of its
popularity. These tests are the least robust and effective in trading strategy development and
should be avoided except in the case the strategies fulfill certain requirements as discussed earlier
in this chapter. Monte Carlo simulations are especially inapplicable in the case of data-mined
strategies and multiple comparisons. If these simulations are used with data-mined strategies, it is
a strong indication that the developer lacks experience. In a nutshell, over-fitted strategies usually
generate good Monte Carlo results. When this method is used in a loop of multiple comparisons,
it loses its significance completely.
Use
Monte Carlo tests only when it is appropriate to do so. In general, the
choice of validation
method depends on the nature of the strategy, and the application and interpretation of results are
more of an art than a science. Most validation tests are done by practitioners but also academics
suffer from either multiple comparisons bias or fail under changing market conditions. The nature
of markets is such that there is no robust test to assess strategy robustness. This is what makes
trading
strategy development very difficult but also an interesting and challenging task.
Mechanical
trading systems became popular in the 1990s because traders thought that by
removing emotions from trading the chances of success were higher. But not all emotions are
bad, especially if they are based on experience and knowledge of market microstructure and
dynamics. In
addition to the notorious problem of data-mining bias that plagues trading system
development and has no easy solution, more traders are realizing nowadays that removing
emotions also removes human intelligence and a deeper understanding of price action that is
difficult, or even impossible, to express mathematically. For these reasons, there is the return to
the basics and specifically to discretionary trading but this time within a quantitative framework.
Mechanical systems still dominate the high and medium frequency spectrum
but for position and
swing trading quantitative discretionary trading is gaining ground. Below is a list of the main
advantages and disadvantages of quantitative discretionary trading.
This
example shows how the disadvantages listed above are being dealt with. Price Action Lab
(PAL) software was used for the identification of price
patterns in daily data. Figure 6-31 shows
the output of the scan function of the program after the close of Monday, February 2, 2015.
Unadjusted daily data since inception were used with the program to scan
12 popular ETFs.
Figure
6-31. Results of a scan of a group of ETFs as of the close of February 2, 2015. Generated by
Price Action Lab software. www.priceactionlab.com
In
Figure 6-31, each line corresponds to an exact price pattern found in the data. P is the win rate,
Trades is the number of historical trades, CL is the maximum number of consecutive losers, and
Target and Stop are the values of the profit target and stop-loss. C indicates the type of target and
stop-loss, in this case, it is a percentage added to the entry price that is shown under Trade On as
the open of the next bar.
Patterns
in three ETFs were found, one pattern long in IWM, two short patterns in DBC, and one
short pattern in QQQ. However, when data-mining for patterns, the following problems arise:
One
must deal with the above three issues before using the results. Unfortunately, most naive
Level III quants that use pattern scanners do not deal with such issues.
The
first step in the evaluation involves increasing the data sample. Normally, hundreds, or even
thousands, of trades are required for statistical significance of trading signals. This is partly
because the small trade samples may not be representative of the actual population. For example,
30 trades in 1-minute data do not represent a significant sample and thousands of trades are
required instead. To increase data samples we may perform a backtest of the patterns using data
from a large group of comparable securities. The idea is that if a pattern is significant, then this
should be reflected in the results of portfolio backtests. In this example, the S&P 500 component
stocks were used with data since 01/2000, and the results are shown in Figure 6-32.
Figure
6-32. Results of a portfolio backtest applied to the results of an ETF scan as of the close of
February 2, 2015. Generated by Price Action Lab software. www.priceactionlab.com
The
Last and First Date columns in the original results shown in Figure 6-31 were replaced in
Figure 6-32 by the portfolio expectation and success rate (the number of profitable tickers out of
the 500) and the P1 column was replaced by the portfolio profit factor, i.e., the ratio of the sum of
winners to the sum of losers. It may be seen from Figure 6-32 that the pattern in IWM was
profitable in 60.68% of the stocks and the profit factor is 1.11 while the number of trades
increased to 11,844. The results for the other patterns show negative expectation and a profit
factor of less than 1.
The
long signal in IWM hit its profit target of 2% on February 5, 2015, three days after it was
identified, as shown in Figure 6-33.
The
procedure followed in this example for identifying and testing trading patterns is just one out
of several possibilities. It involves algorithmic and deterministic data-mining, i.e., data-mining
wherewith the same data and parameter settings always produce the same results. Reproducibility
is important otherwise there is no testing that can be done that conforms to scientific standards.
There are two basic momentum strategies: trend-following and relative strength.
Trend-following,
a.k.a. absolute or time-series momentum, attempts to generate high risk-
adjusted returns by timing market moves and by avoiding large drawdown levels, such as those
that occurred in the stock market after the dot-com bubble burst and the financial crisis. The
strategies use simple models, such as the long-only 50/200 moving average crossover or price
breakout systems.
Relative
strength momentum investors benefit from price crashes and by premium created by
panicked investors who sell at the bottom. Fundamentally, these strategies buy outperforming
securities and sell or even short underperforming. The strategies also use trivial models, such as
the rate of price change.
file:///mnt/ST_320/Download/Chapter Six – Price Action Lab Blog_files/default_005.htm 62/68
10/1/22, 4:52 PM Chapter Six – Price Action Lab Blog
The
historical profitability of momentum strategies is the result of hindsight and data mining.
Allocation in stocks, funds, or ETFs, based on momentum naturally affects overbought/oversold
levels and decisions of value investors, something that is not reflected in backtests. With these
strategies, we are not dealing with trading a few contracts or shares but with substantial
investments by funds, and the dynamics of the relevant markets are affected as a result. Often
momentum strategies involve ranking securities according to past returns, then buying top
performers and selling or even shorting worst performers. Published studies on momentum
strategies are usually based on lookback periods that were optimized through backtests.
Therefore, there is data mining involved and the selection of a certain look-back period ignores
other such periods that did not produce good results.
Momentum
strategies are popular because they are supposed to generate excess alpha at lower
risk as compared to buy and hold and passive allocation. However, as in the case with traditional
trading systems, momentum represents just another method that must be tested for significance
and,
more importantly, for intelligence in capturing alpha.
One
cannot simply appeal to academic studies that demonstrate the momentum anomaly effect in
stock markets and then claim that there is a general method for transforming this effect into
excess alpha. All trading hypotheses must be analyzed on a case-to-case basis. Part of the
analysis could involve testing the following hypothesis regarding backtested and real
performance:
By
“possessing intelligence” it is meant here that a system was not curve-fitted on past data or
was lucky in actual trading but that its returns were due to its ability to adjust to changing market
conditions.
If
the analysis is carefully performed, it will usually show that momentum-based strategies have
no intelligence because of the selection of best performing securities in the past and optimum
look-back periods for calculating the ranking score. In addition, long backtests of such strategies
involve results that were positively impacted by the tech bubble market of the 1990s. It can be
shown that after the 1990s, these strategies do not generate excess alpha [Ref. 11].
Sustainable
momentum in the U.S. stock market disappeared in the late 1990s and was
replaced
by mean-reversion. This is illustrated in the S&P 500 daily chart shown in Figure 6-34 by the
Momersion indicator. This indicator measures momentum and mean-reversion over a period n of
250 daily bars. It is an oscillator that is calculated as follows:
Mc = 0
MRc = 0
For i = 1 to n=250 do
In Pseudo code 6-7, r(k) is the kth arithmetic return and it is calculated as follows
It
may be seen from Figure 6-34 that during the 1990s momentum dominated but after the 2000
top and the dot-com bubble burst, mean-reversion dominated until 2011. This is shown by the
Momersion indicator staying below the 50% line after 2000. After 2011, the S&P 500 is in a state
of “momersion”, i.e., a period where price action alternates between momentum and mean-
reversion.
The chances of finding an edge via backtesting with historical data may be maximized when the
following rules are followed:
The
frequent backtesting of random ideas contributes to an increase in data-mining bias and
facilitates data snooping. Traders should backtest only when there is something worth the dangers
involved from a large data-mining bias.
Rule 2: Backtest trading strategies that involve only a few parameters or preferably none at all
The
more parameters the model has, the easier it is to over-fit it to the data and the more
selections are available that can facilitate data-fishing, increasing data-mining bias.
Rule 3: Limit the number of strategies considered by limiting the number of basic rules they
employ
This
may not be a problem with infrequent manual backtesting but with data-mining software
thousands, millions, or even billions of entry and exit rules can be combined to get a final result.
file:///mnt/ST_320/Download/Chapter Six – Price Action Lab Blog_files/default_005.htm 65/68
10/1/22, 4:52 PM Chapter Six – Price Action Lab Blog
Rule 4: Avoid during the initial development the use of exit rules that naturally curve-fit to the
data
If
exit rules are used that naturally fit the data, such as a trailing stop for example, then the entries
may play no important role. If the same conditions as in the backtest do not repeat in the future,
the system may fail. The entries need to have market timing ability. Exits that maximize profit
potential can be employed after the final result is
obtained and validated.
Rule 5: Use all of the available data to develop systems to increase the range of market
conditions
Unless
backtesting will only be done once for a given market, then there is no
point in using out-
of-sample validation. Each time a new or modified strategy is backtested after an out-of-sample
test, data-mining bias increases due to data snooping. Using all the data in backtests exposes a
system to more market conditions during the development phase. Afterward, Rule 6 may be
applied.
It
is important to have a standard group of securities that reflect a wide
variety of market
conditions and correlations for validation. The selection of a portfolio of securities where a
system works but after it
is developed while ignoring many others where it fails increases data-
mining bias due to the selection of survivors.
Rule 7: In the case of automated data mining accept all systems above a set performance
threshold
Selecting
the best performing system out of a large number of systems introduces selection bias
that in turn increases data-mining bias. All systems above a certain threshold that is set in
advance must be selected and treated as a single system to minimize data-mining bias due to
selection.
This
test is confirmatory and should be made on a forward sample where the system is tested
either on paper or in real trading. It can indicate whether the system has an edge in the tested
Note
that even if all the above rules are applied, the final result may fail
in actual trading because
of a major change in market conditions to a regime where positive performance is not possible.
Unfortunately, there is no guarantee in trading that an edge will stay robust and an expert quant
must work constantly towards discovering new edges that will replace old ones when
performance starts deviating from desired levels. This fact about trading systems has discouraged
many to the point that they have started looking for alternatives to technical analysis. Chapter
Eight mentions briefly some of these alternatives and issues associated with their use.
Table of Contents
©
2015 Michael Harris. All Rights Reserved. Any unauthorized copy, reproduction, distribution,
publication, display, modification, or transmission of any part of this blog is strictly prohibited
without prior written permission.
Search form
COPYRIGHT NOTICE
©
2011 – 2022 Michael Harris. All Rights Reserved. We grant a revocable permission to
create a hyperlink to this blog subject to certain Terms and Conditions.
Any unauthorized
copy, reproduction, distribution, publication, display, modification, or transmission of any part
of this blog is strictly prohibited without prior written permission.
© 2022 PRICE ACTION LAB BLOG