Pseudo-Mathematics and Financial Charlatanism
Pseudo-Mathematics and Financial Charlatanism
Pseudo-Mathematics and Financial Charlatanism
Introduction
A backtest is a historical simulation of an algorithmic investment strategy. Among other things,
it computes the series of profits and losses that
such strategy would have generated had that algorithm been run over that time period. Popular
performance statistics, such as the Sharpe ratio
or the Information ratio, are used to quantify the
backtested strategys return on risk. Investors
typically study those backtest statistics and then
allocate capital to the best performing scheme.
Regarding the measured performance of a
backtested strategy, we have to distinguish between
two very different readings: in-sample (IS) and outof-sample (OOS). The IS performance is the one
simulated over the sample used in the design of
the strategy (also known as learning period or
David H. Bailey is retired from Lawrence Berkeley National
Laboratory. He is a Research Fellow at the University of California, Davis, Department of Computer Science. His email
address is [email protected].
Jonathan M. Borwein is Laureate Professor of Mathematics
at the University of Newcastle, Australia, and a Fellow of the
Royal Society of Canada, the Australian Academy of Science,
and the AAAS. His email address is jonathan.borwein@
newcastle.edu.au.
Marcos Lpez de Prado is Senior Managing Director at
Guggenheim Partners, New York, and Research Affiliate at
Lawrence Berkeley National Laboratory. His email address
is [email protected].
Qiji Jim Zhu is Professor of Mathematics at Western Michigan University. His email address is [email protected].
DOI: http://dx.doi.org/10.1090/noti1105
458
May 2014
We invite the reader to read specific instances of pseudomathematical financial advice at this website: http://www.
m-a-f-f-i-a.org/. Also, Edesses (2007) provides numerous examples.
459
460
Historically, scientists have led the way in exposing those who utilize pseudoscience to extract
a commercial benefit. As early as the eighteenth
century, physicists exposed the nonsense of astrologers. Yet mathematicians in the twenty-first
century have remained disappointingly silent with
regard to those in the investment community who,
knowingly or not, misuse mathematical techniques
such as probability theory, statistics, and stochastic calculus. Our silence is consent, making us
accomplices in these abuses.
The rest of our study is organized as follows:
The section Backtest Overfitting introduces the
problem in a more formal way. The section
Minimum Backtest Length (MinBTL) defines the
concept of Minimum Backtest Length (MinBTL).
The section Model Complexity argues how model
complexity leads to backtest overfitting. The section
Overfitting in Absence of Compensation Effects
analyzes overfitting in the absence of compensation
effects. The section Overfitting in Presence of
Compensation Effects studies overfitting in the
presence of compensation effects. The section
Is Backtest Overfitting a Fraud? exposes how
backtest overfitting can be used to commit fraud.
The section A Practical Application presents
a typical example of backtest overfitting. The
section Conclusions lists our conclusions. The
mathematical appendices supply proofs of the
propositions presented throughout the paper.
Backtest Overfitting
The design of an investment strategy usually
begins with a prior or belief that a certain pattern
may help forecast the future value of a financial
variable. For example, if a researcher recognizes a
lead-lag effect between various tenor bonds in a
yield curve, she could design a strategy that bets on
a reversion towards equilibrium values. This model
might take the form of a cointegration equation,
a vector-error correction model, or a system of
stochastic differential equations, just to name a
few. The number of possible model configurations
(or trials) is enormous, and naturally the researcher
would like to select the one that maximizes the
performance of the strategy. Practitioners often rely
on historical simulations (also called backtests) to
discover the optimal specification of an investment
strategy. The researcher will evaluate, among
other variables, what are the optimal sample sizes,
signal update frequency, entry and profit-taking
thresholds, risk sizing, stop losses, maximum
holding periods, etc.
The Sharpe ratio is a statistic that evaluates an
investment managers or strategys performance on
the basis of a sample of past returns. It is defined as
the ratio between average excess returns (in excess
of the rate of return paid by a risk-free asset, such as
rt N (, 2 ),
(2)
SR =
q,
2
1 + SR
a
2q
,
d - N SR,
(3)
SR
y
where y is the number of years used to estimate
d 2 As y increases without bound, the probaSR.
d approaches a Normal
bility distribution of SR
distribution with mean SR and variance
2
1+
SR 2
2q
. For a
May 2014
461
Figure 3 shows the relation between SR IS (xaxis) and SR OOS (y-axis) for = 0, = 1, N =
1000, T = 1000. Because the process follows a
random walk, the scatter plot has a circular shape
centered at the point (0, 0). This illustrates the fact
that, in absence of compensation effects, overfitting
the IS performance (x-axis) has no bearing on the
OOS performance (y-axis), which remains around
zero.
(lets say zero). MinTRL was developed to evaluate
a strategys track record (a single realized path,
N = 1). The question we are asking now is different,
because we are interested in the backtest length
needed to avoid selecting a skill-less strategy
among N alternative specifications. In other words,
in this article we are concerned with overfitting
prevention when comparing multiple strategies,
not in evaluating the statistical significance of
a single Sharpe ratio estimate. Next, we will
derive the analogue to MinTRL in the context of
overfitting, which we will call Minimum Backtest
Length (MinBTL), since it specifically addresses the
problem of backtest overfitting.
a
d -
From (3), if = 0 and y = 1, then SR
N (0, 1).
Note that because SR = 0, increasing q does not
reduce the variance of the distribution. The proof
of the following proposition is left for the appendix.
Proposition 1. Given a sample of IID random variables, xn Z, n = 1, . . . , N, where Z is the CDF
of the Standard Normal distribution, the expected
maximum of that sample, E[max N ] = E[max{xn }],
can be approximated for a large N as
1
E[max ] (1 )Z 1 1
N
N
(4)
1 1
1
+ Z
1 e
N
462
1
(1)Z 1 1 N
+Z 1 1
E[maxN ]
1 1
Ne
i 2
<
2 ln[N]
E[max N ]
E[max N ]. Figure 2 shows how many years of backtest length (MinBTL) are needed so that E[max N ]
is fixed at 1. For instance, if only five years of data
are available, no more than forty-five independent
model configurations should be tried or we are
almost guaranteed to produce strategies with an
annualized Sharpe ratio IS of 1 but an expected
Sharpe ratio OOS of zero. Note that Proposition 1
assumed the N trials to be independent, which
leads to a quite conservative estimate. If the trials
performed were not independent, the number of
independent trials N involved could be derived
using a dimension-reduction procedure, such as
Principal Component Analysis.
We will examine this tradeoff between N and
T in greater depth later in the paper without
requiring such a strong assumption, but MinBTL
gives us a first glance at how easy it is to overfit by
merely trying alternative model configurations. As
an approximation, the reader may find it helpful
to remember the upper bound to the minimum
backtest length (in years), MinBT L < 2 ln[N] 2 .
E[maxN ]
Model Complexity
How does the previous result relate to model
complexity? Consider a one-parameter model that
may adopt two possible values (like a switch
that generates a random sequence of trades) on
a sample of T observations. Overfitting will be
difficult, because N = 2. Lets say that we make
the model more complex by adding four more
parameters so that the total number of parameters
becomes 5, i.e., N = 25 = 32. Having thirty-two
independent sequences of random trades greatly
increases the possibility of overfitting.
While a greater N makes overfitting easier,
it makes perfectly fitting harder. Modern supercomputers can only perform around 250 raw
computations per second, or less than 258 raw
computations per year. Even if a trial could be
reduced to a raw computation, searching N = 2100
will take us 242 supercomputer-years of computation (assuming a 1 Pflop/s system, capable of
1015 floating-point operations per second). Hence,
a skill-less brute force search is certainly impossible. While it is hard to perfectly fit a complex
skill-less strategy, Proposition 1 shows that there
is no need for that. Without perfectly fitting a
strategy or making it overcomplex, a researcher
can achieve high Sharpe ratios. A relatively simple
strategy with just seven binomial independent parameters offers N = 27 = 128 trials, with an expected
May 2014
463
464
m = + ,
m = m +
T
1 X
m .
T =1
Figure 6 shows that adding a single global constraint causes the OOS performance to be negative
even though the underlying process was trendless.
Also, a strongly negative linear relation between
performance IS and OOS arises, indicating that the
more we optimize IS, the worse the OOS performance of the strategy.
We may rerun the same Monte Carlo experiment
as before, this time on the recentered variables
m . Somewhat scarily, adding this single global
constraint causes the OOS performance to be
negative even though the underlying process was
trendless. Moreover, a strongly negative linear
relation between performance IS and OOS arises,
indicating that the more we optimize IS, the
worse the OOS performance. Figure 6 displays this
disturbing pattern. The p-values associated with
the intercept and the IS performance (SR a priori)
are respectively 0.5005 and 0, indicating that the
negative linear relation between IS and OOS Sharpe
ratios is statistically significant.
The following proposition is proven in the
appendix.
Proposition 3. Given two alternative configuraA
tions (A and B) of the same model, where IS
=
A
B
B
OOS = IS = OOS imposing a global constraint
A = B , implies that
(9)
A
B
A
B
SRIS
> SRIS
a SROOS
< SROOS
.
May 2014
465
Figure 7 illustrates that a serially correlated performance introduces another form of compensation
effects, just as we saw in the case of a global constraint. For example, if = 0.995, it takes about
138 observations to recover half of the deviation
from the equilibrium. We have rerun the previous
Monte Carlo experiment, this time on an autoregressive process with = 0, = 1, = 0.995, and
have plotted the pairs of performance IS vs. OOS.
Recentering a series is one way to introduce
memory into a process, because some data points
will now compensate for the extreme outcomes
from other data points. By optimizing a backtest,
the researcher selects a model configuration that
spuriously works well IS and consequently is likely
to generate losses OOS.
Serial Dependence
Imposing a global constraint is not the only
situation in which overfitting actually is detrimental.
To cite another (less restrictive) example, the same
effect happens if the performance series is serially
conditioned, such as a first-order autoregressive
process,
(10)
m = (1 ) + ( 1)m1 +
or, analogously,
(11)
m = (1 ) + m1 + ,
466
ln[2]
.
ln[]
A
B
A
B
SRIS
> SRIS
a SROOS
< SROOS
.
A Practical Application
Institutional asset managers follow certain investment procedures on a regular basis, such as
rebalancing the duration of a fixed income portfolio (PIMCO); rolling holdings on commodities
(Goldman Sachs, AIG, JP Morgan, Morgan Stanley);
investing or divesting as new funds flow at the end
of the month (Fidelity, BlackRock); participating in
the regular U.S. Treasury Auctions (all major investment banks); delevering in anticipation of payroll,
FOMC or GDP releases; tax-driven effects around
the end of the year and mid-April; positioning for
electoral cycles, etc. There are a large number of
instances where asset managers will engage in
somewhat predictable actions on a regular basis.
It should come as no surprise that a very popular
May 2014
467
468
Conclusions
While the literature on regression overfitting is
extensive, we believe that this is the first study
to discuss the issue of overfitting on the subject
of investment simulations (backtests) and its
negative effect on OOS performance. On the
subject of regression overfitting, the great Enrico
Fermi once remarked (Mayer et al. [20]):
I remember my friend Johnny von Neumann
used to say, with four parameters I can fit
an elephant, and with five I can make him
wiggle his trunk.
The same principle applies to backtesting, with
some interesting peculiarities. We have shown that
backtest overfitting is difficult indeed to avoid. Any
perseverant researcher will always be able to find a
backtest with a desired Sharpe ratio regardless of
the sample length requested. Model complexity is
only one way that backtest overfitting is facilitated.
Given that most published backtests do not report
the number of trials attempted, many of them
may be overfitted. In that case, if an investor
allocates capital to them, performance will vary: It
will be around zero if the process has no memory,
but it may be significantly negative if the process
has memory. The standard warning that past
performance is not an indicator of future results
understates the risks associated with investing
on overfit backtests. When financial advisors do
not control for overfitting, positive backtested
performance will often be followed by negative
investment results.
We have derived the expected maximum Sharpe
ratio as a function of the number of trials (N) and
sample length. This has allowed us to determine
the Minimum Backtest Length (MinBTL) needed to
avoid selecting a strategy with a given IS Sharpe
ratio among N trials with an expected OOS Sharpe
ratio of zero. Our conclusion is that the more trials
a financial analyst executes, the greater should
be the IS Sharpe ratio demanded by the potential
investor.
We strongly suspect that such backtest overfitting is a large part of the reason why so many
algorithmic or systematic hedge funds do not live
up to the elevated expectations generated by their
managers.
May 2014
Appendices
Proof of Proposition 1
Embrechts et al. [5, pp. 138147] show that the
maximum value (or last order statistic) in a sample
of independent random variables following an
exponential distribution converges asymptotically
to a Gumbel distribution. As a particular case, the
Gumbel distribution covers the Maximum Domain
of Attraction of the Gaussian distribution, and
therefore it can be used to estimate the expected
value of the maximum of several independent
random Gaussian variables.
To see how, suppose there is a sample of IID
random variables, zn Z, n = 1, . . . , N, where Z is
the CDF of the Standard Normal distribution. To
derive an approximation for the sample maximum,
max n = max{zn }, we apply the Fisher-TippettGnedenko theorem to the Gaussian distribution
and obtain that
"
#
max N
(14)
lim P r ob
x = G[x],
N
where
x
469
Proof of Proposition 3
Suppose there are two random samples (A and
B) of the same process {m }, where A and B
are of equal size and have means and standard
deviations A , B , A , B . A fraction of each
sample is called IS, and the remainder is called
OOS, where for simplicity we have assumed that
A
A
B
B
IS
= OOS
= IS
= OOS
. We would like to
understand the implications of a global constraint
A = B .
A
A
First, we note that A = IS
+ (1 )OOS
B
B
A
A
and B = IS
+ (1 )OOS
. Then, IS
> OOS
a
A
A
B
B
IS
> A a OOS
< A . Likewise, IS
> OOS
a
B
B
IS
> B a OOS
< B .
Second, because of the global constraint A = B ,
(1) A
(1) B
B
OOS = IS +
OOS
(1)
B
A
A
(OOS OOS ). Then, IS >
A
IS
+
A
B
and IS
IS
=
B
A
IS
a OOS
<
B
A
OOS
. We can divide this expression by IS
> 0,
with the implication that
(17)
A
B
A
B
SRIS
> SRIS
a SROOS
< SROOS
A
where we have denoted SRIS
=
A
IS
A
IS
Proof of Proposition 5
Suppose that we draw two samples (A and B) of
a first-order autoregressive process and generate
two subsamples of each. The first subsample is
called IS and is comprised of = 1, . . . , T , and the
second subsample is called OOS and is comprised
of = T + 1, . . . , T , with (0, 1) and T an
integer multiple of . For simplicity, let us assume
A
A
B
B
that IS
= OOS
= IS
= OOS
. From Proposition 4,
(18), we obtain
(22)
ET [mT ] mT = (1 T )( mT ).
A
B
A
B
Because 1 T > 0, IS
= IS
, SRIS
> SRIS
a
A
B
mT > mT . This means that the OOS of A
begins with a seed that is greater than the seed
A
that initializes the OOS of B. Therefore, mT
>
B
A
A
B
B
mT a ET [mT ] mT < ET [mT ] mT . Because
B
B
IS
= OOS
, we conclude that
(23)
A
B
A
B
SRIS
> SRIS
a SROOS
< SROOS
Python code implementing the experiment described in A Practical Application can be found
at http://www.quantresearch.info/Software.
htm and at http://www.financial-math.org/
software/.
Proof of Proposition 4
This proposition computes the half-life of a firstorder autoregressive process. Suppose there is a
random variable m that takes values of a sequence
of observations {1, . . . , }, where
(18)
m = (1 ) + m1 +
E0 [m ] = (1 ) + m0 .
470
ln[2]
,
ln[]
Acknowledgments
References
[1] D. Bailey, J. Borwein, M. Lpez de Prado
and J. Zhu, The probability of backtest overfitting, 2013, working paper, available at
http://ssrn.com/abstract=2326253.
[2] D. Bailey and M. Lpez de Prado, The Sharpe ratio
efficient frontier, Journal of Risk 15(2) (2012), 344.
Available at http://ssrn.com/abstract=1821643.
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
May 2014
[26]
[27]
[28]
[29]
471