1603 07532 PDF
1603 07532 PDF
1603 07532 PDF
Abstract—We present the expected values from p-value hack- Expected min p-val
ing as a choice of the minimum p-value among m independents
tests, which can be considerably lower than the "true" p-value, 0.10
even with a single trial, owing to the extreme skewness of the
meta-distribution. 0.08
We first present an exact probability distribution (meta-
distribution) for p-values across ensembles of statistically iden-
arXiv:1603.07532v4 [stat.AP] 25 Jan 2018
0.06
tical phenomena. We derive the distribution for small samples
2 < n ≤ n∗ ≈ 30 as well as the limiting one as the sample size n 0.04
becomes large. We also look at the properties of the "power" of
a test through the distribution of its inverse for a given p-value 0.02
and parametrization.
The formulas allow the investigation of the stability of the
m trials
reproduction of results and "p-hacking" and other aspects of 2 4 6 8 10 12 14
meta-analysis.
P-values are shown to be extremely skewed and volatile, Fig. 1. The "p-hacking" value across m trials for the "true" median p-value
regardless of the sample size n, and vary greatly across repetitions pM = .15 and expected "true" value ps = .22. We can observe how easily
of exactly same protocols under identical stochastic copies of the one can reach spurious values < .02 with a small number of trials.
phenomenon; such volatility makes the minimum p value diverge
significantly from the "true" one. Setting the power is shown to PDF
offer little remedy unless sample size is increased markedly or 10
the p-value is lowered by at least one order of magnitude.
-VALUE hacking, just like an option or other mem-
P bers in the class of convex payoffs, is a function that
benefits from the underlying variance and higher moment
8 n=5
n=10
6 n=15
variability. The researcher or group of researchers have an
n=20
implicit "option" to pick the most favorable result in m trials,
without disclosing the number of attempts, so we tend to get n=25
4
a rosier picture of the end result than reality. The distribution
of the minimum p-value and the "optionality" can be made
2
explicit, expressed in a parsimonious formula allowing for the
understanding of biases in scientific studies, particularly under
environments with high publication pressure. p
0.00 0.05 0.10 0.15 0.20
Assume that we know the "true" p-value, ps , what would its
realizations look like across various attempts on statistically Fig. 2. The different values for Equ. 1 showing convergence to the limiting
identical copies of the phenomena? By true value ps , we distribution.
mean its expected value by the law of large numbers across
an m ensemble of possible samples for the phenomenon
1
P P P .12 will be below .05. This implies serious gaming and "p-
under scrutiny, that is m ≤m pi − → ps (where − → denotes
hacking" by researchers, even under a moderate amount of
convergence in probability). A similar convergence argument
repetition of experiments.
can be also made for the corresponding "true median" pM . The
distribution of n small samples can be made explicit (albeit Although with compact support, the distribution exhibits
with special inverse functions), as well as its parsimonious the attributes of extreme fat-tailedness. For an observed
limiting one for n large, with no other parameter than the p-value of, say, .02, the "true" p-value is likely to be >.1
median value pM . We were unable to get an explicit form for (and very possibly close to .2), with a standard deviation
ps but we go around it with the use of the median. >.2 (sic) and a mean deviation of around .35 (sic, sic).
It turns out, as we can see in Fig. 3 the distribution is Because of the excessive skewness, measures of disper-
extremely asymmetric (right-skewed), to the point where 75% sion in L1 and L2 (and higher norms) vary hardly with
of the realizations of a "true" p-value of .05 will be <.05 (a ps , so the standard deviation is not proportional, meaning
borderline situation is 3× as likely to pass than fail a given an in-sample .01 p-value has a significant probability of
protocol), and, what is worse, 60% of the true p-value of having a true value > .3.
Second version, January 2018, First version was March 2015.
N. N. Taleb 1
FAT TAILS RESEARCH PROGRAM
So clearly we don’t know what we are talking about with n degrees of freedom, and, crucially, supposed to deliver
when we talk about p-values. a mean of ζ̄,
Earlier attempts for an explicit meta-distribution in the n+1
2
n
literature were found in [1] and [2], though for situations of (ζ̄−ζ)2 +n
f (ζ; ζ̄) = √
nB n2 , 12
Gaussian subordination and less parsimonious parametrization.
The severity of the problem of significance of the so-called
"statistically significant" has been discussed in [3] and offered where B(.,.) is the standard beta function. Let g(.) be the one-
a remedy via Bayesian methods in [4], which in fact recom- tailed survival function of the Student T distribution with zero
mends the same tightening of standards to p-values ≈ .01. mean and n degrees of freedom:
But the gravity of the extreme skewness of the distribution
12 I ζ2n+n n2 , 12
ζ≥0
of p-values is only apparent when one looks at the meta-
g(ζ) = P(Z > ζ) = 1
distribution. 1 n
2 I ζ2 2 , 2 + 1
ζ<0
For notation, we use n for the sample size of a given study ζ 2 +n
and m the number of trials leading to a p-value. where I(.,.) is the incomplete Beta function.
We now look for the distribution of g ◦ f (ζ). Given that
I. DERIVATION OF THE METADISTRIBUTION OF P - VALUES g(.) is a legit Borel function, and naming p the probability
Proposition 1. Let P be a random variable ∈ [0, 1]) corre- as a random variable, we have by a standard result for the
sponding to the sample-derived one-tailed p-value from the transformation:
paired T-test statistic (unknown variance) with median value
M(P ) = pM ∈ [0, 1] derived from a sample of n size. f g (−1) (p)
ϕ(p, ζ̄) = 0 (−1)
The distribution across the ensemble of statistically identical |g g (p) |
copies of the sample has for PDF
( We can convert ζ̄ into the corresponding median survival
ϕ(p; pM )L for p < 12 probability because of symmetry of Z. Since one half the
ϕ(p; pM ) =
ϕ(p; pM )H for p > 21 observations fall on either side of ζ̄, we can ascertain that
the transformation is median preserving: g(ζ̄) = 12 , hence
ϕ(pM , .) = 21 . Hence we end up having {ζ̄ : 12 I ζ̄2n+n n2 , 12 =
1
(−n−1)
ϕ(p; pM )L = λp2
pM } (positive case) and {ζ̄ : 21 I ζ2 12 , n2 + 1 = pM }
s
λ (λ − 1) ζ 2 +n
− p p pM p (negative case). Replacing we get Eq.1 and Proposition 1 is
(λp − 1) λpM − 2 (1 − λp ) λp (1 − λpM ) λpM + 1 done.
n/2
1
√ √ We note that n does not increase significance, since p-
1 2 1−λp λpM 1
λp − √ √ + 1−λpM −1 values are computed from normalized variables (hence the
λp 1−λpM
universality of the meta-distribution); a high n corresponds
1 to an increased convergence to the Gaussian. For large n, we
ϕ(p; pM )H = 1 − λ0p 2 (−n−1) can prove the following proposition:
λ0p − 1 (λpM − 1) Proposition 2. Under the same assumptions as above, the
n+1
q 2 limiting distribution for ϕ(.):
p
λ0p (−λpM ) + 2 1 − λ0p λ0p (1 − λpM ) λpM + 1
−1
(2pM )(erfc−1 (2pM )−2erfc−1 (2p))
(1) lim ϕ(p; pM ) = e−erfc (2)
n→∞
−1 n 1 −1 1 0
n
where λp = I2p 2 , 2 , λpM = I1−2pM 2 , 2 , λp = where erfc(.) is the complementary error function and
−1 1 n −1
I2p−1 2 , 2 , and I(.) (., .) is the inverse beta regularized erf c(.)−1 its inverse.
function. The limiting CDF Φ(.)
Remark 1. For p= 12 the distribution doesn’t exist in theory, 1
erfc erf−1 (1 − 2k) − erf−1 (1 − 2pM ) (3)
Φ(k; pM ) =
but does in practice and we can work around it with the 2
sequence pmk = 12 ± k1 , as in the graph showing a convergence mv
Proof. For large n, the distribution of Z = s
√v
becomes that
to the Uniform distribution on [0, 1] in Figure 4. Also note n
that what is called the "null" hypothesis is effectively a set of of a Gaussian, and the one-tailed survival function g(.) =
√
measure 0.
1
2 erfc
ζ
√
2
, ζ(p) → 2erfc−1 (p).
Proof. Let Z be a random normalized variable with realiza- This limiting distribution applies for paired tests with known
tions ζ, from a vector ~v of n realizations, with sample mean or assumed sample variance since the test becomes a Gaussian
mv , and sample standard deviation sv , ζ = mv√s−m
v
h
(where mh variable, equivalent to the convergence of the T-test (Student
n
is the level it is tested against), hence assumed to ∼ Student T T) to the Gaussian when n is large.
N. N. Taleb 2
FAT TAILS RESEARCH PROGRAM
N. N. Taleb 3
FAT TAILS RESEARCH PROGRAM
ACKNOWLEDGMENT
Marco Avellaneda, Pasquale Cirillo, Yaneer Bar-Yam,
friendly people on twitter, less friendly verbagiastic psychol-
ogists on twitter, ...
R EFERENCES
[1] H. J. Hung, R. T. O’Neill, P. Bauer, and K. Kohne, “The behavior of the
p-value when the alternative hypothesis is true,” Biometrics, pp. 11–22,
1997.
[2] H. Sackrowitz and E. Samuel-Cahn, “P values as random variables—
expected p values,” The American Statistician, vol. 53, no. 4, pp. 326–
331, 1999.
[3] A. Gelman and H. Stern, “The difference between “significant” and
“not significant” is not itself statistically significant,” The American
Statistician, vol. 60, no. 4, pp. 328–331, 2006.
[4] V. E. Johnson, “Revised standards for statistical evidence,” Proceedings
of the National Academy of Sciences, vol. 110, no. 48, pp. 19 313–19 317,
2013.
[5] O. S. Collaboration et al., “Estimating the reproducibility of psychological
science,” Science, vol. 349, no. 6251, p. aac4716, 2015.
N. N. Taleb 4