nrn3475 p011

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

ANALYSIS

Power failure: why small sample


size undermines the reliability of
neuroscience
Katherine S. Button1,2, John P. A. Ioannidis3, Claire Mokrysz1, Brian A. Nosek4,
Jonathan Flint5, Emma S. J. Robinson6 and Marcus R. Munafò1
Abstract | A study with low statistical power has a reduced chance of detecting a true effect,
but it is less well appreciated that low power also reduces the likelihood that a statistically
significant result reflects a true effect. Here, we show that the average statistical power of
studies in the neurosciences is very low. The consequences of this include overestimates of
effect size and low reproducibility of results. There are also ethical dimensions to this
problem, as unreliable research is inefficient and wasteful. Improving reproducibility in
neuroscience is a key priority and requires attention to well-established but often ignored
methodological principles.

It has been claimed and demonstrated that many (and low sample size of studies, small effects or both) nega-
possibly most) of the conclusions drawn from biomedi- tively affects the likelihood that a nominally statistically
cal research are probably false1. A central cause for this significant finding actually reflects a true effect. We dis-
important problem is that researchers must publish in cuss the problems that arise when low-powered research
order to succeed, and publishing is a highly competitive designs are pervasive. In general, these problems can be
enterprise, with certain kinds of findings more likely to divided into two categories. The first concerns prob-
be published than others. Research that produces novel lems that are mathematically expected to arise even if
results, statistically significant results (that is, typically the research conducted is otherwise perfect: in other
p < 0.05) and seemingly ‘clean’ results is more likely to be words, when there are no biases that tend to create sta-
1
School of Experimental
Psychology, University of
published2,3. As a consequence, researchers have strong tistically significant (that is, ‘positive’) results that are
Bristol, Bristol, BS8 1TU, UK. incentives to engage in research practices that make spurious. The second category concerns problems that
2
School of Social and their findings publishable quickly, even if those prac- reflect biases that tend to co‑occur with studies of low
Community Medicine, tices reduce the likelihood that the findings reflect a true power or that become worse in small, underpowered
University of Bristol,
(that is, non-null) effect 4. Such practices include using studies. We next empirically show that statistical power
Bristol, BS8 2BN, UK.
3
Stanford University School of flexible study designs and flexible statistical analyses is typically low in the field of neuroscience by using evi-
Medicine, Stanford, and running small studies with low statistical power 1,5. dence from a range of subfields within the neuroscience
California 94305, USA. A simulation of genetic association studies showed literature. We illustrate that low statistical power is an
4
Department of Psychology, that a typical dataset would generate at least one false endemic problem in neuroscience and discuss the impli-
University of Virginia,
Charlottesville,
positive result almost 97% of the time6, and two efforts cations of this for interpreting the results of individual
Virginia 22904, USA. to replicate promising findings in biomedicine reveal studies.
5
Wellcome Trust Centre for replication rates of 25% or less7,8. Given that these pub-
Human Genetics, University of lishing biases are pervasive across scientific practice, it Low power in the absence of other biases
Oxford, Oxford, OX3 7BN, UK.
is possible that false positives heavily contaminate the Three main problems contribute to producing unreliable
6
School of Physiology and
Pharmacology, University of neuroscience literature as well, and this problem may findings in studies with low power, even when all other
Bristol, Bristol, BS8 1TD, UK. affect at least as much, if not even more so, the most research practices are ideal. They are: the low probability of
Correspondence to M.R.M. prominent journals9,10. finding true effects; the low positive predictive value (PPV;
e-mail: marcus.munafo@ Here, we focus on one major aspect of the problem: see BOX 1 for definitions of key statistical terms) when an
bristol.ac.uk
doi:10.1038/nrn3475
low statistical power. The relationship between study effect is claimed; and an exaggerated estimate of the mag-
Published online 10 April 2013 power and the veracity of the resulting finding is nitude of the effect when a true effect is discovered. Here,
Corrected online 15 April 2013 under-appreciated. Low statistical power (because of we discuss these problems in more detail.

NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 365

© 2013 Macmillan Publishers Limited. All rights reserved


A N A LY S I S

error creates an odds ratio of 1.60. The winner’s curse a Rejecting the null hypothesis
means, therefore, that the ‘lucky’ scientist who makes Null Observed effect p = 0.05
the discovery in a small study is cursed by finding an
inflated effect.
The winner’s curse can also affect the design and con-
clusions of replication studies. If the original estimate of
the effect is inflated (for example, an odds ratio of 1.60),
1.96 x sem
then replication studies will tend to show smaller effect
sizes (for example, 1.20), as findings converge on the b The sampling distribution
true effect. By performing more replication studies, we
should eventually arrive at the more accurate odds ratio
of 1.20, but this may take time or may never happen if we
only perform small studies. A common misconception
is that a replication study will have sufficient power to
replicate an initial finding if the sample size is similar to 1.96 x sem
that in the original study 14. However, a study that tries
c Increasing statistical power
to replicate a significant effect that only barely achieved
nominal statistical significance (that is, p ~ 0.05) and that
uses the same sample size as the original study, will only
achieve ~50% power, even if the original study accurately
estimated the true effect size. This is illustrated in FIG. 1.
Many published studies only barely achieve nominal sta-
tistical significance15. This means that if researchers in a
particular field determine their sample sizes by historical 1.96 x sem
precedent rather than through formal power calculation,
this will place an upper limit on average power within Figure 1 | Statistical power of a replication study. a | If
that field. As the true effect size is likely to be smaller a study finds evidence for an effect at p = 0.05, then the
Nature Reviews | Neuroscience
than that indicated by the initial study — for example, difference between the mean of the null distribution
because of the winner’s curse — the actual power is likely (indicated by the solid blue curve) and the mean of the
to be much lower. Furthermore, even if power calcula- observed distribution (dashed blue curve) is 1.96 × sem.
tion is used to estimate the sample size that is necessary b | Studies attempting to replicate an effect using the
in a replication study, these calculations will be overly same sample size as that of the original study would have
roughly the same sampling variation (that is, sem) as in the
optimistic if they are based on estimates of the true
original study. Assuming, as one might in a power
effect size that are inflated owing to the winner’s curse calculation, that the initially observed effect we are trying
phenomenon. This will further hamper the replication to replicate reflects the true effect, the potential
process. distribution of these replication effect estimates would be
similar to the distribution of the original study (dashed
Low power in the presence of other biases green curve). A study attempting to replicate a nominally
Low power is associated with several additional biases. significant effect (p ~ 0.05), which uses the same sample
First, low-powered studies are more likely to pro- size as the original study, would therefore have (on
vide a wide range of estimates of the magnitude of an average) a 50% chance of rejecting the null hypothesis
effect (which is known as ‘vibration of effects’ and is (indicated by the coloured area under the green curve) and
thus only 50% statistical power. c | We can increase the
described below). Second, publication bias, selective
power of the replication study (coloured area under the
data analysis and selective reporting of outcomes are orange curve) by increasing the sample size so as to reduce
more likely to affect low-powered studies. Third, small the sem. Powering a replication study adequately (that is,
studies may be of lower quality in other aspects of their achieving a power ≥ 80%) therefore often requires a larger
design as well. These factors can further exacerbate the sample size than the original study, and a power
low reliability of evidence obtained in studies with low calculation will help to decide the required size of the
statistical power. replication sample.
Vibration of effects13 refers to the situation in which
a study obtains different estimates of the magnitude of
the effect depending on the analytical options it imple- more often the case for small studies — here, results can
ments. These options could include the statistical model, change easily as a result of even minor analytical manipu-
the definition of the variables of interest, the use (or not) lations. In small studies, the range of results that can be
of adjustments for certain potential confounders but not obtained owing to vibration of effects is wider than in
others, the use of filters to include or exclude specific larger studies, because the results are more uncertain and
observations and so on. For example, a recent analysis therefore fluctuate more in response to analytical changes.
of 241 functional MRI (fMRI) studies showed that 223 Imagine, for example, dropping three observations from
unique analysis strategies were observed so that almost the analysis of a study of 12 samples because post-hoc
no strategy occurred more than once16. Results can vary they are considered unsatisfactory; this manipulation
markedly depending on the analysis strategy 1. This is may not even be mentioned in the published paper, which

NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 367

© 2013 Macmillan Publishers Limited. All rights reserved


A N A LY S I S

Table 1 | Characteristics of included meta-analyses


First author of k N Summary effect size Power Refs
study
Median (range) Cohen’s d OR Random or Median (range)
fixed effects
Babbage 13 48 (24–67) –1.11 Random 0.96 (0.74– 0.99) 24
Bai 18 322 (92–3152) 1.47 Random 0.20 (0.06–1.00) 25
Bjorkhem-Bergman 6 59 (37–72) –1.20 Random 0.99 (0.94–1.00) 26
Bucossi 21 85 (19–189) 0.41 Random 0.46 (0.13–0.79) 27
Chamberlain 11 53 (20–452) –0.51 NA 0.54 (0.33–1.00) 28
Chang 56 55 (20–309) –0.19 Random 0.10 (0.07–0.38) 29
Chang 6 616.5 (157–1492) 0.98 Fixed 0.05 (0.05–0.06) 30
Chen 12 1193 (288–29573) 0.60 Random 0.92 (0.13–1.00) 31
Chung 11 253 (129–703) 0.67 Fixed 0.09 (0.00–0.15) 32
Domellof 14 143.5 (42–5795) 2.12 Random 0.47 (0.00–1.00) 33
Etminan 14 109 (31–753) 0.80 Random 0.08 (0.05–0.23) 34
Feng 4 450 (370–1715) 1.20 Fixed 0.16 (0.09–0.42) 35
Green 17 69 (29–687) –0.59 Random 0.65 (0.34–1.00) 36
Han 14 212 (40–4190) 1.35 Random 0.12 (0.05 –0.95) 37
Hannestad 13 23 (12–100) –0.13 Random 0.09 (0.07–0.25) 38
Hua 27 468 (114–1522) 1.13 Random 0.09 (0.06–0.22) 39
Lindson 8 257 (48–1100) 1.05 Fixed 0.05 (0.05–0.06) 40
Liu 12 563 (148–1956) 1.04 Fixed 0.05 (0.05–0.07) 41
Lui 6 1678 (1033–9242) 0.89 Fixed 0.15 (0.12–0.60) 42
MacKillop 57 52 (18–227) 0.58 Fixed 0.51 (0.21–0.99) 43
Maneeton 5 53 (22–162) 1.67* Random 0.13 (0.08–0.35) 44
Ohi 6 674 (200–2218) 1.12 Fixed 0.10 (0.07–0.24) 45
Olabi 14 68.5 (14–209) –0.40 Random 0.34 (0.13–0.83) 46
Oldershaw 10 65.5 (40–126) –0.51 Random 0.53 (0.35–0.79) 47
Oliver 7 156 (66–677) 0.86 Fixed 0.07 (0.06–0.17) 48
Peerbooms 36 229 (26–2913) 1.26 Random 0.11 (0.00–0.36) 49
Pizzagalli 22 16 (8–44) 0.92 Random 0.44 (0.19–0.90) 50
Rist 5 150 (99–626) 2.06 Random 0.55 (0.35–0.98) 51
Sexton 8 35 (20–208) 0.43 Fixed 0.24 (0.15–0.98) 52
Shum 11 40 (24–129) 0.89 Fixed 0.78 (0.54–0.93) 53
Sim 2 72 (46–98) 1.23* Random 0.07 (0.07–0.08) 54
Song 12 85 (32–279) 0.15 NA 0.10 (0.07–0.21) 55
Sun 6 437.5 (158–712) 1.93 Fixed 0.65 (0.14–0.98) 56
Tian 4 50 (32–63) 1.26 NA 0.98 (0.93–1.00) 57
Trzesniak 11 124 (55–279) 1.98 Random 0.27 (0.09–0.64) 58
Veehof 8 58.5 (19–156) 0.37 Fixed 0.26 (0.12–0.60) 59
Vergouwen 24 223 (39–1015) 0.83 Random 0.09 (0.06–0.22) 60
Vieta 10 212 (113–361) 0.68* Random 0.27 (0.16–0.39) 61
Wisdom 53 137 (20–7895) –0.14 NA 0.12 (0.06–1.00) 62
Witteman 26 28 (15–80) –1.41 Random 0.94 (0.66–1.00) 63
Woon 24 30 (8–68) –0.60 Random 0.36 (0.11–0.69) 64
Xuan 20 348.5 (111–1893) 1.00 Random 0.05 (0.05–0.05) 65
Yang (cohort) 14 296 (100–1968) 1.38* Random 0.18 (0.11–0.79) 66
Yang (case control) 7 126 (72–392) 2.48 Random 0.73 (0.43–0.93) 66

370 | MAY 2013 | VOLUME 14 www.nature.com/reviews/neuro

© 2013 Macmillan Publishers Limited. All rights reserved

You might also like