nrn3475 p100

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

A N A LY S I S

may simply report that only nine patients were studied. study as being inconclusive or uninformative21. The pro-
A manipulation affecting only three observations could tocols of large studies are also more likely to have been
change the odds ratio from 1.00 to 1.50 in a small study registered or otherwise made publicly available, so that
but might only change it from 1.00 to 1.01 in a very large deviations in the analysis plans and choice of outcomes
study. When investigators select the most favourable, may become obvious more easily. Small studies, con-
interesting, significant or promising results among a wide versely, are often subject to a higher level of exploration
spectrum of estimates of effect magnitudes, this is inevi- of their results and selective reporting thereof.
tably a biased choice. Third, smaller studies may have a worse design quality
Publication bias and selective reporting of outcomes than larger studies. Several small studies may be oppor-
and analyses are also more likely to affect smaller, under- tunistic experiments, or the data collection and analysis
powered studies17. Indeed, investigations into publication may have been conducted with little planning. Conversely,
bias often examine whether small studies yield different large studies often require more funding and personnel
results than larger ones18. Smaller studies more readily resources. As a consequence, designs are examined more
disappear into a file drawer than very large studies that carefully before data collection, and analysis and reporting
are widely known and visible, and the results of which are may be more structured. This relationship is not absolute
eagerly anticipated (although this correlation is far from — small studies are not always of low quality. Indeed, a
perfect). A ‘negative’ result in a high-powered study can- bias in favour of small studies may occur if the small stud-
not be explained away as being due to low power 19,20, and ies are meticulously designed and collect high-quality data
thus reviewers and editors may be more willing to pub- (and therefore are forced to be small) and if large studies
lish it, whereas they more easily reject a small ‘negative’ ignore or drop quality checks in an effort to include as
large a sample as possible.

Records identified through Additional records identified Empirical evidence from neuroscience
database search through other sources Any attempt to establish the average statistical power in
(n = 246) (n = 0)
neuroscience is hampered by the problem that the true
effect sizes are not known. One solution to this problem
is to use data from meta-analyses. Meta-analysis pro-
Records after vides the best estimate of the true effect size, albeit with
duplicates removed
(n = 246) limitations, including the limitation that the individual
studies that contribute to a meta-analysis are themselves
subject to the problems described above. If anything,
Abstracts screened Excluded summary effects from meta-analyses, including power
(n = 246) (n = 73) estimates calculated from meta-analysis results, may also
be modestly inflated22.
Acknowledging this caveat, in order to estimate sta-
Full-text articles screened Excluded tistical power in neuroscience, we examined neurosci-
(n = 173) (n = 82) ence meta-analyses published in 2011 that were retrieved
using ‘neuroscience’ and ‘meta-analysis’ as search terms.
Using the reported summary effects of the meta-analy-
Full-text articles assessed ses as the estimate of the true effects, we calculated the
for eligibility Excluded
(n = 91) (n = 43) power of each individual study to detect the effect indi-
cated by the corresponding meta-analysis.

Articles included in analysis Methods. Included in our analysis were articles published
(n = 48) in 2011 that described at least one meta-analysis of previ-
ously published studies in neuroscience with a summary
Figure 2 | Flow diagram of articles selected for inclusion. Computerized
effect estimate (mean difference or odds/risk ratio) as well
databases were searched on 2 February 2012 via WebNature Reviews
of Science | Neuroscience
for papers published in
2011, using the key words ‘neuroscience’ and ‘meta-analysis’. Two authors (K.S.B. and as study level data on group sample size and, for odds/risk
M.R.M.) independently screened all of the papers that were identified for suitability ratios, the number of events in the control group.
(n = 246). Articles were excluded if no abstract was electronically available (for example, We searched computerized databases on 2 February
conference proceedings and commentaries) or if both authors agreed, on the basis of 2012 via Web of Science for articles published in 2011,
the abstract, that a meta-analysis had not been conducted. Full texts were obtained for using the key words ‘neuroscience’ and ‘meta-analysis’.
the remaining articles (n = 173) and again independently assessed for eligibility by K.S.B. All of the articles that were identified via this electronic
and M.R.M. Articles were excluded (n = 82) if both authors agreed, on the basis of the full search were screened independently for suitability by two
text, that a meta-analysis had not been conducted. The remaining articles (n = 91) were authors (K.S.B. and M.R.M.). Articles were excluded if no
assessed in detail by K.S.B. and M.R.M. or C.M. Articles were excluded at this stage if
abstract was electronically available (for example, confer-
they could not provide the following data for extraction for at least one meta-analysis:
ence proceedings and commentaries) or if both authors
first author and summary effect size estimate of the meta-analysis; and first author,
publication year, sample size (by groups) and number of events in the control group (for agreed, on the basis of the abstract, that a meta-analysis
odds/risk ratios) of the contributing studies. Data extraction was performed had not been conducted. Full texts were obtained for the
independently by K.S.B. and M.R.M. or C.M. and verified collaboratively. In total, n = 48 remaining articles and again independently assessed for
articles were included in the analysis. eligibility by two authors (K.S.B. and M.R.M.) (FIG. 2).

368 | MAY 2013 | VOLUME 14 www.nature.com/reviews/neuro

© 2013 Macmillan Publishers Limited. All rights reserved


A N A LY S I S

100 time, computational analysis of very large datasets is now


relatively straightforward, so that an enormous number of
tests can be run in a short time on the same dataset. These

Relative bias of research finding (%)


80 dramatic advances in the flexibility of research design and
analysis have occurred without accompanying changes to
other aspects of research design, particularly power. For
60
example, the average sample size has not changed sub-
stantially over time88 despite the fact that neuroscientists
40 are likely to be pursuing smaller effects. The increase in
research flexibility and the complexity of study designs89
combined with the stability of sample size and search for
20 increasingly subtle effects has a disquieting consequence:
a dramatic increase in the likelihood that statistically sig-
0
nificant findings are spurious. This may be at the root of
the recent replication failures in the preclinical literature8
0 20 40 60 80 100
and the correspondingly poor translation of these findings
Statistical power of study (%)
into humans90.
Figure 5 | The winner’s curse: effect size inflation as Low power is a problem in practice because of the
Nature Reviews | Neuroscience
a function of statistical power. The winner’s curse normative publishing standards for producing novel,
refers to the phenomenon that studies that find evidence significant, clean results and the ubiquity of null
of an effect often provide inflated estimates of the size of hypothesis significance testing as the means of evaluat-
that effect. Such inflation is expected when an effect has ing the truth of research findings. As we have shown,
to pass a certain threshold — such as reaching statistical these factors result in biases that are exacerbated by low
significance — in order for it to have been ‘discovered’.
power. Ultimately, these biases reduce the reproducibil-
Effect inflation is worst for small, low-powered studies,
which can only detect effects that happen to be large. If, ity of neuroscience findings and negatively affect the
for example, the true effect is medium-sized, only those validity of the accumulated findings. Unfortunately,
small studies that, by chance, estimate the effect to be publishing and reporting practices are unlikely to
large will pass the threshold for discovery (that is, the change rapidly. Nonetheless, existing scientific practices
threshold for statistical significance, which is typically can be improved with small changes or additions that
set at p < 0.05). In practice, this means that research approximate key features of the idealized model4,91,92.
findings of small studies are biased in favour of inflated We provide a summary of recommendations for future
effects. By contrast, large, high-powered studies can research practice in BOX 2.
readily detect both small and large effects and so are less
biased, as both over- and underestimations of the true
Increasing disclosure. False positives occur more fre-
effect size will pass the threshold for ‘discovery’. We
optimistically estimate the median statistical power of quently and go unnoticed when degrees of freedom in
studies in the neuroscience field to be between ~8% and data analysis and reporting are undisclosed5. Researchers
~31%. The figure shows simulations of the winner’s curse can improve confidence in published reports by noting
(expressed on the y‑axis as relative bias of research in the text: “We report how we determined our sample
findings). These simulations suggest that initial effect size, all data exclusions, all data manipulations, and all
estimates from studies powered between ~ 8% and ~31% measures in the study.”7 When such a statement is not
are likely to be inflated by 25% to 50% (shown by the possible, disclosure of the rationale and justification of
arrows in the figure). Inflated effect estimates make it deviations from what should be common practice (that
difficult to determine an adequate sample size for is, reporting sample size, data exclusions, manipula-
replication studies, increasing the probability of type II
tions and measures) will improve readers’ understand-
errors. Figure is modified, with permission, from REF. 103
© (2007) Cell Press. ing and interpretation of the reported effects and,
therefore, of what level of confidence in the reported
effects is appropriate. In clinical trials, there is an
are expended87. Similarly, adopting conservative priors increasing requirement to adhere to the Consolidated
can substantially reduce the likelihood of claiming that Standards of Reporting Trials (CONSORT), and the
an effect exists when in fact it does not 85. At present, same is true for systematic reviews and meta-analyses,
significance testing remains the dominant framework for which the Preferred Reporting Items for Systematic
within neuroscience, but the flexibility of alternative (for Reviews and Meta-Analyses (PRISMA) guidelines are
example, Bayesian) approaches means that they should now being adopted. A number of reporting guidelines
be taken seriously by the field. have been produced for application to diverse study
designs and tools, and an updated list is maintained
Conclusions and future directions by the EQUATOR Network93. A ten-item checklist of
A consequence of the remarkable growth in neurosci- study quality has been developed by the Collaborative
ence over the past 50 years has been that the effects we Approach to Meta-Analysis and Review of Animal Data
now seek in our experiments are often smaller and more in Experimental Stroke (CAMARADES), but to the best
subtle than before as opposed to when mostly easily dis- of our knowledge, this checklist is not yet widely used in
cernible ‘low-hanging fruit’ were targeted. At the same primary studies.

NATURE REVIEWS | NEUROSCIENCE VOLUME 14 | MAY 2013 | 373

© 2013 Macmillan Publishers Limited. All rights reserved


A N A LY S I S

Box 2 | Recommendations for researchers


data within collaborative teams and also making some
or all of those research materials publicly available.
Perform an a priori power calculation Leading journals are increasingly adopting policies for
Use the existing literature to estimate the size of effect you are looking for and design making data, protocols and analytical codes available,
your study accordingly. If time or financial constraints mean your study is at least for some types of studies. However, these poli-
underpowered, make this clear and acknowledge this limitation (or limitations) in the
cies are uncommonly adhered to95, and thus the ability
interpretation of your results.
for independent experts to repeat published analysis
Disclose methods and findings transparently remains low 96.
If the intended analyses produce null findings and you move on to explore your data in
other ways, say so. Null findings locked in file drawers bias the literature, whereas
exploratory analyses are only useful and valid if you acknowledge the caveats and Incentivizing replication. Weak incentives for conduct-
limitations. ing and publishing replications are a threat to identifying
false positives and accumulating precise estimates of
Pre-register your study protocol and analysis plan
Pre-registration clarifies whether analyses are confirmatory or exploratory, encourages research findings. There are many ways to alter repli-
well-powered studies and reduces opportunities for non-transparent data mining and cation incentives97. For example, journals could offer a
selective reporting. Various mechanisms for this exist (for example, the Open Science submission option for registered replications of impor-
Framework). tant research results (see, for example, a possible new
Make study materials and data available submission format for Cortex98). Groups of researchers
Making research materials available will improve the quality of studies aimed at can also collaborate on performing one or many replica-
replicating and extending research findings. Making raw data available will enhance tions to increase the total sample size (and therefore the
opportunities for data aggregation and meta-analysis, and allow external checking of statistical power) achieved while minimizing the labour
analyses and results. and resource impact on any one contributor. Adoption
Work collaboratively to increase power and replicate findings of the gold standard of large-scale collaborative con-
Combining data increases the total sample size (and therefore power) while minimizing sortia and extensive replication in fields such as human
the labour and resource impact on any one contributor. Large-scale collaborative genome epidemiology has transformed the reliability
consortia in fields such as human genetic epidemiology have transformed the reliability of the produced findings. Although previously almost
of findings in these fields. all of the proposed candidate gene associations from
small studies were false99 (with some exceptions100), col-
laborative consortia have substantially improved power,
Registration of confirmatory analysis plan. Both explor- and the replicated results can be considered highly reli-
atory and confirmatory research strategies are legiti- able. In another example, in the field of psychology, the
mate and useful. However, presenting the result of an Reproducibility Project is a collaboration of more than
exploratory analysis as if it arose from a confirmatory 100 researchers aiming to estimate the reproducibility
test inflates the chance that the result is a false positive. of psychological science by replicating a large sample of
In particular, p‑values lose their diagnostic value if they studies published in 2008 in three psychology journals92.
are not the result of a pre-specified analysis plan for Each individual research study contributes just a small
which all results are reported. Pre-registration — and, portion of time and effort, but the combined effect is
ultimately, full reporting of analysis plans — clarifies substantial both for accumulating replications and for
the distinction between confirmatory and explora- generating an empirical estimate of reproducibility.
tory analysis, encourages well-powered studies (at least
in the case of confirmatory analyses) and reduces the Concluding remarks. Small, low-powered studies are
file-drawer effect. These subsequently reduce the likeli- endemic in neuroscience. Nevertheless, there are reasons
hood of false positive accumulation. The Open Science to be optimistic. Some fields are confronting the prob-
Framework (OSF) offers a registration mechanism for lem of the poor reliability of research findings that arises
scientific research. For observational studies, it would from low-powered studies. For example, in genetic epi-
be useful to register datasets in detail, so that one can be demiology sample sizes increased dramatically with the
aware of how extensive the multiplicity and complexity widespread understanding that the effects being sought
of analyses can be94. are likely to be extremely small. This, together with an
increasing requirement for strong statistical evidence
Improving availability of materials and data. Making and independent replication, has resulted in far more
research materials available will improve the quality reliable results. Moreover, the pressure for emphasiz-
of studies aimed at replicating and extending research ing significant results is not absolute. For example, the
findings. Making raw data available will improve data Proteus phenomenon101 suggests that refuting early
aggregation methods and confidence in reported results can be attractive in fields in which data can be
results. There are multiple repositories for making data produced rapidly. Nevertheless, we should not assume
more widely available, such as The Dataverse Network that science is effectively or efficiently self-correcting 102.
Project and Dryad) for data in general and others There is now substantial evidence that a large propor-
such as OpenfMRI, INDI and OASIS for neuroimag- tion of the evidence reported in the scientific literature
ing data in particular. Also, commercial repositories may be unreliable. Acknowledging this challenge is the
(for example, figshare) offer means for sharing data first step towards addressing the problematic aspects
and other research materials. Finally, the OSF offers of current scientific practices and identifying effective
infrastructure for documenting, archiving and sharing solutions.

374 | MAY 2013 | VOLUME 14 www.nature.com/reviews/neuro

© 2013 Macmillan Publishers Limited. All rights reserved

You might also like