Issciencereallyfacingareproducibilitycrisis, Anddo We Need It To?

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

COLLOQUIUM OPINION

COLLOQUIUM OPINION

Issciencereallyfacingareproducibilitycrisis,anddo
we need it to?
Daniele Fanellia,1

Edited by David B. Allison, Indiana University Bloomington, Bloomington, IN, and accepted by Editorial Board Member Susan T. Fiske
November 3, 2017 (received for review June 30, 2017)

Efforts to improve the reproducibility and integrity of science are typically justified by a narrative of crisis,
according to which most published results are unreliable due to growing problems with research and
publication practices. This article provides an overview of recent evidence suggesting that this narrative is
mistaken, and argues that a narrative of epochal changes and empowerment of scientists would be more
accurate, inspiring, and compelling.
| | | |
reproducible research crisis integrity bias misconduct

Is there a reproducibility crisis in science? Many seem suggests that generalizations are unjustified; and (iii) not
to believe so. In a recent survey by the journal Nature, growing, as the crisis narrative would presuppose. Alter-
for example, around 90% of respondents agreed that native narratives, therefore, might represent a better fit for
there is a “slight” or “significant” crisis, and between empirical data as well as for the reproducibility agenda.
40% and 70% agreed that selective reporting, fraud,
and pressures to publish “always” or “often” contrib- How Common Are Fabricated, False, Biased,
ute to irreproducible research (1). Results of this non- and Irreproducible Findings?
randomized survey may not accurately represent the Scientific misconduct and questionable research
population of practicing scientists, but they echo practices (QRP) occur at frequencies that, while
beliefs expressed by a rapidly growing scientific nonnegligible, are relatively small and therefore
literature, which uncritically endorses a new “crisis unlikely to have a major impact on the literature. In
narrative” about science (an illustrative sample of this anonymous surveys, on average 1–2% of scientists
literature is shown in Fig. 1 and listed in Dataset S1). admit to having fabricated or falsified data at least
Put simply, this new “science in crisis” narrative once (2). Much higher percentages admit to other
postulates that a large and growing proportion of QRP, such as dropping data points based on a gut
studies published across disciplines are unreliable feeling or failing to publish a contradictory result.
due to the declining quality and integrity of research However, the percentage of scientific literature that
and publication practices, largely because of growing is actually affected by these practices is unknown,
pressures to publish and other ills affecting the con- and evidence suggests that it is likely to be smaller,
temporary scientific profession. at least five times smaller according to a survey
I argue that this crisis narrative is at least partially among psychologists (3). Data that directly estimate
misguided. Recent evidence from metaresearch stud- the prevalence of misconduct are scarce but appear
ies suggests that issues with research integrity and to corroborate this conclusion. Random laboratory
reproducibility, while certainly important phenomena audits in cancer clinical trials, for example, found
that need to be addressed, are: (i) not distorting the that only 0.28% contained “scientific improprieties”
majority of the literature, in science as a whole as well (4), and those conducted among Food and Drug
as within any given discipline; (ii) heterogeneously dis- Administration clinical trials between 1977 and 1988
tributed across subfields in any given area, which found problems sufficient to initiate “for cause”

a
Department of Methodology, London School of Economics and Political Science, London WC2A 2AE, United Kingdom
This paper results from the Arthur M. Sackler Colloquium of the National Academy of Sciences, “Reproducibility of Research: Issues and Proposed
Remedies,” held March 8–10, 2017, at the National Academy of Sciences in Washington, DC. The complete program and video recordings of most
presentations are available on the NAS website at www.nasonline.org/Reproducibility.
Author contributions: D.F. wrote the paper.
The author declares no conflict of interest.
This article is a PNAS Direct Submission. D.B.A. is a guest editor invited by the Editorial Board.
Published under the PNAS license.
1
Email: [email protected].
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1708272114/-/DCSupplemental.
Downloaded by guest on June 4, 2021

Published online March 12, 2018.

2628–2631 | PNAS | March 13, 2018 | vol. 115 | no. 11 www.pnas.org/cgi/doi/10.1073/pnas.1708272114


investigations only in 4% of cases (5). Visual inspections
of microbiology papers suggested that between 1%
and 2% of papers had been manipulated in ways that
suggested intentional fabrication (6, 7).
The occurrence of questionable or flawed research
and publication practices may be revealed by a high
rate of false-positives and “P-hacked” (8) results.
However, while these issues do appear to be more
common than outright scientific misconduct, their
impact on the reliability of the literature appears to be
contained. Analyses based on the distribution of P
values reported in the medical literature, for example,
suggested a false-discovery rate of only 14% (9). A
similar but broader analysis concluded that P-hacking
Fig. 1. Number of Web of Science records that in the title, abstract, or keywords
was common in many disciplines and yet had minor contain one of the following phrases: “reproducibility crisis,” “scientific crisis,”
effects in distorting conclusions of meta-analyses (10). “science in crisis,” “crisis in science,” “replication crisis,” “replicability crisis.”
Moreover, the same analysis found a much stronger Records were classified by the author according to whether, based on title and
“evidential value” in the literature of all disciplines, abstracts, they implicitly or explicitly endorsed the crisis narrative described in the
text (red), or alternatively questioned the existence of such a crisis (blue), or
which suggests that the majority of published studies
discussed “scientific crises” of other kinds or could not be classified due to
are measuring true effects, a finding that again con- insufficient information (gray). The complete dataset, which includes all titles and
tradicts the belief that most published findings are abstracts and dates back to the year 1933, is available in Dataset S1. This sample
false-positives. Methodological criticisms suggest that is merely illustrative, and does not include the numerous recent research
these and similar studies may be underestimating the articles and opinion articles that discuss the “science is in crisis” narrative without
including any of the above sentences in the title, abstract, or keywords.
true impact of P-hacking (11, 12). However, to the best
of my knowledge, there is no alternative analysis that
finding was consistent with evidence that studies on
suggests that P-hacking is severely distorting the
publication bias may themselves be subject to a
scientific literature.
publication bias (20), which entails that fields that do
Low statistical power might increase the risk of
false-positives (as well as false-negatives). In several not suffer from bias are underrepresented in the
disciplines, the average statistical power of studies metaresearch literature.
was found to be significantly below the recommended The case that most publications are nonreproducible
80% level (13–16). However, such studies showed that would be supported by meta-meta-analyses, if these
power varies widely between subfields or methodol- had shown that on average there is a strong “decline
ogies, which should warn against making simplistic effect,” in which initially strong “promising” results are
generalizations to entire disciplines (13–16). Moreover, contradicted by later studies. While a decline effect
the extent to which low power generates false-positives was measurable across many meta-analyses, it is far
depends on assumptions about the magnitude of the from ubiquitous (19). This suggests that in many meta-
true underlying effect size, level of research bias, analyses, initial findings are refuted, whereas in others
and prior probabilities that the tested hypothesis is they are confirmed. Isn’t this what should happen when
in fact correct (17). These assumptions, just like science is functional?
statistical power itself, are likely to vary substantially Ultimately, the debate over the existence of a re-
across subfields and are very difficult to measure in producibility crisis should have been closed by recent
practice. For most published research findings to be large-scale assessments of reproducibility. Their re-
false in psychology and neuroscience, for example, sults, however, are either reassuring or inconclusive. A
one must assume that the hypotheses tested in “Many labs” project reported that 10 of 13 studies
these disciplines are correct much less than 50% of taken from the psychological literature had been
the time (14, 18). This assumption is, in my opinion, consistently replicated multiple times across different
unrealistic. It might reflect the condition of early settings (21), whereas an analysis in experimental
exploratory studies that are conducted in a theo- economics suggested that, of 18 studies, at least
retical and empirical vacuum, but not that of most 11 had been successfully replicated (22). The largest
ordinary research, which builds upon previous the- reproducibility initiative to date suggested that in
ory and evidence and therefore aims at relatively psychological science, reproducibility was below 50%
predictable findings. (23). This latter estimate, however, is likely to be too
It may be counter-argued that the background lit- pessimistic for at least two reasons. First because,
erature that produces theory and evidence on which once again, such a low level of reproducibility was not
new studies are based is distorted by publication and ubiquitous but varied depending on subfield, meth-
other reporting biases. However, the extent to which odology, and expertise of the authors conducting the
this is the case is, again, likely to vary by research replication (23–25). Second, and more importantly,
subfield. Indeed, in a meta-assessment of bias across because how reproducibility ought to be measured is
all disciplines, small-study effects and gray-literature the subject of a growing methodological and philo-
bias (both possible symptoms of reporting biases) sophical debate, and reanalyses of the data suggest
Downloaded by guest on June 4, 2021

were highly heterogeneously distributed (19). This that reproducibility in psychological science might

Fanelli PNAS | March 13, 2018 | vol. 115 | no. 11 | 2629


be higher than originally claimed (23, 26, 27). Indeed, papers reporting overestimated effects (19, 40, 41).
the very notion of “reproducible research” can be The risk of misconduct and QRPs appears to be higher
confusing, because its meaning and implications de- among researchers in countries that are increasingly
pend on what aspect of research is being examined: represented in the global scientific literature, like
the reproducibility of research methods can in princi- China or India (7, 40). Global demographic changes,
ple be expected to be 100%; but the reproducibility of therefore, might contribute to a rise in the proportion
results and inferences is likely to be lower and to vary of papers affected by scientific misconduct, but such a
across subfields and methodologies, for reasons that trend would have little to do with rising pressures to
have nothing to do with questionable research and publish in Western countries.
publication practices (28).
Do We Need the “Science in Crisis” Narrative to
Are These Problems Getting Worse? Promote Better Science?
In light of multiple recent studies, there is no evidence To summarize, an expanding metaresearch literature
that scientific misconduct and QRPs have increased. suggests that science—while undoubtedly facing old
The number of yearly findings of scientific misconduct and new challenges—cannot be said to be un-
by the US Office of Research Integrity (ORI) has not dergoing a “reproducibility crisis,” at least not in the
increased, nor has the proportion, of all ORI investi- sense that it is no longer reliable due to a pervasive
gations, that resulted in a finding of misconduct (29). and growing problem with findings that are fabri-
Retractions have risen sharply in absolute terms, but cated, falsified, biased, underpowered, selected, and
the number of retractions per retracting journals has irreproducible. While these problems certainly exist
not, suggesting that the trend is due to the diffusion and need to be tackled, evidence does not suggest
and improvement of journal retraction policies and that they undermine the scientific enterprise as a
practices (29). Errata and corrections have also not whole. Science always was and always will be a
increased, nor has the rate of statistical errors made in struggle to produce knowledge for the benefit of all of
mainstream psychological journals (29, 30). humanity against the cognitive and moral limitations
The questionable practice known as “salami-slic- of individual human beings, including the limitations
ing,” in which results are fractionalized to increase of scientists themselves.
publication output, is widely believed to be on the The new “science is in crisis” narrative is not only
rise. However, there is no evidence that scientists are empirically unsupported, but also quite obviously
publishing more papers today than in the 1950s, once counterproductive. Instead of inspiring younger gen-
coauthorship is adjusted for (31). Indeed, assessments erations to do more and better science, it might foster
in various disciplines suggest that, far from becoming in them cynicism and indifference. Instead of inviting
increasingly short and trivial, published studies are greater respect for and investment in research, it risks
getting longer, more complex, and richer in data (e.g., discrediting the value of evidence and feeding
refs. 32–34). antiscientific agendas.
Biases in research and reporting were suggested Furthermore, this narrative is not actually new.
to be on the rise by multiple independent studies, Complaints about a decline in the quality of research
which had found that the relative proportion of recur throughout the history of science, right from its
“positive” and “statistically significant” results re- beginnings (42, 43). Only two elements of novelty
ported in article abstracts has increased over the characterize the current “crisis.” The first is that the
years (35–37). However, the aforementioned evi- validity of these concerns is being assessed scientifi-
dence that papers in many (and maybe most) dis- cally by a global metaresearch program, with results
ciplines are becoming longer and more complex that have been briefly overviewed above (44). The
suggests that negative results may not be disap- second element of historical novelty is the rising
pearing from the literature, as originally suggested, power of information and communication technolo-
but perhaps only from abstracts. Negative results, gies, which are transforming scientific practices in all
in other words, may be increasingly embedded in fields, just as they are transforming all other aspects of
longer publications that contain multiple results, and human life. These technologies promise to make re-
they therefore remain accessible to any researcher search more accurate, powerful, open, democratic,
interested in finding them. transparent, and self-correcting than ever before. At
Finally, pressures to publish have not been con- the same time, this technological revolution creates
vincingly linked to evidence of bias or misconduct. new expectations and new challenges that meta-
Earlier studies that compared the scientific pro- researchers are striving to address.
ductivity of countries offered some support for such a Therefore, contemporary science could be more
link (38, 39). However, later, finer-grained analyses accurately portrayed as facing “new opportuni-
offered contrary evidence, by showing that re- ties and challenges” or even a “revolution” (45).
searchers that publish at higher frequency, in journals Efforts to promote transparency and reproducibil-
with higher impact factor, and in countries where ity would find complete justification in such a narra-
pressures to publish are high, are equally or more tive of transformation and empowerment, a narrative
likely to correct their work, less likely to publish papers that is not only more compelling and inspiring
that are retracted, less likely to author papers that than that of a crisis, but also better supported by
Downloaded by guest on June 4, 2021

contain duplicated images, and less likely to author evidence.

2630 | www.pnas.org/cgi/doi/10.1073/pnas.1708272114 Fanelli


1 Baker M (2016) Is there a reproducibility crisis? Nature 533:452–454.
2 Fanelli D (2009) How many scientists fabricate and falsify research? A systematic review and meta-analysis of survey data. PLoS One
4:e5738.
3 Fiedler K, Schwarz N (2016) Questionable research practices revisited. Soc Psychol Personal Sci 7:45–52.
4 Weiss RB, et al. (1993) A successful system of scientific data audits for clinical trials. A report from the Cancer and Leukemia Group B.
JAMA 270:459–464.
5 Shapiro MF, Charrow RP (1989) The role of data audits in detecting scientific misconduct. Results of the FDA program. JAMA
261:2505–2511.
6 Steneck NH (2006) Fostering integrity in research: Definitions, current knowledge, and future directions. Sci Eng Ethics 12:53–74.
7 Bik EM, Casadevall A, Fang FC (2016) The prevalence of inappropriate image duplication in biomedical research publications. MBio
7:e00809–e00816.
8 Wicherts JM, et al. (2016) Degrees of freedom in planning, running, analyzing, and reporting psychological studies: A checklist to
avoid p-hacking. Front Psychol 7:12.
9 Jager LR, Leek JT (2014) An estimate of the science-wise false discovery rate and application to the top medical literature. Biostatistics
15:1–12.
10 Head ML, Holman L, Lanfear R, Kahn AT, Jennions MD (2015) The extent and consequences of P-hacking in science. PLoS Biol
13:e1002106.
11 Bruns SB, Ioannidis JPA (2016) p-curve and p-hacking in observational research. PLoS One 11:e0149144.
12 Bishop DVM, Thompson PA (2016) Problems in using p-curve analysis and text-mining to detect rate of p-hacking and evidential
value. PeerJ 4:e1715.
13 Dumas-Mallet E, Button KS, Boraud T, Gonon F, Munafo MR (2017) Low statistical power in biomedical science: A review of three
human research domains. R Soc Open Sci 4:160254.
14 Szucs D, Ioannidis JP (2017) Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and
psychology literature. PLoS Biol 15:e2000797.
15 Jennions MD, Moller AP (2003) A survey of the statistical power of research in behavioral ecology and animal behavior. Behav Ecol
14:438–445.
16 Fraley RC, Vazire S (2014) The N-pact factor: Evaluating the quality of empirical journals with respect to sample size and statistical
power. PLoS One 9:e109019.
17 Ioannidis JP (2005) Why most published research findings are false. PLoS Med 2:e124.
18 Button KS, et al. (2013) Power failure: Why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci
14:365–376.
19 Fanelli D, Costas R, Ioannidis JPA (2017) Meta-assessment of bias in science. Proc Natl Acad Sci USA 114:3714–3719.
20 Dubben HH, Beck-Bornholdt HP (2005) Systematic review of publication bias in studies on publication bias. BMJ 331:433–434.
21 Klein RA, et al. (2014) Investigating variation in replicability. A “Many labs” replication project. Soc Psychol 45:142–152.
22 Camerer CF, et al. (2016) Evaluating replicability of laboratory experiments in economics. Science 351:1433–1436.
23 Open Science Collaboration (2015) Psychology. Estimating the reproducibility of psychological science. Science 349:aac4716.
24 Van Bavel JJ, Mende-Siedlecki P, Brady WJ, Reinero DA (2016) Contextual sensitivity in scientific reproducibility. Proc Natl Acad Sci
USA 113:6454–6459.
25 Bench SW, Rivera GN, Schlegel RJ, Hicks JA, Lench HC (2017) Does expertise matter in replication? An examination of the
reproducibility project: Psychology. J Exp Soc Psychol 68:181–184.
26 Patil P, Peng RD, Leek JT (2016) What should researchers expect when they replicate studies? A statistical view of replicability in
psychological science. Perspect Psychol Sci 11:539–544.
27 Etz A, Vandekerckhove J (2016) A Bayesian perspective on the reproducibility project: Psychology. PLoS One 11:e0149794.
28 Goodman SN, Fanelli D, Ioannidis JPA (2016) What does research reproducibility mean? Sci Transl Med 8:341ps312.
29 Fanelli D (2013) Why growing retractions are (mostly) a good sign. PLoS Med 10:e1001563.
30 Nuijten MB, Hartgerink CHJ, van Assen MALM, Epskamp S, Wicherts JM (2016) The prevalence of statistical reporting errors in
psychology (1985-2013). Behav Res Methods 48:1205–1226.
31 Fanelli D, Larivière V (2016) Researchers’ individual publication rate has not increased in a century. PLoS One 11:e0149504.
32 Rodriguez-Esteban R, Loging WT (2013) Quantifying the complexity of medical research. Bioinformatics 29:2918–2924.
33 Vale RD (2015) Accelerating scientific publication in biology. Proc Natl Acad Sci USA 112:13439–13446.
34 Low-Décarie E, Chivers C, Granados M (2014) Rising complexity and falling explanatory power in ecology. Front Ecol Environ
12:412–418.
35 Pautasso M (2010) Worsening file-drawer problem in the abstracts of natural, medical and social science databases. Scientometrics
85:193–202.
36 Fanelli D (2012) Negative results are disappearing from most disciplines and countries. Scientometrics 90:891–904.
37 de Winter JCF, Dodou D (2015) A surge of p-values between 0.041 and 0.049 in recent decades (but negative results are increasing
rapidly too). PeerJ 3:e733.
38 Munafò MR, Attwood AS, Flint J (2008) Bias in genetic association studies: Effects of research location and resources. Psychol Med
38:1213–1214.
39 Fanelli D (2010) Do pressures to publish increase scientists’ bias? An empirical support from US States data. PLoS One 5:e10271.
40 Fanelli D, Costas R, Fang FC, Casadevall A, Bik EM (2017) Why do scientists fabricate and falsify data? A matched-control analysis of
papers containing problematic image duplications. bioRxiv, 10.1101/126805.
41 Fanelli D, Costas R, Larivière V (2015) Misconduct policies, academic culture and career stage, not gender or pressures to publish,
affect scientific integrity. PLoS One 10:e0127556.
42 Mullane K, Williams M (2017) Enhancing reproducibility: Failures from reproducibility initiatives underline core challenges. Biochem
Pharmacol 138:7–18.
43 Babbace C (1830) Reflections of the decline of science in England, and on some of its causes. Available at https://archive.org/details/
reflectionsonde00mollgoog. Accessed November 29, 2017.
44 Ioannidis JPA, Fanelli D, Dunne DD, Goodman SN (2015) Meta-research: Evaluation and improvement of research methods and
practices. PLoS Biol 13:e1002264.
45 Spellman BA (2015) A short (personal) future history of revolution 2.0. Perspect Psychol Sci 10:886–899.
Downloaded by guest on June 4, 2021

Fanelli PNAS | March 13, 2018 | vol. 115 | no. 11 | 2631

You might also like