23 STS909
23 STS909
23 STS909
There has been in the recent statistical literature a vigor- Taking a very broad view, issues of reproducibility and
ous debate about the role of statistical methods in ensur- replicability touch on almost all areas of science, and can
ing reproducibility and replicability of scientific studies. only be addressed through concerted efforts at an institu-
While this discussion has been part of our discipline for tional level. Taking a very narrow view of just the statisti-
many decades, it has gained new urgency with the rapid cal aspects of reproducibility and replicability still leaves
increase in the size and scope of data relevant to nearly a great deal of scope for discussion and disagreement.
every discipline of academic study, as well as to govern- Recognizing the importance of the debates, the editor of
ment, non-governmental organizations, and industry. One Statistical Science asked us to consider what our flag-
aspect of this is the emphasis in the relatively new field ship review journal could contribute to the discussion. We
of data science on the development of principles, strate- strove to address the issues through a mix of papers on
gies and software to enable reproducibility of published theory and on applications, highlighting particular appli-
studies. Several statistical and scientific journals now in- cation areas where the problems struck us as especially
sist on code and data being made available to reviewers, interesting to our readership, and focussing the theoreti-
for example. cal work on multiple testing, post-selection inference, and
In this editorial we follow the National Academies’
methods that provide statistical guarantees without de-
consensus study report [5] (NASEM, 2019) and use ‘re-
tailed model assumptions.
producibility’ to mean obtaining consistent results using
The papers in this volume cover various aspects of both
the same data and methods, and ‘replicability’ to mean
theory and application. They are generally concerned with
obtaining consistent results across studies in similar or
replicability, which is arguably more relevant for study of
closely related settings.
What might be called the ‘recent’ literature on statisti- the theory and methods of statistical science. It should
cal aspects of reproducibility and replicability often takes be noted that several authors did indeed provide code
as its starting point the highly cited article of Ioannidis and data to ensure their computational results are repro-
(2005) [2], arguing that most published research results ducible.
are neither reproducible nor replicable. Leek and Jager Rothenhausler and Bülhmann discuss a general ap-
(2017) [4] study this quantitatively and come to a some- proach to both stability and generalizability of inferences.
what different conclusion. There have also been many From a theoretical point of view, internal stability to per-
calls for changes in statistical methods in order to enhance turbations of the data distribution helps to ensure repli-
reproducibility or replicability, including suggestions to cability of findings. External validity is discussed in the
change the method for determining a declaration of sta- context of both point estimation and uncertainty quantifi-
tistical significance, to abandon declarations of statisti- cation. Parmigiani explores replicability for predictions,
cal significance, to develop formal statistical guidelines an important concern in machine learning. He charac-
for authors (Harrington et al., 2019 [1]; JASA, 2020 [3]), terizes this as results that are consistent across studies
and more. In fall 2020, the Harvard Data Science Review suitable to address the same prediction question. He pro-
published a special issue on reproducibility and replica- poses a multi-agent framework for defining replicability
bility (Stodden, 2020 [6]), which included a summary of and shows that some of the common practical approaches
NASEM (2019) [5]. are special cases.
Robertson, Wason and Ramdas focus their attention on
Alicia Carriquiry is Distinguished Professor and President’s large-scale hypothesis testing in online settings, which
Chair and Director of CSAFE, Department of Statistics, Iowa gives rise to issues of multiplicity, and thus affects replica-
State University, Ames, Iowa 50011, USA (e-mail: bility if these issues are not addressed. Examples treated
[email protected]). Mike Daniels is Professor and Chair,
include A/B testing, platform trials in which several treat-
Andrew Banks Family Endowed Chair, Department of
Statistics, University of Florida, Gainesville, Florida 32603,
ments use the same control group, and the use by many
USA (e-mail: [email protected]). Nancy Reid is University groups of researchers of the same online database. They
Professor, Department of Statistical Sciences, University of describe and illustrate several algorithms for control of
Toronto, Toronto, Ontario M5G 1X6, Canada (e-mail: error rates that explicitly depend on time, such as the
[email protected]). family-wise error rate FWER(t) and the false-discovery
525
526 A. L. CARRIQUIRY, M. J. DANIELS AND N. REID
rate FDR(t). Ramdas, Grünwald, Vovk and Shafer high- erogeneity on reproducibility of studies. Through a col-
light recent work on methods of inference that have uni- lection of examples, Possolo discusses some of the statis-
versal guarantees, irrespective of the model. They empha- tical challenges that arise when comparing and synthesiz-
size the links to betting and game theory, and describe ing results obtained by individual investigators, and high-
how their theory of E-values can be used at arbitrary stop- lights situations where the same data, subject to slightly
ping times. They discuss a recent theoretical advances re- different modeling assumptions, lead to substantively dif-
lated to universal inference, that extend the application of
ferent conclusions.
E-values and E-processes to complex settings.
Bogomolov and Heller discuss the problem of find- There is a great deal of outstanding research relevant to
ings from meta-analysis being completely driven by a sin- reproducibility and replicability that we have omitted: we
gle study and thus being non-replicable. They provide an could provide just a snapshot of the field in this special
overview of analyses, with the appropriate theory, that can issue. We tried to use a wide, albeit subjective, lens in the
be used to establish replicability in the context of a sin- hopes of providing a broad overview of a very important
gle outcome in multiple studies and multiple outcomes set of problems.
in multiple studies. Freuli, Held and Heyard present a
detailed simulation study to consider various metrics of
replication success, in the presence of what they call REFERENCES
“questionable research practices”. The metrics for suc-
[1] H ARRINGTON , D., D’AGOSTINO , R. B., G ATSONIS , C.,
cess include conventional guides, such as the two-trials H OGAN , J. W., H UNTER , D. J., N ORMAND , S.-L. T.,
rule and meta-analytic approaches, two metrics based on D RAZEN , J. M. and H AMEL , M. B. (2019). New guidelines
the sceptical p-value, and a replication metric based on a for statistical reporting in the. N. Engl. J. Med. 381 285–286.
Bayes factor. The questionable research practices include https://doi.org/10.1056/NEJMe1906559
interim and subgroup analyses, selection of significant re- [2] I OANNIDIS , J. P. A. (2005). Why most published research find-
sults, and selection of covariates. ings are false. PLoS Med 2 e124. https://doi.org/10.1371/journal.
Branter, Chang, Nguyen, Hong, Di Stefano and Stuart pmed.0020124
provide a comprehensive review of methods for integrat- [3] JASA Reproducibility Guide (2020). https://jasa-acs.github.io/
ing randomized trials and observational studies in three repro-guide/.
[4] L EEK , T. J. and JAGER , L. R. (2017). Is most published research
different data scenarios: aggregate-level data, federated
really false? Annu. Rev. Stat. Appl. 4 109–122. https://doi.org/10.
learning, and individual participant-level data. They em- 1146/annurev-statistics-060116-054104
phasize the importance of understanding how the origi- [5] National Academies of Sciences, Engineering, and Medicine
nal data were collected, analyzed, and presented, to help (2019). Reproducibility and Replicability in Science. The Na-
ensure replicability of the treatment effect heterogeneity tional Academies Press, Washington, DC. https://doi.org/10.
findings. 17226/25303
Possolo focuses on the important topic of measurement, [6] S TODDEN , V. (2020). Theme editor’s introduction to repro-
and the often-overlooked impact of inter-laboratory het- ducibility and replicability in science. Harv. Data Sci. Rev. 2 4.