2021 - Lakens Preprint On Sample Size Justification

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Sample Size Justification

Daniël Lakens1
1
Eindhoven University of Technology

This unpublished manuscript is submitted for peer review


An important step when designing an empirical study is to justify the sample size that
will be collected. The key aim of a sample size justification for such studies is to explain
how the collected data is expected to provide valuable information given the inferential
goals of the researcher. In this overview article six approaches are discussed to justify
the sample size in a quantitative empirical study: 1) collecting data from (an)almost)
the entire population, 2) choosing a sample size based on resource constraints, 3)
performing an a-priori power analysis, 4) planning for a desired accuracy, 5) using
heuristics, or 6) explicitly acknowledging the absence of a justification. An important
question to consider when justifying sample sizes is which effect sizes are deemed
interesting, and the extent to which the data that is collected informs inferences about
these effect sizes. Depending on the sample size justification chosen, researchers
could consider 1) what the smallest effect size of interest is, 2) which minimal effect
size will be statistically significant, 3) which effect sizes they expect (and what they base
these expectations on), 4) which effect sizes would be rejected based on a confidence
interval around the effect size, 5) which ranges of effects a study has sufficient power
to detect based on a sensitivity power analysis, and 6) which effect sizes are plausible
in a specific research area. Researchers can use the guidelines presented in this article
to improve their sample size justification, and hopefully, align the informational value
of a study with their inferential goals.

Keywords: sample size justification, study design, power analysis, value of information
Word count: 16642

Scientists perform empirical studies to collect data that simply stated, but not justified. This makes it difficult
helps to answer a research question. The more data to evaluate how informative a study will be. To pre-
that is collected, the more informative the study will vent such concerns from emerging when it is too late
be with respect to its inferential goals. A sample size (e.g., after a non-significant hypothesis test has been ob-
justification should consider how informative the data served), researchers should carefully justify their sam-
will be given an inferential goal, such as an estimat- ple size before data is collected.
ing an effect size, or testing a hypothesis. Even though
a sample size justification is sometimes requested in Six Approaches to Justify Sample Sizes
manuscript submission guidelines, when submitting a
Researchers often find it difficult to justify their sample
grant to a funder, or submitting a proposal to an ethi-
size (i.e., a number of participants, observations, or any
cal review board, the number of observations is often
combination thereof). In this review article six possible
approaches are discussed that can be used to justify the
sample size in a quantitative study (see Table 1). This
is not an exhaustive overview, but it includes the most
This work was funded by VIDI Grant 452-17-013 from the common and applicable approaches for single studies.1
Netherlands Organisation for Scientific Research. I would like The first justification is that data from (almost) the en-
to thank Shilaan Alzahawi, José Biurrun, Aaron Caldwell, Gor- tire population has been collected. The second justifica-
don Feld, Yaov Kessler, Robin Kok, Maximilian Maier, Matan tion centers on resource constraints, which are almost
Mazor, Toni Saari, Andy Siddall, and Jesper Wulff for feedback
always present, but rarely explicitly evaluated. The third
on an earlier draft. A computationally reproducible version
and fourth justifications are based on a desired statisti-
of this manuscript is available at https://github.com/Laken
s/sample_size_justification. cal power or a desired accuracy. The fifth justification
Correspondence concerning this article should be ad- 1
The topic of power analysis for meta-analyses is outside
dressed to Daniël Lakens, Den Dolech 1, 5600MB Eindhoven,
the scope of this manuscript, but see Hedges and Pigott (2001)
The Netherlands. E-mail: [email protected]
and Valentine, Pigott, and Rothstein (2010).
2 DANIËL LAKENS1

Table 1
Overview of possible justifications for the sample size in a study.
Type of justification When is this justification applicable?
Measure entire population A researcher can specify the entire population, it is finite, and it is possible
to measure (almost) every entity in the population.
Resource constraints Limited resources are the primary reason for the choice of the sample size
a researcher can collect.
Accuracy The research question focusses on the size of a parameter, and a researcher
collects sufficient data to have an estimate with a desired level of accuracy.
A-priori power analysis The research question has the aim to test whether certain effect sizes can
be statistically rejected with a desired statistical power.
Heuristics A researcher decides upon the sample size based on a heuristic, general rule
or norm that is described in the literature, or communicated orally.
No justification A researcher has no reason to choose a specific sample size, or does not have
a clearly specified inferential goal and wants to communicate this honestly.

relies on heuristics, and finally, researchers can choose have. A shared feature of the different inferential goals
a sample size without any justification. Each of these considered in this review article is the question which
justifications can be stronger or weaker depending on effect sizes a researcher considers meaningful to distin-
which conclusions researchers want to draw from the guish. This implies that researchers need to evaluate
data they plan to collect. which effect sizes they consider interesting. These eval-
uations rely on a combination of statistical properties
All of these approaches to the justification of sample and domain knowledge. In Table 2 six possibly useful
sizes, even the ‘no justification’ approach, are valid jus- considerations are provided. This is not intended to be
tifications in the sense that they give others insight into an exhaustive overview, but it presents common and
the reasons that led to the decision for a sample size in useful approaches that can be applied in practice. Not
a study. It should not be surprising that the ‘heuristics’ all evaluations are equally relevant for all types of sam-
and ‘no justification’ approaches are often unlikely to ple size justifications. These considerations often rely
impress peers. However, it is important to note that on the same information (e.g., effect sizes, the number
the value of the information that is collected depends of observations, the standard deviation, etc.) so these
on the extent to which the final sample size allows a six considerations should be seen as a set of comple-
researcher to achieve their inferential goals, and not on mentary approaches that can be used to evaluate which
the sample size justification that is chosen. effect sizes are of interest.
The extent to which these approaches make other re-
searchers judge the data that is collected as informa- To start, researchers should consider what their small-
tive depends on the details of the question a researcher est effect size of interest is. Second, although only rel-
aimed to answer and the parameters they chose when evant when performing a hypothesis test, researchers
determining the sample size for their study. For ex- should consider which effect sizes could be statistically
ample, a badly performed a-priori power analysis can significant given a choice of an alpha level and sam-
quickly lead to a study with very low informational ple size. Third, it is important to consider the (range
value. These six justifications are not mutually exclu- of) effect sizes that are expected. This requires a care-
sive, and multiple approaches can be considered when ful consideration of the source of this expectation and
designing a study. the presence of possible biases in these expectations.
Fourth, it is useful to consider the width of the con-
Six Ways to Evaluate Which Effect Sizes are fidence interval around possible values of the effect
Interesting size in the population, and whether we can expect this
confidence interval to reject effects we considered a-
The informativeness of the data that is collected de- priori plausible. Fifth, it is worth evaluating the power
pends on the inferential goals a researcher has, or in of the test across a wide range of possible effect sizes
some cases, the inferential goals scientific peers will in a sensitivity power analysis. Sixth, a researcher can
SAMPLE SIZE JUSTIFICATION 3

consider the effect size distribution of related studies


in the literature.

The Value of Information

Value of Information
Since all scientists are faced with resource limitations,
they need to balance the cost of collecting each addi-
tional datapoint against the increase in information
that datapoint provides. This is referred to as the value
of information (Eckermann, Karnon, & Willan, 2010).
Calculating the value of information is notoriously diffi-
cult (Detsky, 1990). Researchers need to specify the cost
of collecting data, and weigh the costs of data collection 0 500 1000 1500

against the increase in utility that having access to the Sample Size
data provides. From a value of information perspective
Figure 1. Example of a non-monotonically increasing
not every data point that can be collected is equally
value of information as a function of the sample size.
valuable (Halpern, Brown Jr, & Hornberger, 2001; Wil-
son, 2015). Whenever additional observations do not
change inferences in a meaningful way, the costs of data
collection can outweigh the benefits. able to collect data from all employees at a firm or study
a small population of top athletes. Whenever it is possi-
The value of additional information can be a non- ble to measure the entire population, the sample size
monotonic function when it depends on multiple in- justification becomes straightforward: the researcher
ferential goals (see Figure 1). A researcher might be used all the data that is available.
interested in comparing an effect against a previously
observed large effect in the literature, a theoretically When the entire population is measured there is no
predicted medium effect, and the smallest effect that need to perform a hypothesis test. After all, there is
would be practically relevant. In such a situation the no population to generalize to.2 When data from the
expected value of sampling information will lead to dif- entire population has been collected the population
ferent optimal sample sizes for each inferential goal. It effect size is known, and there is no confidence interval
could be valuable to collect informative data about a to compute. If the total population size is known, but
large effect, with additional data having less marginal not measured completely, then the confidence interval
utility, up to a point where the data becomes increas- width should shrink to zero the closer a study gets to
ingly informative about a medium effect size, with the measuring the entire population. This is known as the
value of sampling additional information decreasing finite population correction factor for the variance of
once more until the study becomes increasingly infor- the estimator (Kish, 1965). The variance of a sample
mative about the presence or absence of a smallest ef- mean is σ2 /n, which for finite populations is multiplied
fect of interest. by the finite population correction factor of the stan-
dard error:
Because of the difficulty of quantifying the value of
s
(N − n)
information, scientists typically use less formal ap- FPC =
(N − 1)
proaches to justify the amount of data they set out to
collect in a study. Even though the cost-benefit analy- where N is the size of the population, and n is the size
sis is not always made explicit in reported sample size of the sample. When N is much larger than n, the cor-
justifications, the value of information perspective is rection factor will be close to 1 (and therefore this cor-
almost always implicitly the underlying framework that rection is typically ignored when populations are very
sample size justifications are based on. Throughout large, even when populations are finite), and will not
the subsequent discussion of sample size justifications, have a noticeable effect on the variance. When the to-
the importance of considering the value of information tal population is measured the correction factor is 0,
given inferential goals, will repeatedly be highlighted. such that the variance becomes 0 as well. For example,
when the total population consists of 100 top athletes,
Measuring (Almost) the Entire Population
2
It is possible to argue we are still making an inference,
In some instances, it might be possible to collect data even when the entire population is observed, because we
from (almost) the entire population under investigation. have observed a metaphorical population from one of many
For example, researchers might use census data, are possible worlds, see Spiegelhalter (2019).
4 DANIËL LAKENS1

Table 2
Overview of possible ways to evaluate which effect sizes are interesting.
Type of evaluation Which question should a researcher ask?
Smallest effect size of interest What is the smallest effect size that that is considered theoretically or
practically interesting?
The minimal statistically Given the test and sample size, what is the critical effect size that can
detectable effect be statistically significant?
Expected effect size Which effect size is expected based on theoretical predictions or
previous research?
Width of confidence interval Which effect sizes are excluded based on the expected width of the
confidence interval around the effect size?
Sensitivity power analysis Across a range of possible effect sizes, which effects does a design
have sufficient power to detect when performing a hypothesis test?
Distribution of effect sizes What is the empirical range of effect sizes in a specific research area,
in a research area and which effects are a priori unlikely to be observed?

and data is collected from a sample


√ of 35 athletes, the Time and money are two resource limitations all sci-
finite population correction is (100 − 35)/(100 − 1) = entists face. A PhD student has a certain time to com-
0.81. The superb R package can compute population plete a PhD thesis, and is typically expected to com-
corrected confidence intervals (Cousineau & Chiasson, plete multiple research lines in this time. In addition
2019). to time limitations, researchers have limited financial
resources that often direclty influence how much data
Resource Constraints can be collected. A third limitation in some research
lines is that there might simply be a very small number
A common reason for the number of observations in a of individuals from whom data can be collected, such as
study is that resource constraints limit the amount of when studying patients with a rare disease. A resource
data that can be collected at a reasonable cost (Lenth, constraint justification puts limited resources at the
2001). In practice, sample size are always limited by center of the justification for the sample size that will
the resources that are available. Researchers practically be collected, and starts with the resources a scientist
always have resource limitations, and therefore even has available. These resources are translated into an
when resource constraints are not the primary justifi- expected number of observations (N ) that a researcher
cation for the sample size in a study, it is always a sec- expects they will be able to collect with an amount of
ondary justification. money in a given time. The challenge is to evaluate
whether collecting N observations is worthwhile. How
Despite the omnipresence of resource limitations, the
do we decide if a study will be informative, and when
topic often receives little attention in texts on experi-
should we conclude that data collection is not worth-
mental design. This might make it feel like acknowl-
while?
edging resource constraints is not appropriate, but the
opposite is true: Because resource limitations always When evaluating whether resource constraints make
play a role, a responsible scientist carefully evaluates data collection uninformative, researchers need to ex-
resource constraints when designing a study. plicitly consider which inferential goals they have when
collecting data (Parker & Berman, 2003). Having data al-
Resource constraint justifications are based on a trade-
ways provides more knowledge about the research ques-
off between the costs of data collection, and the value
tion than not having data, so in an absolute sense, all
of having access to the information the data provides.
data that is collected has value. However, it is possible
Even if researchers do not explicitly quantify this trade-
that the benefits of collecting the data are outweighed
off, it is revealed in their actions. For example, re-
by the costs of data collection.
searchers rarely spend all the resources they have on
a single study. Given resource constraints, researchers It is most straightforward to evaluate whether data col-
are confronted with an optimization problem of how lection has value when we know for certain that some-
to spend resources across multiple research questions. one will make a decision, with or without data. In such
SAMPLE SIZE JUSTIFICATION 5

situations any additional data will reduce the error rates when deciding upon the error rates while planning the
of a well-calibrated decision process, even if only ever study. When reporting a resource constraints sample
so slightly. For example, without data we will not per- size justification it is recommended to address the five
form better than a coin flip if we guess which of two con- considerations in Table 3. Addressing these points ex-
ditions has a higher true mean score on a measure. With plicitly facilitates evaluating if the data is worthwhile to
some data, we can perform better than a coin flip by collect.
picking the condition that has the highest mean. With
a small amount of data we would still very likely make A-priori Power Analysis
a mistake, but the error rate is smaller than without any
data. In these cases, the value of information might be When designing a study where the goal is to test whether
positive, as long as the reduction in error rates is more a statistically significant effect is present, researchers of-
beneficial than the cost of data collection. ten want to make sure their sample size is large enough
to prevent erroneous conclusions for a range of effect
Another way in which a small dataset can be valuable is sizes they care about. In this approach to justifying a
if its existence eventually makes it possible to perform sample size, the value of information is to collect ob-
a meta-analysis (Maxwell & Kelley, 2011). This argu- servations up to the point that the probability of an
ment in favor of collecting a small dataset requires 1) erroneous inference is, in the long run, not larger than
that researchers share the data in a way that a future a desired value. If a researcher performs a hypothesis
meta-analyst can find it, and 2) that there is a decent test, there are four possible outcomes:
probability that someone will perform a high-quality
meta-analysis that will include this data in the future 1. A false positive (or Type I error), determined by
(Halpern, Karlawish, & Berlin, 2002). The uncertainty the α level. A test yields a significant result, even
about whether there will ever be such a meta-analysis though the null hypothesis is true.
should be weighed against the costs of data collection. 2. A false negative (or Type II error), determined by β,
or 1 - power. A test yields a non-significant result,
One way to increase the probability of a future meta- even though the alternative hypothesis is true.
analysis is if researchers commit to performing this 3. A true negative, determined by 1-α. A test yields a
meta-analysis themselves, by combining several stud- non-significant result when the null hypothesis
ies they have performed into a small-scale meta- is true.
analysis (Cumming, 2014). For example, a researcher 4. A true positive, determined by 1-β. A test yields a
might plan to repeat a study for the next 12 years in significant result when the alternative hypothesis
a class they teach, with the expectation that after 12 is true.
years a meta-analysis of 12 studies would be sufficient
to draw informative inferences (but see ter Schure and Given a specified effect size, alpha level, and power,
Grünwald (2019)). If it is not plausible that a researcher an a-priori power analysis can be used to calculate the
will collect all the required data by themselves, they number of observations required to achieve the desired
can attempt to set up a collaboration where fellow re- error rates, given the effect size.3 Figure 2 illustrates
searchers in their field commit to collecting similar data how the statistical power increases as the number of
with identical measures. If it is not likely that sufficient observations (per group) increases in an independent
data will emerge over time to reach the inferential goals, t test with a two-sided alpha level of 0.05. If we are
there might be no value in collecting the data. interested in detecting an effect of d = 0.5 a sample size
of 90 per condition would give us more than 90% power.
Even if a researcher believes it is worth collecting data Statistical power can be computed to determine the
because a future meta-analysis will be performed, they number of participants, or the number of items (West-
will most likely perform a statistical test on the data. To fall, Kenny, & Judd, 2014) but can also be performed for
make sure their expectations about the results of such single case studies (Ferron & Onghena, 1996; McIntosh
a test are well-calibrated, it is important to consider & Rittmo, 2020)
which effect sizes are of interest. From the six ways to
Although it is common to set the Type I error rate to 5%
evaluate which effect sizes are interesting that will be
discussed in the second part of this review, it is useful 3
Power analyses can be performed based on standardized
to consider the smallest effect size that can be statisti- effect sizes or effect sizes expressed on the original scale. It is
cally significant, the expected width of the confidence important to know the standard deviation of the effect (see
interval around the effect size, and to perform a sensi- the ‘Know Your Measure’ section) but I find it slightly more
tivity power analysis. If a decision or claim is made, a convenient to talk about standardized effects in the context
compromise power analysis is worthwhile to consider of sample size justifications.
6 DANIËL LAKENS1

Table 3
Overview of recommendations when reporting a sample size justification based on resource constraints.
What to address How to address it?
Will a future meta-analysis Consider the plausibility that sufficient highly similar studies will be
be performed? performed in the future to make a meta-analysis possible.
Will a decision or claim be made If a decision is made then any data that is collected will reduce error
regardless of the amount of data rates. Consider using a compromise power analysis to determine Type
that is available? I and Type II error rates. Are the costs worth the reduction in errors?
What is the critical effect size? Report and interpret the critical effect size, with a focus on whether
a hypothesis test would be significant for expected effect sizes. If not,
indicate the interpretation of the data will not be based on p values.
What is the width of the Report and interpret the width of the confidence interval. What will
confidence interval? an estimate with this much uncertainty be useful for? If the null
hypothesis is true, would rejecting effects outside of the confidence
interval be worthwhile (ignoring how a design might have low power
to actually test against these values)?
Which effect sizes will a design Report a sensitivity power analysis, and report the effect sizes that
have decent power to detect? can be detected across a range of desired power levels (e.g., 80%, 90%,
and 95%) or plot a sensitivity analysis.

and aim for 80% power, error rates should be justified


(Lakens, Adolfi, et al., 2018). As explained in the section
on compromise power analysis, the default recommen-
1

dation to aim for 80% power lacks a solid justification.


In general, the lower the error rates (and thus the higher
0.9

the power), the more informative a study will be, but


0.8

the more resources are required. Researchers should


carefully weigh the costs of increasing the sample size
0.7

against the benefits of lower error rates, which would


0.6

probably make studies designed to achieve a power of


Power

90% or 95% more common for articles reporting a sin-


0.5

gle study. An additional consideration is whether the


0.4

researcher plans to publish an article consisting of a set


of replication and extension studies, in which case the
0.3

probability of observing multiple Type I errors will be


0.2

very low, but the probability of observing mixed results


even when there is a true effect increases (Lakens & Etz,
0.1

2017), which would also be a reason to aim for studies


0

with low Type II error rates, perhaps even by slightly


increasing the alpha level for each individual study. 0 10 20 30 40 50 60 70 80 90 100

sample size (per group)


Figure 3 visualizes two distributions. The left distribu-
tion (dashed line) is centered at 0. This is a model for Figure 2. Power curve for an independent t test with
the null hypothesis. If the null hypothesis is true a sta- a true effect size of δ = 0.5 and an alpha of 0.05 as a
tistically significant result will be observed if the effect function of the sample size.
size is extreme enough (in a two-sided test either in
the positive or negative direction), but any significant
result would be a Type I error (the dark grey areas un- hypothesis is true are Type I errors, or false positives,
der the curve). If there is no true effect, formally sta- which occur at the chosen alpha level. The right distri-
tistical power for a null hypothesis significance test is bution (solid line) is centered on an effect of d = 0.5. This
undefined. Any significant effects observed if the null
SAMPLE SIZE JUSTIFICATION 7

is the specified model for the alternative hypothesis in sential to consider the possibility that there is no effect
this study, illustrating the expectation of an effect of d (e.g., a mean difference of zero). An a-priori power anal-
= 0.5 if the alternative hypothesis is true. Even though ysis can be performed both for a null hypothesis signif-
there is a true effect, studies will not always find a sta- icance test, as for a test of the absence of a meaningful
tistically significant result. This happens when, due to effect, such as an equivalence test that can statistically
random variation, the observed effect size is too close provide support for the null hypothesis by reject the
to 0 to be statistically significant. Such results are false presence of effects that are large enough to matter (see
negatives (the light grey area under the curve on the Meyners, 2012; Lakens, 2017; Rogers, Howard, & Vessey,
right). To increase power, we can collect a larger sample 1993). When multiple primary tests will be performed
size. As the sample size increases, the distributions be- based on the same sample, each analysis requires a ded-
come more narrow, reducing the probability of a Type icated sample size justification. If possible, a sample
II error.4 size is collected that guarantees that all tests are infor-
mative, which means that the collected sample size is
based on the largest sample size returned by any of the
a-priori power analyses.
3.0

For example, if the goal of a study is to detect or reject an


effect size of d = 0.4 with 90% power, and the alpha level
is set to 0.05 for a two-sided independent t test, a re-
2.0
Density

searcher would need to collect 133 participants in each


condition for an informative null hypothesis test, and
1.0

136 participants in each condition for an informative


equivalence test. Therefore, the researcher should aim
to collect 272 participants for an informative result for
0.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0


both tests that are planned. This does not guarantee a
study has sufficient power for the true effect size (which
Cohen's δ
can never be known), but it guarantees the study has
Figure 3. Null (δ = 0, grey dashed line) and alternative sufficient power given an assumption of the effect a
(δ = 0.5, solid black line) hypothesis, with α = 0.05 and n researcher is interested in detecting or rejecting. There-
= 80 per group. fore, an a-priori power analysis is useful, as long as a
researcher can justify the effect sizes they are interested
in.
It is important to highlight that the goal of an a-priori
power analysis is not to achieve sufficient power for the If researchers are willing to reject a single hypothesis
true effect size. The true effect size is unknown. The if any of multiple tests yield a significant result, then
goal of an a-priori power analysis is to achieve sufficient the alpha level in the a-priori power analysis should
power, given a specific assumption of the effect size a be corrected for multiple comparisons. For example, if
researcher wants to detect. Just like a Type I error rate four tests are performed, an overall Type I error rate of
is the maximum probability of making a Type I error 5% is desired, and a Bonferroni correction is used. the
conditional on the assumption that the null hypothesis a-priori power analysis should be based on a corrected
is true, an a-priori power analysis is computed under alpha level of .0125.
the assumption of a specific effect size. It is unknown if
this assumption is correct. All a researcher can do is to An a-priori power analysis can be performed analyt-
make sure their assumptions are well justified. Statisti- ically, or by performing computer simulations. Ana-
cal inferences based on a test where the Type II error is lytic solutions are faster but less flexible. A common
controlled are conditional on the assumption of a spe- challenge researchers face when attempting to perform
cific effect size. They allow the inference that, assuming power analyses for more complex or uncommon tests
the true effect size is at least as large as that used in the is that available software does not offer analytic solu-
a-priori power analysis, the maximum Type II error rate tions. In these cases simulations can provide a flexible
in a study is not larger than a desired value. solution to perform power analyses for any test (Mor-
ris, White, & Crowther, 2019). The following code is
This point is perhaps best illustrated if we consider a an example of a power analysis in R based on 10000
study where an a-priori power analysis is performed
both for a test of the presence of an effect, as for a test of 4
These figures can be reproduced and adapted in an online
the absence of an effect. When designing a study, it es- shiny app: http://shiny.ieis.tue.nl/d_p_power/.
8 DANIËL LAKENS1

simulations for a one-sample t test against zero for a


sample size of 20, assuming a true effect of d = 0.5. All
simulations consist of first randomly generating data
based on assumptions of the data generating mecha-
nism (e.g., a normal distribution with a mean of 0.5 and
a standard deviation of 1), followed by a test performed
on the data. By computing the percentage of significant
results, power can be computed for any design.
p <- numeric(10000)
for (i in 1:10000) {
x <- rnorm(n = 20, mean = 0.5, sd = 1) Figure 4. All details about the power analysis that is
p[i] <- t.test(x)$p.value performed can be exported in G*Power.
}
sum(p < 0.05) / 10000
effect size is based on a meta-analysis) or Table 6 (if
There is a wide range of tools available to perform power the effect size is based on a single study) should be dis-
analyses. Whichever tool a researcher decides to use, cussed. If an effect size estimate is based on the existing
it will take time to learn how to use the software cor- literature, provide a full citation, and preferably a direct
rectly to perform a meaningful a-priori power analysis. quote from the article where the effect size estimate
Resources to educate psychologists about power analy- is reported. If the effect size is based on a smallest ef-
sis consist of book-length treatments (Aberson, 2019; fect size of interest, this value should not just be stated,
Cohen, 1988; Julious, 2004; Murphy, Myors, & Wolach, but justified (e.g., based on theoretical predictions or
2014), general introductions (Baguley, 2004; Brysbaert, practical implications, see Lakens et al. (2018)). For an
2019; Faul, Erdfelder, Lang, & Buchner, 2007; Maxwell, overview of all aspects that should be reported when
Kelley, & Rausch, 2008; Perugini, Gallucci, & Costantini, describing an a-priori power analysis, see Table 4.
2018), and an increasing number of applied tutorials
for specific tests (for example, DeBruine & Barr, 2019;
Planning for Precision
Brysbaert & Stevens, 2018; Green & MacLeod, 2016; Kr-
uschke, 2013; Lakens & Caldwell, 2019; Schoemann,
Some researchers have suggested to justify sample sizes
Boulton, & Short, 2017; Westfall et al., 2014). It is impor-
based on a desired level of precision of the estimate
tant to be trained in the basics of power analysis, and
(Cumming & Calin-Jageman, 2016; Kruschke, 2018;
it can be extremely beneficial to learn how to perform
Maxwell et al., 2008). The goal when justifying a sample
simulation-based power analyses. At the same time, it
size based on precision is to collect data to achieve a
is often recommended to enlist the help of an expert,
desired width of the confidence interval around a pa-
especially when a researcher lacks experience with a
rameter estimate. The width of the confidence interval
power analysis for a specific test.
around the parameter estimate depends on the stan-
When reporting an a-priori power analysis, make sure dard deviation and the number of observations. The
that the power analysis is completely reproducible. If only aspect a researcher needs to justify for a sample
power analyses are performed in R it is possible to share size justification based on accuracy is the desired width
the analysis script and information about the version of of the confidence interval with respect to their inferen-
the package. In many software packages it is possible to tial goal, and their assumption about the population
export the power analysis that is performed as a PDF file. standard deviation of the measure.
For example, in G*Power analyses can be exported un-
der the ‘protocol of power analysis’ tab. If the software If a researcher has determined the desired accuracy,
package provides no way to export the analysis, add a and has a good estimate of the true standard deviation
screenshot of the power analysis to the supplementary of the measure, it is straightforward to calculate the
files. sample size needed for a desired level of accuracy. For
example, when measuring the IQ of a group of individ-
The reproducible report needs to be accompanied by uals a researcher might desire to estimate the IQ score
justifications for the choices that were made with re- within an error range of 2 IQ points for 95% of the ob-
spect to the values used in the power analysis. If the served means, in the long run. The required sample
effect size used in the power analysis is based on pre- size to achieve this desired level of accuracy (assuming
vious research the factors presented in Table 5 (if the normally distributed data) can be computed by:
SAMPLE SIZE JUSTIFICATION 9

Table 4
Overview of recommendations when reporting an a-priori power analysis.
What to take into account? How to take it into account?
List all primary analyses Specify all planned primary analyses that test hypotheses for which Type I
that are planned. and Type II error rates should be controlled.
Specify the alpha level List and justify the Type I error rate for each analysis. Make sure to correct
for each analysis for multiple comparisons where needed.
What is the desired power? List and justify the desired power (or Type II error rate) for each analysis.
For each power analysis, specify Report the effect size metric (e.g., Cohen’s d, Cohen’s f ), the effect size
the effect size metric, the effect (e.g., 0.3). and the justification for the effect size, and whether it is based
size, and the justification for is based on a smallest effect size of interest, a meta-analytic effect size
powering for this effect size. estimate, the estimate of a single previous study, or some other source.
Consider the possibility that Perform a power analysis for the test that is planned to examine the
the null hypothesis is true. absence of a meaningful effect (e.g., power for an equivalence test).
Make sure the power analysis Include the code used to run the power analysis, or print a report
is reproducible. containing the details about the power analyses that has been performed.

= 0.4 and would treat observed correlations that differ


more than 0.2 (i.e., 0.2 < r < 0.6) differently, in that effects
!2
z · sd
N= of r = 0.6 or larger are considered too large to be caused
error
by the assumed underlying mechanism (Hilgard, 2021),
while effects smaller than r = 0.2 are considered too
where N is the number of observations, z is the critical
small to support the theoretical prediction. If the goal
value related to the desired confidence interval, sd is
is indeed to get an effect size estimate that is precise
the standard deviation of IQ scores in the population,
enough so that two effects can be differentiated with
and error is the width of the confidence interval within
high probability, the inferential goal is actually a hy-
which the mean should fall, with the desired error rate.
pothesis test, which requires designing a study with
In this example, (1.96 × 15 / 2)ˆ2 = 216.1 observations.
sufficient power to reject effects (e.g., testing a range
If a researcher desires 95% of the means to fall within
prediction of correlations between 0.2 and 0.6).
a 2 IQ point range around the true population mean,
217 observations should be collected. If a desired accu- If researchers do not want to test a hypothesis, for ex-
racy for a non-zero mean difference is computed, accu- ample because they prefer an estimation approach over
racy is based on a non-central t-distribution. For these a testing approach, then in the absence of clear guide-
calculations an expected effect size estimate needs to lines that help researchers to justify a desired level of
be provided, but it has relatively little influence on the precision, one solution might be to rely on a gener-
required sample size (Maxwell et al., 2008). It is also ally accepted norm of precision to aim for. Just as re-
possible to incorporate uncertainty about the observed searchers normatively use an alpha level of 0.05, they
effect size in the sample size calculation, known as as- could plan studies to achieve a desired confidence inter-
surance (Kelley & Rausch, 2006). The MBESS package val width around the observed effect that is determined
in R provides functions to compute sample sizes for a by a norm. Future work is needed to help researchers
wide range of tests (Kelley, 2007). choose a confidence interval width when planning for
accuracy.
What is less straightforward is to justify how a desired
level of accuracy is related to inferential goals. There Heuristics
is no literature that helps researchers to choose a de-
sired width of the confidence interval. Morey (2020) When a researcher uses a heuristic, they are not able to
convincingly argues that most practical use-cases of justify their sample size themselves, but they trust in a
planning for precision involve an inferential goal of dis- sample size recommended by some authority. When I
tinguishing an observed effect from other effect sizes started as a PhD student in 2005 it was common to col-
(for a Bayesian perspective, see Kruschke (2018)). For lect 15 participants in each between subject condition.
example, a researcher might expect an effect size of r When asked why this was a common practice, no one
10 DANIËL LAKENS1

was really sure, but people trusted there was a justifica- Viechtbauer et al. (2015) proposed an approach to com-
tion somewhere in the literature. Now, I realize there pute the sample size for pilot studies where one is in-
was no justification for the heuristics we used. As Berke- terested in identifying any issues that exist in a small
ley (1735) already observed: “Men learn the elements of percentage of participants. In their abstract, they men-
science from others: And every learner hath a deference tioned as an example the sample size of 59 to detect
more or less to authority, especially the young learners, a problem that occurs with a 5% prevalence with 95%
few of that kind caring to dwell long upon principles, confidence. Most researchers directly use this num-
but inclining rather to take them upon trust: And things ber, instead of computing a sample size based on the
early admitted by repetition become familiar: And this assumed prevalence and desired confidence level. If
familiarity at length passeth for evidence.” authors justify a specific sample size (e.g., n = 50) based
on a general recommendation in another paper, either
Some papers provide researchers with simple rules of they are mis-citing the paper, or the paper they are cit-
thumb about the sample size that should be collected. ing is flawed.
Such papers clearly fill a need, and are cited a lot, even
when the advice in these articles is flawed. For exam- Another common heuristic is to collect the same num-
ple, Wilson VanVoorhis and Morgan (2007) translate ber of observations as were collected in a previous study.
an absolute minimum of 50+8 observations for regres- This strategy is not recommended in scientific disci-
sion analyses suggested by a rule of thumb examined plines with widespread publication bias, and/or where
in Green (1991) into the recommendation to collect novel and surprising findings from largely exploratory
~50 observations. Green actually concludes in his ar- single studies are published. Using the same sample
ticle that “In summary, no specific minimum number size as a previous study is only a valid approach if the
of subjects or minimum ratio of subjects-to-predictors sample size justification in the previous study also ap-
was supported”. He does discuss how a general rule of plies to the current study. Instead of stating that you in-
thumb of N = 50 + 8 provided an accurate minimum tend to collect the same sample size as an earlier study,
number of observations for the ‘typical’ study in the repeat the sample size justification, and update it in
social sciences because these have a ‘medium’ effect light of any new information (such as the effect size in
size, as Green claims by citing Cohen (1988). Cohen the earlier study, see Table 6).
actually didn’t claim that the typical study in the social
sciences has a ‘medium’ effect size, and instead said Peer reviewers and editors should be extremely skepti-
(1988, p. 13): “Many effects sought in personality, so- cal of rules of thumb sample size justifications. When-
cial, and clinical-psychological research are likely to be ever one encounters a sample size justification based
small effects as here defined”. We see how a string of on a heuristic, ask yourself: ‘Why is this heuristic used?’
mis-citations eventually leads to a misleading rule of It is important to know what the logic behind a heuris-
thumb. tic is to determine whether the heuristic is valid for a
specific situation. In most cases, heuristics are based
Rules of thumb seem to primarily emerge due to mis- on weak logic, and not widely applicable. It might be
citations and/or overly simplistic recommendations. Si- possible that fields develop valid heuristics for sample
monsohn, Nelson, and Simmons (2011) recommended size justifications. For example, it is possible that a re-
that “Authors must collect at least 20 observations per search area reaches widespread agreement that effects
cell”. A later recommendation by the same authors pre- smaller than d = 0.3 are too small to be of interest, and all
sented at a conference suggested to use n > 50, unless studies in a field use sequential designs (see below) that
you study large effects (Simmons, Nelson, & Simon- have 90% power to detect a d = 0.3. Alternatively, it is
sohn, 2013). Regrettably, this advice is now often mis- possible that a field agrees that data should be collected
cited as a justification to collect no more than 50 obser- with a desired level of accuracy, irrespective of the true
vations per condition without considering the expected effect size. In these cases, valid heuristics would exist
effect size. Schönbrodt and Perugini (2013) examined based on generally agreed goals of data collection. For
at which sample size correlations stabilize and suggest example, Simonsohn (2015) suggests to design replica-
sample sizes ranging from 20 to 470 depending on ex- tion studies that have 2.5 times as large sample sizes
pectations about the true effect size, the desired max- as the original study, as this provides 80% power for an
imum deviation from the true effect, and the desired equivalence test against an equivalence bound set to
stability of the estimate. And yet, many researchers the effect the original study had 33% power to detect,
simply follow the summary recommendation in the ab- assuming the true effect size is 0. As original authors
stract that “Results indicate that in typical scenarios the typically do not specify which effect size would falsify
sample size should approach 250 for stable estimates”. their hypothesis, the heuristic underlying this ‘small
SAMPLE SIZE JUSTIFICATION 11

telescopes’ approach is a good starting point for a repli- effect size of interest (for example, an effect size of zero),
cation study with the inferential goal to reject the pres- and examine which effects could be rejected. Similarly,
ence of an effect as large as was described in an earlier it can be useful to plot a sensitivity curve and evaluate
publication. It is the responsibility of researchers to the range of effect sizes the design has decent power
gain the knowledge to distinguish valid heuristics from to detect, as well as to consider the range of effects for
mindless heuristics. which the design has low power. Finally, there are situa-
tions where it is useful to consider range of effects that
No Justification are likely to be observed in a specific research area.
It might sound like a contradictio in terminis, but it is
What is the Smallest Effect Size of Interest?
useful to distinguish a final category where researchers
explicitly state they do not have a justification for their The strongest possible sample size justification is based
sample size. Perhaps the resources were available to on an explicit statement of the smallest effect size that
collect more data, but they were not used. A researcher is considered interesting. A smallest effect size of inter-
could have performed a power analysis, or planned for est can be based on theoretical predictions or practical
precision, but they did not. In those cases, instead of considerations. For a review of approaches that can
pretending there was a justification for the sample size, be used to determine a smallest effect size of interest
honesty requires you to state there is no sample size in randomized controlled trials, see Cook et al. (2014)
justification. This is not necessarily bad. It is still pos- and Keefe et al. (2013), for reviews of different methods
sible to discuss the smallest effect size of interest, the to determine a smallest effect size of interest, see King
minimal statistically detectable effect, the width of the (2011) and Copay, Subach, Glassman, Polly, and Schuler
confidence interval around the effect size, and to plot (2007), and for a discussion focused on psychological
a sensitivity power analysis, in relation to the sample research, see Lakens et al. (2018).
size that was collected. If a researcher truly had no spe-
cific inferential goals when collecting the data, such an It can be challenging to determine the smallest effect
evaluation can perhaps be performed based on reason- size of interest whenever theories are not very devel-
able inferential goals peers would have when they learn oped, or when the research question is far removed
about the existence of the collected data. from practical applications, but it is still worth thinking
about which effects would be too small to matter. A
Do not try to spin a story where it looks like a study first step forward is to discuss which effect sizes are
was highly informative when it was not. Instead, trans- considered meaningful in a specific research line with
parently evaluate how informative the study was given your peers. Researchers will differ in the effect sizes
effect sizes that were of interest, and make sure that the they consider large enough to be worthwhile (Murphy
conclusions follow from the data. The lack of a sample et al., 2014). Just as not every scientist will find every re-
size justification might not be problematic, but it might search question interesting enough to study, not every
mean that a study was not informative for most effect scientist will consider the same effect sizes interesting
sizes of interest, which makes it especially difficult to enough to study, and different stakeholders will differ
interpret non-significant effects, or estimates with large in which effect sizes are considered meaningful (Kelley
uncertainty. & Preacher, 2012).

What is Your Inferential Goal? Even though it might be challenging, there are impor-
tant benefits of being able to specify a smallest effect
The inferential goal of data collection is often in some size of interest. The population effect size is always
way related to the size of an effect. Therefore, to de- uncertain (indeed, estimating this is typically one of
sign an informative study, researchers will want to think the goals of the study), and therefore whenever a study
about which effect sizes are interesting. First, it is use- is powered for an expected effect size, there is consid-
ful to consider three effect sizes when determining the erable uncertainty about whether the statistical power
sample size. The first is the smallest effect size a re- is high enough to detect the true effect in the popula-
searcher is interested in, the second is the smallest ef- tion. However, if the smallest effect size of interest is
fect size that can be statistically significant (only in stud- can be specified and agreed upon after careful delib-
ies where a significance test will be performed), and the eration, it becomes possible to design a study that has
third is the effect size that is expected. Beyond consid- sufficient power (given the inferential goal to detect of
ering these three effect sizes, it can be useful to evaluate reject the smallest effect size of interest with a certain er-
ranges of effect sizes. This can be done by computing ror rate). A smallest effect of interest may be subjective
the width of the expected confidence interval around an (one researcher might find effect sizes smaller than d
12 DANIËL LAKENS1

= 0.3 meaningless, while another researcher might still


be interested in effects larger than d = 0.1), and there
might be uncertainty about the parameters required

2.0
to specify the smallest effect size of interest (e.g., when
performing a cost-benefit analysis), but after a smallest Critical d = 0.75

1.5
effect size of interest has been determined, a study can

Density
be designed with a known Type 2 error rate to detect

1.0
or reject this value. For this reason an a-priori power
based on a smallest effect size of interest is generally

0.5
preferred, whenever researchers are able to specify one
(Aberson, 2019; Albers & Lakens, 2018; Brown, 1983;

0.0
Cascio & Zedeck, 1983; Dienes, 2014; Lenth, 2001). −1 0 1 2

Cohen's δ
The Minimal Statistically Detectable Effect
Figure 5. Critical effect size for an independent t test
The minimal statistically detectable effect, or the criti- with n = 15 per group and α = 0.05.
cal effect size, provides information about the smallest
effect size that, if observed, would be statistically signifi-
cant given a specified alpha level and sample size (Cook
et al., 2014). For any critical t value (e.g., t = 1.96 for α =
0.05, for large sample sizes) we can compute a critical
mean difference (Phillips et al., 2001), or a critical stan-
dardized effect size. For a two-sided independent t test
the critical mean difference is:

s
sd12 sd22
Mcrit = tcrit +
n1 n2

and the critical standardized mean difference is:

r
1 1
dcrit = tcrit +
n1 n2

In Figure 5 the distribution of Cohen’s d is plotted for


15 participants per group when the true effect size is
either d = 0 or d = 0.5. This figure is similar to Figure 3,
with the addition that the critical d is indicated. We see
that with such a small number of observations in each
group only observed effects larger than d = 0.75 will be
statistically significant. Whether such effect sizes are
interesting, and can realistically be expected, should be
carefully considered and justified. Figure 6. The critical correlation of a test based on a to-
tal sample size of 30 and α = 0.05 calculated in G*Power.
G*Power provides the critical test statistic (such as the
critical t value) when performing a power analysis. For
example, Figure 6 shows that for a correlation based on
the critical effect size, even if the true effect size is small
a two-sided test, with α = 0.05, and N = 30, only effects
(or even when the true effect size is 0, in which case
larger than r = 0.361 or smaller than r = -0.361 can be
each significant effect is a Type I error). Computing
statistically significant. This reveals that when the sam-
a minimal statistically detectable effect is useful for a
ple size is relatively small, the observed effect needs to
study where no a-priori power analysis is performed,
be quite substantial to be statistically significant.
both for studies in the published literature that do not
It is important to realize that due to random variation report a sample size justification (Lakens et al., 2018), as
each study has a probability to yield effects larger than for researchers who rely on heuristics for their sample
SAMPLE SIZE JUSTIFICATION 13

size justification. is present or absent).


It can be informative to ask yourself whether the crit- There are typically three sources for expectations about
ical effect size for a study design is within the range the population effect size: a meta-analysis, a previ-
of effect sizes that can realistically be expected. If not, ous study, or a theoretical model. It is tempting for
then whenever a significant effect is observed in a pub- researchers to be overly optimistic about the expected
lished study, either the effect size is surprisingly larger effect size in an a-priori power analysis, as higher effect
than expected, or more likely, it is an upwardly biased size estimates yield lower sample sizes, but being too
effect size estimate. In the latter case, given publication optimistic increases the probability of observing a false
bias, published studies will lead to biased effect size negative result. When reviewing a sample size justifica-
estimates. If it is still possible to increase the sample tion based on an a-priori power analysis, it is important
size, for example by ignoring rules of thumb and instead to critically evaluate the justification for the expected
performing an a-priori power analysis, then do so. If it effect size used in power analyses.
is not possible to increase the sample size, for example
due to resource constraints, then reflecting on the min- Using an Estimate from a Meta-Analysis
imal statistically detectable effect should make it clear
that an analysis of the data should not focus on p values, In a perfect world effect size estimates from a meta-
but on the effect size and the confidence interval (see analysis would provide researchers with the most ac-
Table 3). curate information about which effect size they could
expect. Due to widespread publication bias in science,
It is also useful to compute the minimal statistically effect size estimates from meta-analyses are regrettably
detectable effect if an ‘optimistic’ power analysis is per- not always accurate. They can be biased, sometimes
formed. For example, if you believe a best case sce- substantially so. Furthermore, meta-analyses typically
nario for the true effect size is d = 0.57 and use this opti- have considerable heterogeneity, which means that the
mistic expectation in an a-priori power analysis, effects meta-analytic effect size estimate differs for subsets of
smaller than d = 0.4 will not be statistically significant studies that make up the meta-analysis. So, although
when you collect 50 observations in a two independent it might seem useful to use a meta-analytic effect size
group design. If your worst case scenario for the alter- estimate of the effect you are studying in your power
native hypothesis is a true effect size of d = 0.35 your de- analysis, you need to take great care before doing so.
sign would not allow you to declare a significant effect
if effect size estimates close to the worst case scenario If a researcher wants to enter a meta-analytic effect size
are observed. Taking into account the minimal statis- estimate in an a-priori power analysis, they need to
tically detectable effect size should make you reflect consider three things (see Table 5). First, the studies in-
on whether a hypothesis test will yield an informative cluded in the meta-analysis should be similar enough
answer, and whether your current approach to sample to the study they are performing that it is reasonable
size justification (e.g., the use of rules of thumb, or let- to expect a similar effect size. In essence, this requires
ting resource constraints determine the sample size) evaluating the generalizability of the effect size estimate
leads to an informative study, or not. to the new study. It is important to carefully consider
differences between the meta-analyzed studies and the
What is the Expected Effect Size? planned study, with respect to the manipulation, the
measure, the population, and any other relevant vari-
Although the true population effect size is always un- ables.
known, there are situations where researchers have a
reasonable expectation of the effect size in a study, and Second, researchers should check whether the effect
want to use this expected effect size in an a-priori power sizes reported in the meta-analysis are homogeneous.
analysis. Even if expectations for the observed effect If not, and there is considerable heterogeneity in the
size are largely a guess, it is always useful to explicitly meta-analysis, it means not all included studies can
consider which effect size are expected. A researcher be expected to have the same true effect size estimate.
can justify a sample size based on the effect size they A meta-analytic estimate should be used based on
expect, even if such a study would not be very informa- the subset of studies that most closely represent the
tive with respect to the smallest effect size of interest. planned study. Note that heterogeneity remains a pos-
In such cases a study is informative for one inferential sibility (even direct replication studies can show het-
goal (testing whether the expected effect size is present erogeneity when unmeasured variables moderate the
or absent), but not highly informative for the second effect size in each sample (Kenny & Judd, 2019; Olsson-
goal (testing whether the smallest effect size of interest Collentine, Wicherts, & van Assen, 2020)), so the main
14 DANIËL LAKENS1

goal of selecting similar studies is to use existing data pants in each condition when the null hypothesis is true
to increase the probability that your expectation is ac- and when there is a ‘medium’ true effect of η2p = 0.0588
curate, without guaranteeing it will be. (Richardson, 2011). As in Figure 5 the critical effect size
is indicated, which shows observed effects smaller than
Third, the meta-analytic effect size estimate should not
η2p = 0.08 will not be significant with the given sample
be biased. Check if the bias detection tests that are
size. If the null hypothesis is true effects larger than η2p =
reported in the meta-analysis are state-of-the-art, or
0.08 will be a Type I error (the dark grey area), and when
perform multiple bias detection tests yourself (Carter,
the alternative hypothesis is true effects smaller than η2p
Schönbrodt, Gervais, & Hilgard, 2019), and consider
= 0.08 will be a Type II error (light grey area). It is clear
bias corrected effect size estimates (even though these
all significant effects are larger than the true effect size
estimates might still be biased, and do not necessarily
(η2p = 0.0588), so power analyses based on a significant
reflect the true population effect size).
finding (e.g., because only significant results are pub-
lished in the literature) will be based on an overestimate
Using an Estimate from a Previous Study
of the true effect size, introducing bias.
If a meta-analysis is not available, researchers often rely
But even if we had access to all effect sizes (e.g., from
on an effect size from a previous study in an a-priori
pilot studies you have performed yourself ) due to ran-
power analysis. The first issue that requires careful at-
dom variation the observed effect size will sometimes
tention is whether the two studies are sufficiently sim-
be quite small. Figure 7 shows it is quite likely to observe
ilar. Just as when using an effect size estimate from a
an effect of η2p = 0.01 in a small pilot study, even when the
meta-analysis, researchers should consider if there are
true effect size is 0.0588. Entering an effect size estimate
differences between the studies in terms of the pop-
of η2p = 0.01 in an a-priori power analysis would suggest
ulation, the design, the manipulations, the measures,
a total sample size of 957 observations to achieve 80%
or other factors that should lead one to expect a dif-
power in a follow-up study. If researchers only follow
ferent effect size. For example, intra-individual reac-
up on pilot studies when they observe an effect size in
tion time variability increases with age, and therefore a
the pilot study that, when entered into a power anal-
study performed on older participants should expect a
ysis, yields a sample size that is feasible to collect for
smaller standardized effect size than a study performed
the follow-up study, these effect size estimates will be
on younger participants. If an earlier study used a very
upwardly biased, and power in the follow-up study will
strong manipulation, and you plan to use a more subtle
be systematically lower than desired (Albers & Lakens,
manipulation, a smaller effect size should be expected.
2018).
Finally, effect sizes do not generalize to studies with
different designs. For example, the effect size for a com-
parison between two groups is most often not similar
to the effect size for an interaction in a follow-up study
0.5

where a second factor is added to the original design Critical effect size = 0.08
(Lakens & Caldwell, 2019).
0.4

Even if a study is sufficiently similar, statisticians have


0.3
Density

warned against using effect size estimates from small pi-


0.2

lot studies in power analyses. Leon, Davis, and Kraemer


(2011) write:
0.1

Contrary to tradition, a pilot study does not


0.0

provide a meaningful effect size estimate for


0 0.1 0.2 0.3 0.4
planning subsequent studies due to the im-
η2p
precision inherent in data from small sam-
ples. Figure 7 . Distribution of partial eta squared under the
The two main reasons researchers should be careful null hypothesis (dotted grey curve) and a medium true
when using effect sizes from studies in the published effect of 0.0588 (solid black curve) for 3 groups with 25
literature in power analyses is that effect size estimates observations.
from studies can differ from the true population effect
size due to random variation, and that publication bias In essence, the problem with using small studies to esti-
inflates effect sizes. Figure 7 shows the distribution mate the effect size that will be entered into an a-priori
for η2p for a study with three conditions with 25 partici- power analysis is that due to publication bias or follow-
SAMPLE SIZE JUSTIFICATION 15

Table 5
Overview of recommendations when justifying the use of a meta-analytic effect size estimate for a power analysis.
What to take into account How to take it into account?
Are the studies in the Are the studies in the meta-analyses very similar in design, measures, and
meta-analysis similar? the population to the study you are planning? Evaluate the generalizability
of the effect size estimate to your study.
Are the studies in the Is there heterogeneity in the meta-analysis? If so, use the meta-analytic
meta-analysis homogeneous? effect size estimate of the most relevant homogenous subsample.
Is the effect size estimate Did the original study report bias detection tests, and was there bias?
unbiased? If so, it might be wise to use a more conservative effect size estimate,
based on bias correction techniques while acknowledging these corrected
effect size estimates might not represent the true meta-analytic effect
size estimate.

up bias the effect sizes researchers end up using for their tive power analysis, but not necessarily a more accurate
power analysis do not come from a full F distribution, power analysis. It is simply not possible to perform an
but from what is known as a truncated F distribution accurate power analysis on the basis of an effect size
(Taylor & Muller, 1996). For example, imagine if there is estimate from a study that might be biased and/or had a
be extreme publication bias in the situation illustrated small sample size (Teare et al., 2014). If it is not possible
in Figure 7. The only studies that would be accessible to to specify a smallest effect size of interest, and there
researchers would come from the part of the distribu- is great uncertainty about which effect size to expect,
tion where η2p > 0.08, and the test result would be statisti- it might be more efficient to perform a study with a
cally significant. It is possible to compute an effect size sequential design (discussed below).
estimate that, based on certain assumptions, corrects
for bias. For example, imagine we observe a result in To summarize, an effect size from a previous study in an
the literature for a One-Way ANOVA with 3 conditions, a-priori power analysis can be used if three conditions
reported as F (2, 42) = 0.017, η2p = 0.176. If we would take are met (see Table 6). First, the previous study is suffi-
this effect size at face value and enter it as our effect size ciently similar to the planned study. Second, there was
estimate in an a-priori power analysis, the suggested a low risk of bias (e.g., the effect size estimate comes
sample size to achieve would suggest we need to collect from a Registered Report, or from an analysis for which
17 observations in each condition. results would not have impacted the likelihood of pub-
lication). Third, the sample size is large enough to yield
relatively accurate effect size estimate, based on the
However, if we assume bias is present, we can use the width of a 95% CI around the observed effect size es-
BUCSS R package (Anderson, Kelley, & Maxwell, 2017) timate. There is always uncertainty around the effect
to perform a power analysis that attempts to correct size estimate, and entering the upper and lower limit
for bias. A power analysis that takes bias into account of the 95% CI around the effect size estimate might be
(under a specific model of publication bias, based on a informative about the consequences of the uncertainty
truncated F distribution where only significant results in the effect size estimate for an a-priori power analysis.
are published) suggests collecting 73 participants in
each condition. It is possible that the bias corrected Using an Estimate from a Theoretical Model
estimate of the non-centrality parameter used to com-
pute power is zero, in which case it is not possible to When your theoretical model is sufficiently specific
correct for bias using this method. As an alternative such that you can build a computational model, and
to formally modeling a correction for publication bias you have knowledge about key parameters in your
whenever researchers assume an effect size estimate is model that are relevant for the data you plan to collect,
biased, researchers can simply use a more conservative it is possible to estimate an effect size based on the ef-
effect size estimate, for example by computing power fect size estimate derived from a computational model.
based on the lower limit of 60% two-sided confidence For example, if one had strong ideas about the weights
interval around the effect size estimate, which Perugini, for each feature stimuli share and differ on, it could
Gallucci, and Costantini (2014) refer to as safeguard be possible to compute predicted similarity judgments
power. Both these approaches lead to a more conserva- for pairs of stimuli based on Tversky’s contrast model
16 DANIËL LAKENS1

Table 6
Overview of recommendations when justifying the use of an effect size estimate from a single study.
What to take into account How to take it into account?
Is the study sufficiently similar? Consider if there are differences between the studies in terms of the
population, the design, the manipulations, the measures, or other factors
that should lead one to expect a different effect size.
Is there a risk of bias? Evaluate the possibility that if the effect size estimate had been smaller,
you would not have used it (or it would not have been published). Examine
the difference when entering the reported, and a bias corrected, effect
size estimate in a power analysis.
How large is the uncertainty? Studies with a small number of observations have large uncertainty.
Consider the possibility of using a more conservative effect size estimate
to reduce the possibility of an underpowered study for the true effect size
(such as a safeguard power analysis).

(Tversky, 1977), and estimate the predicted effect size large to be informative. Regardless of the statistical phi-
for differences between experimental conditions. Al- losophy you plan to rely on when analyzing the data,
though computational models that make point predic- the evaluation of what we can conclude based on the
tions are relatively rare, whenever they are available, width of our interval tells us that with 15 observation
they provide a strong justification of the effect size a per group we will not learn a lot.
researcher expects.
One useful way of interpreting the width of the confi-
dence interval is based on the effects you would be able
Compute the Width of the Confidence Interval
to reject if the true effect size is 0. In other words, if
around the Effect Size
there is no effect, which effects would you have been
If a researcher can estimate the standard deviation of able to reject given the collected data, and which effect
the observations that will be collected, it is possible to sizes would not be rejected, if there was no effect? Effect
compute an a-priori estimate of the width of the 95% sizes in the range of d = 0.7 are findings such as “People
confidence interval around an effect size (Kelley, 2007). become aggressive when they are provoked”, “People
Confidence intervals represent a range around an esti- prefer their own group to other groups”, and “Romantic
mate that is wide enough so that in the long run the true partners resemble one another in physical attractive-
population parameter will fall inside the confidence in- ness” (Richard, Bond, & Stokes-Zoota, 2003). The width
tervals 100 - α percent of the time. In any single study of the confidence interval tells you that you can only
the true population effect either falls in the confidence reject the presence of effects that are so large, if they ex-
interval, or it doesn’t, but in the long run one can act as isted, you would probably already have noticed them. If
if the confidence interval includes the true population it is true that most effects that you study are realistically
effect size (while keeping the error rate in mind). Cum- much smaller than d = 0.7, there is a good possibility
ming (2013) calls the difference between the observed that we do not learn anything we didn’t already know
effect size and the upper 95% confidence interval (or by performing a study with n = 15. Even without data,
the lower 95% confidence interval) the margin of error. in most research lines we would not consider certain
large effects plausible (although the effect sizes that are
If we compute the 95% CI for an effect size of d = 0 based plausible differ between fields, as discussed below). On
on the t statistic and sample size (Smithson, 2003), we the other hand, in large samples where researchers can
see that with 15 observations in each condition of an for example reject the presence of effects larger than
independent t test the 95% CI ranges from d = -0.72 to d d = 0.2, if the null hypothesis was true, this analysis of
= 0.725 . The margin of error is half the width of the 95% the width of the confidence interval would suggest that
CI, 0.72. A Bayesian estimator who uses an uninforma- peers in many research lines would likely consider the
tive prior would compute a credible interval with the study to be informative.
same (or a very similar) upper and lower bound (Albers
et al., 2018; Kruschke, 2011), and might conclude that 5
Confidence intervals around effect sizes can be computed
after collecting the data they would be left with a range using the MOTE Shiny app: https://www.aggieerin.com/shin
of plausible values for the population effect that is too y-server/
SAMPLE SIZE JUSTIFICATION 17

We see that the margin of error is almost, but not exactly, ity power analysis for a new statistical analysis. Other
the same as the minimal statistically detectable effect times, you might not have carefully considered the sam-
(d = 0.75). The small variation is due to the fact that ple size when you initially collected the data, and want
the 95% confidence interval is calculated based on the to reflect on the statistical power of the study for (ranges
t distribution. If the true effect size is not zero, the con- of) effect sizes of interest when analyzing the results.
fidence interval is calculated based on the non-central Finally, it is possible that the sample size will be col-
t distribution, and the 95% CI is asymmetric. Figure lected in the future, but you know that due to resource
8 visualizes three t distributions, one symmetric at 0, constraints the maximum sample size you can collect
and two asymmetric distributions with a noncentral- is limited, and you want to reflect on whether the study
ity parameter (the normalized difference between the has sufficient power for effects that you consider plau-
means) of 2 and 3. The asymmetry is most clearly visible sible and interesting (such as the smallest effect size of
in very small samples (the distributions in the plot have interest, or the effect size that is expected).
5 degrees of freedom) but remains noticeable in larger
Assume a researcher plans to perform a study where
samples when calculating confidence intervals and sta-
30 observations will be collected in total, 15 in each
tistical power. For example, for a true effect size of d
between participant condition. Figure 9 shows how to
= 0.5 observed with 15 observations per group would
perform a sensitivity power analysis in G*Power for a
yield d s = 0.50, 95% CI [-0.23, 1.22]. If we compute the
study where we have decided to use an alpha level of 5%,
95% CI around the critical effect size, we would get d s =
and desire 90% power. The sensitivity power analysis re-
0.75, 95% CI [0.00, 1.48]. We see the 95% CI ranges from
veals the designed study has 90% power to detect effects
exactly 1.48 to 0.00, in line with the relation between
of at least d = 1.23. Perhaps a researcher believes that a
a confidence interval and a p value, where the 95% CI
desired power of 90% is quite high, and is of the opinion
excludes zero if the test is statistically significant. As
that it would still be interesting to perform a study if the
noted before, the different approaches recommended
statistical power was lower. It can then be useful to plot
here to evaluate how informative a study is are often
a sensitivity curve across a range of smaller effect sizes.
based on the same information and should be seen as
different ways to approach the same question.

Central and non−central t distributions

t(5, ncp = 0)
t(5, ncp = 2)
0.3

t(5, ncp = 3)
0.2
0.1
0.0

−10 −5 0 5 10

Figure 8. Central (black) and 2 non-central (darkgrey


and lightgrey) t distributions.

Plot a Sensitivity Power Analysis


A sensitivity power analysis fixes the sample size, de-
sired power, and alpha level, and answers the question
which effect size a study could detect with a desired
power. A sensitivity power analysis is therefore per-
formed when the sample size is already known. Some- Figure 9. Sensitivity power analysis in G*Power soft-
times data has already been collected to answer a differ- ware.
ent research question, or the data is retrieved from an
existing database, and you want to perform a sensitiv- The two dimensions of interest in a sensitivity power
18 DANIËL LAKENS1

analysis are the effect sizes, and the power to observe a of effect sizes that are expected. A sensitivity power anal-
significant effect assuming a specific effect size. These ysis has no clear cut-offs to examine (Bacchetti, 2010).
two dimensions can be plotted against each other to cre- Instead, the idea is to make a holistic trade-off between
ate a sensitivity curve. For example, a sensitivity curve different effect sizes one might observe or care about,
can be plotted in G*Power by clicking the ‘X-Y plot for a and their associated statistical power.
range of values’ button, as illustrated in Figure 10. Re-
searchers can examine which power they would have The Distribution of Effect Sizes in a Research Area
for an a-priori plausible range of effect sizes, or they
can examine which effect sizes would provide reason- In my personal experience the most commonly en-
able levels of power. In simulation-based approaches tered effect size estimate in an a-priori power analysis
to power analysis, sensitivity curves can be created by for an independent t test is Cohen’s benchmark for a
performing the power analysis for a range of possible ‘medium’ effect size, because of what is known as the
effect sizes. Even if 50% power is deemed acceptable default effect. When you open G*Power, a ‘medium’ ef-
(in which case deciding to act as if the null hypothesis fect is the default option for an a-priori power analy-
is true after a non-significant result is a relatively noisy sis. Cohen’s benchmarks for small, medium, and large
decision procedure), Figure 10 shows a study design effects should not be used in an a-priori power anal-
where power is extremely low for a large range of effect ysis (Cook et al., 2014; Correll, Mellinger, McClelland,
sizes that are reasonable to expect in most fields. Thus, & Judd, 2020), and Cohen regretted having proposed
a sensitivity power analysis provides an additional ap- these benchmarks (Funder & Ozer, 2019). The large
proach to evaluate how informative the planned study variety in research topics means that any ‘default’ or
is, and can inform researchers that a specific design is ‘heuristic’ that is used to compute statistical power is
unlikely to yield a significant effect for a range of effects not just unlikely to correspond to your actual situation,
that one might realistically expect. but it is also likely to lead to a sample size that is more
substantially misaligned with the question you are try-
ing to answer with the collected data.
Some researchers have wondered what a better default
would be, if researchers have no other basis to decide
upon an effect size for an a-priori power analysis. Brys-
baert (2019) recommends d = 0.4 as a default in psy-
chology, which is the average observed in replication
projects and several meta-analyses. It is impossible to
know if this average effect size is realistic, but it is clear
there is huge heterogeneity across fields and research
questions. Any average effect size will often deviate sub-
stantially from the effect size that should be expected
in a planned study. Some researchers have suggested to
change Cohen’s benchmarks based on the distribution
of effect sizes in a specific field (Bosco, Aguinis, Singh,
Field, & Pierce, 2015; Funder & Ozer, 2019; Hill, Bloom,
Black, & Lipsey, 2008; Kraft, 2020; Lovakov & Agadullina,
2017). As always, when effect size estimates are based
Figure 10. Plot of the effect size against the desired
on the published literature, one needs to evaluate the
power when n = 15 per group and alpha = 0.05.
possibility that the effect size estimates are inflated due
to publication bias. Due to the large variation in effect
If the number of observations per group had been larger,
sizes within a specific research area, there is little use
the evaluation might have been more positive. We
in choosing a large, medium, or small effect size bench-
might not have had any specific effect size in mind, but
mark based on the empirical distribution of effect sizes
if we had collected 150 observations per group, a sensi-
in a field to perform a power analysis.
tivity analysis could have shown that power was suffi-
cient for a range of effects we believe is most interesting Having some knowledge about the distribution of effect
to examine, and we would still have approximately 50% sizes in the literature can be useful when interpreting
power for quite small effects. For a sensitivity analysis the confidence interval around an effect size. If in a
to be meaningful, the sensitivity curve should be com- specific research area almost no effects are larger than
pared against a smallest effect size of interest, or a range the value you could reject in an equivalence test (e.g.,
SAMPLE SIZE JUSTIFICATION 19

if the observed effect size is 0, the design would only observations can be collected.
reject effects larger than for example d = 0.7), then it is
a-priori unlikely that collecting the data would tell you In the first situation a researcher might be fortunate
something you didn’t already know. enough to be able to collect so many observations that
the statistical power for a test is very high for all effect
It is more difficult to defend the use of a specific effect sizes that are deemed interesting. For example, imagine
size derived from an empirical distribution of effect a researcher has access to 2000 employees who are all
sizes as a justification for the effect size used in an a- required to answer questions during a yearly evaluation
priori power analysis. One might argue that the use of in a company where they are testing an intervention
an effect size benchmark based on the distribution of that should reduce subjectively reported stress levels.
effects in the literature will outperform a wild guess, You are quite confident that an effect smaller than d =
but this is not a strong enough argument to form the 0.2 is not large enough to be subjectively noticeable for
basis of a sample size justification. There is a point individuals (Jaeschke, Singer, & Guyatt, 1989). With an
where researchers need to admit they are not ready to alpha level of 0.05 the researcher would have a statis-
perform an a-priori power analysis due to a lack of clear tical power of 0.99, or a Type II error rate of 0.01. This
expectations (Scheel, Tiokhin, Isager, & Lakens, 2020). means that for a smallest effect size of interest of d = 0.2
Alternative sample size justifications, such as a justifi- the researcher is 8.30 times more likely to make a Type
cation of the sample size based on resource constraints, I error than a Type II error.
perhaps in combination with a sequential study design,
might be more in line with the actual inferential goals Although the original idea of designing studies that
of a study. control Type I and Type II error error rates was that
researchers would need to justify their error rates (Ney-
Additional Considerations When Designing an man & Pearson, 1933), a common heuristic is to set the
Informative Study Type I error rate to 0.05 and the Type II error rate to 0.20,
meaning that a Type I error is 4 times as unlikely as a
So far, the focus has been on justifying the sample size Type II error. The default use of 80% power (or a 20%
for quantitative studies. There are a number of related Type II or β error) is based on a personal preference of
topics that can be useful to design an informative study. Cohen (1988), who writes:
First, in addition to a-priori power analysis and sen-
sitivity power analysis, it is important to discuss com- It is proposed here as a convention that,
promise power analysis (which is useful) and post-hoc when the investigator has no other basis for
power analysis (which is not useful). When sample setting the desired power value, the value
sizes are justified based on an a-priori power analysis .80 be used. This means that β is set at
it can be very efficient to collect data in sequential de- .20. This arbitrary but reasonable value is
signs where data collection is continued or terminated offered for several reasons (Cohen, 1965,
based on interim analyses of the data. Furthermore, it pp. 98-99). The chief among them takes into
is worthwhile to consider ways to increase the power consideration the implicit convention for α
of a test without increasing the sample size. An addi- of .05. The β of .20 is chosen with the idea
tional point of attention is to have a good understand- that the general relative seriousness of these
ing of your dependent variable, especially it’s standard two kinds of errors is of the order of .20/.05,
deviation. Finally, sample size justification is just as i.e., that Type I errors are of the order of four
important in qualitative studies, and although there times as serious as Type II errors. This .80 de-
has been much less work on sample size justification sired power convention is offered with the
in this domain, some proposals exist that researchers hope that it will be ignored whenever an in-
can use to design an informative study. Each of these vestigator can find a basis in his substantive
topics is discussed in turn. concerns in his specific research investiga-
tion to choose a value ad hoc.
Compromise Power Analysis
We see that conventions are built on conventions: the
In a compromise power analysis the sample size and norm to aim for 80% power is built on the norm to set
the effect are fixed, and the error rates of the test are the alpha level at 5%. What we should take away from
calculated, based on a desired ratio between the Type Cohen is not that we should aim for 80% power, but that
I and Type II error rate. A compromise power analysis we should justify our error rates based on the relative
is useful both when a very large number of observa- seriousness of each error. This is where compromise
tions will be collected, as when only a small number of power analysis comes in. If you share Cohen’s belief
20 DANIËL LAKENS1

that a Type I error is 4 times as serious as a Type II error, know the statistical power in our study is low. Although
and building on our earlier study on 2000 employees, it is highly undesirable to make decisions when error
it makes sense to adjust the Type I error rate when the rates are high, if one finds oneself in a situation where
Type II error rate is low for all effect sizes of interest a decision must be made based on little information,
(Cascio & Zedeck, 1983). Indeed, Erdfelder, Faul, and Winer (1962) writes:
Buchner (1996) created the G*Power software in part to
give researchers a tool to perform compromise power The frequent use of the .05 and .01 levels of
analysis. significance is a matter of convention hav-
ing little scientific or logical basis. When the
power of tests is likely to be low under these
levels of significance, and when Type I and
Type II errors are of approximately equal
importance, the .30 and .20 levels of signif-
icance may be more appropriate than the
.05 and .01 levels.
For example, if we plan to perform a two-sided t test,
can feasibly collect at most 50 observations in each in-
dependent group, and expect a population effect size of
0.5, we would have 70% power if we set our alpha level
to 0.05. We can choose to weigh both types of error
equally, and set the alpha level to 0.149, to end up with
a statistical power for an effect of d = 0.5 of 0.851 (given
a 0.149 Type II error rate). The choice of α and β in a
compromise power analysis can be extended to take
prior probabilities of the null and alternative hypoth-
esis into account (Miller & Ulrich, 2019; Murphy et al.,
2014).
A compromise power analysis requires a researcher to
specify the sample size. This sample size itself requires
a justification, so a compromise power analysis will typ-
ically be performed together with a resource constraint
justification for a sample size. It is especially impor-
tant to perform a compromise power analysis if your
Figure 11. Compromise power analysis in G*Power.
resource constraint justification is strongly based on
the need to make a decision, in which case a researcher
Figure 11 illustrates how a compromise power analysis should think carefully about the Type I and Type II er-
is performed in G*Power when a Type I error is deemed ror rates stakeholders are willing to accept. However,
to be equally costly as a Type II error, which for for a a compromise power analysis also makes sense if the
study with 1000 observations per condition would lead sample size is very large, but a researcher did not have
to a Type I error and a Type II error of 0.0179. As Faul, the freedom to set the sample size. This might happen
Erdfelder, Lang, and Buchner (2007) write: if, for example, data collection is part of a larger inter-
Of course, compromise power analyses can national study and the sample size is based on other
easily result in unconventional significance research questions. In designs where the Type II error
levels greater than α = .05 (in the case of rates is very small (and power is very high) some statis-
small samples or effect sizes) or less than α ticians have also recommended to lower the alpha level
= .001 (in the case of large samples or effect to prevent Lindley’s paradox, a situation where a signif-
sizes). However, we believe that the benefit icant effect (p < α) is evidence for the null hypothesis
of balanced Type I and Type II error risks of- (Good, 1992; Jeffreys, 1939). Lowering the alpha level as
ten offsets the costs of violating significance a function of the statistical power of the test can prevent
level conventions. this paradox, providing another argument for a com-
promise power analysis when sample sizes are large.
This brings us to the second situation where a compro- Finally, a compromise power analysis needs a justifica-
mise power analysis can be useful, which is when we tion for the effect size, either based on a smallest effect
SAMPLE SIZE JUSTIFICATION 21

size of interest or an effect size that is expected. Table 7


lists three aspects that should be discussed alongside a
reported compromise power analysis.

1
0.9
What to do if Your Editor Asks for Post-hoc Power?

0.8
Post-hoc (or observed) power is the statistical power

0.7
of the test, assuming the effect size estimated from the

Observed power
data is the true effect size. Post-hoc power is therefore

0.6
not performed before looking at the data based on effect

0.5
sizes that are deemed interesting, as an a-priori power

0.4
analysis is, and it is unlike a sensitivity power analysis
where a range of effect sizes is evaluated. Because post-

0.3
hoc power analysis is based on the observed effect size,

0.2
it does not add any information beyond the reported p

0.1
value, but it presents the same information in a different
way. Despite this fact, editors and reviewers often ask

0
authors to perform post-hoc power analysis to interpret
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
non-significant results. This is not a sensible request,
and whenever it is made, you should not comply with it. p−value

Instead, you should perform a sensitivity power analy- Figure 12. Relationship between p values and power for
sis, and discuss the power for the smallest effect size of an independent t test with α = 0.05 and n = 10.
interest and a realistic range of expected effect sizes.
Post-hoc power is directly related to the p value of the
statistical test (Hoenig & Heisey, 2001). For a z test post-hoc power analyses they would like to be able to
where the p value is exactly 0.05, post-hoc power is al- distinguish between true negatives (concluding there is
ways 50%. The reason for this relationship is that when no effect, when there is no effect) and false negatives (a
a p value is observed that equals the alpha level of the Type II error, concluding there is no effect, when there
test (e.g., 0.05), the observed z score of the test is exactly actually is an effect). Since reporting post-hoc power
equal to the critical value of the test (e.g., z = 1.96 in is just a different way of reporting the p value, report-
a two-sided test with a 5% alpha level). Whenever the ing the post-hoc power will not provide an answer to
alternative hypothesis is centered on the critical value the question editors are asking (Hoenig & Heisey, 2001;
half the values we expect to observe if this alternative Lenth, 2007; Schulz & Grimes, 2005; Yuan & Maxwell,
hypothesis is true fall below the critical value, and half 2005). To be able to draw conclusions about the ab-
fall above the critical value. Therefore, a test where we sence of a meaningful effect, one should perform an
observed a p value identical to the alpha level will have equivalence test, and design a study with high power to
exactly 50% power in a post-hoc power analysis, as the reject the smallest effect size of interest (Lakens et al.,
analysis assumes the observed effect size is true. 2018). Alternatively, if no smallest effect size of interest
was specified when designing the study, researchers
For other statistical tests, where the alternative distribu- can report a sensitivity power analysis.
tion is not symmetric (such as for the t test, where the
alternative hypothesis follows a non-central t distribu- Sequential Analyses
tion, see Figure 8), a p = 0.05 does not directly translate
to an observed power of 50%, but by plotting post-hoc Whenever the sample size is justified based on an a-
power against the observed p value we see that the two priori power analysis it can be very efficient to collect
statistics are always directly related. As Figure 12 shows, data in a sequential design. Sequential designs con-
if the p value is non-significant (i.e., larger than 0.05) trol error rates across multiple looks at the data (e.g.,
the observed power will be less than approximately 50% after 50, 100, and 150 observations have been collected)
in a t test. Lenth (2007) explains how observed power is and can reduce the average expected sample size that
also completely determined by the observed p value for is collected compared to a fixed design where data is
F tests, although the statement that a non-significant p only analyzed after the maximum sample size is col-
value implies a power less than 50% no longer holds. lected (Proschan, Lan, & Wittes, 2006; Wassmer & Bran-
nath, 2016). Sequential designs have a long history
When editors or reviewers ask researchers to report (Dodge & Romig, 1929), and exist in many variations,
22 DANIËL LAKENS1

Table 7
Overview of recommendations when justifying error rates based on a compromise power analysis.
What to take into account How to take it into account?
What is the justification Specify why a specific sample size is collected (e.g., based on resource
for the sample size? constraints or other factors that determined the sample size).
What is the justification Is the effect size based on a smallest effect size of interest or an
for the effect size? expected effect size?
What is the desired ratio of Weigh the relative costs of a Type I and a Type II error by carefully
Type I vs Type II error rates? evaluating the consequences of each type of error.

such as the Sequential Probability Ratio Test (Wald, size. The first option is to use directional tests where rel-
1945), combining independent statistical tests (West- evant. Researchers often make directional predictions,
berg, 1985), group sequential designs (Jennison & Turn- such as ‘we predict X is larger than Y’. The statistical
bull, 2000), sequential Bayes factors (Schönbrodt, Wa- test that logically follows from this prediction is a direc-
genmakers, Zehetleitner, & Perugini, 2017), and safe tional (or one-sided) t test. A directional test moves the
testing (Grünwald, de Heide, & Koolen, 2019). Of these Type I error rate to one side of the tail of the distribution,
approaches, the Sequential Probability Ratio Test is which lowers the critical value, and therefore requires
most efficient if data can be analyzed after every obser- less observations to achieve the same statistical power.
vation (Schnuerch & Erdfelder, 2020). Group sequen-
tial designs, where data is collected in batches, provide Although there is some discussion about when direc-
more flexibility in data collection, error control, and tional tests are appropriate, they are perfectly defen-
corrections for effect size estimates (Wassmer & Bran- sible from a Neyman-Pearson perspective on hypoth-
nath, 2016). Safe tests provide optimal flexibility if there esis testing (Cho & Abe, 2013), which makes a (pre-
are dependencies between observations (ter Schure & registered) directional test a straightforward approach
Grünwald, 2019). to both increase the power of a test, as the riskiness
of the prediction. However, there might be situations
Sequential designs are especially useful when there is where you do not want to ask a directional question.
considerable uncertainty about the effect size, or when Sometimes, especially in research with applied con-
it is plausible that the true effect size is larger than the sequences, it might be important to examine if a null
smallest effect size of interest the study is designed to effect can be rejected, even if the effect is in the oppo-
detect (Lakens, 2014). In such situations data collection site direction as predicted. For example, when you are
has the possibility to terminate early if the effect size is evaluating a recently introduced educational interven-
larger than the smallest effect size of interest, but data tion, and you predict the intervention will increase the
collection can continue to the maximum sample size performance of students, you might want to explore
if needed. Sequential designs can prevent waste when the possibility that students perform worse, to be able
testing hypotheses, both by stopping early when the to recommend abandoning the new intervention. In
null hypothesis can be rejected, as by stopping early if such cases it is also possible to distribute the error rate
the presence of a smallest effect size of interest can be in a ‘lop-sided’ manner, for example assigning a stricter
rejected (i.e., stopping for futility). Group sequential error rate to effects in the negative than in the positive
designs are currently the most widely used approach direction (Rice & Gaines, 1994).
to sequential analyses, and can be planned and ana-
lyzed using rpact (Wassmer & Pahlke, 2019) or gsDesign Another approach to increase the power without in-
(Anderson, 2014).6 creasing the sample size, is to increase the alpha level
of the test, as explained in the section on compromise
Increasing Power Without Increasing the Sample power analysis. Obviously, this comes at an increased
Size probability of making a Type I error. The risk of mak-
ing either type of error should be carefully weighed,
The most straightforward approach to increase the in- which typically requires taking into account the prior
formational value of studies is to increase the sample
size. Because resources are often limited, it is also 6
Shiny apps are available for both rpact: https://rpact.sh
worthwhile to explore different approaches to increas- inyapps.io/public/ and gsDesign: https://gsdesign.shinyap
ing the power of a test without increasing the sample ps.io/prod/
SAMPLE SIZE JUSTIFICATION 23

probability that the null-hypothesis is true (Cascio & In general, the smaller the variation, the larger the stan-
Zedeck, 1983; Miller & Ulrich, 2019; Mudge, Baker, Edge, dardized effect size (because we are dividing the raw
& Houlahan, 2012; Murphy et al., 2014). If you have to effect by a smaller standard deviation) and thus the
make a decision, or want to make a claim, and the data higher the power given the same number of observa-
you can feasibly collect is limited, increasing the alpha tions. Some additional recommendations are provided
level is justified, either based on a compromise power in the literature (Allison, Allison, Faith, Paultre, & Pi-
analysis, or based on a cost-benefit analysis (Baguley, Sunyer, 1997; Bausell & Li, 2002; Hallahan & Rosenthal,
2004; Field, Tyre, Jonzén, Rhodes, & Possingham, 2004). 1996), such as:
Another widely recommended approach to increase 1. Use better ways to screen participants for studies
the power of a study is use a within participant design where participants need to be screened before
where possible. In almost all cases where a researcher participation.
is interested in detecting a difference between groups, 2. Assign participants unequally to conditions (if
a within participant design will require collecting less data in the control condition is much cheaper to
participants than a between participant design. The collect than data in the experimental condition,
reason for the decrease in the sample size is explained for example).
by the equation below from Maxwell, Delaney, and Kel- 3. Use reliable measures that have low error variance
ley (2017). The number of participants needed in a two (Williams, Zimmerman, & Zumbo, 1995).
group within-participants design (NW) relative to the 4. Smart use of preregistered covariates (Meyvis &
number of participants needed in a two group between- Van Osselaer, 2018).
participants design (NB), assuming normal distribu-
It is important to consider if these ways to reduce the
tions, is:
variation in the data do not come at too large a cost for
external validity. For example, in an intention-to-treat
NB(1 − ρ) analysis in randomized controlled trials participants
NW =
2 who do not comply with the protocol are maintained
in the analysis such that the effect size from the study
The required number of participants is divided by two accurately represents the effect of implementing the
because in a within-participants design with two con- intervention in the population, and not the effect of the
ditions every participant provides two data points. The intervention only on those people who perfectly follow
extent to which this reduces the sample size compared the protocol (Gupta, 2011). Similar trade-offs between
to a between-participants design also depends on the reducing the variance and external validity exist in other
correlation between the dependent variables (e.g., the research areas.
correlation between the measure collected in a control
task and an experimental task), as indicated by the (1-ρ) Know Your Measure
part of the equation. If the correlation is 0, a within-
participants design simply needs half as many partici- Although it is convenient to talk about standardized
pants as a between participant design (e.g., 64 instead effect sizes, it is generally preferable if researchers can
128 participants). The higher the correlation, the larger interpret effects in the raw (unstandardized) scores, and
the relative benefit of within-participants designs, and have knowledge about the standard deviation of their
whenever the correlation is negative (up to -1) the rel- measures (Baguley, 2009; Lenth, 2001). To make it pos-
ative benefit disappears. Especially when dependent sible for a research community to have realistic expec-
variables in within-participants designs are positively tations about the standard deviation of measures they
correlated, within-participants designs will greatly in- collect, it is beneficial if researchers within a research
crease the power you can achieve given the sample size area use the same validated measures. This provides
you have available. Use within-participants designs a reliable knowledge base that makes it easier to plan
when possible, but weigh the benefits of higher power for a desired accuracy, and to use a smallest effect size
against the downsides of order effects or carryover ef- of interest on the unstandardized scale in an a-priori
fects that might be problematic in a within-participants power analysis.
design (Maxwell et al., 2017).7 For designs with multiple In addition to knowledge about the standard deviation
factors with multiple levels it can be difficult to specify it is important to have knowledge about the correla-
the full correlation matrix that specifies the expected
population correlation for each pair of measurements 7
You can compare within- and between-participants de-
(Lakens & Caldwell, 2019). In these cases sequential signs in this Shiny app: http://shiny.ieis.tue.nl/within_betwe
analyses might provide a solution. en.
24 DANIËL LAKENS1

tions between dependent variables (for example be- Sample Size Justification in Qualitative Research
cause Cohen’s dz for a dependent t test relies on the
correlation between means). The more complex the A value of information perspective to sample size justifi-
model, the more aspects of the data-generating process cation also applies to qualitative research. A sample size
need to be known to make predictions. For example, in justification in qualitative research should be based on
hierarchical models researchers need knowledge about the consideration that the cost of collecting data from
variance components to be able to perform a power additional participants does not yield new information
analysis (DeBruine & Barr, 2019; Westfall et al., 2014). that is valuable enough given the inferential goals. One
Finally, it is important to know the reliability of your widely used application of this idea is known as satura-
measure (Parsons, Kruijt, & Fox, 2019), especially when tion and is indicated by the observation that new data
relying on an effect size from a published study that replicates earlier observations, without adding new in-
used a measure with different reliability, or when the formation (Morse, 1995). For example, let’s imagine we
same measure is used in different populations, in which ask people why they have a pet. Interviews might re-
case it is possible that measurement reliability differs veal reasons that are grouped into categories, but after
between populations. With the increasing availability interviewing 20 people, no new categories emerge, at
of open data, it will hopefully become easier to estimate which point saturation has been reached. Alternative
these parameters using data from earlier studies. philosophies to qualitative research exist, and not all
value planning for saturation. Regrettably, principled
If we calculate a standard deviation from a sample, this approaches to justify sample sizes have not been de-
value is an estimate of the true value in the population. veloped for these alternative philosophies (Marshall,
In small samples, our estimate can be quite far off, while Cardon, Poddar, & Fontenot, 2013).
due to the law of large numbers, as our sample size in-
When sampling, the goal is often not to pick a repre-
creases, we will be measuring the standard deviation
sentative sample, but a sample that contains a suffi-
more accurately. Since the sample standard deviation
ciently diverse number of subjects such that saturation
is an estimate with uncertainty, we can calculate a con-
is reached efficiently. Fugard and Potts (2015) show
fidence interval around the estimate (Smithson, 2003),
how to move towards a more informed justification for
and design pilot studies that will yield a sufficiently reli-
the sample size in qualitative research based on 1) the
able estimate of the standard deviation. The confidence
number of codes that exist in the population (e.g., the
interval for the variance σ2 is provided in the following
number of reasons people have pets), 2) the probability
formula, and the confidence for the standard deviation
a code can be observed in a single information source
is the square root of these limits:
(e.g., the probability that someone you interview will
mention each possible reason for having a pet), and 3)
(N − 1)s2 /χ2N−1:α/2 , (N − 1)s2 /χ2N−1:1−α/2 the number of times you want to observe each code.
They provide an R formula based on binomial proba-
bilities to compute a required sample size to reach a
Whenever there is uncertainty about parameters, re-
desired probability of observing codes.
searchers can use sequential designs to perform an in-
ternal pilot study (Wittes & Brittain, 1990). The idea A more advanced approach is used in Rijnsoever (2017),
behind an internal pilot study is that researchers spec- which also explores the importance of different sam-
ify a tentative sample size for the study, perform an pling strategies. In general, purposefully sampling in-
interim analysis, use the data from the internal pilot formation from sources you expect will yield novel infor-
study to update parameters such as the variance of the mation is much more efficient than random sampling,
measure, and finally update the final sample size that but this also requires a good overview of the expected
will be collected. As long as interim looks at the data codes, and the sub-populations in which each code
are blinded (e.g., information about the conditions is can be observed. Sometimes, it is possible to identify
not taken into account) the sample size can be adjusted information sources that, when interviewed, would at
based on an updated estimate of the variance without least yield one new code (e.g., based on informal com-
any practical consequences for the Type I error rate munication before an interview). A good sample size
(Friede & Kieser, 2006; Proschan, 2005). Therefore, if justification in qualitative research is based on 1) an
researchers are interested in designing an informative identification of the populations, including any sub-
study where the Type I and Type II error rates are con- populations, 2) an estimate of the number of codes in
trolled, but they lack information about the standard the (sub-)population, 3) the probability a code is en-
deviation, an internal pilot study might be an attractive countered in an information source, and 4) the sam-
approach to consider (Chang, 2016). pling strategy that is used.
SAMPLE SIZE JUSTIFICATION 25

Discussion and makes it possible to design a study that can provide


an informative answer to a scientific question.
Providing a coherent sample size justification is an es-
sential step in designing an informative study. There
are multiple approaches to justifying the sample size References
in a study, depending on the goal of the data collec-
tion, the resources that are available, and the statistical Aberson, C. L. (2019). Applied Power Analysis for the
approach that is used to analyze the data. An overarch- Behavioral Sciences (Second). New York: Routledge.
ing principle in all these approaches is that researchers
consider the value of the information they collect in Albers, C. J., Kiers, H. A. L., & Ravenzwaaij, D. van. (2018).
relation to their inferential goals. Credible Confidence: A Pragmatic View on the Fre-
quentist vs Bayesian Debate. Collabra: Psychology,
The process of justifying a sample size when designing 4(1), 31. https://doi.org/10.1525/collabra.149
a study should sometimes lead to the conclusion that it
is not worthwhile to collect the data, because the study Albers, C. J., & Lakens, D. (2018). When power analy-
does not have sufficient informational value to justify ses based on pilot data are biased: Inaccurate effect
the costs. There will be cases where it is unlikely there size estimators and follow-up bias. Journal of Ex-
will ever be enough data to perform a meta-analysis perimental Social Psychology, 74, 187–195. https:
(for example because of a lack of general interest in //doi.org/10.1016/j.jesp.2017.09.004
the topic), the information will not be used to make a
decision or claim, and the statistical tests do not allow Allison, D. B., Allison, R. L., Faith, M. S., Paultre, F., &
you to test a hypothesis with reasonable error rates or Pi-Sunyer, F. X. (1997). Power and money: Design-
to estimate an effect size with sufficient accuracy. If ing statistically powerful studies while minimizing
there is no good justification to collect the maximum financial costs. Psychological Methods, 2(1), 20–33.
number of observations that one can feasibly collect, https://doi.org/10.1037/1082-989X.2.1.20
performing the study anyway is a waste of time and/or
money (Brown, 1983; Button et al., 2013; Halpern et al., Anderson, K. M. (2014). Group Sequential Design in R.
2002). In Clinical Trial Biostatistics and Biopharmaceutical
Applications (pp. 179–209). New York: CRC Press.
The awareness that sample sizes in past studies were
often too small to meet any realistic inferential goals is Anderson, S. F., Kelley, K., & Maxwell, S. E. (2017).
growing among psychologists (Button et al., 2013; Fra- Sample-size planning for more accurate statistical
ley & Vazire, 2014; Lindsay, 2015; Sedlmeier & Gigeren- power: A method adjusting sample effect sizes for
zer, 1989). As an increasing number of journals start publication bias and uncertainty. Psychological Sci-
to require sample size justifications, some researchers ence, 28(11), 1547–1562. https://doi.org/10.1177/
will realize they need to collect larger samples than they 0956797617723724
were used to. This means researchers will need to re-
quest more money for participant payment in grant Bacchetti, P. (2010). Current sample size conventions:
proposals, or that researchers will need to increasingly Flaws, harms, and alternatives. BMC Medicine, 8(1),
collaborate (Moshontz et al., 2018). If you believe your 17. https://doi.org/10.1186/1741-7015-8-17
research question is important enough to be answered,
but you are not able to answer the question with your Baguley, T. (2004). Understanding statistical power in
current resources, one approach to consider is to orga- the context of applied research. Applied Ergonomics,
nize a research collaboration with peers, and pursue an 35(2), 73–80. https://doi.org/10.1016/j.apergo.200
answer to this question collectively. 4.01.002

A sample size justification should not be seen as a hur- Baguley, T. (2009). Standardized or simple effect size:
dle that researchers need to pass before they can submit What should be reported? British Journal of Psychol-
a grant, ethical review board proposal, or manuscript ogy, 100(3), 603–617. https://doi.org/10.1348/0007
for publication. When a sample size is simply stated, in- 12608X377117
stead of carefully justified, it can be difficult to evaluate
whether the value of the information a researcher aims Bausell, R. B., & Li, Y.-F. (2002). Power analysis for exper-
to collect outweighs the costs of data collection. Being imental research: A practical guide for the biological,
able to report a solid sample size justification means a medical and social sciences. Cambridge University
researchers knows what they want to learn from a study, Press.
26 DANIËL LAKENS1

Berkeley, G. (1735). A defence of free-thinking in math- Cohen, J. (1988). Statistical power analysis for the be-
ematics, in answer to a pamphlet of Philalethes havioral sciences (2nd ed). Hillsdale, N.J: L. Erlbaum
Cantabrigiensis entitled Geometry No Friend to In- Associates.
fidelity. Also an appendix concerning mr. Walton’s
Vindication of the principles of fluxions against the Cook, J., Hislop, J., Adewuyi, T., Harrild, K., Altman, D.,
objections contained in The analyst. By the author Ramsay, C., . . . Vale, L. (2014). Assessing methods to
of The minute philosopher (Vol. 3). specify the target difference for a randomised con-
trolled trial: DELTA (Difference ELicitation in Tri-
Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, Als) review. Health Technology Assessment, 18(28).
C. A. (2015). Correlational effect size benchmarks. https://doi.org/10.3310/hta18280
The Journal of Applied Psychology, 100(2), 431–449.
https://doi.org/10.1037/a0038047 Copay, A. G., Subach, B. R., Glassman, S. D., Polly, D.
W., & Schuler, T. C. (2007). Understanding the min-
Brown, G. W. (1983). Errors, Types I and II. Ameri- imum clinically important difference: A review of
can Journal of Diseases of Children, 137 (6), 586–591. concepts and methods. The Spine Journal, 7 (5), 541–
https://doi.org/10.1001/archpedi.1983.02140320 546. https://doi.org/10.1016/j.spinee.2007.01.008
062014
Correll, J., Mellinger, C., McClelland, G. H., & Judd, C.
Brysbaert, M. (2019). How many participants do we M. (2020). Avoid Cohen’s “Small”, “Medium”, and
have to include in properly powered experiments? “Large” for Power Analysis. Trends in Cognitive Sci-
A tutorial of power analysis with reference tables. ences, 24(3), 200–207. https://doi.org/10.1016/j.tics
Journal of Cognition, 2(1), 16. https://doi.org/10.5 .2019.12.009
334/joc.72
Cousineau, D., & Chiasson, F. (2019). Superb: Computes
Brysbaert, M., & Stevens, M. (2018). Power Analysis standard error and confidence interval of means un-
and Effect Size in Mixed Effects Models: A Tutorial. der various designs and sampling schemes [Manual].
Journal of Cognition, 1(1). https://doi.org/10.5334/
joc.10 Cumming, G. (2013). Understanding the new statistics:
Effect sizes, confidence intervals, and meta-analysis.
Button, K. S., Ioannidis, J. P. A., Mokrysz, C., Nosek, B. Routledge.
A., Flint, J., Robinson, E. S. J., & Munafò, M. R. (2013).
Power failure: Why small sample size undermines Cumming, G. (2014). The New Statistics: Why and
the reliability of neuroscience. Nature Reviews Neu- How. Psychological Science, 25(1), 7–29. https:
roscience, 14(5), 365–376. https://doi.org/10.1038/ //doi.org/10.1177/0956797613504966
nrn3475
Cumming, G., & Calin-Jageman, R. (2016). Introduction
Carter, E. C., Schönbrodt, F. D., Gervais, W. M., & Hil- to the New Statistics: Estimation, Open Science, and
gard, J. (2019). Correcting for Bias in Psychology: A Beyond. Routledge.
Comparison of Meta-Analytic Methods. Advances
in Methods and Practices in Psychological Science, DeBruine, L. M., & Barr, D. J. (2019). Understand-
2(2), 115–144. https://doi.org/10.1177/2515245919 ing mixed effects models through data simulation
847196 [Preprint]. PsyArXiv. https://doi.org/10.31234/osf
.io/xp5cy
Cascio, W. F., & Zedeck, S. (1983). Open a New Win-
dow in Rational Research Planning: Adjust Alpha to Detsky, A. S. (1990). Using cost-effectiveness analysis
Maximize Statistical Power. Personnel Psychology, to improve the efficiency of allocating funds to clin-
36(3), 517–526. https://doi.org/10.1111/j.1744- ical trials. Statistics in Medicine, 9(1-2), 173–184.
6570.1983.tb02233.x https://doi.org/10.1002/sim.4780090124

Chang, M. (2016). Adaptive Design Theory and Imple- Dienes, Z. (2014). Using Bayes to get the most out of
mentation Using SAS and R (2nd edition). Chapman non-significant results. Frontiers in Psychology, 5.
and Hall/CRC. https://doi.org/10.3389/fpsyg.2014.00781

Cho, H.-C., & Abe, S. (2013). Is two-tailed testing for di- Dodge, H. F., & Romig, H. G. (1929). A Method of Sam-
rectional research hypotheses tests legitimate? Jour- pling Inspection. Bell System Technical Journal, 8(4),
nal of Business Research, 66(9), 1261–1266. https: 613–631. https://doi.org/10.1002/j.1538-7305.192
//doi.org/10.1016/j.jbusres.2012.02.023 9.tb01240.x
SAMPLE SIZE JUSTIFICATION 27

Eckermann, S., Karnon, J., & Willan, A. R. (2010). The Green, P., & MacLeod, C. J. (2016). SIMR: An R pack-
Value of Value of Information. PharmacoEconomics, age for power analysis of generalized linear mixed
28(9), 699–709. https://doi.org/10.2165/11537370- models by simulation. Methods in Ecology and Evo-
000000000-00000 lution, 7 (4), 493–498. https://doi.org/10.1111/2041-
210X.12504
Erdfelder, E., Faul, F., & Buchner, A. (1996). GPOWER: A
general power analysis program. Behavior Research Green, S. B. (1991). How Many Subjects Does It Take To
Methods, Instruments, & Computers, 28(1), 1–11. Do A Regression Analysis. Multivariate Behavioral
https://doi.org/10.3758/BF03203630 Research, 26(3), 499–510. https://doi.org/10.1207/
s15327906mbr2603_7
Faul, F., Erdfelder, E., Lang, A.-G., & Buchner, A. (2007).
GPower 3: A flexible statistical power analysis pro- Grünwald, P., de Heide, R., & Koolen, W. (2019). Safe
gram for the social, behavioral, and biomedical sci- Testing. arXiv:1906.07801 [Cs, Math, Stat]. Retrieved
ences. Behavior Research Methods, 39(2), 175–191. from http://arxiv.org/abs/1906.07801
https://doi.org/10.3758/BF03193146
Gupta, S. K. (2011). Intention-to-treat concept: A re-
Ferron, J., & Onghena, P. (1996). The Power of Random- view. Perspectives in Clinical Research, 2(3), 109–112.
ization Tests for Single-Case Phase Designs. The https://doi.org/10.4103/2229-3485.83221
Journal of Experimental Education, 64(3), 231–239.
https://doi.org/10.1080/00220973.1996.9943805 Hallahan, M., & Rosenthal, R. (1996). Statistical power:
Concepts, procedures, and applications. Behaviour
Field, S. A., Tyre, A. J., Jonzén, N., Rhodes, J. R., & Poss- Research and Therapy, 34(5), 489–499. https://doi.
ingham, H. P. (2004). Minimizing the cost of envi- org/10.1016/0005-7967(95)00082-8
ronmental management decisions by optimizing
statistical thresholds. Ecology Letters, 7 (8), 669–675. Halpern, J., Brown Jr, B. W., & Hornberger, J. (2001). The
https://doi.org/10.1111/j.1461-0248.2004.00625.x sample size for a clinical trial: A Bayesian decision
theoretic approach. Statistics in Medicine, 20(6),
Fraley, R. C., & Vazire, S. (2014). The N-Pact Factor: 841–858. https://doi.org/10.1002/sim.703
Evaluating the Quality of Empirical Journals with
Respect to Sample Size and Statistical Power. PLOS Halpern, S. D., Karlawish, J. H., & Berlin, J. A. (2002).
ONE, 9(10), e109019. https://doi.org/10.1371/jour The continuing unethical conduct of underpow-
nal.pone.0109019 ered clinical trials. Jama, 288(3), 358–362. https:
//doi.org/doi:10.1001/jama.288.3.358
Friede, T., & Kieser, M. (2006). Sample size recalculation
in internal pilot study designs: A review. Biometrical Hedges, L. V., & Pigott, T. D. (2001). The power of sta-
Journal: Journal of Mathematical Methods in Bio- tistical tests in meta-analysis. Psychological Meth-
sciences, 48(4), 537–555. https://doi.org/10.1002/bi ods, 6(3), 203–217. https://doi.org/10.1037/1082-
mj.200510238 989X.6.3.203

Fugard, A. J. B., & Potts, H. W. W. (2015). Supporting Hilgard, J. (2021). Maximal positive controls: A method
thinking on sample sizes for thematic analyses: A for estimating the largest plausible effect size. Jour-
quantitative tool. International Journal of Social Re- nal of Experimental Social Psychology, 93. https:
search Methodology, 18(6), 669–684. https://doi.or //doi.org/10.1016/j.jesp.2020.104082
g/10.1080/13645579.2015.1005453
Hill, C. J., Bloom, H. S., Black, A. R., & Lipsey, M. W.
Funder, D. C., & Ozer, D. J. (2019). Evaluating effect (2008). Empirical Benchmarks for Interpreting Ef-
size in psychological research: Sense and nonsense. fect Sizes in Research. Child Development Perspec-
Advances in Methods and Practices in Psychological tives, 2(3), 172–177. https://doi.org/10.1111/j.1750-
Science, 2(2), 156–168. https://doi.org/10.1177/25 8606.2008.00061.x
15245919847202
Hoenig, J. M., & Heisey, D. M. (2001). The abuse of
Good, I. J. (1992). The Bayes/Non-Bayes Compromise: power: The pervasive fallacy of power calculations
A Brief Review. Journal of the American Statistical for data analysis. The American Statistician, 55(1),
Association, 87 (419), 597–606. https://doi.org/10.2 19–24. https://doi.org/10.1198/00031300130033
307/2290192 9897
28 DANIËL LAKENS1

Jaeschke, R., Singer, J., & Guyatt, G. H. (1989). Measure- Kruschke, J. K. (2011). Bayesian assessment of null val-
ment of health status: Ascertaining the minimal clin- ues via parameter estimation and model compari-
ically important difference. Controlled Clinical Tri- son. Perspectives on Psychological Science, 6(3), 299–
als, 10(4), 407–415. https://doi.org/10.1016/0197- 312.
2456(89)90005-6
Kruschke, J. K. (2013). Bayesian estimation supersedes
Jeffreys, H. (1939). Theory of probability (1st ed). Oxford the t test. Journal of Experimental Psychology: Gen-
[Oxfordshire]: New York: Oxford University Press. eral, 142(2), 573–603. https://doi.org/10.1037/a002
9146
Jennison, C., & Turnbull, B. W. (2000). Group sequential
methods with applications to clinical trials. Boca Kruschke, J. K. (2018). Rejecting or Accepting Parameter
Raton: Chapman & Hall/CRC. Values in Bayesian Estimation. Advances in Methods
and Practices in Psychological Science, 1(2), 270–280.
Julious, S. A. (2004). Sample sizes for clinical trials with
https://doi.org/10.1177/2515245918771304
normal data. Statistics in Medicine, 23(12), 1921–
1986. https://doi.org/10.1002/sim.1783
Lakens, D. (2014). Performing high-powered studies
Keefe, R. S. E., Kraemer, H. C., Epstein, R. S., Frank, E., efficiently with sequential analyses: Sequential anal-
Haynes, G., Laughren, T. P., . . . Leon, A. C. (2013). yses. European Journal of Social Psychology, 44(7),
Defining a Clinically Meaningful Effect for the De- 701–710. https://doi.org/10.1002/ejsp.2023
sign and Interpretation of Randomized Controlled
Trials. Innovations in Clinical Neuroscience, 10(5-6 Lakens, D. (2017). Equivalence Tests: A Practical Primer
Suppl A), 4S–19S. for t Tests, Correlations, and Meta-Analyses. Social
Psychological and Personality Science, 8(4), 355–362.
Kelley, K. (2007). Confidence Intervals for Standard- https://doi.org/10.1177/1948550617697177
ized Effect Sizes: Theory, Application, and Imple-
mentation. Journal of Statistical Software, 20(8). Lakens, D., Adolfi, F. G., Albers, C. J., Anvari, F., Apps, M.
https://doi.org/10.18637/JSS.V020.I08 A. J., Argamon, S. E., . . . Zwaan, R. A. (2018). Justify
your alpha. Nature Human Behaviour, 2, 168–171.
Kelley, K., & Preacher, K. J. (2012). On effect size. Psy- https://doi.org/10.1038/s41562-018-0311-x
chological Methods, 17 (2), 137–152. https://doi.or
g/10.1037/a0028086 Lakens, D., & Caldwell, A. R. (2019). Simulation-
Based Power-Analysis for Factorial ANOVA Designs.
Kelley, K., & Rausch, J. R. (2006). Sample size plan- PsyArXiv. https://doi.org/10.31234/osf.io/baxsf
ning for the standardized mean difference: Accu-
racy in parameter estimation via narrow confidence Lakens, D., & Etz, A. J. (2017). Too True to be Bad:
intervals. Psychological Methods, 11(4), 363–385. When Sets of Studies With Significant and Nonsignif-
https://doi.org/10.1037 icant Findings Are Probably True. Social Psycholog-
ical and Personality Science, 8(8), 875–881. https:
Kenny, D. A., & Judd, C. M. (2019). The unappreci-
//doi.org/10.1177/1948550617693058
ated heterogeneity of effect sizes: Implications for
power, precision, planning of research, and replica-
Lakens, D., Scheel, A. M., & Isager, P. M. (2018). Equiva-
tion. Psychological Methods, 24(5), 578–589. https:
lence testing for psychological research: A tutorial.
//doi.org/10.1037/met0000209
Advances in Methods and Practices in Psychological
King, M. T. (2011). A point of minimal important differ- Science, 1(2), 259–269. https://doi.org/10.1177/25
ence (MID): A critique of terminology and methods. 15245918770963
Expert Review of Pharmacoeconomics & Outcomes
Research, 11(2), 171–184. https://doi.org/10.1586/ Lenth, R. V. (2001). Some practical guidelines for ef-
erp.11.9 fective sample size determination. The American
Statistician, 55(3), 187–193. https://doi.org/10.119
Kish, L. (1965). Survey Sampling. New York: Wiley. 8/000313001317098149

Kraft, M. A. (2020). Interpreting effect sizes of education Lenth, R. V. (2007). Post hoc power: Tables and com-
interventions. Educational Researcher, 49(4), 241– mentary. Iowa City: Department of Statistics and
253. https://doi.org/10.3102/0013189X20912798 Actuarial Science, University of Iowa.
SAMPLE SIZE JUSTIFICATION 29

Leon, A. C., Davis, L. L., & Kraemer, H. C. (2011). The Morris, T. P., White, I. R., & Crowther, M. J. (2019). Us-
Role and Interpretation of Pilot Studies in Clinical ing simulation studies to evaluate statistical meth-
Research. Journal of Psychiatric Research, 45(5), 626– ods. Statistics in Medicine, 38(11), 2074–2102. https:
629. https://doi.org/10.1016/j.jpsychires.2010.10 //doi.org/10.1002/sim.8086
.008
Morse, J. M. (1995). The Significance of Saturation.
Lindsay, D. S. (2015). Replication in Psychological Qualitative Health Research, 5(2), 147–149. https:
Science. Psychological Science, 26(12), 1827–1832. //doi.org/10.1177/104973239500500201
https://doi.org/10.1177/0956797615616374
Moshontz, H., Campbell, L., Ebersole, C. R., IJzerman,
Lovakov, A., & Agadullina, E. (2017). Empirically De- H., Urry, H. L., Forscher, P. S., . . . Antfolk, J. (2018).
rived Guidelines for Interpreting Effect Size in Social The Psychological Science Accelerator: Advancing
Psychology. PsyArXiv. https://doi.org/10.17605/O psychology through a distributed collaborative net-
SF.IO/2EPC4 work. Advances in Methods and Practices in Psycho-
logical Science, 1(4), 501–515. https://doi.org/10.1
Marshall, B., Cardon, P., Poddar, A., & Fontenot, R. 177/2515245918797607
(2013). Does Sample Size Matter in Qualitative Re-
search?: A Review of Qualitative Interviews in is Re- Mudge, J. F., Baker, L. F., Edge, C. B., & Houlahan, J. E.
search. Journal of Computer Information Systems, (2012). Setting an Optimal α That Minimizes Errors
54(1), 11–22. https://doi.org/10.1080/08874417.201 in Null Hypothesis Significance Tests. PLOS ONE,
3.11645667 7 (2), e32734. https://doi.org/10.1371/journal.pone
.0032734
Maxwell, S. E., Delaney, H. D., & Kelley, K. (2017). De-
signing Experiments and Analyzing Data: A Model Murphy, K. R., Myors, B., & Wolach, A. H. (2014). Statis-
Comparison Perspective, Third Edition (3 edition). tical power analysis: A simple and general model for
New York, NY: Routledge. traditional and modern hypothesis tests (Fourth edi-
tion). New York: Routledge, Taylor & Francis Group.
Maxwell, S. E., & Kelley, K. (2011). Ethics and sample
size planning. In Handbook of ethics in quantitative Neyman, J., & Pearson, E. S. (1933). On the problem
methodology (pp. 179–204). Routledge. of the most efficient tests of statistical hypotheses.
Philosophical Transactions of the Royal Society of
Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample London A: Mathematical, Physical and Engineering
Size Planning for Statistical Power and Accuracy in Sciences, 231(694-706), 289–337. https://doi.org/10
Parameter Estimation. Annual Review of Psychology, .1098/rsta.1933.0009
59(1), 537–563. https://doi.org/10.1146/annurev.ps
ych.59.103006.093735 Olsson-Collentine, A., Wicherts, J. M., & van Assen,
M. A. L. M. (2020). Heterogeneity in direct repli-
McIntosh, R. D., & Rittmo, J. Ö. (2020). Power calcula- cations in psychology and its association with ef-
tions in single case neuropsychology. https://doi.or fect size. Psychological Bulletin, 146(10), 922–940.
g/10.31234/osf.io/fxz49 https://doi.org/10.1037/bul0000294

Meyners, M. (2012). Equivalence tests A review. Food Parker, R. A., & Berman, N. G. (2003). Sample Size.
Quality and Preference, 26(2), 231–245. https://doi. The American Statistician, 57 (3), 166–170. https:
org/10.1016/j.foodqual.2012.05.003 //doi.org/10.1198/0003130031919

Meyvis, T., & Van Osselaer, S. M. J. (2018). Increasing Parsons, S., Kruijt, A.-W., & Fox, E. (2019). Psychological
the Power of Your Study by Increasing the Effect Science Needs a Standard Practice of Reporting the
Size. Journal of Consumer Research, 44(5), 1157– Reliability of Cognitive-Behavioral Measurements.
1173. https://doi.org/10.1093/jcr/ucx110 Advances in Methods and Practices in Psychological
Science, 2(4), 378–395. https://doi.org/10.1177/25
Miller, J., & Ulrich, R. (2019). The quest for an op- 15245919879695
timal alpha. PLOS ONE, 14(1), e0208631. https:
//doi.org/10.1371/journal.pone.0208631 Perugini, M., Gallucci, M., & Costantini, G. (2014).
Safeguard power as a protection against imprecise
Morey, R. D. (2020). Power and precision [Blog]. power estimates. Perspectives on Psychological Sci-
https://medium.com/@richarddmorey/power- ence, 9(3), 319–332. https://doi.org/10.1177/1745
and-precision-47f644ddea5e. 691614528519
30 DANIËL LAKENS1

Perugini, M., Gallucci, M., & Costantini, G. (2018). A Schnuerch, M., & Erdfelder, E. (2020). Controlling de-
Practical Primer To Power Analysis for Simple Exper- cision errors with minimal costs: The sequential
imental Designs. International Review of Social Psy- probability ratio t test. Psychological Methods, 25(2),
chology, 31(1), 20. https://doi.org/10.5334/irsp.181 206–226. https://doi.org/10.1037/met0000234

Phillips, B. M., Hunt, J. W., Anderson, B. S., Puckett, H. Schoemann, A. M., Boulton, A. J., & Short, S. D. (2017).
M., Fairey, R., Wilson, C. J., & Tjeerdema, R. (2001). Determining Power and Sample Size for Simple and
Statistical significance of sediment toxicity test re- Complex Mediation Models. Social Psychological
sults: Threshold values derived by the detectable and Personality Science, 8(4), 379–386. https://doi.
significance approach. Environmental Toxicology org/10.1177/1948550617715068
and Chemistry, 20(2), 371–373. https://doi.org/10.1 Schönbrodt, F. D., & Perugini, M. (2013). At what sample
002/etc.5620200218 size do correlations stabilize? Journal of Research in
Personality, 47 (5), 609–612. https://doi.org/10.101
Proschan, M. A. (2005). Two-Stage Sample Size Re- 6/j.jrp.2013.05.009
Estimation Based on a Nuisance Parameter: A Re-
view. Journal of Biopharmaceutical Statistics, 15(4), Schönbrodt, F. D., Wagenmakers, E.-J., Zehetleitner,
559–574. https://doi.org/10.1081/BIP-200062852 M., & Perugini, M. (2017). Sequential hypothesis
testing with Bayes factors: Efficiently testing mean
Proschan, M. A., Lan, K. K. G., & Wittes, J. T. (2006). Sta- differences. Psychological Methods, 22(2), 322–339.
tistical monitoring of clinical trials: A unified ap- https://doi.org/10.1037/MET0000061
proach. New York, NY: Springer.
Schulz, K. F., & Grimes, D. A. (2005). Sample size calcu-
Rice, W. R., & Gaines, S. D. (1994). ’Heads I win, tails lations in randomised trials: Mandatory and mys-
you lose’: Testing directional alternative hypotheses tical. The Lancet, 365(9467), 1348–1353. https:
in ecological and evolutionary research. Trends in //doi.org/10.1016/S0140-6736(05)61034-3
Ecology & Evolution, 9(6), 235–237. https://doi.org/ Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of
10.1016/0169-5347(94)90258-5 statistical power have an effect on the power of stud-
ies? Psychological Bulletin, 105(2), 309–316. https:
Richard, F. D., Bond, C. F., & Stokes-Zoota, J. J. (2003). //doi.org/10.1037/0033-2909.105.2.309
One Hundred Years of Social Psychology Quanti-
tatively Described. Review of General Psychology, Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011).
7 (4), 331–363. https://doi.org/10.1037/1089- False-Positive Psychology: Undisclosed Flexibility
2680.7.4.331 in Data Collection and Analysis Allows Present-
ing Anything as Significant. Psychological Science,
Richardson, J. T. E. (2011). Eta squared and partial eta 22(11), 1359–1366. https://doi.org/10.1177/095679
squared as measures of effect size in educational 7611417632
research. Educational Research Review, 6(2), 135–
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2013).
147. https://doi.org/10.1016/j.edurev.2010.12.001
Life after P-Hacking. New Orleans, LA.
Rijnsoever, F. J. van. (2017). (I Can’t Get No) Satura- Simonsohn, U. (2015). Small Telescopes Detectability
tion: A simulation and guidelines for sample sizes and the Evaluation of Replication Results. Psycho-
in qualitative research. PLOS ONE, 12(7), e0181689. logical Science, 26(5), 559–569. https://doi.org/10.1
https://doi.org/10.1371/journal.pone.0181689 177/0956797614567341

Rogers, J. L., Howard, K. I., & Vessey, J. T. (1993). Using Smithson, M. (2003). Confidence intervals. Thousand
significance tests to evaluate equivalence between Oaks, Calif: Sage Publications.
two experimental groups. Psychological Bulletin, Spiegelhalter, D. (2019). The Art of Statistics: How to
113(3), 553–565. https://doi.org/http://dx.doi.org Learn from Data (Illustrated edition). New York: Ba-
/10.1037/0033-2909.113.3.553 sic Books.
Scheel, A. M., Tiokhin, L., Isager, P. M., & Lakens, D. Taylor, D. J., & Muller, K. E. (1996). Bias in linear model
(2020). Why Hypothesis Testers Should Spend Less power and sample size calculation due to estimating
Time Testing Hypotheses. Perspectives on Psycho- noncentrality. Communications in Statistics-Theory
logical Science, 1745691620966795. https://doi.org/ and Methods, 25(7), 1595–1610. https://doi.org/10
10.1177/1745691620966795 .1080/03610929608831787
SAMPLE SIZE JUSTIFICATION 31

Teare, M. D., Dimairo, M., Shephard, N., Hayman, A., Westberg, M. (1985). Combining Independent Statis-
Whitehead, A., & Walters, S. J. (2014). Sample size re- tical Tests. Journal of the Royal Statistical Society.
quirements to estimate key design parameters from Series D (the Statistician), 34(3), 287–296. https:
external pilot randomised controlled trials: A simu- //doi.org/10.2307/2987655
lation study. Trials, 15(1), 264. https://doi.org/10.1
Westfall, J., Kenny, D. A., & Judd, C. M. (2014). Statistical
186/1745-6215-15-264
power and optimal design in experiments in which
ter Schure, J., & Grünwald, P. D. (2019). Accumulation samples of participants respond to samples of stim-
Bias in Meta-Analysis: The Need to Consider Time uli. Journal of Experimental Psychology: General,
in Error Control. arXiv:1905.13494 [Math, Stat]. Re- 143(5), 2020–2045. https://doi.org/10.1037/xge000
trieved from http://arxiv.org/abs/1905.13494 0014
Tversky, A. (1977). Features of similarity. Psychological Williams, R. H., Zimmerman, D. W., & Zumbo, B. D.
Review, 84(4), 327–352. https://doi.org/10.1037/00 (1995). Impact of Measurement Error on Statisti-
33-295X.84.4.327 cal Power: Review of an Old Paradox. The Journal
Valentine, J. C., Pigott, T. D., & Rothstein, H. R. (2010). of Experimental Education, 63(4), 363–370. https:
How Many Studies Do You Need?: A Primer on Sta- //doi.org/10.1080/00220973.1995.9943470
tistical Power for Meta-Analysis. Journal of Edu- Wilson, E. C. F. (2015). A Practical Guide to Value of Infor-
cational and Behavioral Statistics, 35(2), 215–247. mation Analysis. PharmacoEconomics, 33(2), 105–
https://doi.org/10.3102/1076998609346961 121. https://doi.org/10.1007/s40273-014-0219-x

Viechtbauer, W., Smits, L., Kotz, D., Budé, L., Spigt, M., Wilson VanVoorhis, C. R., & Morgan, B. L. (2007). Under-
Serroyen, J., & Crutzen, R. (2015). A simple formula standing power and rules of thumb for determining
for the calculation of sample size in pilot studies. sample sizes. Tutorials in Quantitative Methods for
Journal of Clinical Epidemiology, 68(11), 1375–1379. Psychology, 3(2), 43–50. https://doi.org/10.20982/t
https://doi.org/10.1016/j.jclinepi.2015.04.014 qmp.03.2.p043

Wald, A. (1945). Sequential tests of statistical hypothe- Winer, B. J. (1962). Statistical principles in experimental
ses. The Annals of Mathematical Statistics, 16(2), design. New York : McGraw-Hill.
117–186. https://doi.org/https://www.jstor.org/st Wittes, J., & Brittain, E. (1990). The role of internal pi-
able/2240273
lot studies in increasing the efficiency of clinical
Wassmer, G., & Brannath, W. (2016). Group Sequen- trials. Statistics in Medicine, 9(1-2), 65–72. https:
tial and Confirmatory Adaptive Designs in Clinical //doi.org/10.1002/sim.4780090113
Trials. Cham: Springer International Publishing.
Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc
https://doi.org/10.1007/978-3-319-32562-0
Power in Testing Mean Differences. Journal of Ed-
Wassmer, G., & Pahlke, F. (2019). Rpact: Confirmatory ucational and Behavioral Statistics, 30(2), 141–167.
adaptive clinical trial design and analysis. https://doi.org/10.3102/10769986030002141

You might also like