Saturation in Qualitative Research

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Social Science & Medicine 292 (2022) 114523

Contents lists available at ScienceDirect

Social Science & Medicine


journal homepage: www.elsevier.com/locate/socscimed

Sample sizes for saturation in qualitative research: A systematic review of


empirical tests
Monique Hennink a, *, Bonnie N. Kaiser b
a
Hubert Department of Global Health, Rollins School of Public Health, Emory University, 1518 Clifton Rd, Atlanta, GA, 30322, USA
b
Department of Anthropology and Global Health Program, University of California San Diego, 9500 Gilman Drive 0532, La Jolla, CA, 92093, USA

A R T I C L E I N F O A B S T R A C T

Keywords: Objective: To review empirical studies that assess saturation in qualitative research in order to identify sample
Sample size sizes for saturation, strategies used to assess saturation, and guidance we can draw from these studies.
Saturation Methods: We conducted a systematic review of four databases to identify studies empirically assessing sample
Qualitative research
sizes for saturation in qualitative research, supplemented by searching citing articles and reference lists.
Interviews
Focus group discussions
Results: We identified 23 articles that used empirical data (n = 17) or statistical modeling (n = 6) to assess
saturation. Studies using empirical data reached saturation within a narrow range of interviews (9–17) or focus
group discussions (4–8), particularly those with relatively homogenous study populations and narrowly defined
objectives. Most studies had a relatively homogenous study population and assessed code saturation; the few
outliers (e.g., multi-country research, meta-themes, “code meaning” saturation) needed larger samples for
saturation.
Conclusions: Despite varied research topics and approaches to assessing saturation, studies converged on a
relatively consistent sample size for saturation for commonly used qualitative research methods. However, these
findings apply to certain types of studies (e.g., those with homogenous study populations). These results provide
strong empirical guidance on effective sample sizes for qualitative research, which can be used in conjunction
with the characteristics of individual studies to estimate an appropriate sample size prior to data collection. This
synthesis also provides an important resource for researchers, academic journals, journal reviewers, ethical re­
view boards, and funding agencies to facilitate greater transparency in justifying and reporting sample sizes in
qualitative research. Future empirical research is needed to explore how various parameters affect sample sizes
for saturation.

1. Introduction The concept of saturation was developed by Glaser and Strauss


(1967) as ‘theoretical saturation’ and was part of their influential
Saturation is the most common guiding principle for assessing the grounded theory approach to qualitative research. Grounded theory
adequacy of purposive samples in qualitative research (Morse, 1995, focuses on developing sociological theory from textual data to explain
2015; Sandelowski, 1995). However, guidance on assessing saturation social phenomena. Within this approach, theoretical saturation refers to
and the sample sizes needed to reach saturation have been vague. Until “the point at which gathering more data about a theoretical construct
recently, saturation had not been empirically assessed with different reveals no new properties, nor yields any further theoretical insights
types of qualitative data. A growing interest in empirical assessment of about the emerging grounded theory” (Bryant and Charmaz, 2007
saturation has now generated a body of research on the topic, making it p.611). Thus, it is the point in data collection when all important issues
an opportune time to synthesize it and identify what we can learn from or insights are exhausted from data, which signifies that the conceptual
it. This systematic review sought to identify studies that empirically categories that comprise the theory are ‘saturated’, so that the emerging
assess saturation in qualitative research, to identify sample sizes needed theory is comprehensive and well-grounded in data. Theoretical satu­
for saturation, strategies used to assess saturation, and guidance we can ration is also embedded in an iterative process of concurrently sampling,
draw from these studies. collecting data, and analyzing data (Sandelowski, 1995), whereby data

* Corresponding author.
E-mail addresses: [email protected] (M. Hennink), [email protected] (B.N. Kaiser).

https://doi.org/10.1016/j.socscimed.2021.114523
Received 19 July 2021; Received in revised form 22 October 2021; Accepted 31 October 2021
Available online 2 November 2021
0277-9536/© 2021 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
M. Hennink and B.N. Kaiser Social Science & Medicine 292 (2022) 114523

continuously inform sampling until saturation. strategies used to assess saturation, identify sample sizes needed to reach
Although most qualitative research does not follow a grounded saturation using different qualitative methods, and suggest guidance on
theory approach, the concept of saturation is widely used in other ap­ sample sizes for qualitative research. To our knowledge, this is the first
proaches to qualitative research, where it is typically called ‘data satu­ systematic review on empirical studies of saturation and therefore pro­
ration’ or ‘thematic saturation’ (Hennink et al., 2017). This broader vides a valuable resource for researchers, academic journals, journal
application of saturation focuses more on assessing sample size rather reviewers, ethical review boards, and funding agencies that review
than the adequacy of data to develop theory (as in theoretical satura­ qualitative research. Researchers can refer to our results when esti­
tion). When used in the broader context, saturation refers to the point in mating an appropriate sample size in research proposals and protocols,
data collection when no additional issues or insights are identified and which may lead to more efficient use of research resources and clearer
data begin to repeat so that further data collection is redundant, signi­ justifications for proposed sample sizes. Similarly, our results may pro­
fying that an adequate sample size is reached. Saturation is an important vide evidence-based expectations regarding adequate sample sizes for
indicator that a sample is adequate for the phenomenon studied – that qualitative research to guide those who review and fund research.
data collected have captured the diversity, depth, and nuances of the
issues studied – and thereby demonstrates content validity (Francis 2. Methods
et al., 2010). Reaching saturation has become a critical component of
qualitative research that helps make data collection robust and valid We followed the Preferred Reporting Items for Systematic Reviews
(O’Reilly and Parker, 2013). Moreover, saturation is “the most and Meta-Analyses (PRISMA) guidelines in conducting and reporting
frequently touted guarantee of qualitative rigor offered by authors to our systematic review (Moher et al., 2009). Fig. 1 shows the number of
reviewers and readers" (Morse, 2015, p. 587). In this review, we focus on articles identified, screened, and included. We used a two-stage search
saturation in the broader context, since less is known about adequate process, including database searches and citation searches.
sample sizes for saturation when used outside of the parameters of First, we used four databases – PubMed, Embase, Sociological Ab­
grounded theory. stracts, and CINAHL – to search for articles or book chapters that
Despite the importance of saturation to support the rigor of quali­ included “saturation” and one of the following terms in the title, ab­
tative samples, there is a consistent lack of transparency in how sample stract, or key words/index: “interview,” “focus group,” “qualitative,” or
sizes are justified in published qualitative research (Morse, 1995; Guest “thematic” (see Supplemental Table for full search terms). Search results
et al., 2006; Kerr et al., 2010; Carlsen and Glenton, 2011; Hennink et al., were limited to English-language and human studies. Database searches
2017). Although saturation is the most commonly cited justification for were conducted on January 31 – February 1, 2019 and updated July 10,
an adequate sample size (Morse, 1995, 2015), details of how saturation 2020. Both authors independently screened all article titles, abstracts,
was assessed and the grounds on which it was determined are largely and, where needed, full texts to determine eligibility. Discrepancies
absent in qualitative studies. Vasileiou et al. (2018) conducted a sys­ were discussed and resolved by consensus. To be eligible for inclusion,
tematic review of qualitative studies using in-depth interviews in studies needed to: a) use empirical data to assess saturation in qualita­
health-related journals over a 15-year period and found the vast ma­ tive research or use a statistical model to determine saturation using
jority of articles provided no justification for their sample size. Where hypothetical data, b) focus on saturation outside of grounded theory, c)
justifications were given, saturation was cited in 55% of articles; how­ be published in journal articles or book chapters, and d) be available in
ever, claims of saturation were “never substantiated in relation to pro­ English. Sixteen articles were included from database searches.
cedures conducted in the study itself” (p. 12); only further citations of Second, we conducted citation searches by reviewing the reference
other literature were given that moved away from the study at hand. lists of included articles and using the “cited by” search option in Google
Similarly, a systematic review of 220 studies using focus group discus­ Scholar to identify further records meeting the inclusion criteria. For
sions (Carlsen and Glenton, 2011) found that 83% used saturation to studies with more than 250 citing articles on Google Scholar, we
justify their sample size; however, they provided only superficial searched within citing articles for “saturation” and reviewed the first
reporting of how it was achieved, including unsubstantiated claims of 250 results (which are ordered by relevance). An additional seven ar­
saturation and references to achieving saturation while still using a ticles were included during this step.
predetermined sample size. Another study (Francis et al., 2010) We extracted the following information from the 23 eligible articles:
reviewed articles in the journal Social Science and Medicine over 16 a) meta-data about the article (author, journal, year), b) information
months and found most articles claimed they had reached saturation but about data used (hypothetical vs. empirical; interviews, focus group
provided no clarity on how saturation was defined, achieved, or justi­ discussions, etc.), research objective, sample size, study population
fied. Marshall et al. (2013) also reviewed 83 qualitative studies and (homogenous, heterogenous), and whether data collection was iterative,
found saturation was not explained in any study. There are increasing c) information about saturation, including: definition, goal, data
concerns over claims of saturation without study-based explanations of randomization, strategy to assess saturation, sample size for saturation,
how it was assessed or determined. Unsubstantiated claims of reaching and level of saturation achieved (e.g., 90% of codes), and d) additional
saturation undermine the value of the concept. In part, this lack of information (limitations, any parameters of saturation suggested). Both
transparency may reflect the absence of published guidance on how to authors independently extracted data from 6 articles and discussed re­
assess saturation (Morse, 1995; Guest et al., 2006). In this review, we sults. This was done to identify any issues with the data extraction cat­
seek to identify the strategies used to assess saturation in empirical egories, such as lack of clarity or redundancy, as well as to establish
research, which may encourage greater transparency in reporting satu­ reliability between the two authors. Each remaining article then un­
ration in qualitative studies. derwent data extraction by one of the two authors.
In addition, guidance on specific sample sizes needed to reach We analyzed results separately for studies using empirical data to
saturation in different qualitative methods has been absent or vague in assess saturation versus those using statistical models. We analyzed
the methodological literature, providing only general “rules of thumb” sample sizes for saturation by qualitative method: interviews or focus
that are rarely evidence-based (Morse, 1995; Guest et al., 2006; Kerr group discussions. We conducted comparisons of saturation by homo­
et al., 2010; Bryman, 2012; Hennink et al., 2019). As research empiri­ geneity of the study population and randomization of data to identify
cally assessing saturation begins to fill this gap, it allows us to provide any patterns.
much-needed empirically based guidance on sample sizes for saturation
in qualitative research. 3. Results
In this systematic review, we aim to synthesize empirical studies that
assess saturation in qualitative data. In particular, we aim to document Our systematic review identified 23 articles assessing saturation for

2
M. Hennink and B.N. Kaiser Social Science & Medicine 292 (2022) 114523

Fig. 1. PRISMA Flow Diagram of Systematic Review Search Procedures.

qualitative research. All articles were published between 2006 and aged 14–18.
2020, with the majority (87%, 20/23) published since 2014. Many ar­ Authors described their goal of saturation in two ways, either as
ticles were published in research methodology journals (43%, 10/23) saturation of individual codes or categories. Although terminology
and others in social science (6/23) or topical journals (7/23) (e.g., en­ varied across articles, codes were typically described as individual is­
gineering, computing, natural resources). We categorized the articles sues, topics, or items in data, while categories represented higher-order
into those assessing saturation using empirical data (Table 1, 17 articles) groupings of issues (e.g., broader themes, meta-themes, concepts).
and those using statistical modeling to predict saturation (Table 2, 6 Forty-four percent (7/16) of articles sought saturation of codes, 31% (5/
articles). Since these approaches and results are not comparable, we 16) saturation of categories, and 25% stated both.
report each separately below. Where saturation was defined, authors used similar definitions.
Overall, saturation was described as the point at which little or no
3.1. Approaches to assessing saturation relevant new codes and/or categories were found in data, when issues
begin to be repeated with no further understanding or contribution to
3.1.1. Empirically based tests the study phenomenon, its dimensions, nuances, or variability. Some
Table 1 summarizes 17 articles that assess saturation using empirical articles further specified that saturation should be confirmed only after
data. Some articles used multiple datasets to assess saturation and report no new issues were found in two or three consecutive interviews or focus
the results of each separately; therefore, Table 1 shows 23 tests from 17 groups (Coenen et al., 2012; Francis et al., 2010; Morse et al., 2014) or
articles (NB: while these studies were not conducting experimental tests, that it was determined by two researchers (Morse et al., 2014). Over half
we use the term ‘test’ for brevity to refer to individual studies using of articles (56%, 9/16) randomized the order of data for analysis to
empirical data, as opposed to statistical modeling, to assess saturation). account for interview order, which might influence saturation. Some
Most articles used data from in-depth interviews (10/17) or focus group compared saturation between the randomized order of interviews and
discussions (4/17); two articles used both types of data, and one article the actual order in which interviews were conducted, while others
(Weller et al., 2018) used free list data. We excluded the article by calculated saturation across multiple randomized orderings of data to
Weller et al. in our analysis because free list data are not comparable to identify an average.
free-flowing narrative data from interviews and focus group discussions. Various strategies were used to assess saturation. These are catego­
We therefore use the denominator of 16 when describing all articles and rized in Table 1 and the categories described in Table 3. Most articles
22 when describing the datasets and results of all tests with empirical (75%, 12/16) used a single strategy to assess saturation. All articles used
data. some form of code frequency counts to assess saturation (including code
The original research objective for each dataset used in the tests frequency counts, comparative method, stopping criterion, higher-order
varied, but most studies (14/16) focused on health issues, such as ex­ groupings), and four articles used another approach in addition to code
periences of a specific health condition (e.g., sickle cell disease, multiple frequency counts and compared saturation for each (Hennink et al.,
sclerosis, Paget’s disease), health service, or intervention (e.g., genetic 2017, 2019; Constantinou et al., 2017; Hagaman and Wutich, 2017).
screening, violence prevention, lifestyle interventions, patient reten­ Many articles (37% 6/16) used only code frequency counts to assess
tion). These research objectives are typical of much qualitative heath saturation, which involved counting codes in successive transcripts or
research. The sample size of the datasets used varied from 14 to 132 sets of transcripts until the frequency of new codes diminishes, signaling
interviews and 1 to 40 focus groups. All datasets except one (Francis saturation is reached. Three articles (18%, 3/16) added specific addi­
et al., 2010) had a sample that was much larger than the sample ulti­ tional elements to code frequency counts, such as batch comparisons, a
mately needed for saturation, making them effective for assessing stopping criterion, or counting higher-order groupings of codes, such as
saturation. Francis et al. (2010) report saturation was reached at exactly meta-themes or categories of codes rather than individual codes. Three
the sample size of the study for both datasets used. Most datasets articles (Hennink et al., 2017, 2019; Nascimento et al., 2018) used ‘code
(18/22) had a homogenous study population, such as patients with a meaning’ to assess saturation, an entirely different approach from code
specific disease (e.g., HIV, rheumatoid arthritis, sickle cell) or from a frequency counts. This approach focused on reaching a full understanding
specific demographic group (e.g., male nurses, medical students, South of issues in data as the indicator that saturation is reached, by assessing
Asian adults, African American men). The remaining datasets had more whether the issue, its dimensions, and nuances are fully identified and
heterogeneous samples, such as men aged 20–72 across the US or youths understood. Two articles (Hennink et al., 2017, 2019) then compared

3
M. Hennink and B.N. Kaiser Social Science & Medicine 292 (2022) 114523

Table 1
Articles assessing saturation using empirical data.
ARTICLE DATA USED SATURATION RESULTS

Author, Date, Journal Research Objective Sample Size Study Saturation Data Strategy to Assess Sample Size for Saturation
Population Goalb Randomized Saturationc

Type of Data Used: In-Depth Interviews


Ando et al. (2014) Influences on quality of 39 Homogenous Codes & No Code Freq. Counts 12 interviews for 92% of
Comprehensive life for people with Categories codes
Psychology neurological conditions
a
Coenen et al. (2012) Daily life challenges for 21 Homogenous Categories No Higher Order 9 interviews (inductive)
Qual. Life Research patients with rheumatoid Groupings 12 interviews (deductive)
arthritis
Constantinou et al. Medical students’ beliefs 12 Homogenous Categories Yes Code Freq. Counts 5 interviews (consecutive)
(2017) Qualitative and attitudes towards & Higher Order 8 interviews (random)
Research psychotherapy Groupings
a
Francis et al. (2010) How do doctors manage 14 Homogenous Categories No Stopping Criterion 14 interviews
Psychology and sore throat without
Health antibiotics
a
Francis et al. (2010) Acceptability of genetic 17 Homogenous Categories No Stopping Criterion 17 interviews
Psychology and screening for relatives of
Health patients with Padgett’s
disease
Guest et al. (2006) Perceptions of social 60 (2 countries) Homogenous Codes No Code Freq. Counts 12 interviews for 88% of
Field Methods desirability bias in self- codes
reported reproductive
health behavior
a
Guest et al. (2020) Health seeking behaviors 40 Homogenous Codes Yes Stopping Criterion Depends on parameters
PLOS ONE of African American men 11-14 interviews for ~90%
in southeast US of themes at 0% threshold
a
Guest et al. (2020) Medical risks during 48 Homogenous Codes Yes Stopping Criterion Depends on parameters
PLOS ONE pregnancy amongst 11-14 interviews for ~90%
mothers in southeast US of themes at 0% threshold
a
Guest et al. (2020) Women at high risk of HIV 60 Heterogeneous Codes Yes Stopping Criterion 16 interviews for ~80% of
PLOS ONE in Kenya and South Africa themes
Hagaman and Wutich Cultural understandings of 132 (4 Heterogenous Codes & Yes Higher Order 16 interviews (top 3
(2017) water problems and countries) Categories Groupings & themes within country);
Field Methods solutions Stopping Criterion 20–40 interviews (meta-
themes across countries)
Hennink et al. (2017) Influences on patient 25 Homogenous Codes Yes Code Freq. Counts 9 interviews for 91% of
Qualitative Health retention in HIV care & Code Meaning codes
Research 16-24 interviews for
meaning saturation
Namey et al. (2016) Health seeking behaviors 40 Homogenous Codes Yes Code Freq. Counts 16 interviews for 90% of
Am. J. Evaluation of African American men codes
in Durham, NC
Nascimento et al. Daily functions of school 15 Homogenous Categories No Code Meaning 11 interviews
(2018) Revista children with sickle cell
Brasileira de disease
Enfermagem
Turner-Bowker et al. Patient experiences of 26 Homogenous Categories No Comparative 15 interviews for 92% of
(2018) Value in symptoms in acute or Method concepts
Health chronic health conditions
a
Young and Casey Challenges and strategies 27 Heterogenous Codes & Yes Code Freq. Counts 9 interviews for 90% of
(2019) for engaging men in Categories codes
Social Work prevention of gender-
Research based violence
a
Young and Casey Social workers 15 Homogenous Codes & Yes Code Freq. Counts 9 interviews for ~90% of
(2019) perspectives of success in Categories codes
Social Work working with the US
Research justice system
Type of Data Used: Focus Group Discussions
a
Coenen et al. (2012) Daily life challenges for NS Homogenous Categories No Higher Order 5 groups (inductive &
Qual. Life Research patients with rheumatoid Groupings deductive)
arthritis
Guest et al. (2016) Health seeking behaviors 40 groups Homogenous Codes Yes Code Freq. Counts 4.3 groups (mean) or 3–6
Field Methods of African American men groups for 90% of codes
in Durham, NC
Hancock et al. (2016) Experiences of male 1 group Homogenous Codes & No Stopping Criterion 1 asynchronous online
The Qualitative registered nurses seeking (asynchronous) Categories group
Report employment
Hennink et al. (2019) Design a lifestyle 10 groups Homogenous Codes Yes Code Freq. Counts 4 groups for 94% of codes
Qualitative Health intervention for diabetes & Code Meaning 2 per strata for meaning
Research in South Asian Americans saturation
Morse et al. (2014) Important places for 19 groups Heterogenous Codes Yes Code Freq. Counts 16 groups for 90%
Society & Natural recreation/livelihoods saturation at all hotspots
Resources
(continued on next page)

4
M. Hennink and B.N. Kaiser Social Science & Medicine 292 (2022) 114523

Table 1 (continued )
ARTICLE DATA USED SATURATION RESULTS

Author, Date, Journal Research Objective Sample Size Study Saturation Data Strategy to Assess Sample Size for Saturation
Population Goalb Randomized Saturationc
a
Young and Casey Influences on adolescent 12 groups Heterogenous Codes & Yes Code Freq. Counts 7 groups for complete
(2019) Social Work bystander’s responses to Categories themes
Research dating violence and
bullying
Type of Data Used: Free Lists
Weller et al. (2018) Free listing of items in Free-Lists in 28 Varies by study Items No Most Salient Items 10 interviews for 95% of
PLoS ONE topical domains datasets salient items (named by
20% of participants)

Note: where different levels of saturation are identified, saturation closest to 90% is presented here. Where percentage saturation is not specified, this is due to authors
not indicating this or using another measure to determine saturation (e.g. stopping criterion, specific number of repetitions of a code).
NS - not stated.
a
These studies report results of multiple data sets separately and are included in separate rows.
b
Codes refers to single-level issues in data; categories refers to any higher order groupings of codes.
c
See Table 2 for description of these categories.

Table 2
Articles estimating saturation through statistical modeling.
Data Application Strategy to Assess Saturation Parameters and Assumptions Suggested formula for
saturation

Fofana et al. (2020) Statistical model tested on Uses set theory and partial least squares Xj is the vector of the number of times each (Xj+1 … Xn) = (X1 … Xj)
PLOS ONE empirical dataset of interviews (n regression to estimate saturation theme is coded in the j-th interview BPLS + E
= 12) BPLS is the vector of regression coefficients
E is the matrix of residuals
Fugard and Potts Hypothetical model based on Uses negative binomial probability Assumes random sample. Estimates sample Various outcomes are
(2015) Int. J. Soc. interviews but not tested on distribution to estimate sample needed to size based on population theme prevalence provided based on a
Res. Methodology empirical data reach a certain power (eg, 80% (known probability of issue/theme in the range of values for
probability to identify a theme) based on population of interest) of least prevalent model inputs
several parameters theme, desired number of instances in the
data, and desired power.
Galvin (2015) Hypothetical model based on Uses binomial distribution to answer 5 Assumes random sample ln(1 − P)
n=
J. Building interviews but not tested on research questions; the most relevant is P = probability theme arising in interview ln(1 − R)
Engineering empirical data RQ3: How many interviews to have 95% R = proportion of theme in population n = #
probability of theme emerging? interviews
Lowe et al. (2018) Statistical model tested on Develops saturation index using R = prevalence of a theme in population P(R − 1)
n=
Field Methods empirical datasets including generalized estimating equations P = particular saturation n = # observations R(P − 1)
literature surveys (n = 25), focus Accounts for statistical dependency between
groups (n = 3), and interviews (n observations and likelihood of researcher
= 11) identifying theme. Assumes order of
observations does not influence themes
identified. Assumes random sample
Rowlands et al. Statistical model tested on Calculate thematic saturation using Based on concept analysis using Leximancer For 95% confidence,
(2016) empirical data of interviews (3 lognormal distribution with chosen program. lognormal expression
J. Computer Inf. studies: n = 30, 30, 24) confidence level X‾* is the geometric mean from the lognormal = x* *(s* )2
Systems fit s* is the multiplicative standard deviation
from the
lognormal fit
Van Rijnsoever Hypothetical model based on Uses simulations based on lognormal Accounts for random and purposive samples, Various outcomes are
(2017) PLOS various data types (e.g., distribution and 11 parameters as well as minimal and maximal information provided based on a
ONE interviews, focus groups, from observations. range of values for
documents) but not tested on model inputs
empirical data

saturation using this approach with the code frequency approach. were developed for interview data, while two articles discussed esti­
mating saturation for various forms of data, including interviews, focus
3.1.2. Statistical models groups, documents, and literature surveys. Half of the formulas were
Table 2 summarizes six articles that used statistical modeling to es­ then applied to empirical datasets.
timate saturation. These articles used a different approach than those
summarized above: they developed a formula to estimate the sample
size needed for saturation, which may be used prior to data collection to 3.2. Sample size for saturation
inform study design. Several formulas were based on similar parameters,
such as prevalence of a theme in a population or the desired instances of Fig. 2 shows sample sizes for saturation from empirically based tests
a theme in data (Fugard and Potts, 2015; Galvin, 2015; Lowe et al., using in-depth interview data. The results for each dataset used in the
2018), while others used a lognormal distribution (Rowlands et al., tests (n = 16) are shown as separate data points. Where results are re­
2016; Van Rijnsoever, 2017) or set theory (Fofana et al., 2020). Many of ported at different sample sizes, this is depicted with a line from the
these studies assumed a random sample, while one accounts for both lowest to highest sample size reported, and the parameters influencing
random and purposive samples (Van Rijnsoever, 2017). Most formulas this range are noted. Where authors report different levels of saturation,
saturation closest to 90% is shown for comparability across studies.

5
M. Hennink and B.N. Kaiser Social Science & Medicine 292 (2022) 114523

Table 3
Strategies to assess saturation in empirical tests.
Type of Description of Approach Articles Reporting Strategy
Approach

Code Frequency This approach involves reviewing each interview or focus group transcript and Ando et al. (2014); Guest et al. (2006, 2016); Morse et al. (2014); Namey
Counts counting the number of new codes in each successive transcript or set of et al. (2016); Hennink et al. (2017 a, 2019) a; Constantinou et al., 2017 a);
transcripts, until the frequency of new codes diminishes with few or no more Young and Casey (2019)
codes identified. Several articles additionally randomized the order of data to
assess the influence of sequential bias on saturation. Some articles added
additional elements to the code frequency counts, such as batch comparison, a
stopping criterion or saturation of higher order groupings of data, as outlined
below.
Comparative This approach adds a more structured comparison to the code frequency count Turner-Bowker et al. (2018)
Method approach above. It involves reviewing data in pre-determined batches, such as
quartiles of data (instead of reviewing each interview separately) and listing all
new codes in a saturation table for each batch of data. The subsequent quartile of
data is then reviewed and compared to the first quartile to determine any new
codes, this comparison of data batches continues until few or no new codes are
identified, whereby saturation is achieved.
Stopping Criterion This approach adds a stopping criterion to the code frequency count approach Francis et al. (2010); Guest et al. (2020); Hagaman and Wutich (2017) a;
above. It involves reviewing an initial sample of interviews (e.g. 6 interviews) or Hancock et al. (2016)
focus groups to identify new codes, and using a pre-determined stopping
criterion, which is usually the number of consecutive interviews/groups after the
initial sample where no new codes are identified in the sample (e.g. 2 or 3
interviews with no new codes). Saturation is reached when no new codes are
identified after the stopping criterion of x interviews after the initial sample, or
the number of new codes is under a predetermined threshold (e.g. <5%). In other
studies, the stopping criterion was based on repetitions of a code, such as 3 or 5
instances of a particular code or theme were identified.
High-Order This approach uses a higher order grouping of codes in the code frequency count Coenen et al. (2012); Hagaman and Wutich (2017) a); Constantinou et al.
Groupings approach above. It involves counting higher-order groupings of codes such as (2017) a
meta-themes, salient themes or categories. For example, Coenen et al. (2012)
counted conceptual categories. Hagaman and Wutich (2017) counted codes to
determine the most prevalent codes in the data set, then randomized the interview
order via bootstrapping to determine the average number of interviews needed
to identify the most prevalent codes in data. Weller et al. (2018) focused on
identifying saturation for the most salient items in data.
Code Meaning This approach does not focus on counting codes as the basis for determining Hennink et al. (2017,a, 2019) a; Nascimento et al. (2018);
saturation (as used in the approaches above), instead achieving a full
understanding of codes is the indicator of saturation. It involves reviewing an
interview and noting each issue (or code) identified, then in subsequent
interviews identifying whether any new aspects, dimensions, or nuances of that
code are identified, until nothing new is identified and the code has reached
saturation. Codes may reach saturation at a different point in the data set.
a
These articles used multiple approaches and are therefore listed twice.

Results show that across 16 tests using various approaches to saturation, randomization and those that did not cover the full spectrum of sample
the sample size for saturation ranges between 5 and 24 interviews. The sizes seen in our review.
lowest sample size for saturation was 5 interviews (Constantinou et al., Fig. 3 shows the sample size for saturation from six empirical tests
2017), in a study with a homogenous study population that was inten­ using data from focus group discussions. For comparability, where
ded to support survey findings and where saturation was sought in broad various levels of saturation are reported, those closest to 90% are shown
categories. Together, these study characteristics may explain reaching in the figure. Across all six tests, saturation was reached between 1 and
saturation at 5 interviews. The highest sample sizes for saturation were 16 focus groups. Two tests are outliers and thus not comparable to
20–40 (Hagaman and Wutich, 2017), where saturation of meta-themes others. At the lower end, Hancock et al. (2016) report on saturation in a
across four countries was sought, and 24 (Hennink et al., 2017), single asynchronous, online focus group, and saturation is reported by
where saturation was sought in the meaning of codes, including codes day and participant. At the higher end, Morse et al. (2014) report
less central to the research question. These saturation goals require more reaching saturation at 16 groups; however, they focus on spatial loca­
data, which may support the higher sample sizes found for saturation. tions rather than codes or themes, which may account for the higher
Excluding these outliers, most datasets reached saturation between 9 sample size for saturation. The remaining four tests used similar defi­
and 17 interviews, with a mean of 12–13 interviews, despite using nitions of saturation and reached saturation between 4 and 8 focus
different approaches to assess saturation. Most of these studies had a groups, with a mean of 5–6 groups. Most tests (4/6) had a homogenous
relatively homogenous study population and varied in their saturation study population but varied in their approach to assessing saturation and
goal of codes, categories, or a combination. Only three studies used a the saturation goal of codes or categories. In the two tests using het­
heterogeneous sample. Two of these studies reached saturation at a erogeneous samples, both reached saturation at sample sizes above the
larger sample size than the mean (at 16 interviews), and one reached mean number of groups (at 7 and 17 groups).
saturation at a smaller sample size (at 9 interviews). Therefore, we found In studies that developed statistical models for saturation that were
no pattern in saturation by this characteristic. Similarly, it was difficult applied to empirical data, the sample sizes for saturation were similar to
to identify any pattern of saturation by the order of data, since most tests those above (Table 2). For example, Rowlands et al. (2016) used the
did not compare saturation when analyzing data in the actual interview lognormal distribution to estimate saturation in three datasets of in­
order with the randomized order. Those that did make a comparison terviews, and results found the sample sizes for saturation at 95% con­
found no difference or a slightly higher sample size for saturation in the fidence to be 10, 10, and 13 interviews. Fofana et al. (2020) used set
random versus actual order of interviews. Both studies that used theory and partial least squares regression to estimate saturation at 12

6
M. Hennink and B.N. Kaiser Social Science & Medicine 292 (2022) 114523

Fig. 2. Sample size for saturation in empirical tests with interview data.

interviews when applied to an empirical dataset. interviews excluding outliers. Despite using different approaches to
assess saturation, different datasets, varying saturation goals (codes vs
4. Discussion categories), and homogenous and heterogeneous study populations,
studies still reached saturation within a narrow range of interviews. This
This systematic review sought to identify empirical studies that demonstrates strong external reliability across the different approaches.
assess saturation, to identify sample sizes needed for saturation, strate­ Across all tests, an average of 12–13 interviews reached saturation,
gies used to assess saturation, and guidelines we can draw from these which is remarkably similar to findings from Guest et al. (2006), one of
studies. We identified 23 studies that empirically assessed saturation, the first studies to empirically assess saturation, which reported satu­
with 80% published since 2014. We identified two different approaches ration at 12 interviews. We found no clear pattern in saturation by study
to assess saturation: studies that used empirical data and those that used characteristics, such as homogeneity of the study population, use of
statistical models. randomization, or saturation goal, largely because few studies actually
One approach to assessing saturation focused on developing statis­ assessed these parameters in their approach. In six tests using data from
tical models to estimate sample sizes for saturation prior to data focus group discussions, saturation was reached by 4–8 groups, a simi­
collection. While we applaud efforts to estimate saturation a priori, many larly narrow range. Studies using demographic stratification, heteroge­
of the formulas developed are based on implicit assumptions that do not neous samples, and broader saturation goals (e.g., code meaning, all
align with the conduct of qualitative research, thereby significantly themes vs main themes) needed more groups to reach saturation.
limiting their utility. Many of these studies use probability-based as­ However, we are cautious about drawing conclusions regarding the in­
sumptions, such as having a random sample and knowing the prevalence fluences of these characteristics without more studies with focus group
of a theme in the broader population or the desired instances of a theme data to compare. Overall, these findings provide much-needed empirical
in data. Moreover, researchers often do not know these parameters prior evidence of sample sizes for saturation for different qualitative methods.
to conducting a study, nor is prevalence of items an important focus of Despite convergence of saturation within a specific range of interviews
qualitative research. Since a statistical formula may be seen as akin to a or focus groups, we caution not to use these findings as generic sample
power calculation familiar to quantitative researchers, we feel that this sizes for any qualitative study using these methods, or to justify poorly
may provide a misleading veil of scientific authenticity that ultimately designed or executed qualitative studies, as we discuss below. Instead,
cannot be achieved given the misalignment of assumptions with quali­ we recommend using these results as guidance to consider alongside the
tative research. Given our concerns about these approaches, we do not specific study characteristics when estimating the sample size for a
consider them further. qualitative study.
A second approach to assess saturation used empirical data. In all 16
tests of saturation with data from in-depth interviews, saturation was
reached in under 25 interviews, more specifically between 9 and 17

7
M. Hennink and B.N. Kaiser Social Science & Medicine 292 (2022) 114523

Fig. 3. Sample size for saturation in empirical tests with focus group discussion data.

4.1. Implications for research long-held benchmark for an adequate sample size in qualitative
research. Furthermore, our results show what a ‘small’ sample actually
The results of our systematic review have several important impli­ is, by providing a range of sample sizes for saturation in different
cations. We focus here only on implications of empirically based studies. qualitative methods (e.g., 9–17 interviews or 4–8 focus groups). This is
These results provide empirical guidance regarding adequate samples important because general advice on sample sizes for qualitative
sizes for saturation when using interviews and focus group discussions, research usually suggest higher sample sizes than this. Reviews of
which can be useful when developing qualitative research proposals. textbooks on qualitative research methodology found that sample size
The majority of empirically based studies in our review had a homog­ recommendations vary widely, for example 5–60 interviews (Guest
enous study population and focused research objectives, so these results et al., 2006; Constantinou et al., 2017; Hagaman and Wutich, 2017) and
cannot be confidently extrapolated to studies using different types of 2 to 40 focus groups (Guest et al., 2016). More importantly, none of
samples or broader goals. Therefore, we recommend using these results these recommendations is empirically based. Providing evidence-based
as a starting point to identify a potential range of interviews or focus sample size recommendations, with appropriate caveats, is important.
groups then refining the sample size by considering the study charac­ Qualitative samples that are larger than needed raise ethical issues, such
teristics (e.g., study goals, nature and complexity of phenomenon as wasting research funds, overburdening study participants, and lead­
studied, instrument structure, sampling strategy, stratification of sam­ ing to wasted data (Carlsen and Glenton, 2011; Francis et al., 2010),
ple, researcher’s experience in qualitative research, saturation goal, and while samples that are too small to reach saturation reduce the validity
degree of saturation sought) (Baker and Edwards, 2012; Galvin, 2015; of study findings (Hennink et al., 2017). Our results thus provide
Morse, 1995; see Hennink et al., 2017 for fuller discussion on using empirically based samples sizes for saturation that could be included as
study parameters to estimate saturation). These considerations will not part of the guidelines in instructional textbooks on qualitative research.
only lead to a more tailored sample size for each particular study but Furthermore, Vasileiou et al. (2018) found that even some qualita­
also provide clearer justification for the proposed sample size, thereby tive researchers characterized their own sample size as ‘small’, but this
adding rigor. was “construed as a limitation couched in a discourse of regret or
Our results also provide researchers with strong empirical evidence apology” (p. 12). Although these authors may be writing to the concerns
to refute the common critique that qualitative sample sizes are ‘too of more positivist-oriented readers, few defended their ‘small’ sample on
small’, implying that they are ineffective, although no evidence is usu­ qualitative grounds. We encourage researchers to reflect on our results
ally given for these claims. Our results can be used to demonstrate that to more confidently justify their sample sizes using the principles of
‘small’ sample sizes are effective for qualitative research and to show qualitative research rather than responding to the (mostly inappro­
why they are effective – because they are able to reach saturation, the priate) concerns of a more dominant positivist paradigm and their

8
M. Hennink and B.N. Kaiser Social Science & Medicine 292 (2022) 114523

numerical expectations. Sample sizes in qualitative research are guided without study-specific details on how it was determined.
by data adequacy, so an effective sample size is less about numbers (n’s) Our study has some potential limitations. We included only studies
and more about the ability of data to provide a rich and nuanced account that were published in English and were outside the epistemological
of the phenomenon studied. Ultimately, determining and justifying approach of grounded theory, and we used limited search terms for
sample sizes for qualitative research cannot be detached from the study specific qualitative methods but included common methods. While these
characteristics that influence saturation. Our results echo others, that criteria may have excluded other published tests of saturation, we
“rigorously collected qualitative data from small samples can substan­ believe our search criteria were broad enough to capture a significant
tially represent the full dimensionality of people’s experiences” (Young body of research on the topic. Articles identified in our review focus
and Casey, 2019, p.12) and therefore should not be viewed or presented overwhelmingly on health research and have similar conceptualizations
as a limitation when evaluating the rigor of qualitative research. of saturation. While this makes the studies more comparable, these re­
Our results also provide empirical guidance on effective sample sizes sults may not be applicable to other disciplines that may conceptualize
for saturation for reviewers of qualitative research. This may help to saturation differently.
refocus the routine practice of criticizing qualitative research for ‘small’
sample sizes so that reviewers may instead ask researchers to provide 5. Conclusion
more explicit justifications for their sample size by asking, for example:
“why do you have a sample of 40 interviews, when saturation can typically be Saturation is considered the cornerstone of rigor in determining
reached in less than 25 with a homogenous study population such as yours?” sample sizes in qualitative research, yet there is little guidance on its
Although, we generally do not support using only numerical guidance in operationalization outside of grounded theory. In this systematic re­
determining an effective sample size for qualitative research, these types view, we identified studies that empirically assessed saturation in
of questions reflect a more informed critique that uses available qualitative research, documented approaches to assess saturation, and
empirical evidence on saturation to challenge researchers to be more identified sample sizes for saturation. We describe an array of ap­
transparent in justifying their sample sizes and using the characteristics proaches to assess saturation that demonstrate saturation can be ach­
of each individual study to do so. We therefore encourage qualitative ieved in a narrow range of interviews (9–17) or focus group discussions
researchers to provide fuller justifications of their sample sizes and urge (4–8), particularly in studies with relatively homogenous study pop­
reviewers of qualitative studies to apply these findings to provide more ulations and narrowly defined objectives. Although our systematic re­
effective critiques of sample sizes for qualitative research. This may view identified sample sizes for saturation, we found little empirically
improve the quality of reporting and critiquing qualitative research and based research to determine how specific parameters influence satura­
move away from often unsubstantiated critiques of ‘small’ sample sizes. tion. Further research is needed on how specific parameters influence
Our results also synthesize five distinct approaches to assess satu­ saturation, such as the study goal, nature of the study population,
ration, including several variations of code frequency counts and sampling strategy used (i.e. inductive vs fixed sampling), type of data,
assessing code meaning. Qualitative researchers now have an array of saturation goal, and other influences.
strategies to assess saturation during data collection. Numerous reviews
of qualitative studies have found that saturation is often used to justify a Credit author statement
sample size, but there was an overwhelming lack of transparency in how
it was assessed or determined (Carlsen and Glenton, 2011; Francis et al., Monique Hennink: Conceptualization, Methodology, Formal anal­
2010; Marshall et al., 2013; Vasileiou et al., 2018). This lack of trans­ ysis, Investigation, Data curation, Writing – original draft, Writing –
parency is concerning, particularly given that saturation is hailed as an review & editing, Visualization Bonnie Kaiser: Conceptualization,
indicator of quality in qualitative research. It suggests that saturation is Methodology, Formal analysis, Investigation, Data curation, Writing –
being used as a “mantle of rigor” (Constantinou et al., 2017, p. 2) to original draft, Review and Editing, Visualization.
provide the appearance of rigor that is largely unsubstantiated by re­
searchers and left unchallenged by reviewers of qualitative studies. To Appendix A. Supplementary data
some extent, this lack of transparency may reflect the absence of guid­
ance on assessing saturation. Our review has synthesized a range of Supplementary data to this article can be found online at https://doi.
strategies that can be used by qualitative researchers to become more org/10.1016/j.socscimed.2021.114523.
transparent in reporting how saturation was assessed, whether it was
reached, or the extent to which it was achieved in a study. Researchers References
can now specify a strategy for assessing saturation and the criteria on
which it was determined (e.g., a stopping criterion, cumulative fre­ Ando, H., Cousins, R., Young, C., 2014. Achieving saturation in thematic analysis:
development and refinement of a codebook. Compr. Psychol. 3, 4.
quency graphs, percentage of codes, code meaning). Such greater Baker, S., Edwards, R., 2012. How many qualitative interviews is enough? Expert voices
transparency has clear benefits for the rigor of individual studies but also and early career reflections on sampling and cases in qualitative research. National
for the quality of qualitative research as a whole. Greater transparency Centre for Research Methods, Economic and Social Council (ESRC), United Kingdom.
Bryant, A., Charmaz, K. (Eds.), 2007. The SAGE Handbook of Grounded Theory. Sage,
regarding saturation improves reproducibility of the research and raises London.
expectations on how to report saturation, all of which move away from Bryman, A., 2012. Social Research Methods, fourth ed. Oxford University Press, Oxford,
using generic and unsupported statements such as ‘data were collected UK.
Carlsen, B., Glenton, C., 2011. What about N? A methodological study of sample-size
until saturation’. In addition, journals publishing qualitative research
reporting in focus group studies. BMC Med. Res. Methodol. 11. Article 26.
play a critical role in encouraging transparent reporting of saturation. Coenen, M., Coenen, T., Stamm, A., Stucki, G., Cieza, A., 2012. Individual interviews and
Vasileiou et al. (2018) found that reporting of sample size justifications focus groups with patients with rheumatoid arthritis: a comparison of two
qualitative methods. Qual. Life Res. 21, 359–370. https://doi.org/10.1007/s11136-
aligned with particular academic journals, suggesting that journal re­
011-9943-2.
quirements may encourage norms of greater transparency in reporting Constantinou, C., Georgiou, M., Perdikogianni, M., 2017. A comparative method for
saturation. Journal reviewers may also encourage transparency by themes saturation (CoMeTS) in qualitative interviews. Qual. Res. 1–18.
asking researchers, for example: ‘how did you assess saturation ?’, ‘how do Fofana, F., Bazeley, P., Regnault, A., 2020. Applying a mixed methods design to test
saturation for qualitative data in health outcomes research. PLoS One 15 (6),
you know you reached saturation ? ’ , or ‘to what extent was saturation e0234898.
reached – in core codes, categories, meaning etc.?’. Such requests signal Francis, J., Johnson, M., Robertson, C., Glidewell, L., Entwistle, V., Eccles, M.,
that more transparent, nuanced, and rigorous reporting of saturation is Grimshaw, J., 2010. What is an adequate sample size? Operationalising data
saturation for theory-based interview studies. Psychol. Health 25, 1229–1245.
expected. This should go beyond simple check-list requirements, which Fugard, A.J., Potts, H.W., 2015. Supporting thinking on sample sizes for thematic
may simply perpetuate vague reporting that saturation was reached analyses: a quantitative tool. Int. J. Soc. Res. Methodol. 18, 669–684.

9
M. Hennink and B.N. Kaiser Social Science & Medicine 292 (2022) 114523

Galvin, R., 2015. How many interviews are enough? Do qualitative interviews in Morse, J., 1995. The significance of saturation [Editorial]. Qual. Health Res. 5, 147–149.
building energy consumption research produce reliable knowledge? J of Building https://doi.org/10.1177/104973239500500201.
Engineering 1, 2–12. Morse, J., 2015. Data were saturated . . . [Editorial]. Qual. Health Res. 25, 587–588.
Glaser, B., Strauss, A., 1967. The Discovery of Grounded Theory: Strategies for https://doi.org/10.1177/1049732315576699.
Qualitative Research. Aldine, Chicago. Morse, W.C., et al., 2014. Exploring saturation of themes and spatial locations in
Guest, G., Bunce, A., Johnson, L., 2006. How many interviews are enough? An qualitative public participation geographic information systems research. Soc. Nat.
experiment with data saturation and variability. Field Methods 18, 59–82. Resour. 27 (5), 557–571.
Guest, G., Namey, E., Chen, M., 2020. A simple method to assess and report thematic Namey, E., Guest, G., McKenna, K., Chen, M., 2016. Evaluating bang for the buck: a cost
saturation in qualitative research. PLoS One 15 (5), e0232076. effectiveness comparison between individual interviews and focus groups based on
Guest, G., Namey, E., McKenna, K., 2016. How many focus groups are enough? Building thematic saturation levels. Am. J. Eval. 37 (3), 425–440.
an Evidence Base for Non-Probability sample sizes. Field Methods 29 (1), 3–22. Nascimento, L.C.N., et al., 2018. Theoretical saturation in qualitative research: an
Hagaman, A.K., Wutich, A., 2017. How many interviews are enough to identify experience report in interview with schoolchildren. Rev. Bras. Enferm. 71 (1),
metathemes in multisited and cross-cultural research? Another perspective on Guest, 228–233.
Bunce, and Johnson’s (2006) landmark study. Field Methods 29, 23–41. O’Reilly, M., Parker, N., 2013. ‘Unsatisfactory saturation’: a critical exploration of the
Hancock, M., Amankwaa, L., Revell, M., Mueller, D., 2016. Focus group data saturation: notion of saturated samples sizes in qualitative research. Qual. Res. 13 (2), 190–197.
a new approach to data analysis. Qual. Rep. 21, 11. Rowlands, T., Waddell, N., McKenna, B., 2016. Are we there yet? A technique to
Hennink, M., Kaiser, B., Marconi, V., 2017. Code saturation versus meaning saturation: determine theoretical saturation. J. Comput. Inf. Syst. 56 (1), 40–47.
how many interviews are enough? Qual. Health Res. 27 (4). Sandelowski, M., 1995. Sample size in qualitative research. Res. Nurs. Health 18,
Hennink, M., Kaiser, B., Weber, M.B.(, 2019. What influences saturation? Estimating 179–183.
sample sizes in focus group research. Qual. Health Res. 29 (10), 1483–1496. Turner-Bowker, D.M., Lamoureux, R.E., Stokes, J., Litcher-Kelly, L., Galipeau, N.,
Kerr, C., Nixon, A., Wild, D., 2010. Assessing and demonstrating data saturation in Yaworsky, A., Shields, A.L., 2018. Informing a priori sample size estimation in
qualitative inquiry supporting patient-reported outcomes research. Expert Rev. qualitative concept elicitation interview studies for clinical outcome assessment
Pharmacoecon. Outcomes Res. 10, 269–281. https://doi.org/10.1586/erp.10.30. (COA) instrument development. Value Health 21 (7), 839–842.
Lowe, A., Norris, A.C., Farris, A.J., Babbage, D.R., 2018. Quantifying thematic saturation Van Rijnsoever, F.J., 2017. (I Can’t Get No) Saturation: a simulation and guidelines for
in qualitative data analysis. Field Methods 30 (3), 191–207. sample sizes in qualitative research. PLoS One 12, e0181689.
Marshall, B., Cardon, P., Poddar, A., Pontenot, R., 2013. Does sample size matter in Vasileiou, K., Barnett, J., Thorpe, S., Young, t., 2018. Characterizing and justifying
qualitative research? A review of literature in IS research. J. Comput. Inf. Syst. sample size sufficiency in interview-based studies: systematic analysis of qualitative
11–22. health research over a 15-year period. BMC Med. Res. Methodol. 18, 148.
Moher, D., Liberati, A., Tetzlaff, J., Altman, D.G., Prisma, Group., 2009. Preferred Weller, S., Vickers, B., Bernard, R., Blackburn, V., Borgatti, S., Gravlee, C., Johnson, J.,
reporting items for systematic reviews and meta-analyses: the PRISMA statement. 2018. Open-ended interview questions and saturation. PLoS One 13 (6), e0198606.
PLoS Med. 6 (7), e1000097. Young, D.S., Casey, E.A., 2019. An examination of the sufficiency of small qualitative
samples. Soc. Work. Res. 43 (1), 53–58.

10

You might also like