YanSchuetzeEglington 2020 Preprint
YanSchuetzeEglington 2020 Preprint
YanSchuetzeEglington 2020 Preprint
Veronica X. Yan
Brendan A. Schuetze
Luke G. Eglington
University of Memphis
Author note: Portions of this research were presented at the 60th annual meeting of the
Psychonomic Society in Montreal, Canada. We thank Matthan Moy and Kimberly Nguyen for
Educational Psychology, The University of Texas at Austin, 1912 Speedway STE 504, Austin
Abstract
Numerous studies have shown that an interleaved study sequence of examples (e.g.,
often yields superior category learning. Some explanations for sequencing effects centers on
attentional processes, others focus on memory processes, and the two are often pitted against
each other. We propose a new integrative two-stage framework for sequencing effects in
category-learning and support this framework using a meta-analytic approach. We show, using a
significantly more variance in sequencing effects than attentional factors alone. This approach
also allowed us to examine the nature of the existing evidence, which revealed inferential
limitations due to how researchers typically design experiments. We provide suggestions for
future research on sequencing effects in category learning that would both test the two-stage
A Review of the Interleaving Effect: Theories and Lessons for Future Research
In our everyday lives, the need to make rapid decisions is supported by our fundamental
capacity to categorize the world around us (e.g., objects, people, experiences, actions). While
there are many different approaches researchers have taken to examine what processes are
involved in category learning, an emerging body of literature has focused on how the study and
practice of multiple categories should be sequenced. This research has revealed a strikingly
‘blocking’), it can be better to mix up and switch between different categories (referred to as
‘interleaving’). In the present paper, we conduct both a critical review and a quantitative review
of the interleaving effect in category learning. First, we review the existing theories, finding that
they largely fall into one of two categories: those relating to attention and those relating to
memory. Next, we propose a new two-stage framework (attention, then memory) that integrates
these existing theories. Using a meta-analytic approach, we examine the extent to which the
existing literature is able to test this two-stage framework. We show that including both
attentional and memory factors explained variance in the interleaving effect better than
attentional factors alone. However, as we also discovered, large gaps and confounded
experimental designs in the category literature preclude strong tests of interactions that would be
predicted from our new integrative two-stage model. We conclude with suggestions for future
on one thing at a time. In education, this assumption is manifest in different ways. For example,
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 3
the concepts in our textbooks are blocked chapter by chapter, homework or practice worksheets
often focus on one problem type at a time, and accelerated learning programs where students
take just one short, intensive course at a time are gaining in popularity (e.g., intensive coding
“bootcamps” like HackReactor, or Colorado College’s “Block Plan” in which students only take
one class at a time, five days per week for three and a half weeks). One of the more
counterintuitive findings in the cognitive psychology literature, however, has been that
interleaving study and practice of different concepts can benefit long-term learning1. The
interleaving effect refers to the finding in the cognitive psychology literature that alternating
between different concepts can benefit learning more than focusing on one concept at a time.
One important initial step is to establish whether a particular effect exists and to
understand whether the effect is generalizable beyond the initially studied context. A recent
meta-analysis of the interleaving effect (Brunmair & Richter, 2019) found that there was a
significant overall effect of interleaving (Hedges g = .42). They also found, however, significant
heterogeneity of the effect size. Separating the existing literature by stimuli sets (paintings,
naturalistic photos, artificial images, math-related tasks, expository texts, words, and tastes), the
effect appeared reliable except for in the case of expository texts and words. As an instructional
practice, interleaving promises huge potential for transforming learning. However, it is important
important goal is to be able to understand the processes that give rise to an effect and how they
1
While recently popular in the educationally-relevant cognitive literature as it relates to concept
and category learning, the idea that intermixing training confers benefits is not new. See
contextual interference from early verbal learning research (Battig, 1966, 1978) and from motor
skills training research (Lee, 1992; Shea & Morgan, 1979) and catastrophic interference from
connectionist network research (Maclelland, McNaughton & O’Reilly, 1995; McCloskey &
Cohen, 1989).
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 4
relate to the heterogeneity of an effect. Understanding the underlying mechanisms not only allow
researchers to better predict the boundaries of generalizability, but they can also reveal more
While there are a surprising number of effects that generalize across both verbal and
motor skills learning tasks, the extant literature has been careful to separate discussion of motor
skills and verbal tasks (e.g., Soderstrom and Bjork, 2015; Seger & Miller, 2010). Less care,
however, has been taken with separating out the processes that underlie educationally-relevant
cognitive (i.e., non-motor) tasks. Different educationally-relevant tasks, however, also involve
different types of learning processes. For example, differentiating cancerous cells from non-
cancerous cells requires the ability to recognize and categorize different types of cells. A failure
contrast, being able to apply the appropriate mathematics formula to a particular mathematics
problem requires that learners are not only able to recognize and categorize the type of problem
that is presented, but also to be able to retrieve the correct procedure or formula for that type of
problem, and to be able to correctly implement the procedural steps in the appropriate order. A
failure to correctly solve a mathematics problem could be due to a problem of discrimination and
Brunmair and Richter (2019) chose not to include motor learning studies, correctly, in
our view, as the evidence for dissociations between inductive conceptual learning and procedural
learning is strong enough to warrant separate analyses. Nevertheless, Brunmair and Richter also
included several tasks outside of the focus of our present review. In particular, their inclusion of
tasks involving taste memory is not mirrored in the inclusion standards of the present meta-
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 5
analysis, as taste memory most likely evolved to meet a set of specialized evolutionary
constraints — for example maximizing enjoyment and nutrition, while maintaining aversion to
toxins — distinct from the functions served and brain regions instantiated by other memorial
processes (Lin, Arthurs, & Reilly, 2017; Núñez-Jaramill et al., 2010; Palmerino, Rusiniak, &
Garcia, 1980). They also grouped stimuli by surface features rather than by underlying processes.
For example, under mathematical tasks, they combined both studies in which category
discrimination is emphasized (e.g., being able to recognize when different statistical tests are
appropriate, Sana et al., 2017) with studies which also require procedural skills such as problem
solving (Rohrer et al., 2014). Finally, they included some studies that were about memory rather
than about category learning (e.g., Hausman & Kornell, 2014), or in which blocked and
interleaved schedules were not matched (e.g. studies that used adaptive algorithms, Rau et al.,
2010).
In Brunmair and Richter (2019), the authors examined potential moderators of attention-
based accounts of the interleaving effect: namely, the discrimination hypothesis (that interleaving
enhances attention to features that discriminate between different categories, Kang & Pashler,
2012; Kornell & Bjork, 2008) and its extension, the sequential attention hypothesis (that
interleaving draws attention to between-category differences and that blocking draws attention to
within-category similarities; Carvalho & Goldstone, 2015, 2017). The discrimination hypothesis
argues that the benefit of interleaving examples from different categories is that the juxtaposition
of these different categories highlights the differences between them. Interleaving enhances
discriminative contrasts, blocking does not. The sequential attention theory (Carvalho &
Goldstone, 2017) extends the discrimination hypothesis, positing that the sequence in which
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 6
examples are studied influences the features on which a learner focuses their attention: Under
interleaved sequences, where examples from different categories are juxtaposed, learners'
attention is drawn to the features that help discriminate between the different categories (as in
discriminative contrast theory). Under blocked sequences, where different examples from the
same category are juxtaposed, learners' attention is drawn to the common features that define the
category.
These attention-based theories hold implications for the types of moderators that would
be important. Namely the moderators that have been identified include category similarity,
temporal spacing, temporal juxtaposition and active (vs. passive) learning. Category similarity is
thought to matter because the more similar the to-be-learned categories are to each other, the
greater the interleaving benefit should be; in cases where the need for between-category
discrimination is low or trivial, a blocking benefit might be obtained. Indeed this pattern of
results is demonstrated in Carvalho and Goldstone (2014), who showed that high-similarity
categories were better learned when interleaved while low-similarity categories (where within-
category commonalities were also more difficult to notice) were better learned when blocked.
because these manipulations either make it easier or more difficult for learners to attend to the
important features (Birnbaum et al., 2013; Carvalho & Goldstone, 2014; Kang & Pashler, 2012;
Sana et al., 2017; Zulkiply & Burt, 2013): simultaneous presentations are more likely facilitate
compare and contrast processes than sequential presentations, which in turn is more likely to
facilitate these processes than spaced presentations. Finally, another corollary of the attentional
bias hypothesis is that sequence effects should be larger under conditions of active learning than
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 7
under conditions of passive learning, as under active learning, participants may be more likely to
actively search for the key distinguishing features that discriminate between categories. On the
other hand, during a passive learning task, participants might try to create a “positive
characterization of each category” (Carvalho & Goldstone, 2015, p. 282), and hence benefit from
blocked sequencing.
Importantly however, there are also memory-based accounts of the interleaving effect
that are missing from Brunmair and Richter’s analyses. In fact, the last decade of interest in the
interleaving effect arose as an extension of the spacing effect. The spacing effect (Carpenter,
2017; Cepeda et al., 2006) is one of the most robust effects in the cognitive psychology
literature, and is the finding that distributing repetitions over time leads to better learning than
massing those repetitions. Kornell and Bjork (2008) initially designed their experiments to be an
illustration of a situation in which spacing learning out over time should not be beneficial for
learning, hypothesizing that when category exemplars were spaced, it would make it difficult to
identify commonalities and abstract the categorical information. Instead—to their surprise—the
results revealed that spacing (or, what the literature now refers to as interleaving) category
a given category are inherently distributed from each other. Spacing theories of interleaving
extend the well-known benefits of spacing repetitions for long-term memory to the learning of
categories, and predict that there should be an optimal level of spacing or difficulty (Cepeda et
al., 2008). The study-phase retrieval hypothesis (Appleton-Knapp, Bjork, & Wickens, 2005;
Birnbaum, Kornell, Bjork, & Bjork, 2013; Thios & D’Agostino, 1976), for example, assumes
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 8
that prior presentations are retrieved and elaborated on at the time of subsequent presentations
(see also ‘reminding theory’ of spacing, Benjamin & Ross, 2011). In other words, spacing out
examples from a given category promotes retrieval or reminding of prior examples, which in turn
strengthens learning; in contrast, blocking presentation of examples from the same category does
not afford the forgetting that engages subsequent retrieval processes. Similarly, the retrieval
effort hypothesis (Pyc & Rawson, 2009) posits that the more effortful (but still successful)
sequences, the lack of spacing between presentations means that there is no need for retrieval or
elaboration, and hence, no memorial benefit. When intervals are too long between examples
from the same category, retrieval processes may fail, resulting in no additional learning benefit
(Appleton-Knapp, Bjork, & Wickens, 2005). What constitutes “too long,” however, can be
extends the study-phase retrieval theory for category learning: intervals between presentations of
examples from the same category allow the learners to forget the category-irrelevant details that
are specific to individual examples and to consolidate only the category-relevant details that are
present across category examples (Vlach, 2014; Vlach & Kalish, 2014).
These memory-based theories also hold implications for the types of moderators that
would be important. Namely, that the effect may be moderated by the number of categories or
the number of study examples and retention interval between end of study and final test. Spacing
accounts predict that increasing the spacing between examples from a given category should
improve learning. Birnbaum et al. (2013) tested this theory directly, by varying the average
number of examples that intervened between two examples from the same category, while
holding discrimination variability constant. They found that large spacing (15 intervening
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 9
examples) led to better learning than did small spacing (three intervening examples). Most
studies have not manipulated spacing in this way, but rather tend to interleave all categories
together, which means that the number of categories can be used as a proxy for the spacing
between stimuli from the same category. Memory-based theories would also imply that other
important moderators include the number of presentations (i.e., examples or repetitions — the
more repetitions, the larger the interleaving effect; Pavlik & Anderson, 2005; Shaughnessy,
Zimmerman & Underwood, 1972; Underwood, 1970) and the retention interval (spacing effects
should become stronger with at a delayed test compared to an immediate test; Cepeda et al,
hypotheses have often been pitted against each other in the literature (e.g., Foster et al., 2019;
Kang & Pashler, 2012), the likely reality is that they are not mutually exclusive. We argue that
the two types of processes are not competing, but rather sequential. That is, drawing from
classical models of memory (e.g., multi-store model, Atkinson & Shiffrin, 1968), it is likely that
sequencing effects influence category learning in two stages: First, an attention-based stage of
category discovery in which attention must be drawn to the relevant features. In this stage,
sequencing effects are likely determined by category structure and similarity. Second, a memory-
based stage in which learners need to memorize the cluster of features and the association
between features and category labels: sometimes interleaving may be more beneficial; other
times blocking may be more beneficial. In this stage, sequencing effects are likely to follow the
same patterns as spaced learning effects in memory. Here, interleaving should be more
beneficial. Finally, attention and memory processes are likely cyclical throughout the process of
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 10
learning: The features that we attend to are more likely to be encoded into our memories, and our
memories of past experiences likely shape how we attend to future stimuli (Kim & Rehder,
2011).
Figure 1 is a schematic that illustrates some of the ways in which blocking or interleaving
benefits may arise from this two-stage framework. What this schematic also illustrates is why the
two-stage framework would make different predictions and situates different findings in the
literature. In particular, the schematic illustrates the problem with pitting attention-based theories
against memory-based theories. For example, the effect of inserting fillers in between to-be-
studied category examples is often cited as evidence in favor of the attention-based theories
(discriminative contrast or sequential attention theory) and as evidence against the memory-
based theories. These studies show that if spacing disrupts juxtaposition and hence, contrastive
processes, between examples from different categories (e.g., inserting trivia question fillers
between study examples), then the interleaving benefit is eliminated (Birnbaum et al., 2013;
Sana, Yan, & Kim, 2017). Situated within the two-stage framework however, these results
support the role of attention-based processes but do not constitute evidence against memory-
based theories. Rather, if learning is gated at the attentional stage, then we would not expect
spacing benefits to emerge at the subsequent memory stage either. In fact, in a different study,
Birnbaum et al. (2013) found that when learning is not gated at the attentional stage (i.e., fillers
were not inserted to disrupt discriminative contrasts between category examples), larger lags
between examples of a given category leads to greater learning. This benefit cannot be attributed
the learner were the same between the short-lag condition and the long-lag condition; rather the
Conversely, if the to-be-learned categories are each distinct enough and directing
learners’ attention to the shared and discriminating features is unnecessary, then we might expect
to find only benefits spaced (interleaved) practice on memory. Foster and colleagues (2019), for
blocked sequence. They chose a “critical” problem type and then varied whether the other three
problem types were similar or dissimilar to the critical problem type. They found that the
interleaving benefit was the same regardless of category set and argued that this supported the
results differently in our two-stage framework: For the similar set in which the categories were
required discrmination, we propose that interleaving benefited learning at the attentional stage
(as well as potentially the memory stage); for the dissimilar set in which categories were distinct,
we propose that the attentional stage was unnecessary and hence, we see only benefits of spacing
at the memory stage. That is, the same outcome—better performance following interleaved
The implication of this two-stage framework is that researchers should pay more attention
to the specific processes that might be required of any given type of learning content. For the
for memory-related experimental features and possible interactions. With a more complex model
guiding predictions of sequencing effects, this raises another challenge for conducting a meta-
analysis: It raises the need for a greater sampling of the experimental space. The goals of this
present paper are therefore two-fold: (a) to examine both attention-based and memory-based
moderators of the interleaving effect and (b) to identify the nature of existing studies and to the
gaps that are currently missing in the literature. We begin by conducting a quantitative review.
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 12
Method
parameters differently to those of Brunmair and Richter (2019) to focus more closely on the
underlying processes and limiting our analysis only to studies that tested category learning. Most
importantly, while Brunmair and Richter (2019) coded only moderators that tested attention-
based accounts, we also coded moderators that test memory-based accounts (see Tables S1 and
S2 in Supplemental Online Materials for a summary of the moderators, the number of effect
sizes, the sample size and the average effect sizes for each level of the moderators).
To keep the focus on our theoretical and methodological arguments about future
directions for research, below, we support these arguments through the reporting of key meta-
analytic findings, particularly in the form of reporting overall effects and multiple regression
model comparison. Our method was pre-registered (https://osf.io/gsvu6/) and the full details of
our meta-analysis method and the detailed results are reported in the Supplemental Online
Materials. In other words, the main contribution of this review is not the specific outcome of the
systematic quantitative review per se, but rather how the gaps can inform future research.
Results
A random effects model computed with the metafor R package (Viechtbauer, 2010; R
Core Team, 2018) found that the overall effect size of the interleaving effect size, nesting effects
(k = 205) within publications (k = 61), was significantly larger than zero, g = 0.44, 95% CI =
[.34, .53], p < .001. This effect size is very similar to that found in Brunmair and Richter (2019),
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 13
g = 0.42. It is important to remember that the overall effect size is computed with all studies,
including those that would be predicted to find a benefit of blocking. Unsurprisingly, we also
found a significant heterogeneity of effect sizes, Q(204) = 1213.03, p < .001. From our
univariate random effects model, we calculated I2 = 86.70, indicating 86.70 percent of the
variance in outcomes was due heterogeneity (Viechtbauer, 2019). The forest plot is presented in
Figure S1 in the Supplemental Online Materials, and the funnel plots are presented in Figure S2.
We found no evidence of publication bias (see Supplemental Online Materials for full details).
There were substantial correlations between the different moderators that we coded in
for our meta-analysis. The Cramer’s V correlations are reported in the correlation matrix depicted
in Figure 2. Due to concerns with the confounded nature of moderators assessed in this meta-
analysis, it would be inappropriate to rely on single moderator regression models, as these could
interleaving relative to blocking using the R (R Core Team, 2018) package metafor (Viechtbauer,
experiments nested within paper. We report the relative fits of these models using likelihood
tests, AIC, and BIC, when appropriate. These multiple regression analyses are informed by the
2
Although separate tests of individual moderators may be significant, significance does not
necessarily mean that the moderators, themselves, explain unique variance. Given the correlated
nature of many of the moderators of interest, any single moderator model could actually reflect
the contributions of multiple, correlated moderators.
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 14
predominant theories of the effects of interleaving, however, they were not pre-registered, as we
had not anticipated the high amount of confounding seen in the present dataset, which precluded
single-moderator analysis. The results of each model and their fit indices are presented in Table
1.
Comparing the fit between Model 1 and Model 2 allows us to examine the contribution of
the attention-based factors. Consistent with the findings from Brunmair and Richter (2019),
incorporating attention-based factors (Model 2) led to a significantly better fit than incorporating
only experimental factors (Model 1), χ2 = 32.95, df = 4, p < .001. Comparing the fit between
Model 2 and Model 3 allows us to examine whether adding memory-related factors adds to
variance explained, over and above the attentional factors. Overall, the best fitting model of the
present meta-analytic data was Model 3, representing the combined contributions of memory-
and attention-implicated moderators. Likelihood tests found that Model 3 fit better than Model 2,
Memory-Attention Interaction
We pre-registered that we would explore two interactions. One was the interaction between the
number of categories and supervised/unsupervised learning — however, there were not enough
studies in which unsupervised learning was used. The other interaction we pre-registered was an
In our meta-analysis, we treat the number of categories as a proxy for spacing between
examples from the same category. The two-stage framework would predict that benefits of
increased number of categories (a process that benefits memory) would be most apparent for
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 15
stimuli sets that are dissimilar, as those would not benefit from attention processes. We included
this one interaction term in addition to the predictors of Model 3. We found that this interaction
term was significant (b = 0.03, SE = .01, p = .017), indicating that the slope representing the
benefit of the number of categories for the interleaving effect was steeper for dissimilar items3.
This interaction is depicted in Figure 3. In other words, in support of dual processes of memory
and attention, when the number of categories being studied increases, we see that the interleaving
effect grows at a faster rate when the items are dissimilar. We did not explore any other potential
interactions, however, due to limitations of the existing studies, which we describe in further
Discussion
Our quantitative review of the existing literature provided evidence that sequencing has a
but that average is highly misleading. This basic finding is aligned with the findings of Brunmair
and Richter (2019). What we additionally found—different from that of prior reviews—is that a
model in which memory factors are added on top of attentional factors better accounts for the
variance than attentional factors alone. In the midst of the present quantitative review,
particularly as the result of analyzing our meta-analytic dataset, we realized that certain troubling
3
This model with the interaction term, however, was not significantly different from Model 3, χ 2
= 5.63, df = 2, p = .059.
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 16
lower level of generalizability of our findings than we had initially anticipated. We detail three
core problems. First, a large proportion of the empirical literature uses the same or very similar
2003). Third, we find evidence of restricted ranges of moderators. That is, apart from the strong
correlations between moderators of interest, many moderators have not been experimentally
examined across their entire possible range of values. Each of these issues limits our ability to
disambiguate the true effects of these moderators with implications for furthering theoretical
understanding and practical generalization. In the following sections, we detail each of these
issues.
One particular paradigm represented a large proportion of the effect sizes. Specifically,
36 out of 100 of the published effect sizes and 89 out of 209 of the total (published and
unpublished) effect sizes used paintings stimuli and passive study. Moreover, when painting
stimuli were used, the number of categories was usually 12 (k = 50). In the interest of full
disclosure, many of these were associated with the first author of this present paper (k = 37). In
experimental variable values. Similar to how many participants are sampled from a convenient
experimental designs. We do not believe that this tendency is unique to a particular researcher,
nor to the interleaving literature; rather this is likely a typical feature of most empirical
psychology research. Indeed, it is arguably a rigorous way of proceeding with research: When an
possible (Pashler & Harris, 2012). Retaining the same paradigm is also then important for testing
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 17
underlying processes: researchers can systematically change one experimental feature at a time,
testing whether that change yields differences in the experimental outcome. If research were not
paradigmatic in this way, it would be difficult to isolate why an outcome has been amplified or
attenuated. Yet, such a paradigmatic focus becomes problematic for generalization and situating
important moderators may become confounded within the literature space. Although the two-
stage framework makes predictions about interactions, we were limited in our ability to explore
Confounded moderators limit the conclusions that can be drawn broadly concerning the
association based upon the chi-squared statistic, which measures the association between
nominal variables and always results in a value between 0 and 1 (there are no negative Cramer’s
better at dealing with categorical variables (as most of our moderators are) and does not assume
linear relationships. Generally, V values between 0.1 and 0.3 are considered small effect sizes,
between 0.3 and 0.5 considered medium, and greater than 0.5 considered large (Osteen & Bright,
2010; Kotrlik, Williams, & Jabor, 2011). As this matrix shows, the associations between these
variables and others are extremely high, precluding clear interpretation in several cases. Of the
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 18
66 correlations reported in Figure 2, 42% were of medium size (Vs between 0.3 and 0.5) and
15% were large (Vs greater than or equal to 0.5). This high correlation between moderators is a
known problem in meta-analysis (e.g., Lipsey, 2003), and we implemented one of the proposed
solutions (multivariate analysis). However, the broader problem is that the data available to go
In Figure 4, we provide four examples that further illustrate the issue with confounded
moderators. Each panel in Figure 4 displays a combination of three moderators, with each point
in the plot representing an effect size within our meta-analysis dataset. These four panels are not
a randomly selected set of moderators; rather, we selected combinations of moderators that are
relevant to the two-stage framework. Panel A plots a key moderator within attention-based
hypotheses (similarity of the categories) against a key moderator within the memory-based
interaction that we tested in our meta-analysis. Although we did find that this interaction was
significant, what is immediately apparent from Panel A is that (a) very few studies have used
more than 8 or 12 categories; and (b) experiments using cognitive stimuli are particularly ill-
represented across the range of number of categories—rather, most of the studies using cognitive
The gaps in the evidence base are more apparent in the other three panels: Panel B
reveals that no interleaving studies using cognitive stimuli has presented examples
simultaneous versus sequential study (e.g., Birnbaum et al., 2013; Carvalho & Goldstone, 2014;
Kang & Pashler, 2012). Panels C and D of Figure 4 both point to the insufficiency of existing
studies to test certain predicted interactions of the two-stage framework: the interaction between
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 19
category similarity and presence of interfiller trials, and the interaction between category
similarity and presentation simultaneity. If to-be-learned categories are similar and difficult to
discriminate, then learners’ attention to the critical distinguishing features will be hindered if an
intertrial filler is inserted. Hence, we would expect an intertrial filler to attenuate the interleaving
effect when categories are similar. On the other hand, if the to-be-learned categories are
dissimilar, then we would expect intertrial fillers to amplify the interleaving effect via memory
processes. Similarly, when categories are similar, category learning will benefit primarily from
attentional processes. Interleaving under both simultaneous and sequential presentation allows
learners to attend to the features that distinguish categories, and hence an interleaving benefit
should emerge in both presentation modes. When categories are dissimilar, the need for
attentional processes is low and hence, category learning will benefit primarily from memory
processes. Interleaving under simultaneous presentation is much less likely to engage retrieval or
memory processes and hence, an interleaving benefit should emerge only under sequential
presentation. As Panels C and D show, however, there were almost no studies in which
experimenters tested the effect of sequencing on the learning of dissimilar categories with inter-
trial fillers and there was not a single study in which categories were dissimilar and
simultaneously presented.
During the course of this research, it became apparent that the extant range of values for
moderator. However, this issue of restricted range is also relevant to other moderators, including
those that are typically treated categorically (e.g., similarity) that should perhaps be placed
within a continuous range of similarity. Moderator values were also frequently set to one of a
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 20
few values (e.g., 6 or 12 for the number of categories). Restricting the range of values can be
convenient for executing standard experimental designs (e.g., making the number of categories
divisible by 4 to accommodate a 2x2 design), but also carry consequences. Restricting the range
of values for a moderator impedes theoretical development and practical application. In essence,
restricting the range of potentially important moderators restricts the range of potentially
important conclusions as well. This is not to say that moderators must be manipulated in all
experiments, but rather that if very few researchers (or none) manipulate a moderator, the
theory-building, failing to explore the effects of greater or fewer categories across or within
experiments makes it challenging to know when and how attentional mechanisms that may
underlie interleaving effects actually operate. Participants may compare across exemplars, but
are these benefits influenced by the variety and number of recent examples from which to
compare? To what extent and under what conditions does a participants’ own individual abilities
(e.g., operation span, Kane & Engle, 2003) become relevant? These are important questions that
need very specific answers in order to broadly benefit education. In short, restricting the range of
values contributes to impeding theoretical development that would help educators and
developers of educational technology understand if, and, the extent to which students should
interleave.
dynamically over time (Turke-Brown, 2007). This framework is in progress, and future research
is needed to fill in blank spots in the relations between attention and memory to elucidate how
best to schedule practice for students to learn categories. Our meta-analysis confirmed some
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 21
findings, but also revealed shortcomings in experimental design. Specifically, attention and
memory are frequently studied separately (attention more so than memory), but should be
studied together in order to give a more complete picture of category learning. Below we
describe several possibilities for future research that could account for the issues described above
substantial amount of interleaving research on category learning has used a small number of
stimuli, especially paintings, birds, and butterflies. This is likely attributable to convenience and
a desire to relate findings to prior work (e.g., that also used paintings). However, if the goal is to
generate prescriptions for how to best teach categories (even visual categories), constraining
experimental findings. Such constraints also may limit the ability of researchers to properly
manipulate important variables (e.g., similarity, number of categories). Even if for example the
similarity among paintings was quantified, and a large number of painters (and paintings) were
included, generalization would be difficult. In short, a broad swath of stimuli need to be utilized
to evaluate whether broad claims can be made about how to implement interleaving.
manipulate levels of the existing variables. Under parametric manipulation, variables of interest
are systematically tested across multiple levels of intensity (Brand et al., 2019), in opposition to
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 22
interleaving studies, this might look like implementing experimental designs where the rate of
Critically, this will allow testing of potentially important interactions between variables.
Parametric manipulation has been used to great effect in neighboring research areas of cognitive
psychological research, most notably in Cepeda et al.’s (2008) investigation of the spacing effect
in which the researchers systematically varied multiple levels of spacing intervals with multiple
approach would allow for greater exploration of the experiment-space that may be relevant for
understanding the effect of interleaving. The modal interleaving study in our review used
paintings as stimuli, had six categories per condition, and treated interleaving as a binary factor
(interleaved or blocked). However, extant theories aiming to explain interleaving imply that all
of these attributes (stimuli, number of categories being learned, rate of interleaving) may
interact. For example, if attention across exemplars facilitates comparison and possibly category
learning when exemplars are interleaved, the number of categories will naturally influence the
relevant depending on the similarity and number of categories. Thus, the similarity, number of
implementation may entail collaborative multi-site projects with large samples sizes (e.g., Klein
optimal for determining when and the extent to which interleaving should be applied. Given
existing experimental preferences (e.g., choosing a particular rate of interleaving and number of
even with the use of parametric experimental manipulations—could require gigantic sample
sizes and decades (or longer) of careful research. However, if the goal is to understand the
functional relationship between, for example, the number of categories and (rate of) interleaving,
there may be more efficient methods for experimental design. For example, Lindsay, Mozer,
Huggins, and Pashler (2013) sought to optimize the balance between the rate of interleaving and
amount of fading (e.g., starting with easy examples and increasing difficulty). Their aim was to
devise a method to avoid running large A/B style tests comparing different values of each
variable against each other, and instead assume there was a functional relationship that could be
traversed to find an optimum. For example, instead of manipulating just two levels of each
variable (with a large number of participants at each level), multiple levels of each variable are
chosen (with a smaller number of participants at each level) to allow researchers to model
learning. Such a method might preclude pairwise comparison of points on that functional line
depending on statistical power, but overall would give stronger evidence about the relationship
among the variables of interest. Other recent research has shown how adaptive experimental
design can be more efficient when using multi-armed bandit algorithms (Rafferty, Ying, &
Williams, 2019). Adopting alternative approaches may help avoid some of the pitfalls that have
4
An example coordinate in this experiment-space could be high similarity among categories, ten to-be-learned-
categories, and an intermediate rate of interleaving (e.g., alternate categories 50% of the time), etc.
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 24
become common in many areas of social science (Watts, 2017) and get to the heart of answering
Testing Memory Factors. While our first three recommendations are those that apply
more broadly to the field, our review has also indicated several ways forward for testing theory
specific to the interleaving effect. Our quantitative analysis indicated that more categories
predicted a larger effect of interleaving. This finding supports a spacing effect (more categories
being interleaved leads to more spacing between repetitions of the same category), and by
extension, the contribution of memory processes. But under what circumstances would more
such theories concern comparisons a participant could make between the current exemplar they
are viewing and the immediately preceding exemplar (within or between categories depending
on rate of interleaving). In fact, there is evidence that disrupting comparison with the
immediately preceding exemplar (by inserting fillers) can eliminate the benefit of interleaving in
some cases (Birnbaum, 2013). However, if attentional explanations are mostly concerned with
attention towards immediately preceding examples, it is not clear how a purely attentional
explanation of interleaving could explain why increasing the number of categories (e.g., from 6
to 12) increases the benefit of interleaving. Of course, a participant may be reminded of prior
(nonadjacent) exemplars, and compare them to the present exemplar under consideration. That
process would involve recall, and thus spacing would become an additional relevant factor.
However, to our knowledge, there are no experiments directly testing how spacing and
experiment could insert inter-trial fillers while manipulating the number of categories and
retention interval. Most interleaving experiments seem to have been tailored for studying
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 25
attentional rather than memory effects (e.g., 90% of the experiments in our meta-analysis have a
relevant for both how attention is devoted and what is retained. This is clear in educational
research—expertise can change what instructional format is most beneficial for learning
(Cronbach & Snow, 1977; Kalyuga, Ayres, Chandler & Sweller, 2003). Experts generally
process information relevant to their domain differently, such as preferentially directing attention
to more relevant features (Kim & Rehder, 2011) and chunking information more effectively
(Chase & Simon, 1973). This interaction between expertise (prior learning), attentional
allocation, and subsequent learning is likely to be highly relevant for determining optimal
differently and thus scheduling that most prior research has shown to be effective for naive
learners may become inefficient. For example, blocking may be initially preferable due to
dissimilar category structure, but once the participant learns relevant category features enabling
across category comparison, interleaving may be introduced to future improve learning via
spacing and across-category comparisons. Future educational research should explore the
efficacy of this approach by manipulating category similarity as well as altering the schedule
according to student performance. More precise adaptive scheduling could be achieved with a
computational model. However, although there are recent models that account for category
similarity (Carvalho & Goldstone, 2019) and others that track memory as a function of spacing
(Walsh et al, 2018; Pavlik, Eglington, & Harrell-Williams, 2020; Eglington & Pavlik, 2020), to
our knowledge there are no models that explicitly account for both simultaneously.
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 26
Conclusion
For the present review, we initially set out to meta-analyze the interleaving literature. We
found a medium beneficial effect of interleaving, moderated by several important factors. Our
analysis also revealed that multiple processes may contribute to explaining the interleaving
effect. In contrast to prior work, we coded for moderators that are likely to be relevant for both
memory-based accounts (e.g., number of categories) and those more relevant to attention-based
accounts (e.g., similarity, inter-trial fillers). In line with prior research, our analyses indicated
that interleaving enabled attentional mechanisms that promoted learning, as long as the
categories themselves had the proper similarity structure. However, interleaving also naturally
increases spacing, and, to add to existing literature, we found evidence that this spacing also
contributes to the interleaving effect. In other words, we found evidence that both attentional and
spacing accounts are partially supported, which should not be entirely surprising, given the
However, coding this literature also revealed significant issues concerning how
interleaving research is carried out, especially the importance of more systematic and
elucidate how to effectively implement interleaved practice. For example, despite the known
relationship between the similarity structure of learning materials and the effect of interleaving
(Carvalho & Goldstone, 2014), the similarity among categories has rarely been quantified (or
manipulated directly), and a significant plurality of research has been with specific sets of
stimuli. Our two-stage framework lays out a roadmap for possible interactions to explore, but
these were interactions that we could not examine in the present review given gaps in the
existing evidence base. The non-systematic nature of experimental design within the interleaving
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 27
literature hence makes it more challenging to broadly interpret overall findings and develop
specific theories and models that can guide future research and pedagogy. On a more positive
note, these issues highlight new areas of research and motivate new experiments that may help
further refine the two-stage framework (depicted in Figure 1), and lead to improved pedagogical
References
Abushanab, B., & Bishara, A. J. (2013). Memory and metacognition for piano melodies: Illusory
advantages of fixed- over random-order practice. Memory & Cognition, 41(6), 928–937.
https://doi.org/10.3758/s13421-013-0311-z
Ashby, F. G., Alfonso-reese, L. A., Turken, U., & Waldron, E. M. (1998). A neuropsychological
Ashby, F. G., & Maddox, W. T. (2005). Human Category Learning. Annual Review of
Birnbaum, M. S., Kornell, N., Bjork, E. L., & Bjork, R. A. (2013). Why interleaving enhances
inductive learning: The roles of discrimination and retrieval. Memory & Cognition, 41(3),
392–402. https://doi.org/10.3758/s13421-012-0272-7
Carvalho, P. F., & Goldstone, R. L. (2014). Putting category learning in order: Category
structure and temporal arrangement affect the benefit of interleaved over blocked study.
Carvalho, P. F., & Goldstone, R. L. (2015). The benefits of interleaved and blocked study:
Different tasks benefit from different schedules of study. Psychonomic Bulletin &
Carvalho, P. F., & Goldstone, R. L. (2017). The sequence of study changes what information is
attended to, encoded, and remembered during category learning. Journal of Experimental
https://doi.org/10.1037/xlm0000406
Carvalho, P. F., & Goldstone, R. (2019, September 13). A computational model of context-
Cepeda, N. J., Vul, E., Rohrer, D., Wixted, J. T., & Pashler, H. (2008). Spacing Effects in
1095–1102. https://doi.org/10.1111/j.1467-9280.2008.02209.x
Chase, W. G., & Simon, H. A. (1973). Perception in chess. Cognitive psychology, 4(1), 55-81.
Chun, M. M., & Turk-Browne, N. B. (2007). Interactions between attention and memory.
Cronbach, L., & Snow, R. (1977). Aptitudes and instructional methods: A handbook for research
Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013).
Eglington, L. G., & Kang, S. H. K. (2017). Interleaved Presentation Benefits Science Category
https://doi.org/10.1016/j.jarmac.2017.07.005
Eglington, L. G., & Pavlik Jr, P. I. (2020). Optimizing practice scheduling requires quantitative
https://doi.org/10.1038/s41539-020-00074-4
Hall, K. G., Domingues, D. A., & Cavazos, R. (1994). Contextual interference with skilled
Jolly, E., & Chang, L. J. (2019). The Flatland Fallacy: Moving Beyond Low–Dimensional
Kalyuga, S., Ayres, P., Chandler, P., & Sweller, J. (2003). The expertise reversal effect.
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 30
Kane, M. J., & Engle, R. W. (2003). Working-memory capacity and the control of attention: the
contributions of goal neglect, response competition, and task set to Stroop interference.
https://doi.org/10.1037/0096-3445.132.1.47
Kang, S. H. K., & Pashler, H. (2012). Learning painting styles: Spacing is advantageous when it
https://doi.org/10.1002/acp.1801
Kim, S., & Rehder, B. (2011). How prior knowledge affects selective attention during category
Klein, R. A., Ratliff, K. A., Vianello, M., Adams, R. B., Jr., Bahník, Š., Bernstein, M. J., . . .
Kornell, N., & Bjork, R. A. (2008). Learning Concepts and Categories: Is Spacing the “Enemy of
9280.2008.02127.x
Kotrlik, J. W., Williams, H. A., & Jabor, M. K. (2011). Reporting and Interpreting Effect Size in
132-142.
Mayfield, K. H., & Chase, P. N. (2002). The effects of cumulative practice on mathematics
https://doi.org/10.1901/jaba.2002.35-105
Lindsey, R. V., Mozer, M. C., Huggins, W. J., & Pashler, H. (2013). Optimizing instructional
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 31
Noh, S. M., Yan, V. X., Bjork, R. A., & Maddox, W. T. (2016). Optimal sequencing during
https://doi.org/10.1016/j.cognition.2016.06.007
Ollis, S., Button, C., & Fairweather, M. (2005). The influence of professional expertise and task
complexity upon the potency of the contextual interference effect. Acta Psychologica,
Osteen, P., & Bright, C. Effect Sizes and Intervention Research. Society for Social Work and
https://archive.hshsl.umaryland.edu/handle/10713/3582
Ostrow, K., Heffernan, N., Heffernan, C., & Peterson, Z. (2015). Blocking Vs. Interleaving:
Pashler, H., & Harris, C. R. (2012). Is the replicability crisis overblown? Three arguments
Pavlik Jr, P. I., Eglington, L. G., & Harrell-Williams, L. M. (2020, May 2). Generalized
arXiv:2005.00869.
Rafferty, A., Ying, H., & Williams, J. (2019). Statistical Consequences of using Multi-armed
Rau, M. A., Aleven, V., & Rummel, N. (2013). Interleaved practice in multi-dimensional
learning tasks: Which dimension should we interleave? Learning and Instruction, 23, 98–
114. https://doi.org/10.1016/j.learninstruc.2012.07.003
R Core Team (2018). R: A Language and Environment for Statistical Computing. R Foundation
Rohrer, D., Dedrick, R. F., Hartwig, M. K., & Cheung, C.-N. (2019). A randomized controlled
https://doi.org/10.1037/edu0000367
Rohrer, D., & Taylor, K. (2007). The shuffling of mathematics problems improves learning.
Rohatgi, A. (2018). WebPlotDigitizer - Web Based Plot Digitizer. Austin, Texas, USA.
Sana, F., Yan, V. X., & Kim, J. A. (2017). Study sequence matters for the inductive learning of
https://doi.org/10.1037/edu0000119
Sana, F., Yan, V. X., Kim, J. A., Bjork, E. L., & Bjork, R. A. (2018). Does Working Memory
Capacity Moderate the Interleaving Benefit? Journal of Applied Research in Memory and
Tauber, S. K., Dunlosky, J., Rawson, K. A., Wahlheim, C. N., & Jacoby, L. L. (2013). Self-
012-0319-6
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 33
Watts, D. J. (2017). Should social science be more solution-oriented?. Nature Human Behaviour,
1(1), 0015.
Wolfe, J. M. (2019). Visual Attention: The Multiple Ways in which History Shapes Selection.
Yan, V. X., Bjork, E. L., & Bjork, R. A. (2016). On the difficulty of mending metacognitive
illusions: A priori theories, fluency effects, and misattributions of the interleaving benefit.
https://doi.org/10.1037/xge0000177
Yan, V. X., Soderstrom, N. C., Seneviratna, G. S., Bjork, E. L., & Bjork, R. A. (2017). How
https://doi.org/10.1037/xap0000139
Viechtbauer, W. (2019). I^2 for Multilevel and Multivariate Models. The Metafor package:
project.org/doku.php/tips:i2_multilevel_multivariate
Zulkiply, N., & Burt, J. S. (2013). The exemplar interleaving effect in inductive learning:
Zulkiply, N., McLean, J., Burt, J. S., & Bath, D. (2012). Spacing and induction: Application to
exemplars presented as auditory and visual text. Learning and Instruction, 22(3), 215–
221. https://doi.org/10.1016/j.learninstruc.2011.11.002
TWO-STAGE FRAMEWORK OF SEQUENCING EFFECTS 35
Table 1
Comparison of Multiple Regression Results
Intercept 0.45 (.05) *** 0.81 (0.16) *** 0.83 (0.17) *** 0.26 (.13) *
Schedule Manipulation (Within) -0.12 (0.09) -0.11 (0.09) -0.14 (.08)
Publication Status (Published) -0.10 (.09) -0.09 (0.10) -0.07 (.09)
Stimuli Type (Cognitive) -0.45 (0.12) *** -0.49 (0.12) *** -0.25 (.12) *
Population Pool (MTurk) -0.26 (0.09) ** -0.28 (0.10) ** -0.18 (.09) *
Sequential or Simultaneous
Presentation (Simultaneous) 0.07 (0.09) 0.09 (.09)
Intertrial Filler (Yes) -0.26 (0.12) * -0.26 (0.11) *
Similarity (Dissimilar) -0.32 (.06) *** -0.33 (0.06) ***
Study Type (Active) 0.07 (.08) 0.10 (0.08)
Retention Interval (>= 1 Day) 0.01 (0.10)
Number of Categories 0.04 (0.01) ***
Note. Model fit statistics, such as negative log likelihood, AIC, and BIC, can be used to compare the relative fits of each model, with
Figure 1
Figure 2
Figure 3
Figure 4
Note. Data points are jittered due to the categorical nature of the moderators.