Timing in Turn-Taking and Its Implications For Processing Models of Language
Timing in Turn-Taking and Its Implications For Processing Models of Language
Timing in Turn-Taking and Its Implications For Processing Models of Language
The core niche for language use is in verbal interaction, involving the rapid exchange
of turns at talking. This paper reviews the extensive literature about this system, adding
new statistical analyses of behavioral data where they have been missing, demonstrating
that turn-taking has the systematic properties originally noted by Sacks et al. (1974;
hereafter SSJ). This system poses some significant puzzles for current theories of
language processing: the gaps between turns are short (of the order of 200 ms), but the
latencies involved in language production are much longer (over 600 ms). This seems
to imply that participants in conversation must predict (or ‘project’ as SSJ have it) the
Edited by: end of the current speaker’s turn in order to prepare their response in advance. This
Manuel Carreiras,
Basque Center on Cognition, Brain in turn implies some overlap between production and comprehension despite their use
and Language, Spain of common processing resources. Collecting together what is known behaviorally and
Reviewed by: experimentally about the system, the space for systematic explanations of language
Brian MacWhinney,
Carnegie Mellon University, USA
processing for conversation can be significantly narrowed, and we sketch some first
Martin John Pickering, model of the mental processes involved for the participant preparing to speak next.
The University of Edinburgh, UK
Keywords: turn-taking, conversation, language processing, language production, language comprehension
*Correspondence:
Stephen C. Levinson,
Language and Cognition Department,
Max Planck Institute 1. Introduction: Why Turn-Taking in Conversation is Important
for Psycholinguistics, Wundtlaan 1,
6525 XD Nijmegen, Netherlands
for the Psychology of Language
[email protected]
One of the most distinctive ethological properties of humans is that they spend considerable hours
Specialty section: in the day in a close (often face-to-face) position with others, exchanging short bursts of sound
This article was submitted to in a human-specific communication pattern: extrapolating from Mehl et al. (2007), we may each
Language Sciences, produce about 1200 of these bursts a day, for a total of 2–3 h of speech. The bursts tend to involve a
a section of the journal phrasal or clausal unit, but can be longer or shorter. At the end of such bursts, a speaker stops, and
Frontiers in Psychology another takes a turn. This is the prime ecological niche for language, the context in which language
Received: 28 January 2015 is learned (see Section 6.1 below), in which the cultural forms of language have evolved, and where
Accepted: 16 May 2015 the bulk of language usage happens.
Published: 12 June 2015
This core form of language use poses a central puzzle for psycholinguistics (see Section 6),
Citation: which has largely ignored this context, instead examining details of the processes of
Levinson SC and Torreira F (2015) language production or comprehension separately in laboratory contexts. Yet this prime
Timing in turn-taking and its
use of language involves rapid switching between comprehension and production at a
implications for processing models
of language.
rate implying that these processes must sometimes overlap. Decades of experimentation
Front. Psychol. 6:731. have shown that the language production system has latencies of around 600 ms and up
doi: 10.3389/fpsyg.2015.00731 for encoding a new word (reviewed in Section 6.3) but the gaps between turns average
around 200 ms (see Section 5). This would seem to imply that and among friends and family. As far as we know, it operates
participants planning to respond are already encoding their in a strongly universal way (cf. Stivers et al., 2009, 2010),
responses while the incoming turn from the other speaker while the other speech exchange systems are mostly culture-
is still unfinished. This in turn implies potentially long-range specific.
prediction in comprehension. A sketch model of the interleaving Sacks et al. (1974) argued that conversation is an elemental
of comprehension and production processes is presented in piece of social organization, regulated by social norms that
Section 7. prescribe one speaker at a time but allow open participation.
To appreciate the full nature of this puzzle, it is essential The model they suggested consists of turn units and rules
to review what we know about the turn-taking system and its that operate over those units. The units they suggested are
temporal properties. In Section 2, we review the foundational variable sizes of syntactic units, whose functions as full
Sacks et al. (1974; henceforth SSJ) model of turn-taking, turns can be indicated prosodically. The end of such a unit
considering alternative proposals in Sections 3 and 4. The constitutes a ‘transition relevance place’ or TRP. The rules
model proposes extensive prediction (or ‘projection’) of turn- specify:
ends, and an expectation of swift response. The systematicity
of turn-taking and its temporal patterning are borne out (1) If the current speaker C selects the next speaker N, then
by extensive corpus analyses (Section 5). We then turn C must stop, and N should start. (‘Selection’ could involve
to the psycholinguistic literature (Section 6), noting that address terms, gaze, or in the case of dyadic conversation
sensitivity to turn-end cues is already shown early in child defaults to the other.)
development. We point out that there is considerable evidence (2) If C does not select N, than any participant can self-select,
for predictive language comprehension, and for long latencies in first starter gaining rights to that next unit.
language production, so that the central psycholinguistic puzzle (3) If no other party self-selects, C may continue.
(Section 6.5) posed by turn-taking seems to be resolved by
predicting what the other interlocutor is going to say. Some These rules then recursively apply at each TRP.
direct recent investigations seem to bear this out (Section 6.4), These rules predict that intra-speaker silent gaps (generated
although experimentation in this field is in its infancy. In by rule 3) will be longer than inter-speaker ones, a fact shown
Section 7 we take stock of the recent findings, and sketch a to be correct on large samples of conversation [ten Bosch
processing model addressing some of the processing puzzles et al., 2005 report gaps between continuations by the same
involved. speaker to be about 140 ms (c. 25%) longer than the average
gap in turn transitions between different speakers]. It has also
been suggested that on this basis a turn-taking ‘beat’ or ‘clock’
2. Turn-Taking as a System: Research (with a period between 80 and 180 ms) can be discerned,
from Conversation Analysis suggesting a model of coupled oscillators that allow participants
to synch (Wilson and Zimmerman, 1986; Wilson and Wilson,
Sacks et al. (1974; SSJ) initiated the modern literature on 2005).
conversational turn-taking by outlining how this behavior It was evident to Sacks et al. (1974) that the model had
constitutes a system of social interaction with specific properties. consequences for language processing. They noted that, given
It is not organized in advance (by say an order of speaking, that interlocutors may be addressed at any point, the system
or set units to be uttered), but is highly flexible, allowing for enforces obligate listening. More importantly, they noted that
longer units when so mutually arranged, and organizing an the speed of speaker transition would require ‘projection’
indeterminate number of participants into a single conversation. (prediction) of the end of the incoming turn, and production
The authors note that “overwhelmingly one speaker talks at processes would have to begin before the end of the incoming
a time. Occurrences of more than one speaker at a time are turn, in part because turn beginnings have to be designed to
common but brief [. . .] Transitions (from one turn to the facilitate that very projection (Sacks et al., 1974, 719; Levinson,
next) with no gap and no overlap are common, and together 2013). Later corpus studies have established, as we shall see
with slight gaps and slight overlaps make up the majority (see Section 4), that the great proportion of turn transitions
of transitions” (Sacks et al., 1974, p. 700). Obviously, such fall between −100 and 500 ms, that is, between a short stretch
turn-taking behavior contrasts with the absence of turn-taking of overlap to a gap with a duration equivalent to one to three
in cheering, heckling, laughing, etc. That things could be syllables.
otherwise in the speech domain is shown by the contrasting There is a great deal of later work in conversation analysis
speech exchange systems we also use, as in lectures where (CA) that has contributed to our understanding of this system
questions come at the end, or in a press conference where (see Clayman, 2013; Drew, 2013; Hayashi, 2013 for overviews). It
questions come from many parties but are answered by one, is important to appreciate that not all overlapping of turns can
contrasting with a classroom where questions may come from be understood as behavior that violates the rules above – some
the teacher alone, and may be answered by many volunteers. authors (see Section 4) have seen the frequency of overlap as
The importance of the conversational system is that, unlike the undercutting the Sacks et al. (1974) model. Sacks et al. (1974)
others, it appears to be the default mode of language use, as claimed that overlaps are common, but usually very short, and
shown by its operation in the context of language learning, often accounted for by little additions to the first turn like address
forms or tags [as in (1)], or by misanalyses of when the turn is A considerable body of work has gone into understanding
coming to an end [as in (2) where ‘biscuits’ was projected as the the role of extended gaps or silences in ‘dispreferred’ responses
turn-end but it was followed by ‘and cheese’; overlap indicated (responses not in line with the suggested action in the prior
with square brackets]: turn; see Pomerantz and Heritage, 2013 for review). Corpus
analysis shows that gaps of 700 ms or more are associated
(1) Sacks et al. (1974, p. 707) with dispreferred actions, and that gaps longer than the norm
(>300 ms) decrease the likelihood of an unqualified acceptance,
(9) A: Uh you been down here before havenche.
and increase the likelihood that a response, be it acceptance
B: Yeh. [NB: III:3:5] or rejection, will have a dispreferred turn format (e.g., Yes,
but. . . in the cases of acceptances; Kendrick and Torreira, 2015).
(2) Jefferson (1984, p. 15) Experimental work also shows that gaps of 600 ms or longer
generate inferences of this unwelcome kind (Roberts et al., 2011).
1. Vera: they muucked intuh biscuits. They had (.) quite a lotta The CA approach to turn-taking raises two major issues. The
2. –> biscuit [s’n ch] e e | : : : s e. ] first is what exactly counts as a turn, and how participants can
3. Jenny: –> [Oh : :] well thaht’s it th]en [ye[s. recognize such a unit as complete. The problem is that just
about any word or phrase may in context constitute a turn,
Note especially that some overlaps – namely competing (more while syntactic units can be nested or conjoined indefinitely.
or less simultaneous) first starts – are expectable by the rules Regarding this issue, Sacks et al. (1974, p. 721) note that “some
above (as when two people start simultaneously by rule 2, or a understanding of sound production (i.e., phonology, intonation,
participant operating rule 2 is a bit slow and overlaps with the etc.) is also very important to turn-taking organization.” Thus
current speaker continuing by rule 3). In these cases one or the in the following (drawn from the discussion in Clayman, 2013,
other of the speakers normally drops out (impressionistic gap p. 155), the terminal intonation contours do not occur till the
duration in seconds between brackets): end of the turns, and two turns each composed of three possibly
complete syntactic units (divided by §) occur uninterrupted (note
the whole is recognized by the recipient as a story under way,
(3) Hayashi, 2013, p. 176 (from Auto Discussion)
hence the continuers, which are themselves possibly elicited by
(1) Curt: Mmm I’d like t’get a, high one if I cou:ld.
rising intonation marked with ‘?’):
(2) (0.7)
(3) Gary: –> [I know uh-] (5) Ford and Thompson (1996, p. 151)
(4) Mike: –> [Lemme ask ] a guy at work. He’s gotta bunch K: Vera (.) was talking §on the phone §to her mom?
a’ old clu[nkers. (6) C: mm hm
K: And uh she got off §the phone §and she was
When there is competition to maintain the floor in these incredibly upset?
and other cases, this is often negotiated on a syllable by syllable C: Mm hm.
basis, with e.g., deceleration, increase of intensity, and repeated
In addition to syntactic and prosodic completeness, pragmatic
syllables or words, until one speaker drops out (Schegloff,
completeness may be required to terminate a turn (Ford and
2000).
Thompson, 1996; Levinson, 2013). Clearly a responsive action
Just as different kinds of overlap can be discerned, so can
following the first part of a pair of actions like questions and
different kinds of absence of speech, differentiating between
answers, offers and acceptances, requests and compliances can
pauses (e.g., between units by the same speaker), gaps (between
be inspected for pragmatic efficacy; elsewhere the larger role in
speakers), silences (meaningful absence of speech, e.g., after a
a sequence of speech acts may need to be satisfied.
question), and lapses (where no-one has self-selected to speak).
The second major issue is ‘projection’ or predictive language
It has been suggested (citations below) that participants are
understanding. Sacks et al. (1974) thought it clear that the turn-
very sensitive to timing, so that an excessively long gap after
taking system can only work if there is extensive prediction in
a question, for instance, may be taken to indicate that the
comprehension, so that recipients can use the unfolding turn to
recipient has some kind of problem with it, for example finding
project an overall syntactic and prosodic envelope which would
it difficult to answer in the affirmative, or has uncertainty about
allow them to foresee when and how a turn would come to
the response. In the following a telephone caller takes gap of
an end (see Clayman, 2013 for a review). It is not at all clear
around 2 s to indicate the answer ‘no,’ which he himself then
how this works, given the flexibility and extendibility of most
pre-emptively provides:
syntactic units. Still, interesting insights are provided by such
phenomena as turn-completion by the other, studied in depth by
(4) Levinson, 1983, p. 320 Lerner (1991, 2002; see also Hayashi, 2013). A typical example is
C: So I was wondering would you be in your office on where a bi-clausal structure is begun by speaker A, and the second
Monday (.) by any chance? clause completed by speaker B as below. Clearly an If..then..
(2.0 s) or Whenever. . ., X. . . structure projects a second downstream
C: Probably not. clause.
(7) Lerner (1991, p. 445) a rule that sometime during the course of a turn a speaker should
1. Rich: if you bring it intuh them glance at the addressee, expecting to find a gazing addressee
2. Carol: –> ih don’t cost yuh nothing. whenever he or she looks. The idea that speaker gaze when
returning to addressee could function as a turn-yielding cue is,
Such cases do not alone show that recipients accurately predict however, not easy to substantiate; More recently, Rossano (2013)
the content of the second clause (indeed sometimes a jokey has suggested this is because gaze is actually oriented to larger
exploitation of the structure may appear). But sometimes exactly units of conversation (sequences), which it may serve to open and
the same words do occur in overlap: close.
further aspects of the standard model: “Thus, the no-gap– the utterance) to identify turn ends in Dutch questions with
no-overlap principle (Sacks et al., 1974) can neither be final rising intonation. Further research should investigate other
used as a part of an argument in favor of projection nor linguistic contexts.
against reaction simply because the no-gap–no-overlap cases Another notion that has some currency is that turn-taking
hardly ever occur in real speaker change data. Importantly, could be driven by coupled oscillators (Wilson and Wilson,
this means that a principal motivation for projection in 2005). Coupled oscillators have been shown to play a role in
turn-taking is invalid.” This attack on projection as a coordination in the animal world, e.g., in the synchronization of
central element of the model will prove misplaced when fire-fly flashing where an individual’s flashes reset the neighboring
we turn to consider the psycholinguistic evidence below (in fireflies’ oscillators, so gradually converging on a single beat.
fact Heldner and Edlund, 2010, p. 566 later concede that However, it is well known that human synchronization does not
projection of content may be responsible for overlaps and short primarily work in this way, but rather by means of temporal
gaps). estimation, which is easily shown by demonstrating that humans
The central plank of the dismissal of projection is that turn- can tap together without waiting to hear the others’ taps
taking is often not as rapid as has been claimed. Heldner and (Buck and Buck, 1976). Moreover, given the highly variable
Edlund (2010, p. 563) note: lengths of turns, nothing like the firefly mechanism can work
in conversation. Indeed, human coordination in general relies
“The cumulative distribution above the 200 ms threshold was also on simulating the other’s task, thus on high-level cognition
of interest, as it represented the cases where reaction to cessation (Sebanz and Knoblich, 2008). There is, however, room for
of speech might be relevant given published minimal reaction a low level metronome, as it were, and Wilson and Wilson
times for spoken utterances (Fry, 1975; Izdebski and Shipp,
(2005) suggest that readiness to speak is governed by the
1978; Shipp et al., 1984). The distribution above this threshold
represented 41–45% of all between-speaker intervals. These cases syllable, so that participant A’s beginning of a syllable tends
were thus potentially long enough to be reactions to the cessation to coincide with B’s least readiness to speak, while the end
of speech, or even more so to some prosodic information just of the syllable coincides with B’s increased readiness. There is
before the silence.” indeed some evidence for entrainment or accommodation of
the gap size between specific dyads, but there is no such effect
There are two separate proposals here. The first is that for gaps on intra-turn pauses (ten Bosch et al., 2005) suggesting that
longer than 200 ms, participants might simply react to silence. turn-transition timing is rather unconnected to other temporal
This threshold is implausible. First, silence will only become properties of speaking, although more research is required
recognizable as silence after c. 200 ms (after all the duration of here.
voiceless stop consonants ranges up to 180 ms; cf. Heldner and Careful observers have convinced themselves that such a
Edlund, 2010), at which point it will still take a further minimally ‘beat’ is set up in English conversation by stress-timing, such
200 ms to react (so 400 ms in total). That minimal reaction is for a that interlocutors producing unmarked actions with their turns
prepared vowel (Fry, 1975), and any more complex response will tend to come in ‘on the beat’ (Couper-Kuhlen, 2009). However,
increase according to Hick’s Law (see below); a choice between the perceived rhythm of speech does not appear to have direct
one of two prepared responses takes 350 ms for example. We now acoustic correlates, and to date we are unable to objectively
have, say, 550 ms from actual cessation of speech till beginning confirm these observations (note too that languages differ in their
of a minimal response, and as Heldner and Edlund (2010) note rhythmic properties). Interestingly, recent corpus measurements
70–82% of responses are within 500 ms. Thus reaction to silence, show that, rather than the entrainment of a conversational beat,
although certainly possible in a minority of cases, would not seem there is a reverse correlation of speaker A’s speech rate and
to play a major role in the organization of turn-taking (see Riest speaker B’s response timing, perhaps because B has less time
et al., 2015). to plan her message as A’s speech rate increases, and vice versa
The second proposal is that there is the possibility of reaction (Roberts et al., 2015).
to “some prosodic information just before the silence.” Here there
is less room for disagreement; CA practitioners and associated
phoneticians have themselves emphasized the role of turn-final 5. Statistical Studies of Corpora
intonational and segmental cues (see Walker, 2013 for a review).
Duncan drew attention to turn-keeping intonation cues and The statistical study of turn-taking began early, prompted by
lengthened (‘drawled’) syllables. Critical here are two factors: developments in telephony, with a special interest in the speed
(a) it must be shown not only (as Duncan did) that there are of turn-transition (e.g., Norwine and Murphy, 1938). It has
available prosodic/phonetic features of turn-ends, but also that become standard to represent overlaps and gaps on a single
participants actually use them, (b) the location of the features time scale [sometimes called ‘the floor transfer offset’ (FTO)] in
with respect to the turn end is important (e.g., sentence accents which positive values correspond to gaps, and negative values
in English sometimes occur well before turn ends, in which case represent overlap. Table 1 summarizes average values of FTOs
talk of projection suits better than talk of reaction to terminal in ten languages as reported in four studies (caveat: codings and
cues, cf. Wells and Macfarlane, 1998). Bögels and Torreira methods differ somewhat in these studies). Note that although
(in press) provide experimental evidence that listeners do use mean values vary, they do so in narrow window, roughly a quarter
turn-final prosodic information (located in the last syllable of of a second either side of the cross-linguistic mean, and that
TABLE 1 | Average floor transfer offsets (FTOs) in ten different languages sample was from videotape and included early visual responses
as reported by four different studies.
(e.g., nods) and audible pre-utterance inbreaths. The general
Language Average FTO (ms) Source finding was that although languages differ, e.g., in their degree of
use of visual modality or mean response times, the factors that
English 410 Norwine and Murphy (1938)∗ speeded or slowed response times (e.g., gaze, agreement) were
English 480 Sellen (1995)∗ shared. Heldner (2011) shows that estimates of the percentage of
English 460 Sellen (1995) perceived overlaps and gaps in this sample match closely other
Dutch −78 De Ruiter et al. (2006)∗ quantitative samples.
Japanese 7 Stivers et al. (2009) The intensive study of turn-taking under different conditions
Tzeltal 67 Stivers et al. (2009) is still in its infancy. We know that responses to Wh-questions
Yélî-Dnye 71 Stivers et al. (2009) are slower than polar (yes–no) questions cross-linguistically
Dutch 108 Stivers et al. (2009) (unpublished data from the Stivers et al., 2009 study), presumably
Korean 182 Stivers et al. (2009) because of the greater cognitive complexity of response involved.
English 236 Stivers et al. (2009)
Longer answers can also be shown to take more preparation,
Italian 309 Stivers et al. (2009)
reflected in both reaction times, and breathing preparation
Lao 419 Stivers et al. (2009)
(Torreira et al., 2015). Complexity of response has also been
Danish 468 Stivers et al. (2009)
shown to influence timings in children’s responses (Casillas,
Ākhoe Hai|| om 423 Stivers et al. (2009)
2014). We also know that individuals tend to accommodate to
∗ No eye-contact between conversation participants. the gap-length of others, so that when changing conversational
partners, individuals’ response times change to match their new
interlocutors (ten Bosch et al., 2004, 2005). And intriguingly,
the factors affecting response times are uniform across cultures transition speeds are higher on the phone than face-to-face
(Stivers et al., 2009). In the following two sections, we look in (Levinson, 1983; ten Bosch et al., 2005).
more detail at the distribution of gaps and overlaps.
5.2. Overlap
5.1. Distribution of Gaps In contrast to gaps, the study of overlap in corpora has provided
About half a century ago, Brady (1968) reported average gap only gross facts. As mentioned, Heldner and Edlund (2010)
durations of 345–456 ms and medians from 264 to 347 ms report c. 40% of speaker-transitions involving overlaps (including
(depending on the threshold used in the automatic detection of any overlap of greater than 10 ms). Their histogram makes clear
speech) in a corpus of sixteen telephone calls between friends that the modal overlap is less than 50 ms in the Spoken Dutch
in the USA. Task-oriented interaction shows surprisingly similar Corpus, with a mean −610 ms, and median −470 ms. ten Bosch
patterns [e.g., Verbmobil – a travel scheduling task by telephone, et al. (2005) report that the proportion of overlaps increases
has geometric means of 380 ms (English), 363 ms (German), from 44% in face-to-face conversation to 52% in telephone
389 ms (Japanese); Weilhammer and Rabold, 2003]. In a wide conversation, with males more likely to overlap their interlocutor
review, Heldner and Edlund (2010) looked at three different than females, but looking just at the transition from speaker
corpora, automatically processing two of them for speaker A to speaker B, 80% of transitions are gaps and 20% partial
transitions: a Dutch dialog corpus, and English and Swedish Map overlaps in face-to-face conversation (the corresponding figures
Tasks (where interlocutors must adjust their positions on slightly for telephony are 73 and 27%).
mismatching maps). The first two corpora included both face- Because of the lack of detailed statistical analysis of overlaps
to-face and non-face-to-face interaction. Heldner and Edlund in corpora, we have undertaken a new analysis of overlaps
(2010) found closely matching patterns across corpora, with in the Switchboard Corpus of English telephone conversations
combined scale (FTO) modes for speaker transition at c. 200 ms (Godfrey et al., 1992). We address the following questions:
(i.e., a short gap) and c. 60% of transitions being gaps, 40%
overlaps (including any overlap of greater than 10 ms; the modal (1) In running speech, how common is overlap (i.e.,
overlap is less than 50 ms in the Spoken Dutch Corpus). Around simultaneous talk by more than one party at a time)
41–45% of gaps were longer than 200 ms, and between 70 and compared to talk by one party alone?
82% of all transitions were shorter than 500 ms. (2) In floor transfers, how common are overlaps compared to
These quantitative approaches generalize over all kinds of gaps?
speech acts and responses. But there is also growing work (3) What is the distribution of overlap duration, and where do
focused specifically on question–answer timings. Question– overlaps tend to start relative to the interlocutor’s turn?
answer sequences are an interesting context to examine, because (4) What is the distribution of different overlap types (cf.
questions make a floor transfer relevant, whereas in other Jefferson, 1986)?
contexts a floor transfer between speakers is often optional.
Stivers et al. (2009) looked at 10 languages from around the 5.2.1. Method
world, including smaller, unwritten languages, and found rather We analyzed a subset of 348 conversations (totaling around
fast transitions in polar question contexts, with means between 38 h of dyadic conversation) that were free of timing errors,
7 and 468 ms, and modes from 0 to 200 ms. The coding of this and with annotations included in the NXT-Switchboard Corpus
5.2.2. Findings
The recordings were divided as follows: 77% of the signal FIGURE 2 | Histogram of floor transfer offsets (FTOs) in the
corresponded to speech by one speaker only, 19.2% to silence Switchboard Corpus (Godfrey et al., 1992; Calhoun et al., 2010, see
(i.e., either pauses within a speaker’s turn or gaps as defined Section 5.2.1 for details). Each bin has a size of 100 ms.
above), and only 3.8% to simultaneous speech by both speakers
(either between-overlaps or within-overlaps). If we exclude silent
parts, 95.3% of the speech signal corresponded to speech by Their duration exhibited a distribution highly skewed to the right,
one speaker. This seems to fit well with Sacks and colleagues’ with an estimated modal duration of 350 ms, a median of 389 ms,
observation that “overwhelmingly, one party speaks at a time” a mean of 447 ms, and 75% of the data with values below 532 ms.
(Sacks et al., 1974, p. 700). The duration of within-overlaps is thus usually shorter than that
With regard to how common overlaps are in terms of of two syllables. This appears to fit well with Sacks et al.’s (1974)
proportion of turn-transitions, Figure 2 shows the distribution observation that “occurrences of more than one speaker at a time
of the duration of gaps and between-overlaps combined together are common, but brief.”
as FTOs (i.e., with positive values for gaps and negative We now examine the distribution of different types of
values for between-overlaps). Between-overlaps (negative FTOs) overlaps. A prediction made by the Sacks et al. (1974) model
represented 30.1% of all floor transfers. As for the duration of is that most overlaps should be occasioned by a number of
overlaps, and their location within the interlocutor’s turn, we circumstances emerging from the application of its rules. For
observed that between-overlaps exhibited a distribution highly instance: (i) Overlaps often arise when unforeseen additions
skewed to the left, with an estimated modal duration of 96 ms, to the first speaker’s turn after a transition relevance place
a median of 205 ms, a mean of 275 ms, and with 75% of the (e.g., during increments or tags); (ii) They may occur after a
data with values below 374 ms. Within-overlaps tended to start silence when two speakers may self-select and launch articulation
close to the beginning of the utterances that they overlapped, without realizing that another party is doing the same thing
with a modal offset of 350 ms, a median of 389 ms, a mean of (cf. ‘blind spot’ cases, Jefferson, 1986); (iii) They may frequently
447 ms, and 75% of the data exhibiting offsets below 532 ms. arise in cases involving backchannels signaling feedback to the
FIGURE 1 | Illustration of gaps, within-overlaps, and between-overlaps for two speakers (SPK1 and SPK2 ) in our classification scheme following
Heldner and Edlund (2010).
main speaker (e.g., yeah, right) and other minimal utterances that backchannels often occurred after a TRP or a period of silence,
do not constitute an attempt to take the floor. The Sacks et al. suggesting that their timing is sensitive to specific cues in the
(1974) model also predicts signs of overlap avoidance when it main speaker’s turn (cf. Gravano and Hirschberg, 2009).
occurs, for instance by speakers’ abandoning their turns without The second most common feature (37%) was the presence
reaching a point of turn completion. Another sign of speakers’ of a possible transition-relevance place (i.e., a point of syntactic,
special orientation to overlapping talk is that they may engage in intonational, and pragmatic turn completion) in the overlapped
competition for the floor, for instance by repeating syllables or turn within a time window of 500 ms before the start of the
words, often with increased intensity and pitch levels (Schegloff, overlap. Another common feature was a period of silence (29%).
2000). In cases with this feature, one of the two speakers produced an
To estimate the prevalence of such possible causal contexts for utterance briefly after her interlocutor. These cases often involved
overlap, in a separate analysis we randomly sampled 100 between- a backchannel (n = 35, or 60%), or resulted in one of the two
overlaps and 100 within-overlaps from our data, and annotated speakers abandoning their turn prematurely before reaching a
them for a number of relevant features, including (a) the presence point syntactic and prosodic completion (n = 14, or 24.1%).
of a backchannel or brief token of agreement (e.g., yeah, right) in The presence of a disfluency in the utterance of the overlapped
either the overlapped or overlapping utterance, (b) the presence speaker before the start of the overlap (i.e., short silent pauses,
of a period of silence within 200 ms from the beginning of the repeated syllables or words, or noticeable decreases in speech
overlap period, (c) the presence of a transition relevance place rate) was also common. In these cases, it seems that the recipient
(a point of syntactic, prosodic and pragmatic completion) in the produced a backchannel in response to the disfluency at a point
overlapped turn within the 500 ms leading to the overlap, (d) when the interlocutor had already resumed her turn, causing
an abandoned (i.e., syntactically and prosodically incomplete) overlap. In total, cases exhibiting one or more of these six features
utterance by any of the two speakers during or immediately accounted for 95% of the data.
following the overlap interval, and (e) the presence of repeated The remaining 10 cases involved three terminal between-
syllables or words in any of the two speaker’s utterances during overlaps affecting the last syllable of the previous turn, two cases
or immediately following the overlap interval. Other recurrent exhibiting laughter by one of the speakers, two cases involving a
features observed during or close to the overlap interval, such as turn-initial particle (i.e., uhm and well) produced in overlap with
laughter and disfluencies, were also annotated. the last syllable of the preceding turn, one case with a speaker
Table 2 shows the most frequent features observed in the data talking to someone else in the room, and one case of overlap due
(note that the features are not mutually exclusive). Interestingly, a clear phonetic segmentation error in the annotation.
the majority of overlap cases (73%) involved a backchannel. Our analysis thus confirms that overlaps, though reasonably
Backchannels, especially continuers like “mm hm” or “uh huh,” common (30% of transitions), are of short duration (i.e., less than
are not construed as full turns, but rather pass up the opportunity 5% of the speech signal; between-overlaps have a modal duration
to take a turn, and are thus principled intrusions into the other’s 96 ms), occur largely in principled places (e.g., in between-
speech (Schegloff, 2000). It should be noted that, in half of the overlaps, after possible completions, in simultaneous turn-starts),
between-overlaps, it was not the backchannel that incurred the and mostly involve backchannels (which do not constitute full
overlap, but rather the main speaker who produced an utterance turns). In light of these observations, we conclude that the vast
in overlap with the backchannel. We also noted that overlapping majority of instances of overlap in our dyadic conversations are
consistent with the turn-taking system proposed by Sacks et al.
(1974).
TABLE 2 | Frequency of seven features in a subset of 200 cases of overlap
(100 between-overlaps, and 100 within-overlaps) extracted from our
Switchboard data.
6. Psycholinguistics
Between- Within-overlaps Percentage in
overlaps (n = 100) total
Psycholinguistic processing puts tight constraints on any
(n = 100) (n = 200)
psychologically real model of turn-taking. Here we first draw
Backchannel or 74 72 73% attention to the early sensitivity to turn-taking in child
agreement present development. Then we consider three main psycholinguistic
Follows TRP (<500 ms) 23 51 37% aspects: predictive theories of language comprehension, studies
Follows silence 21 37 29% of language production (from conceptual planning to speech
(simultaneous start)
articulation), and ideas about the relation between these two
Abandoned turn 21 18 19.5%
processes. Finally we turn to a small number of experimental
Follows disfluency in 4 18 11%
interlocutor’s turn
studies aimed at understanding the relationship between
Repeated syllables or 4 12 8%
comprehension and production processes in turn-taking.
words
Any of the six features 93 97 95% 6.1. ‘Proto-Conversation’ and Turn Taking in
above Human Development
Note that observations can exhibit more than one feature at the same time (e.g., Parallel to Sacks et al. (1974), in the 1970s there was an interest
cases of overlap after a period of silence involving a backchannel. in children’s acquisition of turn-taking abilities. Trevarthen
(1977) and Bruner (1983) coined the term “protoconversation” grammar bounds a discrete infinity, and hence there was no core
for the rhythmic alternation of vocalizations between care- role for prediction in language understanding. The spell lasted
giver and infant in the early months of life, and its systematic decades, but meanwhile both engineering and psycholinguistic
properties were demonstrated by Bateson (1975), with average experiments have demonstrated a core role for statistical learning
turn transitions of about 1.5 s at 3 months. Subsequent work and estimation in language comprehension. For example, eye-
showed that this gap reduced in the following pre-linguistic movement studies in the visual world paradigm show that
months to around 800 ms (Jasnow and Feldstein, 1986; Beebe listeners predict upcoming entities from likely collocations (e.g.,
et al., 1988). Such early onset suggests that turn-taking may hearing “the boy is eating” participants look at the cake and not
have an instinctive basis. Garvey and Berninger (1981) showed the ball in the picture). Determiners (e.g., French un vs. une),
that the gap duration increased toward a second and a half Adjectives (“freshly baked”) and verbs (“eat”) can predict nouns
in toddlers, presumably because of the cognitive difficulties of by their selectional restrictions; in language that have verbs at the
language production, and remained at around a second even end of the sentence like Japanese, participants can use the nouns
for 5-year-olds [this slow convergence with adult norms has to predict the verbs (Altmann and Kamide, 1999; Kamide et al.,
recently been confirmed for a larger sample by Stivers et al. 2003). Another source of insight comes from EEG, where it can
(under review)]. be shown that the syntactic frame is used to predict upcoming
After a long pause, there is now renewed interest in the material. For example, when the sentential context leads one to
development of turn-taking and its timing in children, and we expect a specific noun (‘she carried the eggs in a . . .’) but the
now have better data, methods and concepts. Using audiovisual gender of an incoming article is incongruous an N400 may be
corpus techniques, Hilbrink et al. (submitted) have confirmed the evoked before the noun itself is encountered (e.g., in Spanish
general pattern earlier reported, namely relatively fast transitions una canasta ‘a basket’ vs. un costal ‘a sack’). These studies use
in the prelinguistic period, with a slowing down as language the inverse correlation between the cloze probability and the
starts to be comprehended at 9 months. Using eye-tracking of amplitude of an N400 to demonstrate prediction (it is hard in fact
infants watching dyadic interaction, several studies have shown to distinguish prediction from integration difficulties; see Kutas
that 3-year-olds observers of dyadic conversations between two et al., 2011 for review). Predictive language comprehension is
adults can anticipate speaker transitions (Tice and Henetz, 2011; not only achieved on the basis of semantic and morphosyntactic
Casillas and Frank, 2013, submitted; Keitel et al., 2013). Although regularities. In an experiment involving visual searches under
the gaze shifts tend to occur in the gap (i.e., not in overlap the directions of a confederate, Ito and Speer (2008) showed
with the turn preceding the floor transition), known saccade that participants could anticipate referents on a screen (e.g., a
latencies for infants are c. 300 ms (Fernald et al., 2008), showing “drum” vs. a “ball”) on the basis of the location of contrastive
that they have often systematically detected the end of the turn pitch accents in the vocal instructions being given to them (e.g.,
before the gap. Researchers have also been able to show that by “now take the GREEN ball” vs. “now take the green BALL”).
3 year-olds, children are using intonation to do this projection Listeners therefore appear to be able to use different sorts of
of turn-ends (Keitel et al., 2013). Casillas and Frank (submitted) linguistic information (i.e., semantic, morphosyntactic, prosodic)
found that 3-year-olds were just as good at anticipating speaker in order to predict the content of an incoming utterance. For an
change as adults, and did so more after questions than statements. overview of recent work on predictive language understanding
They then looked at younger infants and filtered the speech, see Pickering and Garrod (2013).
so they could distinguish whether prosody or lexico-syntax was Recent investigations have also shown direct connections
enabling this anticipation. They found that 1 and 2 year-olds of these predictive inferences to projection in conversation.
were better than chance at anticipating transitions, and that Gisladottir et al. (2015) conducted an EEG experiment in which
anticipation improves with age. Children under 3 were better participants listened to mini-dialogs of two turns. The second
in the prosody-only condition (with words filtered out) than turn (e.g., “I have a credit card”) could be invariant over three
they were in the words-only condition (with prosody filtered), conditions, a question like “How are you going to pay?,” an offer
indicating an early advantage for prosody (adults only showed an like “I can lend you the money,” or a trouble announcement like
advantage for words + prosody). Clearly these studies confirm “I don’t have any money.” In each of three contexts, the same
that projection is a real phenomenon, that it is learnt early, second turn performs a different speech act (i.e., an answer, a
and that prosody plays an important role in this ability. They declination, or an offer). The EEG signal, averaged over many
also indicate that turn-taking is established before language, such adjacency pairs, showed that very early (often in the first
that it forms a framework for language acquisition, and that 400 ms) the different speech act forces of the response were
the complexities of language slow down the framework through predicted. Speech act detection is the precondition to response
middle childhood. preparation, and it seems to be an early predictive process.
A second relevant study (Magyari et al., 2014) looked at the EEG
6.2. Predictive Language Comprehension signal of participants listening to turns extracted from genuine
Early in the history of psycholinguistics, Chomsky (1969, p. 57) conversations whose turn-endings they had to predict by pressing
insisted that probability and prediction had no possible role to a button. These turns had already been sorted into unpredictable
play in a scientific theory of language: “It must be recognized that vs. predictable by a cloze test, where participants had to guess the
the notion ‘probability of a sentence’ is an entirely useless one, missing words of items cut-off at various points. The predictable
under any known interpretation of this term.” He reasoned that a turns (compared to the unpredictable ones) showed a very early
EEG signature of preparation to respond about half way through the processing of the second noun has begun but not finished
the turn (c. 1200 ms before the end). Recently Riest et al. (2015) by this time, while 900 ms is required for three word utterances
show experimentally that responses based on prediction are (Schnur et al., 2006). Most of these studies incidentally (but
not significantly different than those based on pre-knowledge. not Bates et al., 2003) involve pre-familiarization of the words
They also incidentally attempt to estimate stochastic tendencies and pictures, so these response times are effectively after some
for possible reactive responses (although these stimuli are non- amount of priming.
linguistic and do not have the uncertainty associated, e.g., with There is also good information on the planning required
voiceless stops). These studies together suggest that quite long- for sentence production from eye-movement studies. When
range prediction is normally involved in understanding language participants are shown pictures of simple transitive or intransitive
in a conversational mode. scenes (e.g., boy kicking ball, girl running), it takes about 1500 ms
before speech output begins (Griffin and Bock, 2000; Gleitman
6.3. Latencies in Language Production et al., 2007). Interestingly, what happens within this 1500 ms
There are striking differences between language comprehension is language-dependent – for example verb-first languages show
and production despite the fact that the processes must be rather different visual scanning of the pictures than verb medial
intimately related. One of the clearest differences is in processing languages (Norcliffe et al., 2015), but the latencies remain
speed. Speech production is a bottleneck on the whole language similar.
system: at about an average of seven syllables per second, speech During this period of planning for language production,
can be estimated to have a bit-rate of under 100 bps (Levinson, output processes involve the synergies between multiple speech
2000, p. 28). Studies of language production show that pre- organs. For example, breathing for speaking may need to be
articulation processes run three or four times faster than actual initiated. Earlier studies have shown that such breathing activity
articulation (Wheeldon and Levelt, 1995). Studies of language involves a number of latencies: first, c. 140–320 ms must be
comprehension under compression show that people can parse allowed for from the time the decision to inhale is made till the
and comprehend speech at three or four times the speed of time the signal reaches the intercostal muscles (Draper et al.,
speech production (Calvert, 1986, p. 178; Mehler et al., 1993). 1960); second, the inhalation time in spontaneous dialog is
Speech encoding is one part of the process that has to be strictly typically over 500 ms long (McFarland, 2001, p. 136). Together,
serial. Articulation is thus a severe bottleneck on communication, these numbers suggest a latency of at least 500–800 ms prior to
and the system compensates by utilizing pragmatic heuristics speech. In a recent study of breathing in conversation (Torreira
in production that augment the coded message (Levinson, et al., 2015, this volume), we have shown that short responses
2000). to questions are often made on residual lung air, whereas
Happily, there have been extensive studies of language longer responses are likely to require a planned inhalation.
production that allow us to quantify the latency in each part of The actual inhalation most typically starts briefly (i.e., 15 ms)
the production process, using picture naming as a task (Levelt, after the end of the interlocutor’s question, and it is probably
1989). The average reaction from seeing a picture to beginning triggered just before the phonological retrieval process for the
the naming of has been estimated at 600 ms (Indefrey and Levelt, first word of the planned response. Thus the breathing data
2004, p. 106). The literature unfortunately gives no ranges or suggests that whether or not inhalation is required is a decision
standard deviations, with the exception of a study by Bates et al. made during conceptual planning of the response, and that the
(2003), which provides cross-linguistic averages that are much trigger for inhalation, most typically produced during the last
longer at over 1000 ms, with all minimums over 650 ms. Indefrey few hundred ms of the interlocutor’s turn, is often based on
and Levelt (2004, p. 108), on the basis of a meta-study of available a prediction that the current speaker will imminently end her
experiments, propose approximate figures for each stage of the turn.
process, which we show in Table 3. Recent studies of vocal preparation using ultrasound
For multiword utterances, the effect is not linear. Naming two techniques show that tongue movements preceding speech
nouns takes 740–800 ms before output begins, with evidence that production start considerably before the acoustic signal, with
clear preparation between 120 and 180 ms prior to the acoustic
release (Schaeffler et al., 2014) and with some effects detectable
as early as 480 ms (Drake et al., 2014). Although not yet
TABLE 3 | Estimated average time windows for successive operations in
spoken word encoding (Indefrey and Levelt, 2004, p. 108).
studied in a conversational context (although see de Vos et al.,
2015, this volume, for the parallel in signed conversation),
Operation Duration (ms) these measurements provide further estimates of the latencies
Conceptual preparation (from picture onset to selecting 175
involved in language production. These latencies are perhaps
the target concept) not surprising given the complexity of language encoding and
Lemma retrieval 75 the need for the processes to be funneled into a single, serial
Form encoding: sequence of operations. Donders (1869) showed that reaction
Phonological code retrieval 80 time varies with the number of choices that need to be made,
Syllabification 125 and Hick’s Law (Hick, 1952) suggests this relation is generally
Phonetic encoding (till initiation of articulation) 145 logarithmic (reaction time will increase with decision time,
Total 600 where decision time T = log2 (n) and n is the number of equally
probable choices). When one considers that in production single only reacted on average around 400 ms after the end of the
words have to be plucked from a word lexicon consisting of over stimulus, suggesting that in this case participants’ button presses
20,000 entries, one can see immediately the processing problems were produced in reaction to silence. On the other hand, in
involved. Combined with the relatively slow nature of nerve another condition consisting of similar words, but featuring a
conduction (known since Helmholtz, 1850), and the complexity final intonational boundary, RTs were around 100 ms on average,
of the coordination of c. 100 muscles involved in articulation suggesting reaction to or local prediction of an intonationally
(Levelt, 1989), slow reaction times can be expected. well-formed question end. It should be noted that while pitch
To summarize, language production involves latencies of had been filtered in the De Ruiter et al. (2006) study, duration
well over half a second, and a multi-word utterance is likely and other phonetic cues to prosodic structure were still present
to involve a second or more of processing before articulation in their filtered No Pitch condition. This new study shows that
begins. Although the conversational context may expedite some participants do use prosodic cues to judge turn-ending. What
of these processes, the bulk of this latency is attributed to the the de Ruiter et al. study does establish is that they need to be
phonological and phonetic encoding processes (as are frequency integrated with the lexical/syntactic information to carry turn-
effects, Jescheniak and Levelt, 1994) which are probably not ending indications.
compressible. There are other experimental techniques that can be used
to explore turn-taking. One is to use confederates (Bavelas and
6.4. Experimental Studies of Turn-Taking Gerwing, 2011), another to use the visual world paradigm with
There have been as yet relatively few experimental studies of eye-tracking (Sjerps and Meyer, 2015). The latter study, using a
turn-taking, due to the difficulties involved in gaining sufficient dual task paradigm, found that maximal interference in the non-
experimental control in free interaction. However, indirect linguistic task occurred 500 ms before the end of the incoming
light has been thrown on the mechanisms by extracting turns turn (see also Boiteau et al., 2014); however, the linguistic task
from conversation and experimentally testing when and how involved visual monitoring and was non-contingent with the
participants detect turn ends. De Ruiter et al. (2006) extracted incoming turn, so was far removed from conversation.
turns from a corpus of conversations in Dutch, and got A method that combines control with live interaction involves
participants to press a button in anticipation of turn endings. alternating live and pre-recorded responses in such a way that
They manipulated the turns so that there were versions where participants are unaware of the manipulation (Bögels et al., 2014).
pitch information was filtered out (No Pitch), where the words In a recent study, we exploited this technique in a quiz-game
were masked but the pitch preserved (No Words), where both (Bögels et al., submitted). Participants were recorded for EEG in
were filtered (No Pitch, No Words) and finally where amplitude a shielded room, and could not see the quiz master – this allowed
variation was also removed (Noise condition). They found some of the interaction to be live, some pre-recorded. The quiz
that accuracy of turn-end anticipation was preserved under questions were designed so that in some the answer was available
No Pitch, but significantly lost under No Words, and hugely early, and that in others the answer was available only toward the
affected under the other conditions, and they claim that “The very end of the question, as in:
conclusion is clear: lexicosyntactic structure is necessary (and
possibly sufficient) for accurate end-of-turn projection, while Which character, also called 007, appears in the famous movies?
intonational structure, perhaps surprisingly, is neither necessary (Early)
nor sufficient” (De Ruiter et al., 2006, p. 531). Which character from the famous movies, is also called 007? (Late).
This study suggested then that lexicon and syntax are the key
guide to turn-structure and completion. But there are aspects In a second experiment, participants heard the same questions
of prosody and articulation that may be critical, and in the but did not have to answer them. Instead, they only had to
normal case intonation may also be an important signal. To remember them, as prompted by later probes. The neural patterns
test this, Bögels and Torreira (in press) used turns taken from were then compared with those in the first experiment, where
multiple scripted interviews, with questions like “So you’re a participants had to verbally respond, to the second where they
student at Radboud University?” (long version) vs. “So you’re a only had to comprehend and memorize. The results revealed a
student?” (short version). The short versions exhibited a higher clear neural signature associated with production, localized in
maximum pitch and greater duration on the last syllable of the the appropriate areas, occurring within 500 ms of the point at
word ‘student’ than the long versions, due to the presence of which a plausible answer to the question became available. Bögels
an intonational phrase boundary at the end of this word in the and colleagues interpreted this as showing that participants begin
short questions, but not in the long questions. They cross-spliced planning their response as soon as they can, up to a second or
their materials in different ways, and did the same button-press more before the incoming turn ends.
experiment as De Ruiter et al. (2006). Participants often false
alarmed (pressed the button) at ‘student’ when a phrase-final 6.5. The Core Psycholinguistic Puzzle
word was cross-spliced into the middle of the long version – From a psycholinguistic point of view, turn-taking presents
they were clearly using the prosodic information to anticipate the following puzzle: in spite of the long latencies involved in
turn closure. Participants were also presented with truncated language production (600–1500 ms or more), participants often
long sentences ending in a syntactic point of completion, but manage to achieve smooth turn transitions (with the most typical
lacking a final intonation phrase boundary: now participants gaps as little as 100–300). As a solution to this puzzle, we suggest
that comprehension is predictive, even more so than is currently observations and constraints, as originally noted by Sacks et al.
thought. As soon as possible, participants extract the speech act (1974, p. 700). We are now, however, able to add both additional
of the incoming utterance, which is the sine qua non for planning constraints and a certain amount of temporal precision to those
their appropriate response. In order to overcome the production early observations:
latencies, they must also start the planning and encoding of the
response as soon as possible. (1) Turns are mostly short (mean 1680 ms, median 1227 ms; cf.
This suggests that there is a significant overlap of see Section 5.2.1), consisting of one or more interjections,
comprehension and production processes. Given an average phrases or clauses at the syntactic level, and one or
turn (approximated as an interpausal unit in our Switchboard more intonational units at the prosodic level. Turn ends
Corpus data) of 1680 ms, somewhere in the middle response typically co-occur with points of both syntactic and prosodic
preparation may already be underway. This provides a second completion.
central puzzle: conversation involves constant double tasking, (2) Intra-speaker gaps are longer by c. 150 ms than inter-speaker
and this double tasking uses the same language system. The gaps (ten Bosch et al., 2005), suggesting ordered rules (the
difficulty of the puzzle is increased when one takes into account rights to the next turn unit belong first to the next speaker,
the findings that both comprehension and production use much and only if not exercised, to the current speaker).
of the same neural circuitry (Segaert et al., 2011). It is plausible (3) Inter-speaker gaps are most typically short, with modal values
that the difficulty here is overcome through rapid task switching, for FTOs falling between 100 and 200 ms (cf. Figure 2).
and the gradual switch of resources from comprehension of the Medium gaps and short overlaps are also common, although
incoming turn toward production of the response. less so than short gaps.
Pickering and Garrod (2013) outline a general model of (4) Lengthy gaps (over 700 ms) may carry semiotic significance
psycholinguistic processing, suggesting that production and (mostly, of an undesired or unexpected response; Kendrick
comprehension are intimately intermeshed. Just as generally and Torreira, 2015), thus contributing to propel fast timing.
in action control, forward prediction of one’s own action is (5) Overlaps, though common, are brief (with a mean of 275 ms
performed to correct deviations, so in interaction forward at turn-transitions, and occupying less than 5% of the
prediction of the other’s actions is used to check perception, spoken signal in our telephone calls data). Overlaps are more
and aid preparation of response. This is a nice account, but common at turn transitions than within turns, and mostly
the complexities rapidly multiply. Listeners, on this account, involve back-channels, simultaneous first-starts, disfluencies,
are both using their full comprehension system, and running and other features predicted by Sacks et al. (1974).
a fast simulation of the other’s production in order to predict (6) Turn-taking is established early in infancy, long before full
the outcome. Now, given the turn-taking facts established linguistic competence, which actually appears to slow down
above, we must add to this computational burden the need response times; adult conversation timing is not achieved till
to simultaneously prepare one’s own turn in advance involving late in middle childhood.
both the full production system and a hypothesized fast forward (7) Given the latencies of speech production (over 600 ms),
predictor. So the poor listener who is about to respond has not incoming turns have to be predicted if accurate timing is to
only the full comprehension and production processes running be achieved. EEG recordings suggest the production process
simultaneously, but also two fast prediction systems (one for self, in responsive turns starts as soon as the gist of the incoming
one for other). This quadruple tasking looks unlikely, especially turn can be detected.
as similar tasks are hard to multitask. Additional problems are (8) Turn-final cues seem to be used to recognize that a turn
that unlike physical action prediction, which can be estimated by is definitely coming to an end. These cues are typically
a few heuristics, it is not clear how a fast approximate language prosodic (e.g., phrase-final syllable lengthening and specific
prediction system would be feasible especially in production – melodic patterns in many intonational languages) but also
producers have to grind through the syntax to find, e.g., what syntactic (e.g., syntactic closure), and in principle could be of
order to put words in. More likely the real production system other types too (e.g., gestural). In the appropriate pragmatic
may be involved minus the phonological and phonetic encoding, context, these turn-final cues can trigger the decision of
which account for the bulk of the production latency. the next speaker to articulate. From the point of view of
In any case, regardless of how this is achieved, the social interaction, it is effective articulation that constitutes a
experimental and corpus studies reviewed in this section point of no return (as opposed to other preparatory events
converge in showing that participants in conversation often preceding speech, such as pre-utterance inhalations and
anticipate the content of the others’ turns well in advance, and mouth noises).
that they use that information to prepare their response early.
turns are minimal of course, but in this case a bid must be made of response. While turn-final cues in the incoming turn seem
for an extended turn, as in: likely to play a role, they cannot be sufficient given the long
latencies in language planning and production. To overcome
(9) Terasaki, 1976, p. 53 these long latencies, predictive comprehension must be involved,
D: I forgot to tell you the two best things that happen’ to me together with a strategy of early beginnings to production. Bögels
today. et al. (submitted) suggest that production begins as soon as it
R: Oh super=What were they? can – that is, as soon as the speech act content of the incoming
D: I got a B+ on my math test ((material omitted)) and I got turn is clear. This implies of course dual-tasking, perhaps by
an athletic award. rapid alternation (‘time sharing’). A new study using a dual-task
paradigm and eye-tracking suggests that the heaviest interference
An alternative model is the turn-end signaling system is rather late (Sjerps and Meyer, 2015), and tied to looking-for-
proposed by Duncan (1972), also mentioned above, under which speaking which was postponed in this task toward the end of the
the system is wholly in the control of the current speaker, who incoming turn. Both early and late processes are almost certainly
has exclusive rights and signals transfer at the end of the turn. In involved, but what exactly is happening, and when during natural
contrast, Sacks et al. (1974) held that “It is misconceived to treat conversation remains to be determined.
turns as units characterized by a division of labor in which the The flowchart diagram in Figure 3 sketches the cognitive
speaker determines units and boundaries,” instead, “the turn as a processes that must minimally be at work in the recipient
unit is interactively determined.” of a typical turn at talk during conversation. Predictive
Duncan (1972, p. 286) proposed a simple rule of the sort comprehension is underway early, and already by half way
“The auditor may take his speaking turn when the speaker gives through more predictable turns will suggest a temporal envelope
a turn-yielding signal.” Such a system would be in effect like for completion (Magyari et al., 2014). If so, morphosyntax may
the “over and out” cuing at the end of turns on a two-way provide most of the early clues to the overall structural envelope
(half duplex) radio which permits hearing or talking but not (e.g., turns beginning with if or either or whenever projects a
both at once by a single party. Such a system predicts that two clause structure), so offering some long distance projection.
overlap can only occur when “over” cues are mistakenly given Within the last half second or so, the actual words will often be
or overridden; the large incidence of overlaps in corpora, and predicted (Magyari and de Ruiter, 2012), and, within that same
their clustering at principled locations (like overlapped tags or late time-frame, cues to imminent turn closure, usually prosodic
address forms) is then hard to reconcile with such a model. and phonetic, are likely to appear (Local and Walker, 2012; Bögels
As mentioned, the model presumed that these turn-yielding and Torreira, in press), indicating a likely turn end.
signals such as intonation are context-independent, but in fact A recipient’s first task is to identify or predict the speech
we know they are not – e.g., in English final rising intonation in act or action being carried out – both the illocutionary force
a question may signal finality but in a statement continuation; and the likely propositional content. In cases in which the
thus their interpretation would have to be embedded in complex illocutionary force of the incoming utterance makes a floor
comprehension processes. The model is in any case very partial: exchange relevant or due, production planning may begin as
it tells us nothing about how or why people should initiate a turn, soon as it is recognized, as suggested by the results in Bögels
why turns are generally short, how multiple participants can be et al. (submitted). Production is, at least in the latter stages,
integrated into a single conversation, how overlap is resolved, and serial, and proceeds through conceptualization, lemma retrieval,
so forth. But it may add a component to a more complex overall phonological retrieval, and phonetic encoding, following a time
model. course that seems well understood (Indefrey, 2011), extending
600–1200 ms or more before articulation depending on the ease
7.2. Toward an Adequate Psycholinguistic of retrieval and the length of the turn. In this model, early
Model of Turn Taking – Cognitive preparation is assumed, but actual articulation is held till turn-
Processes in the Responder1 final cues (e.g., upcoming syntactic closure, a non-turn-keeping
We believe that the property list in Section 7 above puts fairly intonational phrase boundary) are detected, whereupon actual
narrow constraints on a possible model of turn-taking. One articulation is launched. Assuming these cues fall in the last half-
area of particular interest is the temporal constraints that turn- second of the incoming turn, reaction to those will be sufficient
taking imposes on language processing, given that conversational to launch pre-prepared material so that it appears soon after the
interchange is the core form of language use. These constraints other’s turn is completed.
are funneled into one crucial link in the system, namely, the Figure 3 sketches the kind of interaction between
current addressee preparing to respond. Here we consider the comprehension and production processes that must be
cognitive processes that must be involved. involved in a typical turn transition (i.e., involving a FTO
The crucial questions concern what factors govern the of c. 200 ms). There is an early gist comprehension with speech
decision making process that lies behind the initiation and timing act apprehension sent as soon as possible to the production
1 The ideas presented in this section were developed in collaboration with Mathias
conceptualizer (see Levinson, 2013; Gisladottir et al., 2015).
The production system may automatically begin to formulate
Barthel, Sara Bögels, and the other members of the INTERACT project at the Max
Planck Institute for Psycholinguistics. See also Section 5.3 in Heldner and Edlund right down to the phonology (Bögels et al., submitted), but with
(2010) for a parallel proposal. the actual articulation held in a buffer until the comprehension
FIGURE 3 | Sketch of the interleaving of comprehension and production in the recipient of an incoming turn.
system signals an imminent completion of the incoming turn. The model therefore captures the most typical turn transition
Prior to that signal, it is likely that pre-articulation preparation values observed in conversational corpora.
(requiring c. 200 ms) of the vocal apparatus would be underway – What, however, accounts for the significant number of overlap
this would include readying the vocal tract for the gestures to and long gap cases observable in any conversation? A reviewer
be made (see Drake et al., 2014; Schaeffler et al., 2014), and the suggests that human factors such as lack of attention, pre-
decision to inhale prior to delivery of longer responses (Torreira formulated agendas, and apparent involvement with actual
et al., 2015, this volume). minimal responsiveness may all be involved, and notes that
Meanwhile the comprehension system continues to check the apparent good timing may be achieved with buffers like particles.
incoming signal for possible closure at both the syntactic and However, the evidence is that conversation is generally more
prosodic level. As soon as there are consistent signals of linguistic demanding than that – for example 95% of questions get answers
completion, a go-signal is sent to production, and any buffered (Stivers, 2010), and particles like well and uhm in English
articulation released. It is likely that visual monitoring of gesture are semiotically loaded and thus not empty buffers (Kendrick
can also be utilized for the go-signal (Duncan, 1974), but this and Torreira, 2015), while Roberts et al. (2015) failed to find
awaits experimental confirmation. statistical differences in the timing of turns with and without
This model is responsive to all the constraints listed in such particles. In addition, it is likely that speakers sometimes
Section 7. What this model crucially adds is: use other turn-taking than the one sketched in Figure 3. For
example, under competition for the floor, or when responding
(a) an account of how responders can often respond with short to highly predictable utterances, speakers may decide to launch
latencies despite the long latencies of the production system; articulation without waiting to identify turn-final cues. In cases of
(b) why the corpus statistical results reliably show a modal long transition latencies, speakers may not have been able to plan
response with positive offsets of around 100–300 ms, the initial stages of their turn early enough to launch articulation
reflecting the reaction time to the turn-final prosodic cues when the interlocutor’s turn-final cues become available. This
in the incoming turn (i.e., reaction to the go-signal, as may indeed be due to a low attentional level on the part of the
hypothesized by Heldner and Edlund, 2010). speaker, or to the interlocutor’s turn being unclear in purpose
until its end or simply to the complexity of the response required
(Torreira et al., 2015, this volume).
The model sketch in Figure 3 is based on average, modal,
and minimal temporal latencies reported in the literature. We
would like to propose that this model is generally valid in the 8. Conclusion
most frequent scenarios. If speakers launched their responses as
early as they could without waiting for turn-final cues, we should This overview of work on turn-taking behavior over the
expect overlapping or no-gap–no-overlap transitions to be the last half century shows that turn-taking is a remarkable
most common, rather than a short gap. And, if speakers typically phenomenon, for it combines high temporal coordination
launched language planning only after identifying turn-final cues, between participants with the remarkable complexity and open-
we should expect the most frequent transition times to involve at endedness of the language that fills the turns. The tension
least half a second or more rather than short gaps of 100–300 ms. between these two properties is reflected in the development
References Donders, F. C. (1869). “On the speed of mental processes,” in Attention &
Performance II, ed. and trans. W. G. Koster (Amsterdam: North-Holland),
Altmann, G., and Kamide, Y. (1999). Incremental interpretation at verbs: 412–431.
restricting the domain of subsequent reference. Cognition 73, 247–264. doi: Drake, E., Schaeffler, S., and Corley, M. (2014). “Articulatory effects of
10.1016/S0010-0277(99)00059-1 prediction during comprehension: an ultrasound tongue imaging approach,” in
Bates, E., D’Amico, S., Jacobsen, T., Székely, A., Andonova, E., Devescovi, A., Proceedings of the 10th International Seminar on Speech Production, Cologne.
et al. (2003). Timed picture naming in seven languages. Psychon. Bull. Rev. 10, Draper, M. H., Ladefoged, P., and Whitteridge, D. (1960). Expiratory pressures and
344–380. doi: 10.3758/BF03196494 air flow during speech. Br. Med. J. 1, 1837–1843. doi: 10.1136/bmj.1.5189.1837
Bateson, M. C. (1975). Mother-infant exchanges: the epigenesis of conversational Drew, P. (2013). “Turn Design,” in Handbook of Conversation Analysis, eds T.
interaction. Ann. N. Y. Acad. Sci. 263, 101–113. doi: 10.1111/j.1749- Stivers and J. Sidnell (Chichester: Wiley-Blackwell), 131–149.
6632.1975.tb41575.x Duncan, S. D. (1972). Some signals and rules for taking speaking turns in
Bavelas, J. B., and Gerwing, J. (2011). The listener as addressee in face-to-face conversation. J. Pers. Soc. Psychol. 23, 283–292. doi: 10.1037/h0033031
dialogue. Int. J. Listening 25, 178–198. doi: 10.1080/10904018.2010.508675 Duncan, S. D. (1974). On the structure of speaker-auditor interaction during
Beebe, B., Alson, D., Jaffe, J., Feldstein, S., and Crown, C. (1988). Vocal congruence speaking turns. Lang. Soc. 2, 161–180. doi: 10.1017/S0047404500004322
in mother-infant play. J. Psychol. Res. 17, 245–259. doi: 10.1007/BF01686358 Fernald, A., Zangl, R., Portillo, A. L., and Marchman, V. A. (2008). “Looking while
Bögels, S., Barr, D., Garrod, S., and Kessler, K. (2014). Conversational interaction listening: using eye movements to monitor spoken language comprehension
in the scanner: mentalizing during language processing as revealed by MEG. by infants and young children,” in Developmental Psycholinguistics: On-line
Cereb. Cortex doi: 10.1093/cercor/bhu116 [Epub ahead of print]. Methods in Children’s Language Processing, eds I. A. Sekerina, E. M. Fernandez,
Bögels, S., and Torreira, F. (in press). Listeners use intonational phrase boundaries and H. Clahsen (Amsterdam: Benjamins), 97–135. doi: 10.1075/lald.44.
to project turn ends in spoken interaction. J. Phonet. 06fer
Boiteau, T. W., Malone, P. S., Peters, S. A., and Almor, A. (2014). Interference Ford, C. E., and Thompson, S. A. (1996). “Interactional units in conversation:
between conversation and a concurrent visuomotor task. J. Exp. Psychol. Gen. syntactic, intonational, and pragmatic resources for the projection of turn
143, 295–311. doi: 10.1037/a0031858 completion,” in Interaction and Grammar, eds E. Ochs, E. A. Schegloff, and S. A.
Brady, P. T. (1968). A statistical analysis of on-off patterns in 16 conversations. Bell Thompson (Cambridge: Cambridge University Press), 135–184.
Sys. Tech. J. 47, 73–91. doi: 10.1002/j.1538-7305.1968.tb00031.x Fry, D. B. (1975). Simple reaction-times to speech and non-speech stimuli. Cortex
Bruner, J. (1983). Child’s Talk. New York, NY: Norton. 11, 355–360. doi: 10.1016/S0010-9452(75)80027-X
Buck, J., and Buck, E. (1976). ‘Synchronous fireflies’. Sci. Am. 234, 74–85. doi: Garvey, C., and Berninger, G. (1981). Timing and turn-taking in children’s
10.1038/scientificamerican0576-74 conversations. Discourse Process. 4, 27–57. doi: 10.1080/01638538109544505
Byrd, D. (1993). 54,000 American stops. UCLA Work. Papers Phon. 83, 97–116. Gisladottir, R., Chwilla, D., and Levinson, S. C. (2015). Conversation electrified:
Calhoun, S., Carletta, J., and Brenier, J. M. (2010). The NXT-format switchboard ERP correlates of speech act recognition in underspecified utterances. PLoS
corpus: a rich resource for investigating the syntax, semantics, pragmatics and ONE 10:e0120068. doi: 10.1371/journal.pone.0120068
prosody of dialogue. Lang. Resour. Eval. 44, 387–419. doi: 10.1007/s10579-010- Gleitman, L. R., January, D., Nappa, R., and Trueswell, J. C. (2007). On the give and
9120-1 take between event apprehension and utterance formulation. J. Mem. Lang. 57,
Calvert, D. R. (1986). Descriptive Phonetics, 2nd Edn. New York, NY: Thieme 544–596. doi: 10.1016/j.jml.2007.01.007
Medical Publisher. Godfrey, J., Holliman, E., and McDaniel, J. (1992). “SWITCHBOARD: telephone
Casillas, M. (2014). “Taking the floor on time: delay and deferral in children’s turn speech corpus for research and development,” in Proceedings of the
taking,” in Language in Interaction: Studies in Honor of Eve V. Clark, eds I. IEEE International Conference on Acoustics, Speech and Signal Processing
Arnon, M. Casillas, C. Kurumada, and B. Estigarribia (Amsterdam: Benjamins), (ICASSP) (San Francisco, CA: IEEE), 517–520. doi: 10.1109/icassp.1992.
101–114. doi: 10.1075/tilar.12.09cas 225858
Casillas, M., and Frank, M. C. (2013). “The development of predictive processes in Goodwin, C. (1980). Restarts, pauses, and the achievement of mutual gaze at
children’s discourse understanding,” in Proceedings of the 35th Annual Meeting turn-beginning. Soc. Inq. 50, 272–302. doi: 10.1111/j.1475-682X.1980.tb00023.x
of the Cognitive Science Society, eds M. Knauff, M. Pauen, N. Sebanz, and I. Gravano, A., and Hirschberg, J. (2009). “Backchannel-inviting cues in task-oriented
Wachsmuth (Austin, TX: Cognitive Society), 299–304. dialogue,” in Proceedings of SigDial 2009, London, 253–261.
Clayman, S. (2013). “Turn-constructional units and the transition-relevance place,” Griffin, Z. M., and Bock, K. (2000). What the eyes say about speaking. Psychol. Sci.
in Handbook of Conversation Analysis, eds T. Stivers and J. Sidnell (Chichester: 4, 274–279. doi: 10.1111/1467-9280.00255
Wiley-Blackwell), 151–166. Hayashi, M. (2013). “Turn allocation and turn sharing,” in Handbook of
Chomsky, N. (1969). “Quine’s Empirical Assumptions,” in Words and Objections, Conversation Analysis, eds T. Stivers and J. Sidnell (Chichester: Wiley-
eds D. Davidson and J. Hintikka (Dordrecht: Reidel), 53–68. doi: 10.1007/978- Blackwell), 167–190.
94-010-1709-1_5 Heldner, M. (2011). Detection thresholds for gaps, overlaps and no-gap-no-
Couper-Kuhlen, E. (2009). “Relatedness and timing in talk-in-interaction,” in overlaps. J. Acoust. Soc. Am. 130, 508–513. doi: 10.1121/1.3598457
Where Prosody Meets Pragmatics, eds D. Barth-Weingarten, N. Dehé, and Heldner, M., and Edlund, J. (2010). Pauses, gaps and overlaps in conversations.
A. Wichmann (Leiden: Brill), 257–276. doi: 10.1163/9789004253223_012 J. Phon. 38, 555–568. doi: 10.1016/j.wocn.2010.08.002
Crystal, T., and House, A. (1988). Segmental durations in connected-speech Helmholtz, H. (1850). “Vorläufiger Bericht Über die Fortpflanzungs-
signals: current results. J. Acoust. Soc. Am. 83, 1553–1573. doi: 10.1121/1.395911 Geschwindigkeit der Nervenreizung,” in Archiv für Anatomie, Physiologie
De Ruiter, J. P., Mitterer, H., and Enfield, N. J. (2006). Projecting the end of a und wissenschaftliche Medicin (Berlin: Veit & Comp.), 71–73.
speaker’s turn: a cognitive cornerstone of conversation. Language 82, 515–535. Hick, W. E. (1952). On the rate of gain of information. Q. J. Exp. Psychol. 4, 11–26.
doi: 10.1353/lan.2006.0130 doi: 10.1080/17470215208416600
de Vos, C., Torreira, F., and Levinson, S. C. (2015). Turn-timing in signed Indefrey, P. (2011). The spatial and temporal signatures of word
conversations: coordinating stroke-to-stroke turn boundaries. Front. Psychol. production components: a critical update. Front. Psychol. 2:255. doi:
6:268. doi: 10.3389/fpsyg.2015.00268 10.3389/fpsyg.2011.00255
Indefrey, P., and Levelt, W. J. M. (2004). The spatial and temporal Norcliffe, E., Konopka, A., Brown, P., and Levinson, S. C. (2015). Word order
signatures of word production components. Cognition 92, 101–144. doi: affects the time-course of sentence formulation in Tzeltal. Lang. Cogn. Neurosci.
10.1016/j.cognition.2002.06.001 doi: 10.1080/23273798.2015.1006238
Ito, K., and Speer, S. R. (2008). Anticipatory effects of intonation: eye Norwine, A. C., and Murphy, O. J. (1938). Characteristic time intervals in
movements during instructed visual search. J. Mem. Lang. 58, 541–573. doi: telephonic conversation. Bell Syst. Tech. J. 17, 281–291. doi: 10.1002/j.1538-
10.1016/j.jml.2007.06.013 7305.1938.tb00432.x
Izdebski, K., and Shipp, T. (1978). Minimal reaction times for phonatory initiation. Pickering, M. J., and Garrod, S. (2013). An integrated theory of language
J. Speech Hear. Res. 21, 638–651. doi: 10.1044/jshr.2104.638 production and comprehension. Behav. Brain Sci. 36, 329–347. doi:
Jasnow, M., and Feldstein, S. (1986). Adult-like temporal characteristics of 10.1017/S0140525X12001495
mother-infant vocal interactions. Child Dev. 57, 754–761. doi: 10.2307/11 Pomerantz, A., and Heritage, J. (2013). “Preference,” in Handbook of Conversation
30352 Analysis, eds T. Stivers and J. Sidnell (Chichester: Wiley-Blackwell), 210–228.
Jefferson, G. (1984). “Notes on some orderliness of overlap onset,” in Discourse Riest, C., Jorschick, A. B., and De Ruiter, J. P. (2015). Anticipation in turn-
Analysis and Natural Rhetoric, eds V. D’Urso and P. Leonardi (Padua: Cleup taking: mechanisms and information sources. Front. Psychol. 6:89. doi:
Editore), 11–38. 10.3389/fpsyg.2015.00089
Jefferson, G. (1986). Notes on ‘latency’ in overlap onset. Hum. Stud. 9, 153–183. Roberts, F., Margutti, P., and Takano, S. (2011). Judgments concerning the
doi: 10.1007/BF00148125 valence of inter-turn silence across speakers of American English, Italian,
Jescheniak, J. D., and Levelt, W. J. M. (1994). Word frequency effects in speech and Japanese. Discourse Process. 48, 331–354. doi: 10.1080/0163853X.2011.
production: retrieval of syntactic information and of phonological form. J. Exp. 558002
Psychol. Learn. Mem. Cogn. 20, 824–843. doi: 10.1037/0278-7393.20.4.824 Roberts, S. G., Torreira, F., and Levinson, S. C. (2015). The effects of processing
Kamide, Y., Altmann, G. T. M., and Haywood, S. L. (2003). The time-course and sequence organization on the timing of turn taking: a corpus study. Front.
of prediction in incremental sentence processing: evidence from anticipatory Psychol. 6:509. doi: 10.3389/fpsyg.2015.00509
eye movements. J. Mem. Lang. 49, 133–156. doi: 10.1016/S0749-596X(03) Rossano, F. (2013). “Gaze in conversation,” in Handbook of Conversation Analysis,
00023-8 eds T. Stivers and J. Sidnell (Chichester: Wiley-Blackwell), 308–329.
Keitel, A., Prinz, W., Friederici, A. D., von Hofsten, C., and Daum, M. M. Sacks, H., Schegloff, E., and Jefferson, G. (1974). A simplest systematics for
(2013). Perception of conversations: the importance of semantics and the organization of turn-taking in conversation. Language 50, 696–735. doi:
intonation in children’s development. J. Exp. Child Psychol. 116, 264–277. doi: 10.1353/lan.1974.0010
10.1016/j.jecp.2013.06.005 Schaeffler, S., Scobbie, J. M., and Schaeffler, F. (2014). “Measuring reaction times:
Kendon, A. (1967). Some functions of gaze-direction in social interaction. Acta vocalisation vs. articulation,” in Proceedings of the 10th International Seminar
Psychol. 26, 22–63. doi: 10.1016/0001-6918(67)90005-4 on Speech Production, Cologne.
Kendrick, K., and Torreira, F. (2015). The timing and construction of Schegloff, E. A. (2000). Overlapping talk and the organization of turn-taking for
preference: a quantitative study. Discourse Process. 52, 255–289. doi: 10.1080/ conversation. Lang. Soc. 29, 1–63. doi: 10.1017/S0047404500001019
0163853X.2014.955997 Schnur, T. T., Costa, A., and Caramazza, A. (2006). Planning at the phonological
Kutas, M., DeLong, K. A., and Smith, N. J. (2011). “A look around at what lies level during sentence production. J. Psycholinguist. Res. 35, 189–213. doi:
ahead: prediction and predictability in language processing,” in Predictions in 10.1007/s10936-005-9011-6
the Brain: Using our Past to Generate a Future, ed. M. Bar (Oxford: Oxford Sebanz, N., and Knoblich, G. K. (2008). “From mirroring to joint action,” in
University Press), 190–207. Embodied Communication, eds I. Wachsmuth, M. Lenzen, and G. K. Knoblich
Lerner, G. H. (1991). On the syntax of sentences in progress. Lang. Soc. 20, 441–458. (Oxford: Oxford University Press), 129–150.
doi: 10.1017/S0047404500016572 Segaert, K., Menenti, L., Weber, K., and Hagoort, P. (2011). A paradox of
Lerner, G. H. (2002). “Turn-sharing: the choral co-production of talk-in- syntactic priming: why response tendencies show priming for passives,
interaction,” in The Language of Turn and Sequence, eds C. Ford, B. Fox, and and response latencies show priming for actives. PLoS ONE 6:e24209. doi:
S. Thompson (Oxford: Oxford University Press), 225–256. 10.1371/journal.pone.0024209
Levelt, W. J. M. (1989). Speaking: From Intention to Articulation. Cambridge, MA: Sellen, A. J. (1995). Remote conversations: the effects of mediating
MIT Press. talk with technology. Hum. Comput. Interact. 10, 401–444. doi:
Levinson, S. (1983). Pragmatics. Cambridge: Cambridge University Press. 10.1207/s15327051hci1004_2
Levinson, S. (2000). Presumptive Meanings. Cambridge, MA: MIT Press. Shipp, T., Izdebski, K., and Morrissey, P. (1984). Physiologic stages of vocal
Levinson, S. (2013). Recursion in pragmatics. Language 89, 149–162. doi: reaction time. J. Speech Hear. Res. 27, 173–178. doi: 10.1044/jshr.2702.173
10.1353/lan.2013.0005 Sjerps, M., and Meyer, A. (2015). Variation in dual-task performance reveals
Local, J., and Walker, G. (2012). How phonetic features project more talk. J. Int. late initiation of speech planning in turn-taking. Cognition 136, 304–324. doi:
Phon. Assoc. 42, 255–280. doi: 10.1017/S0025100312000187 10.1016/j.cognition.2014.10.008
Magyari, L., Bastiaansen, M. C. M., De Ruiter, J. P., and Levinson, S. C. (2014). Stivers, T. (2010). An overview of the question-response system in
Early anticipation lies behind the speed of response in conversation. J. Cogn. American English conversation. J. Pragmatics 42, 2772–2781. doi:
Neurosci. 26, 2530–2539. doi: 10.1162/jocn_a_00673 10.1016/j.pragma.2010.04.011
Magyari, L., and de Ruiter, J. P. (2012). Prediction of turn-ends based Stivers, T., Enfield, N. J., Brown, P., Englert, C., Hayashi, M., Heinemann, T., et al.
on anticipation of upcoming words. Front. Psychol. 3:376. doi: (2009). Universals and cultural variation in turn-taking in conversation. Proc.
10.3389/fpsyg.2012.00376 Natl. Acad. Sci. U.S.A. 106, 10587–10592. doi: 10.1073/pnas.0903616106
McFarland, D. H. (2001). Respiratory markers of conversational interaction. Stivers, T., Enfield, N. J., and Levinson, S. C. (2010). Question-response sequences
J. Speech Lang. Hear Res. 44, 128–143. doi: 10.1044/1092-4388(2001/012) in conversation across ten languages: an introduction. J. Pragmatics 42, 2615–
Mehl, M. R., Vazire, S., Ramírez-Esparza, N., Slatcher, R. B., and Pennebaker, 2619. doi: 10.1016/j.pragma.2010.04.001
J. W. (2007). Are women really more talkative than men? Science 317, 82. doi: ten Bosch, L., Oostdijk, N., and Boves, L. (2005). On temporal aspects of
10.1126/science.1139940 turn-taking in conversational dialogues. Speech Commun. 47, 80–86. doi:
Mehler, J., Sebastian, N., Altmann, G., Christophe, A., and Pallier, C. (1993). 10.1016/j.specom.2005.05.009
Understanding compressed sentences: the role of rhythm and meaning. ten Bosch, L., Oostdijk, N., and de Ruiter, J. P. (2004). “Turn-taking in social
Paper presented at the Temporal information processing in the nervous talk dialogues: temporal, formal, and functional aspects,” in Proceedings of the
system. Ann. N. Y. Acad. Sci. 682, 272–282. doi: 10.1111/j.1749-6632.1993.tb Ninth Conference on Speech and Computer (SPECOM 2004), Saint-Petersburg:
22975.x St.Petersburg, 454–461.
Munhall, K., Gribble, P., Sacco, L., and Ward, M. (1996). Temporal constraints on Terasaki, A. (1976). Pre-announcement Sequences in Conversation (No. 99). Irvine,
the McGurk effect. Percept. Psychophys. 58, 351–362. doi: 10.3758/BF03206811 CA: University of Irvine, Social Sciences.
Tice, M., and Henetz, T. (2011). “Turn-boundary projection: looking ahead,” in Wilson, M., and Wilson, T. P. (2005). An oscillator model of the timing
Proceedings of the 33rd Annual Conference of the Cognitive Science Society, eds of turn-taking. Psychon. Bull. Rev. 12, 957–968. doi: 10.3758/BF032
L. Carlson, C. Hölscher, and T. Shipley (Austin, TX: Cognitive Science Society), 06432
838–843. Wilson, T. P., and Zimmerman, D. H. (1986). The structure of silence
Torreira, F., Bögels, S., and Levinson, S. C. (2015). Breathing for answering: the between turns in two-party conversation. Discourse Process. 9, 375–390. doi:
time course of response planning in conversation. Front. Psychol. 6:284. doi: 10.1080/01638538609544649
10.3389/fpsyg.2015.00284 Yngve, V. H. (1970). On getting a word in edgewise. Papers from the Sixth
Trevarthen, C. (1977). “Descriptive analyses of infant communicative behaviour,” Regional Meeting of the Chicago Linguistic Society. Chicago: Chicago Linguistic
in Studies in Mother-Infant Interaction, ed. H. R. Schaffer (London: Academic Society.
Press), 89–117.
Walker, G. (2013). “Phonetics and prosody in conversation,” in Handbook Conflict of Interest Statement: The authors declare that the research was
of Conversation Analysis, eds T. Stivers and J. Sidnell (Chichester: Wiley- conducted in the absence of any commercial or financial relationships that could
Blackwell), 455–474. be construed as a potential conflict of interest.
Weilhammer, K., and Rabold, S. (2003). “Durational aspects in turn taking,” in
Proceedings of the International Conference of Phonetic Sciences, Barcelona. Copyright © 2015 Levinson and Torreira. This is an open-access article distributed
Wells, B., and Macfarlane, S. (1998). Prosody as an interactional under the terms of the Creative Commons Attribution License (CC BY). The use,
resource: turn-projection and overlap. Lang. Speech 41, 265–294. doi: distribution or reproduction in other forums is permitted, provided the original
10.1177/002383099804100403 author(s) or licensor are credited and that the original publication in this journal
Wheeldon, L. R., and Levelt, W. J. M. (1995). Monitoring the time-course of is cited, in accordance with accepted academic practice. No use, distribution or
phonological encoding. J. Mem. Lang. 34, 311–334. doi: 10.1006/jmla.1995.1014 reproduction is permitted which does not comply with these terms.