Traditional studies of human categorization often treat the processes of encoding features and cu... more Traditional studies of human categorization often treat the processes of encoding features and cues as peripheral to the question of how stimuli are categorized. However, in domains where the features and cues are less transparent, how information is encoded prior to categorization may constrain our understanding of the architecture of categorization. This is particularly true in speech perception, where acoustic cues to phonological categories are ambiguous and influenced by multiple factors. Here, it is crucial to consider the joint contributions of the information in the input and the categorization architecture. We contrasted accounts that argue for raw acoustic information encoding with accounts that posit that cues are encoded relative to expectations, and investigated how two categorization architectures-exemplar models and back-propagation parallel distributed processing models-deal with each kind of information. Relative encoding, akin to predictive coding, is a form of noi...
Previous research on speech perception in both adults and infants has supported the view that con... more Previous research on speech perception in both adults and infants has supported the view that consonants are perceived categorically; that is, listeners are relatively insensitive to variation below the level of the phoneme. More recent work, on the other hand, has shown adults to be systematically sensitive to within category variation (McMurray, Tanenhaus & Aslin, 2002). Additionally, recent evidence suggests that infants are capable of using within-category variation to segment speech and to learn phonetic categories. Here we report two studies of 8-month-old infants, using the head-turn preference procedure, that examine more directly infants ’ sensitivity to within-category variation. Infants were exposed to 80 repetitions of words beginning with either /b / or /p/. After exposure, listening times to tokens of the same category with small variations in VOT were significantly different than to both the originally exposed tokens and to the cross–category-boundary competitors. Thu...
Understanding spoken language requires analysis of the rapidly unfolding speech signal at multipl... more Understanding spoken language requires analysis of the rapidly unfolding speech signal at multiple levels: acoustic, phonological, and semantic. However, there is not yet a comprehensive picture of how these levels relate. We recorded electroencephalography (EEG) while listeners (N=31) heard sentences in which we manipulated acoustic ambiguity (e.g., a bees/peas continuum) and sentential expectations (e.g., Honey is made by bees). EEG was analyzed with a mixed effects model over time to quantify how language processing cascades proceed on a millisecond-by-millisecond basis. Our results indicate: (1) perceptual processing and memory for fine-grained acoustics is preserved in brain activity for up to 900 msec; (2) contextual analysis begins early and is graded with respect to the acoustic signal; and (3) top-down predictions influence perceptual processing in some cases, however, these predictions are available simultaneously with the veridical signal. These mechanistic insights provi...
Speech unfolds rapidly over time, and the information necessary to recognize even a single phonem... more Speech unfolds rapidly over time, and the information necessary to recognize even a single phoneme may not be available simultaneously. Consequently, listeners must both integrate prior acoustic cues and anticipate future segments. Prior work on stop consonants and vowels suggests that listeners integrate asynchronous cues by partially activating lexical entries as soon as any information is available, and then updating this when later cues arrive. However, a recent study suggests that for the voiceless sibilant fricatives (/s/ and / /), listeners wait to initiate lexical access until all cues have arrived at the onset of the vowel. Sibilants also contain coarticulatory cues that could be used to anticipate the vowel upcoming. However, given these results, it is unclear if listeners could use them fast enough to speed vowel recognition. The current study examines anticipation by asking when listeners use coarticulatory information in the frication to predict the upcoming vowel. A visual world paradigm experiment found that listeners do not wait: they anticipate the vowel immediately from the onset of the frication, even as they wait several hundred milliseconds to identify the fricative. This finding suggests listeners do not strictly process phonemes in the order that they appear; rather the dynamics of language processing may be largely internal and only loosely coupled to the dynamics of the input.
Speech unfolds over time and the cues for even a single phoneme are rarely available simultaneous... more Speech unfolds over time and the cues for even a single phoneme are rarely available simultaneously. Consequently, to recognize a single phoneme listeners must integrate material over several hundred milliseconds. Prior work contrasts two accounts: 1) a memory buffer account in which listeners accumulate auditory information in memory and only access higher level representations (i.e., lexical representations) when sufficient information has arrived; and 2) an immediate integration scheme in which lexical representations can be partially activated on the basis of early cues and then updated when more information arises. These studies have uniformly shown evidence for immediate integration for a variety of phonetic distinctions. We attempted to extend this to fricatives, a class of speech sounds which requires not only temporal integration of asynchronous cues (the frication, followed by the formant transitions 150-350 msec later), but also integration across different frequency bands, and compensation for contextual factors like coarticulation. Eye-movements in the visual world paradigm and showed clear evidence for a memory buffer. Results were replicated in five experiments, ruling out methodological factors and tying the release of the buffer to the onset of the vowel. These findings support a general auditory account for speech by suggesting that the acoustic nature of particular speech sounds may have large effects on how they are processed. It also has major implications for theories of auditory and speech perception by raising the possibility of a encapsulated memory buffers in early auditory processing.
Journal of Experimental Psychology: Human Perception and Performance, 2017
During spoken language comprehension listeners transform continuous acoustic cues into categories... more During spoken language comprehension listeners transform continuous acoustic cues into categories (e.g., /b/ and /p/). While long-standing research suggests that phonetic categories are activated in a gradient way, there are also clear individual differences in that more gradient categorization has been linked to various communication impairments such as dyslexia and specific language impairments
Language learning is generally described as a problem of acquiring new information (e.g., new wor... more Language learning is generally described as a problem of acquiring new information (e.g., new words). However, equally important are changes in how the system processes known information. For example, a wealth of studies has suggested dramatic changes over development in how efficiently children recognize familiar words, but it is unknown what kind of experience-dependent mechanisms of plasticity give rise to such changes in real-time processing. We examined the plasticity of the language processing system by testing whether a fundamental aspect of spoken word recognition, lexical interference, can be altered by experience. Adult participants were trained on a set of familiar words over a series of 4 tasks. In the high-competition (HC) condition, tasks were designed to encourage coactivation of similar words (e.g., net and neck) and to require listeners to resolve this competition. Tasks were similar in the low-competition (LC) condition, but did not enhance this competition. Immediately after training, interlexical interference was tested using a visual world paradigm task. Participants in the HC group resolved interference to a fuller degree than those in the LC group, demonstrating that experience can shape the way competition between words is resolved. TRACE simulations showed that the observed late differences in the pattern of interference resolution can be attributed to differences in the strength of lexical inhibition. These findings inform cognitive models in many domains that involve competition/ interference processes, and suggest an experience-dependent mechanism of plasticity that may underlie longer term changes in processing efficiency associated with both typical and atypical development.
Objectives-While outcomes with cochlear implants (CIs) are generally good, performance can be fra... more Objectives-While outcomes with cochlear implants (CIs) are generally good, performance can be fragile. The authors examined two factors that are crucial for good CI performance. First, while there is a clear benefit for adding residual acoustic hearing to CI stimulation (typically in low frequencies), it is unclear whether this contributes directly to phonetic categorization. Thus, the authors examined perception of voicing (which uses low-frequency acoustic cues) and fricative place of articulation (s/ʃ, which does not) in CI users with and without residual acoustic hearing. Second, in speech categorization experiments, CI users typically show shallower identification functions. These are typically interpreted as deriving from noisy encoding of the signal. However, psycholinguistic work suggests shallow slopes may also be a useful way to adapt to uncertainty. The authors thus employed an eye-tracking paradigm to examine this in CI users. Design-Participants were 30 CI users (with a variety of configurations) and 22 age-matched normal hearing (NH) controls. Participants heard tokens from six b/p and six s/ʃ continua (eight steps) spanning real words (e.g., beach/peach, sip/ship). Participants selected the picture corresponding to the word they heard from a screen containing four items (a b-, p-, sand ʃ-initial item). Eye movements to each object were monitored as a measure of how strongly they were considering each interpretation in the moments leading up to their final percept. Results-Mouse-click results (analogous to phoneme identification) for voicing showed a shallower slope for CI users than NH listeners, but no differences between CI users with and without residual acoustic hearing. For fricatives, CI users also showed a shallower slope, but unexpectedly, acoustic + electric listeners showed an even shallower slope. Eye movements showed a gradient response to fine-grained acoustic differences for all listeners. Even considering only trials in which a participant clicked "b" (for example), and accounting for variation in the
Many sources of context information in speech (such as speaking rate) occur either before or afte... more Many sources of context information in speech (such as speaking rate) occur either before or after the phonetic cues they influence, yet there is little work examining the time-course of these effects. Here, we investigate how listeners compensate for preceding sentence rate and subsequent vowel length (a secondary cue that has been used as a proxy for speaking rate) when categorizing words varying in voice-onset time (VOT). Participants selected visual objects in a display while their eye-movements were recorded, allowing us to examine when each source of information had an effect on lexical processing. We found that the effect of VOT preceded that of vowel length, suggesting that each cue is used as it becomes available. In a second experiment, we found that, in contrast, the effect of preceding sentence rate occurred simultaneously with VOT, suggesting that listeners interpret VOT relative to preceding rate.
Research in speech perception has been dominated by a search for invariant properties of the sign... more Research in speech perception has been dominated by a search for invariant properties of the signal that correlate with lexical and sublexical categories. We argue that this search for invariance has led researchers to ignore the perceptual consequences of systematic variation within such categories and that sensitivity to this variation may provide an important source of information for integrating information
The product of speech perception is contrast between discrete units of meaning, words, which are ... more The product of speech perception is contrast between discrete units of meaning, words, which are contrasted by features. While traditional approaches argued that discreteness is imposed by mechanisms like categorical perception that discard within-category detail, recent research suggests that fine-grained detail is preserved throughout processing. We develop an alternative that argues that discreteness emerges from processes that parse overlapping sources of variance from the signal. These need not discard acoustic detail and may make it more useful to listeners. We develop a computational implementation (Computing Cues Relative to Expectations, C-CuRE) and test it on a corpus of vowel productions. It shows how C-CuRE reveals underlying vowel features despite contextual variance, and simultaneously uses the variance to better predict upcoming vowels. 1. Leaving aside F0 as a feature encoding pragmatic meaning.
Most research on infant speech categories has relied on measures of discrimination. Such work oft... more Most research on infant speech categories has relied on measures of discrimination. Such work often employs categorical perception as a linking hypothesis to enable inferences about categorization on the basis of discrimination measures. However, a large number of studies with adults challenge the utility of categorical perception in describing adult speech perception, and this in turn calls into question how to interpret measures of infant speech discrimination. We propose here a parallel channels model of discrimination (built on Pisoni and Tash Perception & Psychophysics, 15(2), 285-290, 1974), which posits that both a noncategorical or veridical encoding of speech cues and category representations can simultaneously contribute to discrimination. This can thus produce categorical perception effects without positing any warping of the acoustic signal, but it also reframes how we think about infant discrimination and development. We test this model by conducting a quantitative review of 20 studies examining infants' discrimination of voice onset time contrasts. This review suggests that within-category discrimination is surprisingly prevalent even in classic studies and that, averaging across studies, discrimination is related to continuous acoustic distance. It also identifies several methodological factors that may mask our ability to see this. Finally, it suggests that infant discrimination may improve over development, contrary to commonly held notion of perceptual narrowing. These results are discussed in terms of theories of speech development that may require such continuous sensitivity.
Speech sounds are highly variable, yet listeners readily extract information from them and transf... more Speech sounds are highly variable, yet listeners readily extract information from them and transform continuous acoustic signals into meaningful categories during language comprehension. A central question is whether perceptual encoding captures acoustic detail in a one-to-one fashion or whether it is affected by phonological categories. We addressed this question in an event-related potential (ERP) experiment in which listeners categorized spoken words that varied along a continuous acoustic dimension (voice-onset time, or VOT) in an auditory oddball task. We found that VOT effects were present through a late stage of perceptual processing (N1 component, ~100 ms poststimulus) and were independent of categorization. In addition, effects of within-category differences in VOT were present at a postperceptual categorization stage (P3 component, ~450 ms poststimulus). Thus, at perceptual levels, acoustic information is encoded continuously, independently of phonological information. Fur...
Most theories of categorization emphasize how continuous perceptual information is mapped to cate... more Most theories of categorization emphasize how continuous perceptual information is mapped to categories. However, equally important are the informational assumptions of a model, the type of information subserving this mapping. This is crucial in speech perception where the signal is variable and context dependent. This study assessed the informational assumptions of several models of speech categorization, in particular, the number of cues that are the basis of categorization and whether these cues represent the input veridically or have undergone compensation. We collected a corpus of 2,880 fricative productions (Jongman, Wayland, & Wong, 2000) spanning many talker and vowel contexts and measured 24 cues for each. A subset was also presented to listeners in an 8AFC phoneme categorization task. We then trained a common classification model based on logistic regression to categorize the fricative from the cue values and manipulated the information in the training set to contrast (a) models based on a small number of invariant cues, (b) models using all cues without compensation, and (c) models in which cues underwent compensation for contextual factors. Compensation was modeled by computing cues relative to expectations (C-CuRE), a new approach to compensation that preserves fine-grained detail in the signal. Only the compensation model achieved a similar accuracy to listeners and showed the same effects of context. Thus, even simple categorization metrics can overcome the variability in speech when sufficient information is available and compensation schemes like C-CuRE are employed.
The speech signal is notoriously variable, with the same phoneme realized differently depending o... more The speech signal is notoriously variable, with the same phoneme realized differently depending on factors like talker and phonetic context. Variance in the speech signal has led to a proliferation of theories of how listeners recognize speech. A promising approach, supported by computational modeling studies, is contingent categorization, wherein incoming acoustic cues are computed relative to expectations. We tested contingent encoding empirically. Listeners were asked to categorize fricatives in CV syllables constructed by splicing the fricative from one CV syllable with the vowel from another CV syllable. The two spliced syllables always contained the same fricative, providing consistent bottom-up cues; however on some trials, the vowel and/or talker mismatched between these syllables, giving conflicting contextual information. Listeners were less accurate and slower at identifying the fricatives in mismatching splices. This suggests that listeners rely on context information beyond bottom-up acoustic cues during speech perception, providing support for contingent categorization.
Traditional studies of human categorization often treat the processes of encoding features and cu... more Traditional studies of human categorization often treat the processes of encoding features and cues as peripheral to the question of how stimuli are categorized. However, in domains where the features and cues are less transparent, how information is encoded prior to categorization may constrain our understanding of the architecture of categorization. This is particularly true in speech perception, where acoustic cues to phonological categories are ambiguous and influenced by multiple factors. Here, it is crucial to consider the joint contributions of the information in the input and the categorization architecture. We contrasted accounts that argue for raw acoustic information encoding with accounts that posit that cues are encoded relative to expectations, and investigated how two categorization architectures-exemplar models and back-propagation parallel distributed processing models-deal with each kind of information. Relative encoding, akin to predictive coding, is a form of noi...
Previous research on speech perception in both adults and infants has supported the view that con... more Previous research on speech perception in both adults and infants has supported the view that consonants are perceived categorically; that is, listeners are relatively insensitive to variation below the level of the phoneme. More recent work, on the other hand, has shown adults to be systematically sensitive to within category variation (McMurray, Tanenhaus & Aslin, 2002). Additionally, recent evidence suggests that infants are capable of using within-category variation to segment speech and to learn phonetic categories. Here we report two studies of 8-month-old infants, using the head-turn preference procedure, that examine more directly infants ’ sensitivity to within-category variation. Infants were exposed to 80 repetitions of words beginning with either /b / or /p/. After exposure, listening times to tokens of the same category with small variations in VOT were significantly different than to both the originally exposed tokens and to the cross–category-boundary competitors. Thu...
Understanding spoken language requires analysis of the rapidly unfolding speech signal at multipl... more Understanding spoken language requires analysis of the rapidly unfolding speech signal at multiple levels: acoustic, phonological, and semantic. However, there is not yet a comprehensive picture of how these levels relate. We recorded electroencephalography (EEG) while listeners (N=31) heard sentences in which we manipulated acoustic ambiguity (e.g., a bees/peas continuum) and sentential expectations (e.g., Honey is made by bees). EEG was analyzed with a mixed effects model over time to quantify how language processing cascades proceed on a millisecond-by-millisecond basis. Our results indicate: (1) perceptual processing and memory for fine-grained acoustics is preserved in brain activity for up to 900 msec; (2) contextual analysis begins early and is graded with respect to the acoustic signal; and (3) top-down predictions influence perceptual processing in some cases, however, these predictions are available simultaneously with the veridical signal. These mechanistic insights provi...
Speech unfolds rapidly over time, and the information necessary to recognize even a single phonem... more Speech unfolds rapidly over time, and the information necessary to recognize even a single phoneme may not be available simultaneously. Consequently, listeners must both integrate prior acoustic cues and anticipate future segments. Prior work on stop consonants and vowels suggests that listeners integrate asynchronous cues by partially activating lexical entries as soon as any information is available, and then updating this when later cues arrive. However, a recent study suggests that for the voiceless sibilant fricatives (/s/ and / /), listeners wait to initiate lexical access until all cues have arrived at the onset of the vowel. Sibilants also contain coarticulatory cues that could be used to anticipate the vowel upcoming. However, given these results, it is unclear if listeners could use them fast enough to speed vowel recognition. The current study examines anticipation by asking when listeners use coarticulatory information in the frication to predict the upcoming vowel. A visual world paradigm experiment found that listeners do not wait: they anticipate the vowel immediately from the onset of the frication, even as they wait several hundred milliseconds to identify the fricative. This finding suggests listeners do not strictly process phonemes in the order that they appear; rather the dynamics of language processing may be largely internal and only loosely coupled to the dynamics of the input.
Speech unfolds over time and the cues for even a single phoneme are rarely available simultaneous... more Speech unfolds over time and the cues for even a single phoneme are rarely available simultaneously. Consequently, to recognize a single phoneme listeners must integrate material over several hundred milliseconds. Prior work contrasts two accounts: 1) a memory buffer account in which listeners accumulate auditory information in memory and only access higher level representations (i.e., lexical representations) when sufficient information has arrived; and 2) an immediate integration scheme in which lexical representations can be partially activated on the basis of early cues and then updated when more information arises. These studies have uniformly shown evidence for immediate integration for a variety of phonetic distinctions. We attempted to extend this to fricatives, a class of speech sounds which requires not only temporal integration of asynchronous cues (the frication, followed by the formant transitions 150-350 msec later), but also integration across different frequency bands, and compensation for contextual factors like coarticulation. Eye-movements in the visual world paradigm and showed clear evidence for a memory buffer. Results were replicated in five experiments, ruling out methodological factors and tying the release of the buffer to the onset of the vowel. These findings support a general auditory account for speech by suggesting that the acoustic nature of particular speech sounds may have large effects on how they are processed. It also has major implications for theories of auditory and speech perception by raising the possibility of a encapsulated memory buffers in early auditory processing.
Journal of Experimental Psychology: Human Perception and Performance, 2017
During spoken language comprehension listeners transform continuous acoustic cues into categories... more During spoken language comprehension listeners transform continuous acoustic cues into categories (e.g., /b/ and /p/). While long-standing research suggests that phonetic categories are activated in a gradient way, there are also clear individual differences in that more gradient categorization has been linked to various communication impairments such as dyslexia and specific language impairments
Language learning is generally described as a problem of acquiring new information (e.g., new wor... more Language learning is generally described as a problem of acquiring new information (e.g., new words). However, equally important are changes in how the system processes known information. For example, a wealth of studies has suggested dramatic changes over development in how efficiently children recognize familiar words, but it is unknown what kind of experience-dependent mechanisms of plasticity give rise to such changes in real-time processing. We examined the plasticity of the language processing system by testing whether a fundamental aspect of spoken word recognition, lexical interference, can be altered by experience. Adult participants were trained on a set of familiar words over a series of 4 tasks. In the high-competition (HC) condition, tasks were designed to encourage coactivation of similar words (e.g., net and neck) and to require listeners to resolve this competition. Tasks were similar in the low-competition (LC) condition, but did not enhance this competition. Immediately after training, interlexical interference was tested using a visual world paradigm task. Participants in the HC group resolved interference to a fuller degree than those in the LC group, demonstrating that experience can shape the way competition between words is resolved. TRACE simulations showed that the observed late differences in the pattern of interference resolution can be attributed to differences in the strength of lexical inhibition. These findings inform cognitive models in many domains that involve competition/ interference processes, and suggest an experience-dependent mechanism of plasticity that may underlie longer term changes in processing efficiency associated with both typical and atypical development.
Objectives-While outcomes with cochlear implants (CIs) are generally good, performance can be fra... more Objectives-While outcomes with cochlear implants (CIs) are generally good, performance can be fragile. The authors examined two factors that are crucial for good CI performance. First, while there is a clear benefit for adding residual acoustic hearing to CI stimulation (typically in low frequencies), it is unclear whether this contributes directly to phonetic categorization. Thus, the authors examined perception of voicing (which uses low-frequency acoustic cues) and fricative place of articulation (s/ʃ, which does not) in CI users with and without residual acoustic hearing. Second, in speech categorization experiments, CI users typically show shallower identification functions. These are typically interpreted as deriving from noisy encoding of the signal. However, psycholinguistic work suggests shallow slopes may also be a useful way to adapt to uncertainty. The authors thus employed an eye-tracking paradigm to examine this in CI users. Design-Participants were 30 CI users (with a variety of configurations) and 22 age-matched normal hearing (NH) controls. Participants heard tokens from six b/p and six s/ʃ continua (eight steps) spanning real words (e.g., beach/peach, sip/ship). Participants selected the picture corresponding to the word they heard from a screen containing four items (a b-, p-, sand ʃ-initial item). Eye movements to each object were monitored as a measure of how strongly they were considering each interpretation in the moments leading up to their final percept. Results-Mouse-click results (analogous to phoneme identification) for voicing showed a shallower slope for CI users than NH listeners, but no differences between CI users with and without residual acoustic hearing. For fricatives, CI users also showed a shallower slope, but unexpectedly, acoustic + electric listeners showed an even shallower slope. Eye movements showed a gradient response to fine-grained acoustic differences for all listeners. Even considering only trials in which a participant clicked "b" (for example), and accounting for variation in the
Many sources of context information in speech (such as speaking rate) occur either before or afte... more Many sources of context information in speech (such as speaking rate) occur either before or after the phonetic cues they influence, yet there is little work examining the time-course of these effects. Here, we investigate how listeners compensate for preceding sentence rate and subsequent vowel length (a secondary cue that has been used as a proxy for speaking rate) when categorizing words varying in voice-onset time (VOT). Participants selected visual objects in a display while their eye-movements were recorded, allowing us to examine when each source of information had an effect on lexical processing. We found that the effect of VOT preceded that of vowel length, suggesting that each cue is used as it becomes available. In a second experiment, we found that, in contrast, the effect of preceding sentence rate occurred simultaneously with VOT, suggesting that listeners interpret VOT relative to preceding rate.
Research in speech perception has been dominated by a search for invariant properties of the sign... more Research in speech perception has been dominated by a search for invariant properties of the signal that correlate with lexical and sublexical categories. We argue that this search for invariance has led researchers to ignore the perceptual consequences of systematic variation within such categories and that sensitivity to this variation may provide an important source of information for integrating information
The product of speech perception is contrast between discrete units of meaning, words, which are ... more The product of speech perception is contrast between discrete units of meaning, words, which are contrasted by features. While traditional approaches argued that discreteness is imposed by mechanisms like categorical perception that discard within-category detail, recent research suggests that fine-grained detail is preserved throughout processing. We develop an alternative that argues that discreteness emerges from processes that parse overlapping sources of variance from the signal. These need not discard acoustic detail and may make it more useful to listeners. We develop a computational implementation (Computing Cues Relative to Expectations, C-CuRE) and test it on a corpus of vowel productions. It shows how C-CuRE reveals underlying vowel features despite contextual variance, and simultaneously uses the variance to better predict upcoming vowels. 1. Leaving aside F0 as a feature encoding pragmatic meaning.
Most research on infant speech categories has relied on measures of discrimination. Such work oft... more Most research on infant speech categories has relied on measures of discrimination. Such work often employs categorical perception as a linking hypothesis to enable inferences about categorization on the basis of discrimination measures. However, a large number of studies with adults challenge the utility of categorical perception in describing adult speech perception, and this in turn calls into question how to interpret measures of infant speech discrimination. We propose here a parallel channels model of discrimination (built on Pisoni and Tash Perception & Psychophysics, 15(2), 285-290, 1974), which posits that both a noncategorical or veridical encoding of speech cues and category representations can simultaneously contribute to discrimination. This can thus produce categorical perception effects without positing any warping of the acoustic signal, but it also reframes how we think about infant discrimination and development. We test this model by conducting a quantitative review of 20 studies examining infants' discrimination of voice onset time contrasts. This review suggests that within-category discrimination is surprisingly prevalent even in classic studies and that, averaging across studies, discrimination is related to continuous acoustic distance. It also identifies several methodological factors that may mask our ability to see this. Finally, it suggests that infant discrimination may improve over development, contrary to commonly held notion of perceptual narrowing. These results are discussed in terms of theories of speech development that may require such continuous sensitivity.
Speech sounds are highly variable, yet listeners readily extract information from them and transf... more Speech sounds are highly variable, yet listeners readily extract information from them and transform continuous acoustic signals into meaningful categories during language comprehension. A central question is whether perceptual encoding captures acoustic detail in a one-to-one fashion or whether it is affected by phonological categories. We addressed this question in an event-related potential (ERP) experiment in which listeners categorized spoken words that varied along a continuous acoustic dimension (voice-onset time, or VOT) in an auditory oddball task. We found that VOT effects were present through a late stage of perceptual processing (N1 component, ~100 ms poststimulus) and were independent of categorization. In addition, effects of within-category differences in VOT were present at a postperceptual categorization stage (P3 component, ~450 ms poststimulus). Thus, at perceptual levels, acoustic information is encoded continuously, independently of phonological information. Fur...
Most theories of categorization emphasize how continuous perceptual information is mapped to cate... more Most theories of categorization emphasize how continuous perceptual information is mapped to categories. However, equally important are the informational assumptions of a model, the type of information subserving this mapping. This is crucial in speech perception where the signal is variable and context dependent. This study assessed the informational assumptions of several models of speech categorization, in particular, the number of cues that are the basis of categorization and whether these cues represent the input veridically or have undergone compensation. We collected a corpus of 2,880 fricative productions (Jongman, Wayland, & Wong, 2000) spanning many talker and vowel contexts and measured 24 cues for each. A subset was also presented to listeners in an 8AFC phoneme categorization task. We then trained a common classification model based on logistic regression to categorize the fricative from the cue values and manipulated the information in the training set to contrast (a) models based on a small number of invariant cues, (b) models using all cues without compensation, and (c) models in which cues underwent compensation for contextual factors. Compensation was modeled by computing cues relative to expectations (C-CuRE), a new approach to compensation that preserves fine-grained detail in the signal. Only the compensation model achieved a similar accuracy to listeners and showed the same effects of context. Thus, even simple categorization metrics can overcome the variability in speech when sufficient information is available and compensation schemes like C-CuRE are employed.
The speech signal is notoriously variable, with the same phoneme realized differently depending o... more The speech signal is notoriously variable, with the same phoneme realized differently depending on factors like talker and phonetic context. Variance in the speech signal has led to a proliferation of theories of how listeners recognize speech. A promising approach, supported by computational modeling studies, is contingent categorization, wherein incoming acoustic cues are computed relative to expectations. We tested contingent encoding empirically. Listeners were asked to categorize fricatives in CV syllables constructed by splicing the fricative from one CV syllable with the vowel from another CV syllable. The two spliced syllables always contained the same fricative, providing consistent bottom-up cues; however on some trials, the vowel and/or talker mismatched between these syllables, giving conflicting contextual information. Listeners were less accurate and slower at identifying the fricatives in mismatching splices. This suggests that listeners rely on context information beyond bottom-up acoustic cues during speech perception, providing support for contingent categorization.
Uploads
Papers by Bob McMurray