Papers by Steven Greenberg
The Journal of the Acoustical Society of America, 1978
QKth Id,- ,-tin_n, '•coustical Society of America S5 seems to disconfirm a phonological hypothesi... more QKth Id,- ,-tin_n, '•coustical Society of America S5 seems to disconfirm a phonological hypothesis to explain the poorer identification of isolated vowels IT. R. Edman, W. Strange, and J. J. Jenkins, J. Acoust. Soc. Am. 59, S25(A) (1976)]. Possible reasons for the ineffectiveness of/g/ contexts to aid vowel identification are discussed. [Supported by NIMH, NICHD.] 10:24 B8. Dynamic information specifies vowel identity. Winifred Strange,

8th European Conference on Speech Communication and Technology (Eurospeech 2003)
Temporal dynamics provide a fruitful framework with which to examine the relation between informa... more Temporal dynamics provide a fruitful framework with which to examine the relation between information and spoken language. This paper serves as an introduction to the special Euro-speech session on " Time is of the Essence – Dynamic Approaches to Spoken Language, " providing historical and conceptual background germane to timing, as well as a discussion of its scientific and technological prospects. Dynamics is examined from the perspectives of perception, production, neurology, synthesis, recognition and coding, in an effort to define a prospective course for speech technology and research. Speech is inherently dynamic, reflecting the motion of the tongue and other articulators during the course of vocal production. Such articulatory dynamics are reflected in rapid spectral changes, known as formant transitions, characteristic of the acoustic signal. Although such dynamic properties have long been of interest to speech scientists, their fundamental importance for spoken language has only recently received broad recognition. The special Eurospeech session on " Time is of the Essence – Dynamic Approaches to Spoken Language " is designed to acquaint the speech community with current research representative of this new emphasis on dynamics from a broad range of scientific and technical perspectives. The current paper serves as a brief introduction, providing historical and conceptual background for the session as a whole. Traditionally, articulatory mechanisms have been examined principally from a biomechanical perspective. Given the structural constraints imposed through phylogenetic descent, speech production has generally been viewed as nature's way of solving an exceedingly complicated problem with limited biome-chanical means. The jaw, tongue, lips and other articulators can move only so fast, their rates of motion limited by their anatomical and physiological characteristics. Such properties reflect an evolutionary process long antedating the origins of human vocal communication. From this purely articulatory perspective, speech's spectro-temporal properties are primarily the consequence of biomechanical constraints imposed through the course of human (and mammalian) evolution. If the fine details of spoken language are governed by vocal production, how does the brain decode speech given the acoustic nature of the input to the auditory system? One prominent model, known as " Motor Theory, " posits that the brain back-computes the articulatory gestures directly from the acoustic signal [17]. In essence, this framework likens the auditory system to a delivery service that transmits packages containing articulatory gestures decoded at some higher level of the brain. The process of perceiving (and ultimately understanding) speech thus reduces to …
3rd International Conference on Spoken Language Processing (ICSLP 1994)
We have developed a statistical model of speech that incorporates certain temporal properties of ... more We have developed a statistical model of speech that incorporates certain temporal properties of human speech perception. The primary goal of this work is to avoid a number of current constraining assumptions for statistical speech recognition systems, particularly the model of speech as a sequence of stationary segments consisting of uncorrelated acoustic vectors. A focus on perceptual models may in principle allow for statistical modeling of speech components that are more relevant for discrimination between candidate utterances during speech recognition. In particular, we hope to develop systems that have some of the robust properties of human audition for speech collected under adverse conditions. The outline of this new research direction is given here, along with some preliminary theoretical work.

Due to the incompletely understood nature of prosodic stress, the implementation of an automatic ... more Due to the incompletely understood nature of prosodic stress, the implementation of an automatic transcriber is very difficult on the basis of the currently available knowledge. For this reason, a number of data driven approaches are applied to a manually annotated set of files from the OGI English Stories Corpus. The goal of this analysis is twofold. First, it aims to implement an automatic detector of prosodic stress with sufficiently reliable performance. Second, the effectiveness of the acoustic features most commonly proposed in the literature is assessed. That is, the role played by duration, amplitude and fundamental frequency of syllabic nuclei is investigated. Several data-driven algorithms, such as Artificial Neural Networks (ANN), statistical decision trees and fuzzy classification techniques, and a knowledge-based heuristic algorithm are implemented for the automatic transcription of prosodic stress. As reference, two different subsets from the OGI English stories databa...

ABSTRACT abstract The spectro-temporal coding,of Danish consonants,was,investigated using,an info... more ABSTRACT abstract The spectro-temporal coding,of Danish consonants,was,investigated using,an information-theoretic approach. Listeners were,asked,to identify eleven,different consonants spoken in a CV[l] syllable context (where C refers to the initial consonant, V refers to one of three vowels, [I, a, u], and [l] refers to the syllable-final liquid segment). Each syllable was processed so that only a portion of the original audio spectrum,was present. Narrow (three-quarter octave) bands of speech, with center frequencies of 750 Hz, �500 Hz and 3000 Hz, were presented individually and in combination with each other. The modulation spectrum of each band was low-pass filtered at 24, 12, 6 and 3 Hz. Confusion matrices of the consonant-identification data were computed, and from these the amount,of information,transmitted for each of three,phonetic feature dimensions – voicing, manner and place of articulation – was calculated for each condition. This form of analysis provides a simple means,of determining,whether,information associated with each phonetic feature dimension,combines,linearly across the audio spectrum, and, if not, delineates a method for characterizing the (non-linear) nature of information integration. In addition, the analysis provides a means to associate specific portions of the,modulation,spectrum,with phonetic feature properties. Such analyses indicate that: (1) Accurate, robust decoding of place-of-articulation information requires broadband,cross-spectral integration (2) Place-of-articulation information is associated most closely with the modulation spectrum above 6 Hz, with the most significant contribution coming from the region above �2 Hz. (3) Place-of-articulation information is crucial for accurate consonant,recognition.
In a previous study (14) we had concluded that amplitude and duration are the most important acou... more In a previous study (14) we had concluded that amplitude and duration are the most important acoustic parameters underlying the patterning of prosodic stress in casually spoken American English, and that fundamental frequency (f o ) plays a only minor role in the assignment of stress. The current study re-examines this conclusion (using both the range and average level of

The Journal of the Acoustical Society of America, 1986
The physiological basis of auditory frequency selectivity was investigated by recording the tempo... more The physiological basis of auditory frequency selectivity was investigated by recording the temporal response patterns of single ouchlear-nerve fibers in the cat. The characteristic frequency and sharpness of tuning was determined for low-frequency ouchlear-nerve fibers with two-tone signals whose frequency components were of equal amplitude and starting phase. The measures were compared with those obtained with sinusoidal signals. The two-tone characteristic frequency (2TCF) is deftned as the arithmetic-center frequency at which the fiber is synchronized to both signal frequencies in equal measure. The 2TCF closely corresponds to the characteristic frequency as determined by the frequency threshold curve. Moreover, the 2TCF changes relatively little (2%-12%) over a 60-dB intensity range. The 2TCF generally shifts upward with increasing intensity for ouchlear-nerve fibers tuned to frequencies below 1 kHz and shifts downward as a function of intensity for units with characteristic frequencies (CF's) above 1 kHz. The shifts in the 2TCF are considerably smaller than those observed with sinusoidal signals. Filter functions were derived from the synchronization pattern to the two-tone signal by varying the frequency of one of the components over the fiber's response area while maintaining the other component at the 2TCF. The frequency selectivity of the two-tone filter function was determined by dividing the vector strength to the variable frequency signal by the vector strength to the CF tone. The filter function was measured 10 dB down from the peak (2T Q m dB) and compared with the Q ]o dB of the frequency threshold curve. The correlation between the two measures of frequency selectivity was 0.72. The 2T Q ]o aa does change as a function of intensity. The magnitude and direction of the change is dependent on the sharpness of tuning at low and moderate sound-pressure levels (SPL's). The selectivity of the more sharply tuned fibers (2T Q •o da > 3) diminishes at intensities above 60 dB SPL. However, the broadening of selectivity is relatively small in comparison to discharge rate-based measures of selectivity. The selectivity of the more broadly tuned units remains unchanged or improves slightly at similar intensity levels. The present data indicate that the frequency selectivity and tuning of lowfrequency cochlear-nerve fibers are relatively stable over a 60-dB range of SPL's when measured in terms of their temporal discharge properties.

The Journal of the Acoustical Society of America, 1984
The implementation of engineering noise controls on existing, rebuilt, and new equipment must tak... more The implementation of engineering noise controls on existing, rebuilt, and new equipment must take into account both technological and economic feasibility. Retro-fitting engineering controls to existing noisy equipment on the production floor, while desirable and in some instance necessary, can have serious negative economic consequences. Costs resulting from decreased productivity, increased maintenance, frequency of replacement, and possible increased floorspace requirements are all factors which must be considered. General Motors Corporation, (GM, while pursuing the investigation and implementation of feasible engineering noise controls on existing equipment, places special emphases on the purchase of new equipment. The "General Motors Corporation Sound Level Specification for Machinery and Equipment," Revision February 1979, specifies a maximum time-weighted average sound level of 80 dB (A). Measurements are taken at the operator's ear location and on the measurement envelope as specified in "NMTBA Noise Measurement Techniques" Second Edition dated January 1976. GM, working in conjunction with its suppliers, has experienced considerable success in achieving feasible engineering controls through the application of the GM Purchase Specification. 4:50 J8. The Singapore construction noise standard. Raymond B. W. Heng a)

The Journal of the Acoustical Society of America, 1999
It is generally assumed that the auditory representation of speech is derived primarily from the ... more It is generally assumed that the auditory representation of speech is derived primarily from the spatial distribution of activity across the tonotopically organized neuronal axis. Such temporal parameters as spike synchronization, interspike interval, and neuronal oscillations are also regarded as potential complements to the classical tonotopic representation of the speech spectrum. The present study used magnetoencephalography (MEG) to investigate the temporal behavior of large neuronal ensembles in human auditory cortex. Experimental stimuli were sinusoidal and square‐wave signals, sinusoidally amplitude‐modulated tones, and synthetic vowels. The results indicate the presence of a systematic representation of the signal spectrum and pitch derived from the latency pattern of the major auditory‐evoked neuromagnetic field. The magnitude of latency differential to signals of variable frequency and pitch can be as large as 25 ms. This ‘‘tono‐chronic’’ analysis may offer an alternative encoding mechanism for...

Hearing Research, 1987
Neural temporal coding of low pitch. I. Human frequency-following responses to complex tones The ... more Neural temporal coding of low pitch. I. Human frequency-following responses to complex tones The neural basis of low pitch was investigated in the present study by recording a brainstem potential from the scalp of human sublccts during presentation of complex tones which evoke a variable sensation of pitch. The potential recorded. the frequency-following response (FFR), reflects the temporal discharge activity of auditory neurons in the upper brainatem pathway. It was used as an index of neural periodicity in order to determine the extent lo which the low pitch of complex tones is encoded in the temporal discharge activity of auditory brainstem neurons. A tone composed of harmomcs of a common fundamental produces a sensation of pitch equal to that of the 'missing' fundamental. Such signals generate brainstcm potentials which are spectrally similar to FFR recorded in response to sinusoidal signals equal in frequency to the missing fundamental. Both types of signals generate FFR which arc periodic. with a frequency similar to the perceived pitch of the stimuli. It is shown that the FFR to the missing fundamental is not the result of a distortion product by recording FFR to a complex signal in the presence of low-frequency bandpasa no&e. Neither is the FFR the result of neural synchronization to the waveform envelope modulation pattern. This was determined by recording FFR to inharmonic and quasi-frequency-modulated signals. It was alao determined that the 'existence region' for FFR to the missing fundamental lies below 2 kHz and that the most favorable spectral region for FFR to complex tones is between 0.5 and 1.0 kHz. These results are consistent with the hypothesis that far-field-recorded FFR does reflect neural activity germane to the processing of low pitch and that such pitch-relevant activity is based on the temporal discharge pattcrna of neurons in the upper auditory brainstem pathway.
1996 LVCSR Summer Workshop Technical …, 1996
... cut across this phonetic feature. But we will monitor the transcriptions to ascertain if the ... more ... cut across this phonetic feature. But we will monitor the transcriptions to ascertain if the partitioning of the vocalic space for unstressed syllables is sufficient to capture the key phonetic information. Syllabic codas will be coded ...
8th European Conference on Speech Communication and Technology (Eurospeech 2003)
... due to cost and labor constraints, nor does every site have the resources to develop a MAUS .... more ... due to cost and labor constraints, nor does every site have the resources to develop a MAUS ... ISCA Workshop on Automatic Speech Recognition: Challenges for the New Millennium, pp. ... [33] Linguistic Data Consortium www.ldc.upenn.edu [34] Liu, SA Landmark detection for ...
Journal of Phonetics, 1988
A table carrying a workpiece being machined, is mounted on a base and has at least two degrees of... more A table carrying a workpiece being machined, is mounted on a base and has at least two degrees of freedom with respect thereto for an optimum positioning of the workpiece with respect to a cutting tool by way of at least two table orientation mechanisms.

Understanding the human ability to reliably process and decode speech across a wide range of acou... more Understanding the human ability to reliably process and decode speech across a wide range of acoustic conditions and speaker characteristics is a fundamental challenge for current theories of speech perception. Conventional speech representations such as the sound spectrogram emphasize many spectro-temporal details that are not directly germane to the linguistic information encoded in the speech signal and which consequently do not display the perceptual stability characteristic of human listeners. We propose a new representat?ohM format, the modulation spectrogram, that discards much of the spectro-temporal detail in the speech signal and instead focuses on the underlying, stable structure incorporated in the low-frequency portion of the modulation spectrum distributed across critical-band-like channels. We describe the representation and illustrate its stability with color-mapped displays and with results from automatic speech recognition experiments.

The role of duration, amplitude and fundamental frequency of syllabic vocalic nuclei is investiga... more The role of duration, amplitude and fundamental frequency of syllabic vocalic nuclei is investigated for marking prosodic stress in spontaneous American English discourse. Local maxima of di#erent evidence variables, implemented as combinations of the three basic parameters # duration, amplitude and pitch #, are supposed to be related with prosodic stress. As reference, two di#erent subsets from the OGI English stories database were manually marked in terms of prosodic stress bytwo di#erent trained linguists. The ROC curves, built on the training examples, show that both transcribers grant a major role to the amplitude and duration rather than to the pitch of the vocalic nuclei. More complex evidence variables, involving a product of the three basic parameters, allow around 80# primary stressed and 77# unstressed syllables to be correctly recognized in the test #les of both transcribers' datasets. The agreementbetween the two transcribers on a set of common #les supplies only sl...
Uploads
Papers by Steven Greenberg