Papers by Jana Dankovicova
In Handbook of the International Phonetic Association a Guide to the Use of the International Phonetic Alphabet Cambridge University Press Cambridge, 1999
ABSTRACT In forensic speaker identification expert witnesses frequently use acoustic characterist... more ABSTRACT In forensic speaker identification expert witnesses frequently use acoustic characteristics of speech to estimate whether two (or more) voice samples are from the same or from different speakers. Previous research demonstrated that speaker idiosyncratic temporal features in speech may be helpful in such a decision making process (Dellwo & Koreman, 2008). Such temporal features are, for example, durational characteristics of consonantal (C)-and vocalic (V)-intervals that have previously been widely used for the analysis of language-specific rhythm (e.g. %V and deltaC, see Ramus et al., 1999, or the PVI, see Grabe & Low, 2002). The present research investigated the influence of voice disguise on within-speaker variability of rhythmic acoustic parameters such as %V (percentage over which speech is vocalic), deltaC (standard deviation of C-interval durations), the PVI (Pairwise Variability Index) and some variants of these measures. In a single subject pilot study we recorded a male native English speaker reading 29 sentences (average number of syllables/sentence: 19.2; standard deviation: 7.1) in his native accent (standard Northern British English) and also using a disguised voice (Liverpudlian English). In the disguised condition the speaker raised his voice in each of the sentences (average f0: normal: 130 Hz, disguised: 160 Hz) and has higher intonational variability (coefficient of variation of f0 standard deviation: normal: 3%; disguised: 20%). Further, under disguise the speaker applied a more breathy voice quality. Informal auditory tests revealed that expert phoneticians could not judge reliably whether the disguised and the normal speech samples were from one or two different speakers. Results for temporal characteristics of C-and V-intervals such as %V, deltaC, the PVI, and a number of variants of these measures, revealed no differences between the normal and disguised speech conditions (ranges, inter-quartile ranges and medians highly overlapped; independent samples t-tests with voice disguise as a factor and each of the rhythm measures as dependent variable: t between 0.05 and 0.95, p always >0.4). The results provide further evidence for the view that temporal characteristics of C-and V-intervals may be highly speaker idiosyncratic and it is possible that speakers lack strategies to control these parameters. As such, they may be powerful parameters for forensic speaker identification, in particular in situations involving voice disguise when other parameters frequently used in forensic speech analysis are typically of little or no help (for our speaker: mean f0, f0 variability, voice quality, speaker accent) or are more affected by signal degradations typical for forensic speech recordings. We are currently carrying out further experiments on the influence of voice disguise on speaker idiosyncratic temporal characteristics using a larger number of speakers and voice disguise techniques and using a wider range of measurements. We are particularly testing measurements based on timing of voice and voiceless intervals, which can be processed fully automatically for large corpora (Dellwo, Fourcin & Abberton, 2007). References Dellwo, V. and Koreman, J. (2008) How speaker idiosyncratic is measurable speech rhythm? Durational variability in speech and the rhythm class hypothesis. In: C. Gussenhoven and N. Warner (eds.): Papers in Laboratory Phonology 7 (Berlin, a.o.: Mouton de Gruyter). Ramus, F., Nespor, M., and Mehler, J. (1999) Correlates of linguistic rhythm in the speech signal. Cognition 73: 265-292.
Journal of the International Phonetic Association, 2004
It is often thought that the ability to use prosodic features accurately is mastered in early chi... more It is often thought that the ability to use prosodic features accurately is mastered in early childhood. However, research to date has produced conflicting evidence, notably about the development of children's ability to mark prosodic boundaries. This paper investigates (i) whether, by the age of eight, children use temporal boundary features in their speech in a systematic way, and (ii) to what extent adult listeners are able to interpret their production accurately and unambiguously. The material consists of minimal pairs of utterances: one utterance includes a compound noun, in which there is no prosodic boundary after the first noun, e.g. ‘coffee-cake and tea’, while the other utterance includes simple nouns, separated by a prosodic boundary, e.g. ‘coffee, cake and tea’. Ten eight-year-old children took part, and their productions were rated by 23 adult listeners. Two phonetic exponents of prosodic boundaries were analysed: pause duration and phrase-final lengthening. The re...
phonetiklabor.de
Work is currently being carried out on a speech database constructed in order to study speech rhy... more Work is currently being carried out on a speech database constructed in order to study speech rhythm and speech rate. The database, BonnTempo-Corpus (BTC), and the Praat based analysis tools, BonnTempo-Tools (BTT), are a powerful instrument for ...
Clinical Linguistics & Phonetics, 2011
Foreign accent syndrome (FAS) is an acquired neurogenic disorder characterized by altered speech ... more Foreign accent syndrome (FAS) is an acquired neurogenic disorder characterized by altered speech that sounds foreign-accented. This study presents a British subject perceived to speak with an Italian (or Greek) accent after a brainstem (pontine) stroke. Native English listeners rated the strength of foreign accent and impairment they perceived in speech of the FAS subject, alongside that of two native English speakers and Italian, Greek, and French L2 speakers acting as controls. The FAS subject was perceived to be as foreign-sounding as the L2 control speakers, but was also perceived as mildly impaired. The FAS subject's own perception of accents was also explored and it was found that his ability to distinguish presence and absence of accent does not seem to be affected. The relationship between listeners' perceptions and features of the FAS speech is explored via correlational statistics and qualitative analysis. Impressionistic phonetic analysis, supplemented by acoustic analysis, confirmed a number of features consistent with a typical Italian (and also Greek) accent and the Italian and Greek L2 speakers. A pre-stroke and a post-stroke sample from the FAS subject were compared and the nature of post-stroke changes in segmental realizations is discussed.
Work is currently being carried out on a speech database constructed in order to study speech rhy... more Work is currently being carried out on a speech database constructed in order to study speech rhythm in connection with speech rate. The database, BonnTempo-Corpus, and the Praat based analysis tools, BonnTempo-Tools, are a powerful instrument for examining various aspects of recently proposed rhythm measures (e.g. %V, C, nPVI, rPVI, etc.) in relation to speech rate among a wide range of languages and speakers. First observations pose new problems on traditionally not well classifiable languages like Czech.
5th International Conference on Spoken Language Processing (ICSLP 1998)
Some preliminary investigations of within-speaker variations due to voluntary and induced speakin... more Some preliminary investigations of within-speaker variations due to voluntary and induced speaking manners have been performed. The ultimate aim of the investigations was to suggest methods to take care of within-speaker variations in automatic speaker verification. Special software was developed to systematically elicit different types of voluntary and involuntary speech variations that might realistically occur in everyday situations. A database containing speech from 50 Swedish male speakers was collected using this software. Acoustic analyses have been performed on and the results compared between voluntary and involuntary speech variations. The acoustic parameters that have been studied included segment durations, formant frequencies at vowel midpoints, fundamental frequency and overall amplitude and amplitude in frequency bands.
In Proc International Congress of Phonetic Sciences San Francisco Usa, 1999
Speech Communication, 2000
Some experiments to take care of within speaker variations in speaker verification has been perfo... more Some experiments to take care of within speaker variations in speaker verification has been performed. To get speaker variation, speaking behaviour elicitation software has been developed. It was found that if an ASV system was trained on varied speech, speaker verification on even more varied speech improved significantly. RÉSUMÉ Nous décrivons les expériences réalisées pour prendre en considération les variations intra-locuteur dans un système de vérification automatique du locuteur (ASV). Afin d'obtenir des variations représentatives de la parole d'un locuteur (différentes vitesses d'élocution, états émotionnels particuliers, etc.), nous avons développé un logiciel spécifique que nous décrivons. Notre travail montre qu'entraîner un ASV avec des échantillons de parole enregistrés dans différentes conditions d'élocution améliore de façon significative les performances du système; et ce, même lorsque le système doit faire face à d'autres types de variations que celles vues lors de l'apprentissage.
Journal of the International Phonetic Association, 1999
This paper reports the results of an experiment on the effects of six speaking styles on some of ... more This paper reports the results of an experiment on the effects of six speaking styles on some of the acoustic properties of speech. The experiment was part of an exploration of within-speaker variation in connection with automatic speaker verification (ASV), pursuing the hypothesis that the elicitation of style variation in the training phase of an ASV system (‘structured training’) would enhance the performance of the system. Swedish-speaking subjects produced a digit sequence at varying speaking rates and loudness levels, and also with simulated denasality (pinched nose) and under cognitive stress. Duration of vowels and consonants, and formant frequencies of vowels, were measured. A number of consistent patterns of variation emerged for duration and vowel quality and are reported here. The discussion explores the relation between the patterns observed and the success, or in the case of speech under stress the failure, of structured training in reducing the error rates in ASV.
While a number of languages have been classified as either syllable- or stress-timed, the case of... more While a number of languages have been classified as either syllable- or stress-timed, the case of Czech remains unclear. In this paper we make predictions about Czech rhythm on the basis of our analysis of syllable complexity in recorded samples of Czech. The results on syllable complexity show mixed features. This is reflected in the classification of Czech rhythm using rhythm measures based on durational variability of consonantal and vocalic intervals.
Computer Speech & Language, 2000
Conversational speech exhibits considerable pronunciation variability, which has been shown to ha... more Conversational speech exhibits considerable pronunciation variability, which has been shown to have a detrimental effect on the accuracy of automatic speech recognition. There have been many attempts to model pronunciation variation, including the use of decision-trees to generate alternate word pronunciations from phonemic baseforms. Use of such pronunciation models during recognition is known to improve accuracy. This paper describes the use of such pronunciation models during acoustic model training. Subtle difficulties in the straightforward use of alternatives to canonical pronunciations are first illustrated: it is shown that simply improving the accuracy of the phonetic transcription used for acoustic model training is of little benefit. Analysis of this paradox leads to a new method of accommodating nonstandard pronunciations: rather than allowing a phoneme in the canonical pronunciation to be realized as one of a few distinct alternate phones predicted by the pronunciation model, the HMM states of the phoneme's model are instead allowed to share Gaussian mixture components with the HMM states of the model of the alternate realization. Qualitatively, this amounts to making a soft decision about which surface-form is realized. Quantitative experiments on the Switchboard corpus show that this method improves accuracy by 1.7% (absolute).
Language and Speech, 2007
Few attempts have been made to look systematically at the relationship between musical and intona... more Few attempts have been made to look systematically at the relationship between musical and intonation analysis skills, a relationship that has been to date suggested only by informal observations. Following Mackenzie Beck (2003), who showed that musical ability was a useful predictor of general phonetic skills, we report on two studies investigating the relationship between musical skills, musical training, and intonation analysis skills in English. The specially designed music tasks targeted pitch direction judgments and tonal memory. The intonation tasks involved locating the nucleus, identifying the nuclear tone in stimuli of different length and complexity, and same/different contour judgments. The subjects were university students with basic training in intonation analysis. Both studies revealed an overall significant relationship between musical training and intonation task scores, and between the music test scores and intonation test scores. A more detailed analysis, focusing on the relationship between the individual music and intonation tests, yielded a more complicated picture. The results are discussed with respect to differences and similarities between music and intonation, and with respect to form and function of intonation. Implications of musical training on development of intonation analysis skills are considered. We argue that it would be beneficial to investigate the differences between musically trained and untrained subjects in their analysis of both musical stimuli and intonational form from a cognitive point of view.
ProSynth uses a hierarchical prosodic structure (implemented in XML) as its core linguistic repre... more ProSynth uses a hierarchical prosodic structure (implemented in XML) as its core linguistic representation. To model intonation we map template representations of F 0 contours onto this structure. The template for a particular pitch pattern is derived from analysis of a labelled speech database. For a falling nuclear pitch accent this template has three turning points: two define the F 0 peak and one marks the end of the F 0 fall. Statistical analysis confirmed that the alignment and shape of the template are sensitive to the properties of the structure and also provided quantitative values for F 0 synthesis. Our results suggest that phonetic interpretation of the nuclear pitch accent is best related to the accented Foot rather than to the accented syllable. In determining parameter values for synthesis, we conclude that F 0 information should be integrated with temporal and segmental information.
Uploads
Papers by Jana Dankovicova