Academia.eduAcademia.edu

Rhythmic patterns and literary genres in synthesized speech

2016, Speech Prosody 2016

In this paper, the rhythmic patterns observed in natural and synthesized speech are compared for three literary forms (rhymes, poems, and fairy tales). The aim of the comparison is to evaluate how rhythm could be improved in synthesized speech, which could allow adapting it to specific styles or genres. The study is based on the analysis of a corpus of six rhymes, four poems and two extracts from fairy tales. All texts were recorded by three speakers and were generated with two distinct synthesized voices. The comparison of the rhythmic patterns observed is done by analyzing duration in relation to prosodic structure in the various data sets. This approach allows showing that rhythmic differences between synthesized and natural speech are mostly due to the marking of prosodic structure.

Rhythmic Patterns and Literary Genres in Synthesized Speech Elisabeth Delais-Roussarie, Damien Lolive, Hiyon Yoo, David Guennec To cite this version: Elisabeth Delais-Roussarie, Damien Lolive, Hiyon Yoo, David Guennec. Rhythmic Patterns and Literary Genres in Synthesized Speech. Speech Prosody, 2016, Boston, United States. ฀hal-01338873฀ HAL Id: hal-01338873 https://hal.inria.fr/hal-01338873 Submitted on 1 Jul 2016 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Rhythmic Patterns and Literary Genres in Synthesized Speech Elisabeth Delais-Roussarie1, Damien Lolive2, Hiyon Yoo1 and David Guennec2 1 UMR 7110-LLF & Université Paris-Diderot, France 2 IRISA, ENSSAT, Université Rennes 1, France [email protected], [email protected], {damien.lolive/david.guennec }@irisa.fr Abstract In this paper, the rhythmic patterns observed in natural and synthesized speech are compared for three literary forms (rhymes, poems, and fairy tales). The aim of the comparison is to evaluate how rhythm could be improved in synthesized speech, which could allow adapting it to specific styles or genres. The study is based on the analysis of a corpus of six rhymes, four poems and two extracts from fairy tales. All texts were recorded by three speakers and were generated with two distinct synthesized voices. The comparison of the rhythmic patterns observed is done by analyzing duration in relation to prosodic structure in the various data sets. This approach allows showing that rhythmic differences between synthesized and natural speech are mostly due to the marking of prosodic structure. Index Terms: Rhythmic patterns, phono-genre, speech synthesis, prosodic structure. comparing duration patterns and speech rate along at least two dimensions (synthesized vs. natural speech, difference among the literary genres) are presented. Main findings are discussed in section 4 by focusing on what is crucial to improve TTS systems. 2. Corpus and Methodology 2.1. Corpus The corpus used to study the rhythmic patterns obtained in natural and synthesized speech consisted of three distinct types of texts that could all be addressed to children: six rhymes, four poems and two extracts from fairy tales. Table 1 summarizes the exact composition of the corpus according to literary genres. Differences among speakers and synthesized voices results mostly from schwa insertion or deletion, and from word omission (in the pronunciation of titles for poems and rhymes for instance). The effective number of syllables obtained for the three speakers and the two synthesis systems is given in the last column. 1. Introduction In the last twenty years, the overall quality of synthesized speech has greatly improved with the emergence of new TTS techniques, including corpus-based concatenative speech synthesis systems ([1] and [2]). Nevertheless, generating a natural-sounding prosody remains a challenge (see [3] among others). More specifically, the rhythmic component of these systems often sounds odd and unnatural, and needs to be improved for using synthesis in a wide range of applications (games, educational software, etc.). In a research project aiming at using speech synthesis to teach writing skills to primary school pupils, it appeared important to improve the TTS system to allow it reading more accurately different types of data: fairy tales, poetry and rhymes. In order to achieve such a goal, a comparison of the rhythmic patterns obtained in natural and synthesized speech in the different genres was achieved. Regarding synthesized speech, one of our hypotheses was that more accurate rhythmic patterns would be observed in fairy tales, since the corpora used to select the speech units for the TTS system are mostly composed by read sentences extracted from audiobooks. Since our findings do not really confirm the hypothesis, it seemed to us important to understand the reasons why the rhythmic patterns were more accurate for poems and rhymes than for fairy tales. This paper is organized as follows. Section 2 provides a description of the data and methods used for the study. In section 3, the results obtained from the prosodic analyses by Table 1. Corpus composition Effective number of syllables (natural vs. synth) Speaking Style Number of words Number of syll. Rhymes 158 228 683 syll. / 454 syll. Poems 290 422 1347 syll. / 808 syll. Tales 522 777 2323 syll. / 1538 syll. Total 970 1427 4353 syll. / 2800 syll. The set of texts was recorded by three speakers (two males and one female) in a sound-proof room. Time for reading and rehearsing texts was given to the participants before recording. Among the three speakers, two were reading the texts as parents would read a story to their children, whereas the third one is a trained actor and was reading the texts with great expressivity. As for the synthesized stimuli, they were produced by a corpus-based TTS system as presented in [4] and for which pre-selection filters are used instead of a target cost. For the purpose of this study, the ordered filters set we used is the following: 1. Unit label (cannot be relaxed). 2. Is the unit a Non Speech Sound (cannot be relaxed) ? 3. Is the phone in the last syllable of its sentence ? 4. Is the phone in the last syllable of its major prosodic group (IP) ? 5. Is the current syllable in word end ? 6. Is the current syllable with a rising intonation ? During the best unit sequence search, if the number of units corresponding to a given set of filters is too low, the last filter of the set is relaxed. By reducing the number of applied constraints, the search space becomes wider. In any case, the two first filters are kept. Furthermore, a penalty is applied to phoneme classes for which concatenation seems to be risky (see [5]). Indeed, we consider that joining two units on a vowel is more likely to produce an artefact than when joining is made on the silent part of a plosive or even with a fricative. Concerning prosody, no specific treatment is made, and the only constraints that may improve the generated speech rhythm are the pre-selection filters, as they impose positional constraints to selected units. Finally, pauses are placed at designated places by the system: a pause is for instance inserted after each punctuation mark. Note also that their duration remains fixed, and is not related to the length of the preceding speech stretch. For this study, two distinct synthesized voices were used. They differ according to the way they were produced:  voice SY-P, a male voice, is based on a corpus of 10 hours extracted from an audiobook, i.e. a novel read by an actor.  voice SY-A, a female voice, consists of 7 hours of read speech, the read items being specifically designed to build up a speech synthesis system. The differences in the content and the size of the corpora lead to consider voice SY-P as more expressive than voice SY-A, which is more neutral. To generate the synthesized stimuli, the structure in stanza and lines for poems and rhymes was represented by using punctuation marks such as comma. The three stanza in (1), which are extracted from a poem (La fourmi, R. Desnos), were typed as shown in (2) to obtain the synthesized version. (1) Une fourmi traînant un char plein de pingouins et de canards ça n'existe pas, ça n'existe pas EASYALIGN [7]. The obtained phonetic transcriptions and acoustic segmentations were controlled and corrected when necessary. The entire annotated data set was then used to carry out the rhythmic and prosodic analysis. To generate the duration patterns and to analyze and compare pause durations and speech rates according to speakers and genres, vowels were chosen as the base unit instead of syllables. This choice results from the fact that syllable structures vary a lot in French and syllabic duration cannot be a robust indicator to evaluate the lengthening rate. As the number of vowels located in the different prosodic positions was limited because of the size of the corpus, it was difficult to normalize duration. We thus decided to make, a distinction between long and short vowels, even if such a distinction does not exist in the French phonological system. Nasal vowels ([],[], [] and []) and sequences composed of a semi-vowel and a vowel in nuclear positions (as, for instance, [j] in tiens [tj], [wa] in noir [nwa]) were thus encoded as long vowel, whereas the remaining oral vowels were considered as short. Since previous studies on French prosody showed that phrasing, intonation and accentuation are highly intertwined in this language (e.g. among others [8]), all sentences from the different texts were segmented in prosodic phrases, a distinction being made between three levels of phrasing (prosodic word PWD, phonological phrase PP and intonational phrase IP). Rules were used to derive the prosodic phrases from the text, i.e. from the morpho-syntactic structure and the number of syllables (see, among others, [9], [10] and [11]). Such an approach has the advantage of avoiding a certain circularity. Since the last syllable of prosodic phrases is considered as accented in French and is usually lengthened (see [12]), we distinguish three categories of accented syllables to compare the lengthening rate of the accented syllables in relation to their prosodic position:  AC-PWD, which corresponds to the last metrical syllable of a prosodic word, i.e. a word from a lexical category such as Verb, Noun, Adjective and Adverb (see [13] and [14] among others); Une fourmi parlant français parlant latin et javanais ça n'existe pas, ça n'existe pas  AC-PP, which coincides with the last metrical syllable of a minor phrase, i.e. of the lexical head of a syntactic projection (see [9], [15] and [16] among others); eh ! et pourquoi pas !  AC-IP, which corresponds to the last metrical syllable of any IP, IP boundaries being located at the end of a clause, a detached syntactic constituent, or a line (in poems and rhymes), see [14], [17] and [18] among others. (2) Une fourmi traînant un char, plein de pingouins et de canards, ça n'existe pas, ça n'existe pas. Une fourmi parlant français, parlant latin et javanais, ça n'existe pas, ça n'existe pas. Eh ! Et pourquoi pas ! As shown above, the end of stanzas is always encoded by a full stop, when no punctuation mark was used in the original text. The lines were encoded by a comma, except when ending with a punctuation mark in the text. The other parts of the text remain unchanged. 2.2. Methodology The data were first orthographically transcribed and segmented in utterances using PRAAT [6]. The orthographic transcription was then phonetized, and the audio signal automatically segmented into phones, syllables, and graphemic words by means of the speech processing script 3. Results Duration patterns obtained for natural and synthesized speech allows analyzing and comparing speech rates, pause duration and distribution, and prosodic structure marking. The results are presented in the two following sub-sections. 3.1. Speech rate and pausing The total duration of the various readings was used to calculate for each speaker and each genre the speech and articulation rates as well as pause durations. The difference between articulation and speech rates relies on the fact that pauses are not taken into account to calculate articulation rate (see [19]). Table 2 summarizes the results obtained for each speaker and in the three distinct genres. The first two rows indicate respectively speech and articulation rates in number of phones by second, whereas the last two rows are of interest to study the duration and distribution of pauses. Table 2. Speech and articulation rates in phones/sec, and pause duration and percentage of pauses (related to the total duration of readings) Rhymes Average speech rate (ph./sec.) Average articulation rate (ph./sec) Total pause duration (ms) Average % of pauses Poems Average speech rate (ph./sec.) Average articulation rate (ph./sec) Total pause duration Average % of pauses Tales Average speech rate (ph./sec.) Average articulation rate (ph./sec) Total pause duration Average % of pauses LOD DRE GOR SY-A SY-P 9.9 7.35 7.08 7.63 9.09 12.09 7.83 8.53 9.79 12.61 2178.92 1449.22 2573.76 3025 3000 25.27 13.60 24.15 29.06 33.80 LOD DRE GOR SY-A SY-P 10.6 8.16 6.28 8.26 9.45 13.60 9.32 8.72 10.70 12.85 1534 1373.36 2590.52 2000 2000 27.38 18.10 33.85 28.17 31.29 LOD DRE GOR SY-A SY-P 10.58 8.74 8.18 9.31 10.79 14.99 10.08 11.09 11.36 13.68 1331.33 763.40 1482.06 992 992.14 32.82 18.06 29.96 21.96 24.79 The articulation and speech rates observed for each genre vary a lot, but one cannot say that synthesized voices differ from natural one. LOD and SY-P speak faster than the other speakers in all genres, whereas GOR and DRE obtain the lowest rates. Within a given literary genre, the speech and articulation rates obtained by synthesized voices are included in the variation space derived from the three natural voices. A comparison across genres shows that human speakers adapt their speech and articulation rates to genres, slower rates being used for rhymes and poetry reading, whereas this adaptation is less clear for synthesized speech. This derives from the fact that the same corpus and the same unit selection procedure are used in all genres by the two synthesized voices. Differences are however minor. Concerning pausing, there is an important difference between natural and synthesized speech, across genres as well as in general. The pause proportion is lower in rhymes than in tales in the productions of all three speakers. By contrast, there is a higher proportion of pause in rhymes than in tales for the two synthesized voices. In addition, pause duration appears to be related to articulation rate in natural voices, longer pauses being observed in rhymes and poetry. By and large, no great difference is to be observed between natural and synthesized speech concerning speech and articulation rates. Indeed, rates vary a lot between speakers, but synthesized voices vary along the same lines. By contrast, pause duration and proportion differ between synthesized and natural voices. Since pauses may also be used to encode prosodic structure, a careful analysis of duration patterns with respect to prosodic structure is provided in the following section. Finally, the main difference between the two synthesized voices comes from their nature as SY-A is read speech while SY-P is more expressive. For instance, the average speech rates and articulation rates are very different for both voices. 3.2. Prosodic structure and duration patterns In French, syllabic and vocalic lengthening mostly indicates phrasing and accentuation. Indeed, accented syllables, which correspond to the last full syllable at any level of prosodic structure, are lengthened, the lengthening rate being generally related to the level of phrasing (e.g. [15], [18]). Lengthening rates were thus computed by comparing the duration of vowels in unaccented syllables with the duration of the nucleus of any last metrical syllables (i.e. accented syllables) at the level of the prosodic word (PWD), the phonological phrase (PP) and the intonation phrase (IP). Table 3 summarizes the results obtained per genre. Mean duration of vowels in unaccented syllables is given in the first line of each genre, the lengthening rate being given in the following lines. As shown in table 3, there is a relatively important variation in the duration of vowels in unaccented positions between the different genres, especially for human speakers. In general, vowels in unaccented positions are longer in rhymes and poems than in tales. By contrast, no clear variation is observed between genres for synthesized voices. This result confirms the fact that human speakers adapt their speaking rate according to genres, in contradistinction to synthesized voices. As far as edge marking is concerned, lengthening always occurs at the end of the three distinct levels of phrasing, i.e. prosodic word, phonological phrase and intonational phrase, in synthesized as well as in natural speech. Across all genres, lengthening rate is from 10 to 20 %, at PWD level, 30 to 60% at PP level, and 80 to 180% at the IP level. These rates correspond to what is often said about French durational patterns. In rhymes and to a lesser extend in poetry, lengthening rates do not clearly allow distinguishing the three distinct levels of phrasing (e.g. differences between PWD and PP for LOD, DRE and SY-P in rhymes, and differences between PP and IP for LOD and GOR in poetry). Note also that lengthening rates marking IP boundaries are more important in all genres for SY-A than for human speakers; in the case of SY-P, it is proportionally more important in tales. Table 3. Mean duration of vowels in unaccented syll. (in ms) and lengthening rate (in %) at three levels of phrasing (PWD, PP and IP) Rhymes Mean Unacc. duration Length. Rate ACPWD Length. Rate ACPP Length. Rate AC-IP Poems Mean Unacc. duration Length. Rate ACPWD Length. Rate ACPP Length. Rate AC-IP Tales Mean Unacc. duration Length. Rate ACPWD Length. Rate ACPP Length. Rate AC-IP LOD DRE GOR SY-A SY-P 66 130 93 81 68 30% 20% 20% 20% 30% 20% 20% 50% 30% 10% 90% 70% 70% 150% 60% LOD DRE GOR SY-A SY-P 67 110 95 78 69 10% 20% 40% 20% 20% 60% 40% 100% 50% 40% 60% 70% 80% 190% 80% LOD DRE GOR SY-A SY-P 59 99 78 77 65 10% 20% 10% 20% 20% 20% 40% 50% 40% 30% 80% 80% 100% 190% 100% On the whole, the duration patterns obtained for synthesized speech in all genres are relatively comparable to what is observed in natural speech: the different levels of phrasing are encoded by a lengthening, whose relative rate varies in relation to boundary strength (see [15], [17] and [20] among others). 4. Discussion The comparison between synthesized speech and natural speech does not show strong differences. The variation that occurs in speech and articulation rates does not allow distinguishing natural speech from synthesized one. Concerning final lengthening and edge marking, the prosodic analysis clearly showed that final lengthening occurs in natural and in synthesized speech, despite some differences in the lengthening rate observed at the level of the IP for SY-A (in all genres) and for SY-P (in tales, to a lesser extend). It is doubtful however that these differences explain the lack of naturalness in rhythm. By listening to synthesized stimuli we were surprised by the quality of the rhythmic patterns observed in rhymes, especially for SY-A. Indeed, they sounded very natural in comparison to those obtained for tales. So, the encountered problems in rhythm cannot be attributed to extra-lengthening at IP level. Since speech and articulation rates on the one hand, and durational marking of the prosodic structure, on the other, cannot be invoked to account for the lack of naturalness in the rhythmic patterns, other explanations have to be found. In fact, two lines of research are worth exploring. Firstly, no correlation between speech rates, boundary strength and pause duration is observed in synthesized speech, whereas such a correlation exists in natural speech. Indeed, prosodic phrases such as PPs and IPs tend to have the same number of syllables or the same duration in French (see, among others, [9], [10], [11] and [21]), and pause durations may be of importance to obtain isochrony. In synthesis speech, pause duration remain constant. In addition, tonal patterns probably play a role in the development of rhythmic patterns. By inserting a comma at the end of each line, the realization of a non-final melodic contour (i.e. continuation rise) was forced in rhymes and poetry. Since this contour was repeated regularly, it reinforced the impression of rhythm. By contrast, the form and the occurrence of tonal movements are less controlled in tales. As a consequence, the recurrence of prosodic patterns, which is crucial for rhythm, was not obtained. 5. Conclusion and perspectives The analysis of the duration patterns observed in natural and synthesized speech for the three literary genres showed clearly that duration cannot by itself explain the lack of naturalness of the rhythmic patterns in speech synthesis. Values obtained for segmental duration and edge marking are indeed comparable in all cases. Further research on a larger corpus is necessary. In addition, three points are forth investigating to improve the unit selection procedure in the speech synthesis system and, henceforth, rhythmic patterns:  Clearly distinguishing the various levels of phrasing: at present, lengthening rates observed at the end of the three levels of phrasing may lead to treat PWD and PP, on the one hand, and IP on the other. In natural speech, rates are located along a continuum, in all genres and for all speakers;  Taking into account the form of the tonal movements realized on accented syllables: the procedure used to generate the synthesized stimuli forced to insert a specific tonal contour at the end of each line, i.e. at a reasonable distance in terms of number of syllables;  Adapting lengthening rates, articulation rates and pause duration to genres, but also to satisfy some kinds of correlation. 6. Acknowledgements The research presented here is supported by the French Investissements d’Avenir - Labex EFL program (ANR-10LABX-0083). It is also partly achieved within the ANR research project SynPaFlex (ANR-15-CE23-0015). 7. References [1] Y. Sagisaka, “Speech synthesis by rule using an optimal selection of non-uniform synthesis units”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp.679-682, 1988. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] A. Hunt and A. Black, “Unit selection in a concatenative speech synthesis system using a large speech database”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 373-376, 1996. M. Schröder, “Expressive Speech Synthesis: Past, Present, and Possible Futures,” in Affective Information Processing, pp. 111126, London: Springer, 2009. D. Guennec, D. and D. Lolive, “Unit Selection Cost Function Exploration Using an A* Based Text-to-Speech System”, in P. Sojka, A. Horák, I. Kopeček and K Pala, Text, Speech and Dialogue 2014, LNCS, Springer, Heidelberg, vol. 8655, pp. 432–440, 2014 P. Alain, J. Chevelu, D. Guennec, G. Lecorvé and D. Lolive “The IRISA Text-To-Speech System for the Blizzard Challenge 2015”, Blizzard Challenge 2015 Workshop, 2015. P. Boersma and D. Weenink, “Praat: doing phonetics by computer (Version 5.5)”. www.praat.org, 2014. J.-Ph. Goldman, “EasyAlign: an automatic phonetic alignment tool under Praat”, Proceedings of Interspeech 2011, pp. 32333236, 2011. B. Post, “The multi-faceted relation between phrasing and intonation in French”, in C. Gabriel & C. Lleó, Intonational Phrasing at the Interfaces: Cross-Linguistic and Bilingual Studies in Romance and Germanic, pp. 44-74, Amsterdam: Benjamins, 2011. E. Delais-Roussarie, “Phonological phrasing and accentuation in French”, in M. Nespor and N. Smith (eds), Dam phonology: HIL phonology papers II, den Haag: Holland Academic Graphics, pp. 1-38, 1996. P. Martin, “Prosodic and rhythmic structures in French”, Linguistics 25, pp. 925-949, 1987 V. Pasdeloup, “A prosodic Model for French Text-to-speech synthesis : A psycholinguistic approach”, in G. Bailly, C. Benoit and T.R Sawallis (eds), Talking Machines: Theories, Models, and Designs, Elsevier Science Publishers, pp. 335-48, 1992. J. Fletcher, “ Rhythm and final lengthening in French”, Journal of Phonetics 19, pp. 193-212, 1991 P. Mertens , J.-P. Goldman, E. Wehrli and A. Gaudinat, “La synthèse de l’intonation à partir de structures syntaxiques riches”, Traitement Automatique des Langues 42 :(1), pp. 142195, 2001. M. Nespor and I. Vogel, Irene, Prosodic phonology, Dordrecht: Foris, 1986. B. Post, Tonal and phrasal structures in French intonation, Den Haag: Holland Academic Graphics, 2000. E. Selkirk, “On derived domains in sentence phonology”, Phonology Yearbook 3, pp. 371-405, 1986. E. Delais-Roussarie, B. Post, M. Avanzi, C. Buthke, A. Di Cristo, I. Feldhausen, S-A. Jun, P. Martin, T. Meisenburg, A. Rialland, R. Sichel-Bazin, and H. Yoo, “Intonational Phonology of French. Developing a ToBi system for French”, in S. Frota and P. Prieto (eds), Intonation in Romance, Oxford University Press, pp. 63-100, 2015. C. Portes and R. Bertrand, “Permanence et variation des unités prosodiques dans le discours et l’interaction”, Journal of French Language Studies 21, pp. 97-110, 2011. A-C. Simon, A. Auchlin, M. Avanzi and J.-Ph. Goldman, “Les phonostyles: une description prosodique des styles de parole en français”, in M. Abécassis and G. Ledegen (eds), Les voix des Français : en parlant, en écrivant, Bern : Lang, pp. 71-88, 2010. E. Delais-Roussarie and I. Feldhausen, “Variation in Prosodic Boundary Strength: a study on dislocated XPs in French”, in: N. Campbell, D. Gibbon and D Hirst (eds), Proceedings of Speech Prosody 2014, Dublin, May 2014, pp. 1052–1056, 2014. F. Wioland, Prononcer les mots du français. Des sons et des rythmes, Paris: Hachette, 1991.