9th International Conference on Speech Prosody 2018
13-16 June 2018, Poznań, Poland
Investigating prosody in music and speech
1
1
2
Yundu Wang , Elinor Payne
2
Guildhall School of Music & Drama, London, UK
Phonetics Laboratory, University of Oxford, and St Hilda’s College, Oxford, UK
[email protected],
[email protected]
influence musical expressivity.While some empirical work on
cross-lingusitic influences upon musical performance exists
(e.g. [7][8][9]), none have involved both ecological validity
(use of repertoire from the Western classical canon in a
performance setting) and analyses of the participants’ speech.
It has been proposed that certain linguistic properties, such
as syllable-structure, quantitative vowel reduction, and stressinduced lengthening, contribute to differing rhythmic
impressions in speech, giving rise to a distinction between socalled ‘stress-timing’ and ‘syllable-timing’ ([10]) - terms
which have persisted even though it is now understood that the
basis of these perceptual differences is not isochrony of units
such as the syllable of prosodic foot. Research suggests that
read Mandarin Chinese speech has a lower nPVI_V and a
higher %V than read English speech, and that native Mandarin
speakers of English have a lower Varco_V and higher %V
than native English speakers ([11]). While the tone system of
Mandarin and stress system of English are fundamentally
different, it has been claimed that Mandarin has some form of
prominence alternation = with ‘neutral’, toneless syllables
having reduced duration and schwa-like qualities [12][13] . In
terms of boundaries, research suggests that English phrasefinal lengthening occurs only on the syllable that actually
abuts the boundary, while lengthening in Mandarin is reported
to begin earlier (i.e. during the pre-final syllable) and the
boundary syllable itself is shorter than in English (in
spontaneous speech; [15]).
This study investigates whether a performer’s native
language influences or informs the ‘musical prosody’ ([3]) of
his or her performance. Specifically, it seeks to answer the
question, ‘Is language influence reflected through the
expressive execution of rhythm and phrasing during
performance?’ This preliminary investigation has the
following hypothesis: Within musical performance, we predict
differences in the degree of timing variability (above and
beyond what is prescribed by the composed notation) between
language groups.
Abstract
We investigated the speech and musical performances of six
classical pianists, of native Mandarin Chinese and English
language backgrounds, comparing the prosodic properties in
their speech with temporal expressivity in their piano
performances. We expected intra-language consistency.
Results, while mixed, suggest both intra-language and intraspeaker consistency, which implies that individual, expressive
(performative) ability affects both speech and music.
Index Terms: musical performance, prosody, English,
Mandarin Chinese, rhythm, production, speech
1.
Introduction
Studies on the cognitive parallels between speech and music
have primarily focused on perception. Research suggests that
the two domains share ‘basic processing mechanisms’,
including the ability to absorb and learn sound categories, to
perceive regularities from rhythmic and melodic sequences, to
integrate elements such as words and musical tones into
syntactic structures, and to extract meaning from and
emotional responses through sound (cf. [1]).
Cross-domain research on performance is limited, but
studies suggest that classical musicians exhibit expressive
nuances in their performances that are similar to prosodic
elements in speech. Examples include phrase-final
lengthening to mark boundaries and changes in intensity and
duration to mark prominence ([2][3]).
Combinative studies that are both cross-domain and crosslinguistic are rare, but promising. A speech rhythm index
(nPVI) ([4]), based on capturing degree of variability in
vocalic intervals in speech was used to examine the musical
themes of composers from language backgrounds claimed to
belong to differing linguistic rhythm groups ([5][6]). It was
determined that the nPVI scores for music corresponded with
associated language scores; music for which a higher nPVI
was obtained (e.g. American, English, and Swedish) were
written by composers whose native language was also
associated with a high linguistic nPVI, whereas French, Italian,
and Spanish music had significantly lower nPVI. While the
neat typological rhythmic classification of languages, has met
with some criticism, not least because rhythm metric scores
have been shown to vary as a function of speech task and
individual performance as much as between languages, there
are broad distributional differences between languages that
suggest linguistic properties do play some role in shaping, or
constraining the rhythmic properties of speech. Here we
investigate whether linguistic background may also
2.
2.1.
Methodology
Participants
The participants were six classical pianists, residing, at the
time of the experiment, in London, UK. Three participants
were Mandarin Chinese (Putonghua) native speakers with L2
English. The remaining three were monolingual English
speakers, including two American and one British. All three
Mandarin speakers were from mainland China and had lived
in the UK for a period of 1-9 years. All participants completed
a language proficiency questionnaire to build fluency profiles,
which included time spent in an English-speaking country as
547
10.21437/SpeechProsody.2018-111
expressive versions and labelled according to bar and beat
numbers. IOIs for the melody layer were labelled according to
note durations (DQ for dotted quaver, SQ for semiquaver, C
for crotchet, and M for minim).
well as self-ratings of language fluency in English and
Mandarin.
2.2.
Materials and elicitation
The English speech data, read by all six participants, consisted
of productions of an excerpt from ‘A Serious Case’ ([16]).
The Mandarin speech data, read only by the three Mandarin
speakers, consisted of read productions of an excerpt from
‘Outside the Window’ ( 窗 外 ) ([17]). The musical data
consisted of performances of an excerpt from ‘Rosemary’, a
work for piano solo composed by the English composer Frank
Bridge (1879-1941). Each participant was asked to play the
excerpt in two styles: i) mechanically (without expression)
and ii) with expression. Studies have shown that such
directions alter the amount of a performer’s expressive
nuances ([18][19]). Both versions were recorded for each
participant. Auditory impressions of the expressive
performances of the six participants were made by both the
author (a professional classical pianist) and co-author, prior to
analyses. A professional classical pianist was recruited to
conduct a blind listening test of the music recordings.
Both speech and music recordings were made in the
recording studio of the Guildhall School of Music & Drama in
London. The speech recordings were made in a vocal booth,
equipped with a Neumann U87 microphone. The pianists
performed on a Steinway Model B grand piano, recorded with
a pair of DPA 4011 (cardioid) microphones. All recordings
were made using ProTools software.
2.3.
Figure 1: The piano score of ‘Rosemary’ from 3
Sketches, H. 68 (1906)
2.4.
Analysis
Durations of syllabic, consonantal, and vocalic intervals were
extracted using a Praat script. The following rhythmic
measures were calculated for four English sentences by each
speaker, and mean values calculated for each speaker: %V,
∆V, VarcoV, and nPVI_V (this paper focuses on English
speech - only the syllabic intervals of the Mandarin Chinese
sentences were extracted for speech rate analysis). For the
purposes of this study (see [5] and [6]), only vocalic durations
were analysed, since these are comparable with note durations.
Music durations of each performance were analysed and the
mean, standard deviation, nPVI_V, and VarcoV were
calculated. For this preliminary study, only durations of the
quaver beat level were targeted (the quaver pulse allows for
detailed durational analysis without resorting to note-by-note
durational extraction).
In both speech and music analyses, duration values for
phrase-final units were excluded from the calculations due to
their possible distortive effect on the measures. In speech, prefinal and final phrase syllables were excluded. This is to
control for any pre-final syllable lengthening than may occur
in the Mandarin speakers. In music, penultimate and ultimate
bars was excluded for the same purpose. While degree of final
lengthening is of interest in itself, it can be analysed separately
from measures of more holistic timing variability.
Speech-rate and overall tempo were calculated for speech
(both English and Mandarin Chinese sentences) and music,
respectively. Speech-rate (syllables per second or sps) was
calculated manually by dividing the total number of syllables
by the total length of the sentence. Again, the final and prefinal syllables were excluded to control for phrase-final
lengthening. Large pauses (often triggered in speech by the
presence of commas and semicolons, and demarcating phrase
boundaries) were excluded from the total length. Interestingly,
Labelling
Consonant and vocalic intervals were segmented from the
waveform and spectrogram and start-points and end-points
labelled on a syllabic tier in Praat. Vocalic and consonantal
segmentation were carried out with reference to standard
criteria: placement of boundaries between vocalic and
consonantal intervals was guided primarily by the presence of
a sudden, significant drop in amplitude and a break in the
formant structure, particularly F2. Marking of the consonant
onsets was facilitated by various cues, according to the
manner of the consonant. In addition to syllable boundaries,
syllables were also labelled according to segmental content
and according to level of prominence. For prosodic
prominence, syllables were labelled as belonging to one of
three levels: i) unstressed; ii) stressed; iii) nuclear stressed.
Both intonational and intermediate phrase boundaries were
identified and marked. Syllables, as well as vocalic and
consonantal intervals, were categorized according to phrase
position (initial, medial, or final).
The music recordings were visualised through
spectrograms and note onsets (time instances) were labelled
manually using Sonic Visualiser ([20]), a free software system
designed for the analysis of musical sound files. As with the
speech analysis, labelling was aided by spectrographic and
audio information. Timings of onsets were labelled on three
separate beat levels (time instance layers): i) semi-quaver beat;
ii) quaver beat; iii) crotchet beat. Bar timings were labelled on
a separate layer. Time instances of the melody in the right
hand (notes in the treble clef of Figure 1) were also labelled.
The differences between the time points of successive time
instances yielded the interonset intervals (IOIs). IOIs for each
beat level were extracted from both the mechanical and
548
all three Mandarin speakers of English had a number of
smaller pauses between syllables, and it was decided that
these should be included in the total length sum, as such
pauses are interpreted as a characteristic of the three Mandarin
speakers (when speaking L2 English). The scores of four
sentences were averaged for each speaker. In music, the
average quaver-note duration was calculated by dividing the
total duration by the total number of quaver beats (16 quaver
beats were analysed for each performance). It was necessary
to exclude beats that were involved in phrase-final
lengthening and prominence cues (conspicuous slowing down,
lengthening and/or delaying of notes to highlight importance)
(see [21] for reference). From the average quaver-note
duration, the average metronome speed was determined
(quavers per minute or qpm).
M. average
English 1
English 2
English 3
E. average
English
4.17
4.61
4.87
6.34
6.50
5.91
Chinese
5.05
4.33
5.49
-
2.5.
Results
2.5.1.
Speech rate and overall tempo
Participant
Mandarin 1
Mandarin 2
Mandarin 3
English 1
English 2
English 3
qpm
96
77
100
117
85
80
Participant
Mandarin 1
Mandarin 2
Mandarin 3
English 1
English 2
English 3
Table 2: nPVI_V and VarcoV of speech, with overall
averages.
Mandarin 1
Mandarin 2
Mandarin 3
nPVI_
V
61.41
43.35
35.09
VarcoV
13.25
12.57
7.96
22.07
14.02
10.39
Table 4: Scores of blind test and VarcoV
nPVI_V and VarcoV
Participant
nPVI_
V
10.43
11.76
8.24
14.21
11.51
11.22
Analysis of nPVI_V and VarcoV in English speech showed
weak language-based grouping. Mandarin 1 had the highest
nPVI_V of both language groups, which suggests that there
may be significant variation across speakers within any one
language.
Analysis of nPVI and Varco scores for IOIs showed
moderate consistency of language-based grouping. The
average nPVI score was 10.14 in the Mandarin group,
compared to 12.31 in the English group. The average Varco
score was 11.26 in the Mandarin group, compared to 15.49 in
the English group. Only one participant (English 3) had a
score lower than those of the Mandarin speakers (further
discussion below).
Auditory impressions of the expressive performances of
the six participants were made by both the author (a
professional classical pianist) and co-author, prior to analyses.
A further set of impressions were made by a professional
classical pianist in a blind listening session. A score ranging
from 1 to 5 was given to each participant, based on the amount
of durational variation heard in the recording, with 1 being the
least and 5 being the highest. The Varco scores of individuals
represented well the impressions of both the author and coauthor, as well as the scores of the third listener.
Analysis of speech rate of read English sentences showed a
language-based grouping of results, with native Mandarin
speakers having a slower speech rate than native English
speakers. This is a common phenomenon of second-language
speakers ([11]), although interestingly the Mandarin sentences
were also read with slower speech rates.
Analysis of overall tempo showed weak language-based
grouping: the Mandarin group scored an overall average of 91
qpm while the English group scored 94 qpm. Between
language groups, English 2 and 3 had slower tempi than
Mandarin 1 and 2.
A detailed cross-comparison of speech rate with overall
tempo did not show a consistent speaker-based grouping:
English 1 had the fastest tempo, while English 2 had the
fastest speech rate (n.b., English 3 had the slowest speech rate
and overall tempo). Interestingly, some consistency remained
within the Mandarin group: Mandarin 3 had the fastest speech
rate and overall tempo. Mandarin 2 had the slowest overall
tempo as well as speech rate, but only in Mandarin speech. In
English speech, Mandarin 1 had the slowest speech rate,
which may be contributed by a low self-rated fluency of
English speech (more in the Discussion).
2.5.2.
43.49
34.04
47.6
44.48
42.04
Table 3: nPVI_V and VarcoV of music.
Table 1: Speech rate (sps) and Metronome speed
Participant
Mandarin 1
Mandarin 2
Mandarin 3
English 1
English 2
English 3
46.62
44.63
53.41
48.15
48.73
VarcoV
50.58
42.84
37.05
549
Blind
3
2
2
5
4
2
VarcoV
13.25
12.57
7.96
22.07
14.02
10.39
of prosodic variability between dialects of a language,
including regional dialects.
The nPVI_V and VarcoV measures in speech revealed
inconclusive language-based grouping. Mandarin 1 had the
highest scores of both measures out of all the participants.
English 1 had the second lowest nPVI_V and the lowest
VarcoV score out of all the participants. This is inconsistent
with the results of both the durational and impressionistic
analyses of English 1’s musical performance. It has been
suggested in studies of nonnative stress and rhythm production
that Mandarin speakers of English would lengthen stressed
syllables more, rather than shorten unstressed syllables (as
would be expected for native English speakers). Also,
discrepancies between listeners’ impression of nonnative
speech and acoustic measures may be a result of slower speech
rate and ‘selective lengthening’ (see [11]). As for the results of
English 1, subsequent analysis of consonantal durations, stress
and boundaries will be conducted.
Results of nPVI and Varco measures in music are more
suggestive of the possibility of language-based grouping. The
overall average of nPVI music scores for Mandarin speakers
were lower than that of English speakers. Varco scores were
even more supportive of language-based grouping within
performances. Additionally, intra-speaker consistency was
suggested by both the nPVI and Varco scores. Scores were
largely consistent with listeners’ impression and
spectrographic comparison. The nPVI and Varco scores of
Mandarin 2 were the only discrepancy. Mandarin 2 had the
highest nPVI score and the median Varco score. It is possible
that the tempo or speed of performance affected the nPVI
score, since Mandarin 2 had the slowest tempo (and the
measure does not control for speed rate). This phenomenon
could also explain the highest nPVI_V score (and slowest
speech rate) for Mandarin 1’s English speech.
2.5.2. Spectrograms
Figure 2: Mandarin 3 expressive version
Figure 3: English 1 expressive version
Spectrograms made in Sonic Visualiser allowed for visual
comparisons of the durational variability between
performances. Line graphs with a curved plot type show
variation between successive quaver-beat durations. Curves
rise upwards as durations lengthen and fall as they shorten.
Comparisons between the performances of Mandarin 3
(lowest Varco score) and English 1 (highest Varco score)
show significant contrast. Mechanical versions of each
participant show the extent of durational contrast between
mechanical and expressive performance. The mechanical
versions of Mandarin 3 and English 1 are quite similar; this is
true for all six performances.
3.
4.
Conclusions
This study has conducted preliminary acoustic measurement
on the speech and musical performance of classical pianists
with differing native language backgrounds, with the purpose
of investigating the diversity of performative aspects in both
and music performance, and possible correlations with degree
of expressivity. Results suggest that while speech rate and
overall tempo are not consistent with language-based or
speaker-based grouping, they may affect respective nPVI_V
measures for both speech and music. This study also suggests,
based on nPVI and Varco measures in music, the possibility of
both intra-language and intra-speaker consistency in speech
and musical performance.
Next steps will include analysis of further data gathered on
different kinds of speech (neutral sentence reading,
spontaneous speech), as well as different degrees of
expressivity in performance (mechanical and expressive), and
analysis of intensity and duration variation between stressed
and unstressed syllables.
Discussion
This preliminary investigation focused on two properties of
production in both speech and music: timing variability and
rate. Results of a comparison between speech rate and tempo
were inconclusive for either language-based grouping or intraspeaker consistency. Overall tempo scores were higher for
two Mandarin speakers than for two English speakers.
Although English 1 had the fastest overall tempo in
performance, this was not reflected in speech rate. English 2
had the fastest speech rate, but not a significantly fast tempo.
It is interesting to note that English 1 was British, while
English 2 and English 3 were from eastern and western United
States, respectively. These results bring to attention the extent
5.
Acknowledgements
We would like to thank the Guildhall School of Music &
Drama for providing support and resources, as well as the
participants for letting us analyse their voices and
performances.
550
6.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[11]
[12]
[13]
[14]
[15]
[16]
[18]
[19]
[20]
[21]
computer. The Journal of the Acoustical Society of America, 83
S120–S120.
References
Patel, A.D. (2008) Music, Language, and the Brain.New York:
Oxford University Press.
Palmer, C., Jungers, M., and Jusczyk, P. W. (2001) Episodic
memory for musical prosody. Journal of Memory and Language.
45. pp.526-545.
Palmer, C. and Hutchins, S. (2006) What is musical prosody? In
Ross, B.H. (ed.). Psychology of Learning and Motivation. 46 (1).
Amsterdam, The Netherlands: Elsevier Press. pp.245-278.
Grabe, E. & Low, E. L. (2002) Durational variability in speech
and the rhythm class hypothesis. In Gussenhoven, C. & Warner,
N. (Eds.). Laboratory phonology 7 Berlin: Mouton de Gruyter.
pp.515–546.
Patel, A.D. and Daniele, J.R. (2003). An empirical comparison
of rhythm in language and music. Cognition. 87. pp. B35-B45.
Huron, D., and Ollen, J. (2003). Agogic contrast in French and
English themes: Further support for Patel and Daniele (2003).
Music Perception.21.pp.267–271.
Ohgushi, K. (2002) Comparison of Dotted Rhythm Expression
between Japanese and Western Pianist., 7th International
Conference on Music Perception and Cognition, Sydney.
Sadakata, M., Ohgushi, K., and Desain, P. (2004) A crosscultural comparison study of the production of simple rhythmic
patterns. Psychology of Music. 32. pp.389–403.
Slobodian, L.N. (2008) Perception and production of linguistic
and musical rhythm by Korean and English middle school
students. Empirical Musical Review. 3 (4).
Dauer, R. M. (1983) Stress timing and syllable-timing reanalyzed.
Journal of Phonetics. 11 pp. 51-62.
Mok, P., Dellwo, V. (2008). Comparing native and non-native
speech rhythm using acoustic rhythmic measures: Cantonese,
Beijing Mandarin and English. in the Speech Prosody 2008, s.n.,
Campinas, Brazil, pp. 423–426.
Chao, Y. R. (1968). A grammar of spoken Chinese. Berkeley
and Los Angeles: University of California Press.
Chen, Y., Xu, Y. (2006). Production of Weak Elements in
Speech – Evidence from F₀ Patterns of Neutral Tone in Standard
Chinese. Phonetic. 63 pp.47–75.
Zhang, Y. H., Nissen, S. L., and Francis, A. L. (2008) Acoustic
characteristics of English lexical stress produced by native
Mandarin speakers. The Journal of the Acoustical Society of
America. 123. pp.4498-4513.
Fon, J., 2002. A Cross-Linguistic Study on Syntactic and
Discourse Boundary Cues in Spontaneous Speech. Columbus,
OH: The Ohio State University dissertation.
Rose, C., n.d. A Serious Case. [Online] Learn English | British
Council. Available from:
https://learnenglish.britishcouncil.org/en/stories-poems/seriouscase [accessed: 14 June 2017].
Learn Mandarin Chinese. (2015) Elementary Level Chinese
Readings: Story 窗外(Outside of the Window). [Online]
Available at: http://tcfl.tingroom.com/2015/01/6421.html.
[Accessed: 14 June 2017].
Gabrielsson, A. (1987) Action and Perception in Rhythm and
Music. Papers Given at a Symposium in the Third International
Conference on Event Perception and Action. Royal Swedish
Academy of Music.
Seashore, C.E., 1936. New Vantage Grounds in the Psychology
of Music. Science. 84.pp.517–522.
Cannam, C., Landone, C., and Sandler, M. (2010) Sonic
Visualiser: An Open Source Application for Viewing,
Analysing, and Annotating Music Audio Files. Proceedings of
the ACM Multimedia 2010 International Conference.
Repp, B.H., 1988. Patterns of expressive timing in performances
of a Beethoven Minuet by nineteen famous pianists and one
551