12 Celpstrum y Spectrum 2010

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Spectral- and Cepstral-Based Measures During

Continuous Speech: Capacity to Distinguish


Dysphonia and Consistency Within a Speaker
*Soren Y. Lowell, *Raymond H. Colton, †Richard T. Kelley, and *Youngmee C. Hahn, *ySyracuse, New York

Summary: Spectral- and cepstral-based acoustic measures are preferable to time-based measures for accurately rep-
resenting dysphonic voices during continuous speech. Although these measures show promising relationships to per-
ceptual voice quality ratings, less is known regarding their ability to differentiate normal from dysphonic voice
during continuous speech and the consistency of these measures across multiple utterances by the same speaker. The
purpose of this study was to determine whether spectral moments of the long-term average spectrum (LTAS) (spectral
mean, standard deviation, skewness, and kurtosis) and cepstral peak prominence measures were significantly different
for speakers with and without voice disorders when assessed during continuous speech. The consistency of these mea-
sures within a speaker across utterances was also addressed. Continuous speech samples from 27 subjects without voice
disorders and 27 subjects with mixed voice disorders were acoustically analyzed. In addition, voice samples were per-
ceptually rated for overall severity. Acoustic analyses were performed on three continuous speech stimuli from a reading
passage: two full sentences and one constituent phrase. Significant between-group differences were found for both ceps-
tral measures and three LTAS measures (P < 0.001): spectral mean, skewness, and kurtosis. These five measures also
showed moderate to strong correlations to overall voice severity. Furthermore, high degrees of within-speaker consis-
tency (correlation coefficients 0.89) across utterances with varying length and phonemic content were evidenced for
both subject groups.
Key Words: VoiceeDysphoniaeAcousticeCepstraleSpectraleSpectral momentseLong-term average spectrum.

INTRODUCTION can be more accurately applied to dysphonic voices as these


Diagnosis and assessment of treatment outcomes in voice disor- measures do not require cycle boundary detection. The ceps-
ders are dependent on accurate and reliable measures of voice. trum is derived by applying a Fourier transformation to the
Comprehensive voice evaluation includes auditory-perceptual spectrum. Cepstral peak prominence (CPP) provides a measure
measures and more objective, quantifiable measures such as of the difference between the periodic fundamental frequency
those obtained through acoustic analyses. Although auditory- energy and the average energy of the signal derived from linear
perceptual characteristics will always be a necessary compo- regression.8,9 By averaging constituent cepstra within defined
nent in defining a voice disorder, they cannot provide a complete frames, a smoothed CPP (CPPS) measure can be produced,10
picture of the underlying physiological deficits and the re- providing a less noisy representation of the relative periodicity
liability of judgments can be less than optimal.1e4 Acoustic in the acoustic signal. CPP and related measures show high
measures can provide increased objectivity, and certain correlation to breathiness8e12 and additional discriminative
parameters may reflect physiological components related to potential for hoarseness13 but less of a relationship to rough-
vocal fold vibratory behavior. Current practices are moving ness.12e14 Additionally, CPP-related measures show strong
away from time-based measures to more spectral-based mea- correlations to dysphonia severity.12e15
sures because of accuracy issues when applying these measures Whereas cepstral measures highlight relative strength of har-
to dysphonic speakers. Many time-based acoustic measures, monic energy in the acoustic signal, the LTAS provides infor-
such as jitter and shimmer, rely on exact demarcation of the mation on the distribution of energy throughout the frequency
cycle-to-cycle boundaries in the acoustic waveform. Although range. The LTAS is derived from the mean of many constituent
this boundary detection can be reliable and valid for voices spectra for a longer speaking sample. Measures such as spectral
that are reasonably periodic, reliability breaks down for less slope, spectral tilt, or skewness can show relative distribution of
periodic voices that characterize dysphonic individuals.5e7 high- versus low-frequency energy in the acoustic signal and
Spectral-based measures such as those derived from the ceps- degree of spread of high-frequency energy from the central ten-
trum of the acoustic signal and the long-term average spectrum dency of the spectrum. In normal speakers, the frequency distri-
(LTAS) provide viable alternatives to time-based measures and bution of energy derived from the LTAS can differentiate good
versus poor voice quality.16 Spectral tilt has been related to
Accepted for publication June 18, 2010.
breathiness by some researchers,17 whereas other researchers
From the *Department of Communication Sciences and Disorders, Syracuse University, have found minimal relationships of spectral tilt to breathiness
Syracuse, New York; and the yDepartment of Otolaryngology and Communication
Sciences, Upstate Medical University, Syracuse, New York.
in sustained vowels.8 The distribution and shape of the LTAS
Address correspondence and reprint requests to Soren Y. Lowell, Department of can be characterized by spectral moments, with the first four
Communication Sciences and Disorders, Syracuse University, 805 S. Crouse Avenue,
Hoople Building, Syracuse, NY 13210. E-mail: [email protected]
moments representing the central tendency (spectral mean), de-
Journal of Voice, Vol. -, No. -, pp. 1-10 viation from the central tendency (spectral standard deviation
0892-1997/$36.00
Ó 2010 The Voice Foundation
[SD]), tilt of the spectral distribution (skewness), and peaked-
doi:10.1016/j.jvoice.2010.06.007 ness of the distribution (kurtosis). Spectral mean and spectral
2 Journal of Voice, Vol. -, No. -, 2010

SD can reflect changes in voice associated with treatment.18 The purpose of this study was to determine whether cepstral-
Spectral moments can differentiate speakers with neurological and LTAS-derived acoustic measures could differentiate a group
voice disorders from normal speakers,19 but less is known about of mixed, laryngeal-based dysphonic speakers from a group of
the potential for multiple spectral moments to differentiate la- normal speakers during continuous speech and whether these
ryngeal-based disorders from normal speakers. acoustic measures were correlated to auditory-perceptual voice
Another advantage of cepstral and LTAS measures is that quality ratings. It was hypothesized that significant differences
they can be applied to continuous speech, which may provide between groups would be evidenced for the cepstral and spec-
a more representative sample of dysphonic speaking patterns tral measures based on previous studies that showed significant
than a sustained vowel. Correlations of CPP to dysphonia sever- differentiation of speaker groups.15,19,22 An additional study
ity during continuous speech are equal to or higher20 than those purpose was to determine the relationships of cepstral and
derived from sustained vowels only.15 Halberstam21 found LTAS measures between utterances produced by the same
stronger correlations of CPP and CPPS to hoarseness during speaker; comparisons of one sentence to a second sentence
continuous speech compared with vowels. CPPS can also and one sentence to a constituent phrase of that sentence were
show a greater relationship to the dimension of roughness dur- made to examine consistency. Finally, a methodological
ing continuous speech than during sustained vowels.12 In addi- question addressed whether the retention or removal of
tion to cepstral measures, the prediction of breathiness was unvoiced portions of the speech signal affected cepstral or
greater for the LTAS-derived measure of spectral tilt when as- LTAS parameters.
sessed during continuous speech rather than during a vowel.10
Issues related to the consistency of acoustic measures across
speaking contexts and optimal methods for deriving acoustic MATERIALS AND METHODS
measures are important to their clinical application. The consis- Speaker samples
tency of both cepstral- and spectral-based measures within Twenty-seven dysphonic voice samples and 27 normal voice
a speaker across continuous speaking contexts for normal and samples were selected from a large published database recorded
dysphonic speakers has been minimally addressed to date. by Massachusetts Ear and Eye Institute (MEEI) and distributed
Comparisons of sustained vowels to connected speech within in digitized audio files on CD-ROM by KayPENTAX.24 This
a speaker show that correlations of certain acoustic measures database contains w700 voice samples representing both nor-
to dimensions of dysphonia can vary greatly.10,21 Whether the mal speakers and speakers with a wide range of voice disorders.
variation of phonemic content from one continuous speech For this study, criteria for inclusion of samples in both groups
sample to another can produce differences in these acoustic were as follows: (a) English as the primary native language,
measures is unknown. Furthermore, it is important to (b) nonsmoking status, and (c) sample included first and second
determine the length of a continuous speech segment that sentences from the Rainbow passage. Dysphonic samples were
may be needed to allow phonemic variations to stabilize over included if they showed (a) presence of laryngeal-based voice
time for combined cepstral and spectral measures. disorder (neurological diagnoses such as spasmodic dysphonia,
Methods for extracting both cepstral- and spectral-based multiple sclerosis, myasthenia gravis, and Parkinson disease
measures also vary across studies. If methodological differ- were excluded) and (b) dysphonia versus aphonia (consistently
ences influence overall means and patterns of group aphonic samples were excluded because of lack of a distinguish-
differences, these differences will need to be considered when able waveform). A range of voice disorder diagnoses were
interpreting and applying the results of multiple studies to the sought so that breathy, strained, and rough voice qualities
voice-disordered population. Hillenbrand and colleagues8,10 would be represented in the dysphonic group.
designed a computer analysis program that could accept Age range for the dysphonic speaker samples was 19e86
either sustained vowels or continuous speech as input. These years, with a mean age of 41 years. Age range for the normal
and several other researchers have used the full continuous speaker samples was 26e55 years, with a mean age of 39 years.
speech signal as input when deriving cepstral- or LTAS-based There were 14 women and 13 men in the dysphonic speaker
measures, without extracting pauses and unvoiced segments group and 11 women and 16 men in the normal speaker group.
from the signal.8,10,18,20,21 Other researchers addressing LTAS Primary disorders in the dysphonic speaker group included
measures have removed unvoiced segments from the mass lesions of the vocal folds (8), paresis/paralysis (13), kera-
continuous speech sample before computing acoustic tosis/leukoplakia (3), vocal fold edema (1), presbyphonia (1),
measures.16,22,23 In regard to the CSL 4500 (KayPENTAX, and laryngeal web (1).
Lincoln Park, NJ) automated computer algorithms for Standardized recording procedures and instruments were
deriving LTAS measures, it is recommended that unvoiced used for all voice samples from the database recorded at
portions of the signal be removed as their inclusion in the MEEI. Voice samples were recorded in a sound-treated booth,
analysis may substantially impact LTAS computations with a speaker-to-microphone distance of 15 cm, using a con-
(S. Crump, KayPENTAX, personal communication, September denser microphone and digital recording device. Recordings
23, 2009). However, whether the inclusion of these unvoiced for the Rainbow passage were digitized at 25 kHz for both
portions in the continuous speech signal substantially affects the dysphonic and normal voice samples. Voice samples were
either cepstral or LTAS measures has not been previously edited to produce three comparison stimuli: first sentence of
addressed. the Rainbow passage (17 words), second sentence of the
Soren Y. Lowell, et al Spectral- and Cepstral-Based Measures During Continuous Speech 3

Rainbow passage (12 words), and a constituent phrase within KayPENTAX CSL 4500 system, an LTAS was computed by di-
the second sentence (first six words). viding the speech sample into 1024-point frames (Blackman
window) for Fourier transform computations, producing a win-
Acoustic analyses dow length of w40 milliseconds. Four moments of the LTAS
The CSL 4500 (KayPENTAX) system was used for analysis of were derived: spectral mean, spectral SD, skewness, and kurto-
the LTAS. The CSL program does not distinguish voiced and sis. Cepstral analyses for this study were implemented in the
unvoiced portions of the signal and includes both in its compu- SpeechTool program (J. Hillenbrand, Western Michigan Univer-
tations of the LTAS. Measures derived from the LTAS using sity, Kalamazoo, MI). CPP and CPPS were computed according
these algorithms are, therefore, likely to be affected by un- to the procedures outlined by these and other authors.8e10 The
voiced portions of the signal. Thus, before conducting the ceps- procedural approach of Hillenbrand and colleagues uses linear
tral and spectral analyses of the acoustic signal, pauses and regression analysis to determine average amplitude of energy
unvoiced segments were edited out of each sample to produce in the cepstrum, which is then compared with the cepstral
a concatenated signal representing voiced portions only of the peak to derive CPP, and an additional smoothing algorithm
continuous speech samples. This methodology is in line produces the CPPS measure by averaging the cepstrum data
with several previous studies addressing LTAS and cepstral over multiple frames. The use of smoothing can, therefore,
measures.16,22,23 reduce artifacts, and cepstral analysis programs that use linear
Piloting of automated voice detection algorithms to deter- regression may be more accurate at determining average
mine voiced segments with the CSL 4500 (KayPENTAX) sys- energy in the cepstrum than methods that do not use linear
tem revealed inconsistent identification of voiced segments. regression.26
Therefore, sentences and phrases were manually edited using To address a secondary question of the effects of unvoiced
the acoustic waveform as the primary determinant of voiced segments on cepstral and LTAS measures, a comparison sample
segments and the auditory signal as a secondary determinant. of unedited versions of sentences 1 and 2 was also subject to the
Manual procedures have been reliably implemented by other above-noted cepstral and LTAS measures. These speech samples
researchers to detect the presence of unvoiced segments in dys- contained pauses and unvoiced segments, with editing only to
phonic voice samples that may be less amenable to automated isolate the first and second sentences of the Rainbow reading
voicing detection.25 Criteria for identification of voiced seg- passage. Edits were again done at crosspoints of the waveform
ments were implemented to assure consistent editing proce- with the zero-amplitude line to minimize any artifacts.
dures across all voice samples. The presence of three or more
periodic or quasiperiodic waveforms after the start point or be- Auditory-perceptual severity ratings
fore the end point was required for the start and end of each Three judges with extensive experience in voice disorders con-
sentence/phrase and to define the beginning and end of each ducted ratings of overall dysphonia severity. Before performing
edited segment. Edits were done where the waveform crossed the ratings, a 1.5-hour training session was conducted using
the zero-amplitude line to minimize any artifacts. Segments a different group of voice samples. Ratings were discussed,
that were judged as fully aperiodic, including silent pauses, in- and consensus was reached regarding typical examples of
spiratory pauses, and noise associated with unvoiced conso- mild, moderate, and severe as rated on the Consensus
nants, were deleted. Identical criteria were used for editing Auditory-Perceptual Evaluation of Voice (CAPE-V) visual ana-
the normal and dysphonic voice samples, and samples from log scale (VAS).27 Anchor examples were established through
each group were alternated to ensure consistency. After initial this consensus discussion from a set of training stimuli that
development of criteria for editing, the same researcher did not include any of the speaker samples for this study. Sub-
(Y.C.H.) edited all samples. sequent ratings of the study samples were performed indepen-
Intrarater reliability for the manual editing process was as- dently by each rater using a 100-mm VAS with definition of
sessed with Pearson r correlation coefficients for w20% of overall severity as provided under the CAPE-V. Ratings were
the dysphonic speaker samples (six subjects) and 20% of the performed in one session, with the use of headphones and files
normal speaker samples (six subjects). To determine whether presented at a comfortable volume for the rater. Voice quality
reliability of manual editing differed between the dysphonic dimensions were not rated individually, but the presence of
and normal speaker groups, separate correlation coefficients three voice quality features as defined on the CAPE-V (rough-
were calculated for each group, with coefficients assessed for ness, breathiness, and strain) was noted for each sample by
each of the acoustic variables across two speaking contexts each judge to verify that the samples represented a range of
(sentences 1 and 2). For the dysphonic group, Pearson r values voice quality dimensions. The anchor examples that were estab-
ranged from 0.96 to 0.99, indicating that the intrarater reliabil- lished in the training session were presented at the start of the
ity was excellent for the manual editing procedures. For the rating session and every 10 samples thereafter to maximize con-
normal speaker group, Pearson r values ranged from 0.97 to sistency of rating standards. Raters were able to relisten to the
0.99, indicating similar high levels of intrarater reliability for anchors at any additional times during their ratings and could
this group. also repeat the sample being rated as often as desired.
Spectral and cepstral acoustic measures were derived from To determine intrarater reliability of auditory-perceptual rat-
the edited, continuous speech samples. All measures were ings, 20% of the dysphonic speaker samples (six subjects) and
computed for each of the three speaking tasks. Using the 20% of the normal speaker samples (six subjects) were
4 Journal of Voice, Vol. -, No. -, 2010

randomly presented twice in the .wav files that were rated by seen in the dysphonic speaker graph, acoustic energy shows
each researcher. Spearman r correlation coefficients were a sharp peak in the low frequencies and then a steep drop-off
used to determine intrarater reliability because of the nonnor- to the low energy levels that are represented out to 10 kHz in
mal distribution of the CAPE-V data. Correlation coefficients a long high-frequency tail. In contrast, the graph for a normal
ranged from 0.88 to 0.97, indicating acceptable levels of intra- speaker is less peaked, with high to moderate energy extending
rater reliability. Interrater reliability between the three judges through the mid-frequencies but dropping to the zero line at
was also assessed with intraclass correlation coefficients 5 kHz. These distribution differences likely contribute to the
(ICCs) for all ratings. The ICC for average measures was significantly lower central tendency (spectral mean) in the dys-
0.92, whereas the ICC for single measures was 0.79, indicating phonic group, F(1,52) ¼ 45.2, P < 0.001. The low-amplitude,
acceptable levels of interrater reliability. high-frequency energy represented in the dysphonic LTAS up
to 10 kHz compared with the normal speaker group in which
energy drops off at w5 kHz may explain the significantly
Statistics
greater skewness for the dysphonic group, with Mann-
Repeated-measures analyses of variance were implemented to
Whitney U ranging from 16.5 to 45.0, P < 0.001, across tasks.
test for group differences on four of the acoustic measures: spec-
The sharp peak of the dysphonic LTAS graph in the low fre-
tral mean, spectral SD, CPP, and CPPS, which were all normally
quencies, with a steep subsequent drop-off in energy thereafter,
distributed across tasks. The acoustic measures of skewness and
likely produces the significantly higher kurtosis values for the
kurtosis were not normally distributed across all tasks, and
dysphonic group, with Mann-Whitney U ranging from 29.0 to
group and task differences were, therefore, assessed with the
62.0, P < 0.001, across tasks. Although spectral SD was also
Mann-Whitney U test and the Wilcoxon test. Significance level
lower in the dysphonic group, it did not meet the corrected
was defined as <0.008 (0.05/6) to correct for multiple compari-
alpha level, with F(1,52) ¼ 7.06, P ¼ 0.01.
sons. When overall group differences occurred, follow-up
A representative graph of the smoothed cepstrum for each
comparisons were made to determine which tasks were signifi-
speaker group is shown in Figure 2. As can be seen in these
cantly different.
graphs, the difference between the cepstral peak and the aver-
To determine within-speaker consistency, correlation coeffi-
age energy level of the cepstrum in the dysphonic group is
cients were used to assess relationships of sentence 1 to sen-
smaller than that in the normal speaker group. These differences
tence 2 and sentence 2 to the phrase that comprised half of
in relative energy of the fundamental frequency resulted in sig-
sentence 2. Pearson r correlation coefficients determined the re-
nificant lower CPP and CPPS values for the dysphonic group
lationships for spectral mean, spectral SD, CPP, and CPPS,
(see Table 1 for means and SDs), F(1,52) ¼ 118.7, P < 0.001
whereas Spearman r coefficients determined the relationships
for CPP; F(1,52) ¼ 67.6, P < 0.001 for CPPS.
for skewness and kurtosis.

Auditory-perceptual ratings of voice and correlation


RESULTS to acoustic measures
Differentiation of dysphonic speakers from normal Mean CAPE-V VAS ratings for overall severity were calculated
speakers across the three judges to determine the auditory-perceptual
Three of the four spectral measures showed significant differ- characteristics of the speaker group. The group mean for the
ences between the dysphonic and normal speaker groups. The normal speakers was 1.67 mm (SD ¼ 2.12), whereas the group
means and SDs for all the acoustic measures across speaking mean for the dysphonic speakers was 36.94 mm (SD ¼ 15.24).
tasks are presented in Table 1. A representative graph of the The Mann-Whitney U test indicated that the overall severity of
LTAS for each speaker group is presented in Figure 1. As voice quality was significantly worse (higher rating) for the

TABLE 1.
Normal and Dysphonic Speaker Groups’ Means and SDs for Spectral Mean, Spectral SD, Skewness, Kurtosis, CPP, and
CPPS, Derived From Edited Samples Containing Voiced Segments Only
Task/Group Spectral Mean Spectral SD Skewness Kurtosis CPP CPPS
Sentence 1
Normal speaker 495.66 (102.51) 620.47 (158.96) 5.01 (1.85) 44.99 (39.46) 18.19 (1.18) 8.58 (0.52)
Dysphonic speaker 314.98 (67.93) 483.49 (169.66) 12.88 (4.75) 267.34 (261.47) 13.29 (1.50) 6.14 (1.12)
Sentence 2
Normal speaker 396.47 (67.24) 507.62 (121.56) 6.62 (2.38) 73.70 (68.50) 16.69 (1.35) 7.81 (0.77)
Dysphonic speaker 298.56 (55.35) 408.10 (160.99) 14.64 (4.28) 333.39 (219.44) 12.99 (1.33) 5.91 (1.09)
Phrase
Normal speaker 392.72 (74.34) 472.62 (141.00) 7.25 (3.01) 96.78 (112.29) 16.82 (1.51) 7.90 (0.75)
Dysphonic speaker 302.44 (55.25) 390.63 (173.40) 15.03 (4.77) 368.42 (258.65) 13.44 (1.55) 6.30 (1.17)
Soren Y. Lowell, et al Spectral- and Cepstral-Based Measures During Continuous Speech 5

FIGURE 1. Representative graphs of the LTAS for a normal and dysphonic speaker during sentence 1 of the Rainbow passage.

dysphonic speakers, P < 0.001. To determine the representation between these measures for the dysphonic speakers during sen-
of several voice quality dimensions, the total numbers of sam- tence 1. Spearman r correlation coefficients were computed
ples that were rated with presence of roughness, breathiness, because of the non-normal distribution of CAPE-V ratings.
and strain by at least two of the three judges were assessed. Among the LTAS measures, moderate or greater correlations
Roughness was most frequently represented (25 of 27 samples) were evidenced for spectral mean (r ¼ 0.64, P < 0.001),
followed by breathiness (19 of 27 samples) and then by strain skewness (r ¼ 0.71, P < 0.001), and kurtosis (r ¼ 0.67,
(11 of 27) samples. P < 0.001); spectral mean was negatively correlated to voice se-
To determine the relationships between the six acoustic mea- verity ratings, and skewness and kurtosis were positively corre-
sures and overall voice severity, correlations were assessed lated to voice severity ratings. Spectral SD was minimally

FIGURE 2. Representative graphs of the smoothed cepstrum for a normal and dysphonic speaker during sentence 1 of the Rainbow passage.
6 Journal of Voice, Vol. -, No. -, 2010

correlated to voice severity (r ¼ 0.26, p < 0.056). Moderate P < 0.001 (Greenhouse-Geisser), occurred. However, follow-
to high negative correlations to voice severity were evidenced up contrasts with t tests did not meet the corrected levels of
for CPP (r ¼ 0.78, P < 0.001) and CPPS (r ¼ 0.72, significance for comparisons between sentences 1 and 2,
P < 0.001). F(1,52) ¼ 10.1, P ¼ 0.002, or between sentence 2 and the
constituent phrase, F(1,52) ¼ 10.4, P ¼ 0.002.
Similarities and differences between tasks
for dysphonic and normal speakers Relationships between tasks for all speakers
To address whether mean values for acoustic measures were To determine consistency of LTAS and cepstral measures
consistent across tasks, within-group comparisons were exam- across speaking contexts with varying phonetic content, cor-
ined. Significant within-group differences were evidenced for relations were determined for all speakers between sentences
spectral mean, F(1.6,84) ¼ 57.5, P < 0.001 (Greenhouse- 1 and 2. Consistency was high between these two utterances,
Geisser), with significant group-by-task interaction effects, with correlation coefficients ranging from 0.889 to 0.973
F(1.6,84) ¼ 32.5, P < 0.001 (Greenhouse-Geisser). Follow-up (Table 2). To determine the consistency of LTAS and cepstral
contrasts with t tests showed significant task differences be- measures across a longer versus shorter constituent utterance,
tween sentences 1 and 2 for the normal speaker group only, correlations were determined for all speakers between sen-
F(1,52) ¼ 36.2, P < 0.001, whereas comparisons between sen- tence 2 and the constituent phrase. Consistency was also
tence 2 and the constituent phrase were not significantly differ- high between these two utterances, with correlation coeffi-
ent, F(1,52) ¼ 0.7, P ¼ 0.394. Significant within-group cients ranging from 0.898 to 0.962 (Table 2). Example
differences were also evidenced for spectral SD, F(1.6,82) ¼ graphs depicting the correlations for spectral mean in the
56.1, P < 0.001 (Greenhouse-Geisser), with no significant inter- two task comparisons (Figure 3) and for CPP in the two
action effects, F(1.6,82) ¼ 2.8, P ¼ 0.081. Follow-up contrasts task comparisons (Figure 4) demonstrate that consistent rela-
with t tests showed significant task differences between senten- tionships were evidenced for both the dysphonic and normal
ces 1 and 2, F(1,52) ¼ 75.3, P < 0.001, but not between sen- speakers.
tence 2 and the constituent phrase (greater than the corrected
significance level, F(1,52) ¼ 7.2, P ¼ 0.010). For skewness, Comparison of unedited speech samples containing
significant within-group differences were evidenced for com- unvoiced segments
parisons of sentence 1 to sentence 2 in both groups (P < 0.001) An additional exploratory analysis was conducted to address
but not for comparisons of sentence 2 to the constituent phrase the methodological issue of whether the removal of voiced
(P  0.028). For kurtosis, significant within-group differences portions of the acoustic signal affected cepstral or LTAS
were evidenced for comparisons of sentence 1 to sentence 2 in measures. For this analysis, the same six acoustic measures
both groups (P < 0.001) but not for comparisons of sentence 2 were derived from the unedited versions of sentences 1 and
to the constituent phrase (P  0.021). 2 that contained both voiced and unvoiced segments. Means
Significant within-group differences were also evidenced for and SDs for each group are presented in Table 3. Comparison
cepstral measures. For the CPP measure, significant main to Table 1 shows that overall mean values for all six mea-
effects for task, F(1.5,80) ¼ 44.3, P < 0.001 (Greenhouse- sures were often substantially different for the unedited anal-
Geisser), and significant group-by-task interaction effects, yses compared with the edited analyses. Group differences
F(1.5,80) ¼ 33.6, P < 0.001 (Greenhouse-Geisser), occurred. were minimized in the spectral measures derived from the
Follow-up contrasts with t tests showed significant task differ- unedited samples, and patterns of differences were some-
ences between sentences 1 and 2 for the normal speaker group times reversed. Spectral mean (P ¼ 0.361), skewness
only, F(1,52) ¼ 29.9, P < 0.001, whereas comparisons between (P  0.009), and kurtosis (P  0.079) were not significantly
sentence 2 and the constituent phrase were not significantly different between groups in the unedited samples. For spec-
different (greater than the corrected level of significance, tral SD, the pattern reversed so that dysphonic speakers
F(1,52) ¼ 5.9, P ¼ 0.018). had significantly higher spectral SD values than normal
For the CPPS measure, significant main effects for task, speakers (P < 0.001). Although the overall means for CPP
F(1.5,75) ¼ 23.5, P < 0.001 (Greenhouse-Geisser), and signifi- and CPPS were different when comparing values from the
cant group-by-task interaction effects, F(1.5,75) ¼ 17.2, analysis of samples with and without unvoiced segments,

TABLE 2.
Consistency Across Tasks for All Speakers Represented Through Correlations Between Sentences 1 and 2 and Between
Sentence 2 and the Constituent Phrase
Task/Group Spectral Mean Spectral SD Skewness Kurtosis CPP CPPS
Sentences 1 and 2 0.896 (<0.001) 0.889 (<0.001) 0.969 (<0.001) 0.973 (<0.001) 0.943 (<0.001) 0.891 (<0.001)
Sentence 2 and the phrase 0.915 (<0.001) 0.898 (<0.001) 0.959 (<0.001) 0.952 (<0.001) 0.976 (<0.001) 0.962 (<0.001)
Pearson r correlation coefficients (P value) for spectral mean, spectral SD, CPP, and CPPS and Spearman r correlation coefficients (P value) for skewness and
kurtosis.
Soren Y. Lowell, et al Spectral- and Cepstral-Based Measures During Continuous Speech 7

FIGURE 4. Relationships for CPP between sentences 1 and 2 and


FIGURE 3. Relationships for spectral mean between sentences 1
between sentence 2 and the constituent phrase in a normal and
and 2 and between sentence 2 and the constituent phrase in a normal
dysphonic speaker.
and dysphonic speaker.

the pattern between groups remained the same. Dysphonic auditory-perceptual ratings of voice quality were evidenced.
speakers showed significantly lower CPP (P < 0.001) and Of further interest was whether consistency across utterances
CPPS (P < 0.001) values than normal speakers in the analysis with differing length and phonemic content would be evidenced
of unedited versions, as they had for the edited samples for cepstral and spectral moment measures. The dysphonic
(unvoiced segments removed). speakers represented a range of severity and voice quality
dimensions.

DISCUSSION
Measures derived from the cepstrum and LTAS of the acoustic Group differences of dysphonic and normal speakers
signal have shown promise for either differentiating select dys- Spectral moments from the LTAS and cepstral-based measures
phonic groups or predicting overall dysphonia, but the combina- of edited voice samples strongly differentiated the dysphonic
tion of these measures in the connected speech of a mixed from normal speaker groups, supporting the primary hypothesis
dysphonic group has been minimally studied. This study ad- of this study. For the dysphonic group, spectral mean, spectral
dressed the potential for cepstral- and spectral-based measures SD, CPP, and CPPS were lower, whereas skewness and kurtosis
to differentiate normal speakers from a mixed dysphonic group were higher. Five of these six acoustic measures showed statis-
and whether correlations between acoustic measures and tically significant differences; spectral SD showed less disparity
8 Journal of Voice, Vol. -, No. -, 2010

TABLE 3.
Normal and Dysphonic Speaker Groups’ Means and SDs for Spectral Mean, Spectral SD, Skewness, Kurtosis, CPP,
and CPPS, Derived From Unedited Samples Containing Voiced and Unvoiced Segments
Task/Group Spectral Mean Spectral SD Skewness Kurtosis CPP CPPS
Sentence 1
Normal speaker 577.88 (112.20) 895.47 (162.92) 4.68 (1.33) 31.42 (18.88) 15.36 (1.08) 6.42 (0.58)
Dysphonic speaker 642.49 (244.77) 1319.41 (452.58) 5.35 (2.13) 37.75 (38.01) 11.44 (0.93) 3.93 (0.84)
Sentence 2
Normal speaker 383.83 (57.99) 577.58 (107.77) 7.00 (2.04) 72.54 (49.66) 14.82 (1.09) 6.35 (0.69)
Dysphonic speaker 384.74 (97.77) 764.13 (289.86) 9.22 (3.30) 117.92 (90.86) 11.66 (1.09) 4.31 (1.08)

between groups and did not meet the corrected level of signif- whereas skewness and kurtosis did not reflect change after treat-
icance (P ¼ 0.01). ment. Interestingly, in their dysphonic speakers, a lowering of
Differences in LTAS measures reflect relative degree and the spectral mean and spectral SD was associated with an im-
dispersion of high- and low-frequency energy in the acoustic provement in voice quality after treatment, an opposite pattern
signal. LTAS distributions for dysphonic speakers in this study than that of our present study and the study by Dromey.19 In the
frequently showed low-level energy continuing out to 10 kHz, Tanner et al study, unvoiced portions were not removed from
whereas this pattern was not evidenced in the normal speakers. the analyzed signal; thus, the inclusion of unvoiced segments
The presence of increased high-frequency energy has been could contribute to the differing patterns of spectral mean
associated with dysphonic voices,28 and posttreatment voice and spectral SD between their study and ours. However, Dro-
improvement may be associated with a reduction in high- mey also included unvoiced segments in his LTAS analysis;
frequency energy.18 Spectral tilt has also been related to voice yet, spectral mean and spectral SD were higher in the normal
quality severity.10,22 In the present study, the sharp peak of voice samples than in the disordered samples. Differences in
energy in the low-frequency range, steep drop-off of energy voice quality dimensions in the constituent groups may also
thereafter, and spread of energy at low levels throughout the be a factor associated with the LTAS patterns evidenced in
high-frequency range likely account for the substantially higher the Tanner et al study. Although only voice and articulatory se-
kurtosis and skewness values for the dysphonic group and their verity were reported in the Dromey study, it is likely that his
lower spectral mean. hypokinetic dysarthria group was high on the breathiness di-
Few studies have reported on group differences of LTAS mension of voice quality abnormality. In the present study,
spectral moments between a mixed dysphonic group and nor- most of our dysphonic speakers (70%) also showed breathi-
mal speakers. Dromey19 addressed spectral moment differences ness, whereas 41% showed strain. In the Tanner et al study,
between patients with hypokinetic dysarthria secondary to overall voice severity was reported but individual voice dimen-
Parkinson disease and normal speakers in connected speech. sions were not. It is possible that in the Tanner et al functional
He found similar patterns of group means to those evidenced dysphonia group, strain would be more frequently represented,
in this study; spectral mean and spectral SD were significantly with less predominance of the breathiness feature. Thus, differ-
lower for the speakers with hypokinetic dysarthria relative to ences in individual dimensions of voice quality might account
the normal speakers, whereas skewness and kurtosis were sig- for the association of a lower spectral mean and spectral SD
nificantly higher. The similarity in patterns to our results oc- with better voice quality in that study. To address the issue
curred despite the methodological differences between that of whether the patterns of group differences for spectral mo-
study and ours; Dromey19 did not exclude unvoiced segments ments between dysphonic and normal speakers are different
in his LTAS analyses, used a much longer speech sample for various individual voice quality dimensions, future studies
(nine sentences of the Rainbow passage) to derive the LTAS, with individual ratings of each voice quality dimension are
and included a dysphonic group with a uniform diagnosis (Par- needed.
kinson disease). The pattern of differences between cepstral measures (CPP
Recently, Houtz et al29 found that spectral moments differen- and CPPS) for the dysphonic speakers in this study was similar
tiated female speakers with adductor spasmodic dysphonia, to that evidenced in previous studies that have included
a neurological disorder, from females with muscle tension connected speech analysis10,20,21; both CPP and CPPS were
dysphonia, whereas males did not show differences between lower for the dysphonic speakers relative to the normal
speaker groups. Normal speakers were not included in that speakers. Graphs of the smoothed cepstrum demonstrated that
study, but patterns indicated that speakers with MTD had higher the prominence of peaks relative to the overall level of
values for all four spectral moments as compared with speakers background noise was minimal for the dysphonic speakers,
with spasmodic dysphonia. accounting for their significantly lower CPP and CPPS
Tanner et al18 determined that spectral mean and spectral SD measures, in line with what Hillenbrand and Houde10 described
significantly differentiated pre- to posttreatment changes in as a generally flat cepstral distribution for their breathy dys-
voice for a group of individuals with functional dysphonia, phonic speakers.
Soren Y. Lowell, et al Spectral- and Cepstral-Based Measures During Continuous Speech 9

Relationship of voice severity ratings and acoustic correspondingly high or low in another context. However,
measures when comparing absolute means between groups or from pre-
Multiple spectral and cepstral measures were related to voice to posttreatment, it is clear that consistent utterances must be
severity in this study. Of the four spectral moments assessed used because of the likely differences in mean values across
in this study, spectral mean, kurtosis, and skewness showed contexts. Future studies should determine whether dysphonic
moderate or greater relationships (r ¼ 0.64, 0.71, and 0.67, or normal speakers show change in absolute values for spectral
respectively) to ratings of overall voice severity. The directions moments, CPP, and CPPS when assessed in the same utterance
of the relationships for each measure were similar to those evi- over differing time points and without intervention.
denced by Dromey in patients with Parkinson disease. How-
ever, in that study, no significant correlations were found Effects of unvoiced segments on acoustic measures
between voice severity and spectral moments. Limited data Studies addressing LTAS and cepstral measures have varied
are available on relationships of spectral moments to voice substantially in their analysis approaches, with some incorpo-
quality in groups that include mixed dysphonic and normal rating both voiced and unvoiced segments into the analyzed
speakers such as in our present study. Measures related to spec- signal10,20,21 and others removing the unvoiced portions of the
tral tilt or skewness have shown similar positive correlations with signal before analysis.16,23 In the present study, we removed
breathiness severity.10 For the cepstral measures, both CPP and pauses and unvoiced segments before computing the LTAS
CPPS showed moderate to strong relationships with overall voice and cepstral measures. Manual editing provided the most
severity (P ¼ 0.78 and 0.72, respectively). These results are reliable detection of voicing in our samples but added
consistent with several other studies in which a strong, negative substantial time to the analysis process. To determine whether
correlation of overall voice severity and CPP or CPPS has been the extraction of these unvoiced segments impacted the
demonstrated.12,14,20,21,23,30 Objective acoustic measures that acoustic measures computed in this study, we compared
show high correlation to subjective voice rating measures may mean values derived from the entirety of samples (including
have strong clinical utility in diagnosing voice disorders, unvoiced segments) to values computed from the edited
characterizing severity of dysphonia, and demonstrating change samples (excluding unvoiced segments).
associated with treatment. Both cepstral and spectral analyses were impacted by the in-
clusion of unvoiced segments. Absolute means for all measures
Consistency of acoustic measures across tasks were substantially different for the analyses performed on the
To our knowledge, consistency of LTAS and cepstral measures unedited versus edited sentences. The potential for LTAS mea-
across varying connected speech utterances has not been previ- sures of spectral mean, skewness, and kurtosis to differentiate
ously addressed. Group means and correlations were compared dysphonic from normal speakers appeared considerably weak-
for sentence 1 versus sentence 2 and sentence 2 versus the con- ened by the inclusion of unvoiced segments, as group means
stituent phrase to determine within-speaker consistency for the were not significantly different for three of the four LTAS mea-
six acoustic measures. Each comparison included differences in sures (spectral mean, kurtosis, and skewness) with this analysis
both phonetic content and length, with 17 versus 12 words in method. Spectral SD group differences were significant and
sentences 1 and 2 and 12 versus 6 words in the comparison of showed a reversal in direction from that demonstrated in the
sentence 2 and the constituent phrase. Overall means for most edited sentences with unvoiced segments removed. In the anal-
LTAS and cepstral measures were significantly different be- ysis of edited samples, spectral SD was the one LTAS measure
tween sentences 1 and 2 but not between sentence 2 and the that did not show significant differences between groups. Ceps-
constituent phrase. Differences across the sentence 1 and 2 tral measures, in contrast, continued to reflect significant group
tasks occurred in both groups but were more frequent for the differences even when unvoiced segments were retained in the
normal speakers (all measures for the normal speakers vs four analysis. The direction of differences also remained the same,
of six measures for the dysphonic speakers). The normal with dysphonic speakers showing lower CPP and CPPS group
speakers may show greater flexibility in their vocal vibratory means than normal speakers. Hillenbrand and colleagues de-
patterns. This flexibility may be manifested as a greater range signed their algorithms to accommodate continuous speech,
in acoustic measures both across tasks and within tasks, as evi- which perhaps contributes to the lesser impact of unvoiced seg-
denced in the trend toward higher spectral SD in the normal ments with that analysis approach. Future studies are needed to
speaker group. The similarity between overall means of sen- compare the effects of automated and nonautomated voicing
tence 2 and the constituent phrase indicates that a relatively detection methods on a range of spectral- and cepstral-based
short utterance of six words may be sufficient to reveal group acoustic measures. When designing and interpreting studies
spectral patterns. that incorporate acoustic measures, however, consideration of
In spite of the differences in means across some tasks, corre- the effects of unvoiced segments may be warranted.
lations between sentences 1 and 2 and between sentence 2 and
the constituent phrase were high (r  0.89) for all LTAS and
cepstral measures. Thus, neither the length of the connected CONCLUSIONS
speech utterance (beyond six words) nor the differing phonemic Both cepstral measures and spectral moments derived from the
content affected the pattern of the derived measures; those LTAS strongly differentiated a group of mixed dysphonic
speakers who were high or low in one speech context were speakers from a normal speaker group. Spectral mean, CPP,
10 Journal of Voice, Vol. -, No. -, 2010

and CPPS were significantly lower in the connected speech of 9. Heman-Ackah YD, Heuer RJ, Michael DD, et al. Cepstral peak promi-
dysphonic speakers, whereas kurtosis and skewness measures nence: a more reliable measure of dysphonia. Ann Otol Rhinol Laryngol.
2003;112:324e333.
were significantly higher. A greater degree and dispersion of
10. Hillenbrand J, Houde RA. Acoustic correlates of breathy vocal quality:
high-frequency energy contributed to the dysphonic patterns dysphonic voices and continuous speech. J Speech Hear Res. 1996;39:
in the LTAS measures, and lesser energy in the fundamental fre- 311e321.
quency relative to the total background energy of the signal 11. Hartl DA, Hans S, Vaissiere J, Brasnu DA. Objective acoustic and aerody-
contributed to the dysphonic patterns in the cepstral measures. namic measures of breathiness in paralytic dysphonia. Eur Arch Otorhino-
laryngol. 2003;260:175e182.
Consistency of measures across utterances with varying phone-
12. Eadie TL, Baylor CR. The effect of perceptual training on inexperienced
mic content is a critical issue in evaluating both the clinical and listeners’ judgments of dysphonic voice. J Voice. 2006;20:527e544.
research utility of these measures. We found that although 13. Awan SN, Roy N. Acoustic prediction of voice type in women with func-
absolute means for LTAS and cepstral measures differed in tional dysphonia. J Voice. 2005;19:268e282.
comparisons between utterances of differing length and phone- 14. Heman-Ackah YD, Michael DD, Goding GS Jr. The relationship between
cepstral peak prominence and selected parameters of dysphonia. J Voice.
mic content, each measure was strongly correlated across utter-
2002;16:20e27.
ances. Therefore, individual and group patterns across 15. Awan SN, Roy N. Toward the development of an objective index of dyspho-
utterances were similar, supporting the use of these measures nia severity: a four-factor acoustic model. Clin Linguist Phon. 2006;20:
at the group level. Future studies are needed to determine 35e49.
whether individuals will show consistency over time with ceps- 16. Leino T. Long-term average spectrum in screening of voice quality in
speech: untrained male university students. J Voice. 2009;23:671e676.
tral- and LTAS-based measures when intervention or other
17. Mendoza E, Valencia N, Munoz J, Trujillo H. Differences in voice quality
functional change has not occurred. The present study ad- between men and women: use of the long-term average spectrum (LTAS).
dressed relationships between overall dysphonia severity and J Voice. 1996;10:59e66.
acoustic measures and found that spectral mean, kurtosis, skew- 18. Tanner K, Roy N, Ash A, Buder EH. Spectral moments of the long-term
ness, and cepstral measures were correlated with voice severity. average spectrum: sensitive indices of voice change after therapy? J Voice.
2005;19:211e222.
Previous research has addressed the relationships of cepstral-
19. Dromey C. Spectral measures and perceptual ratings of hypokinetic dysar-
derived measures to various individual voice dimensions, but thria. J Med Speech Lang Pathol. 2003;11:85e94.
further study of these relationships for spectral-based measures 20. Awan SN, Roy N, Dromey C. Estimating dysphonia severity in continuous
is needed. speech: application of a multi-parameter spectral/cepstral model. Clin Lin-
guist Phon. 2009;23:825e841.
21. Halberstam B. Acoustic and perceptual parameters relating to connected
speech are more reliable measures of hoarseness than parameters relat-
Acknowledgments ing to sustained vowels. ORL J Otorhinolaryngol Relat Spec. 2004;66:
We thank Dr James Hillenbrand for the kind sharing of his 70e73.
cepstral analysis software and Ramani Voleti for her contribu- 22. Eadie TL, Doyle PC. Classification of dysphonic voice: acoustic and
auditory-perceptual measures. J Voice. 2005;19:1e14.
tion to the perceptual ratings in this study.
23. Maryn Y, Dick C, Vandenbruaene C, Vauterin T, Jacobs T. Spectral,
cepstral, and multivariate exploration of tracheoesophageal voice quality
in continuous speech and sustained vowels. Laryngoscope. 2009;119:
REFERENCES 2384e2394.
1. Kent R. Hearing and believing: some limits to the auditory-perceptual 24. Voice Disorders Database Model 4337 [computer program]. Lincoln Park,
assessment of speech and voice disorders. Am J Speech Lang Pathol. NJ: KayPENTAX; 1994.
1996;5:7e23. 25. Roy N, Whitchurch M, Merrill RM, Houtz D, Smith ME. Differential diag-
2. Kreiman J, Gerratt BR, Precoda K, Berke GS. Individual differences in nosis of adductor spasmodic dysphonia and muscle tension dysphonia using
voice quality perception. J Speech Hear Res. 1992;35:512e520. phonatory break analysis. Laryngoscope. 2008;118:2245e2253.
3. Kreiman J, Gerratt BR, Precoda K. Listener experience and perception of 26. Heman-Ackah YD. Reliability of calculating the cepstral peak without
voice quality. J Speech Hear Res. 1990;33:103e115. linear regression analysis. J Voice. 2004;18:203e208.
4. Kreiman J, Gerratt BR. Sources of listener disagreement in voice quality 27. Kempster GB, Gerratt BR, Verdolini Abbott K, Barkmeier-Kraemer J,
assessment. J Acoust Soc Am. 2000;108:1867e1876. Hillman RE. Consensus auditory-perceptual evaluation of voice: develop-
5. Titze IR, Liang H. Comparison of F0 extraction methods for high-precision ment of a standardized clinical protocol. Am J Speech Lang Pathol. 2009;
voice perturbation measurements. J Speech Hear Res. 1993;36: 18:124e132.
1120e1133. 28. Klich RJ. Relationships of vowel characteristics to listener ratings of
6. Rabinov CR, Kreiman J, Gerratt BR, Bielamowicz S. Comparing reliability breathiness. J Speech Hear Res. 1982;25:574e580.
of perceptual ratings of roughness and acoustic measure of jitter. J Speech 29. Houtz DR, Roy N, Merrill RM, Smith ME. Differential diagnosis of muscle
Hear Res. 1995;38:26e32. tension dysphonia and adductor spasmodic dysphonia using spectral mo-
7. Bielamowicz S, Kreiman J, Gerratt BR, Dauer MS, Berke GS. Comparison ments of the long-term average spectrum. Laryngoscope. 2010;120:
of voice analysis systems for perturbation measurement. J Speech Hear 749e757.
Res. 1996;39:126e134. 30. Awan SN, Roy N. Outcomes measurement in voice disorders: application
8. Hillenbrand J, Cleveland RA, Erickson RL. Acoustic correlates of breathy of an acoustic index of dysphonia severity. J Speech Lang Hear Res.
vocal quality. J Speech Hear Res. 1994;37:769e778. 2009;52:482e499.

You might also like