1414
zyx
zy
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-35, NO. 10, OCTOBER 1987
z
A Weighted Cepstral Distance Measure for Speech
Recognition
Abstract-A weighted cepstral distance measure is proposed and is using weighting. In his experiment, the cepstral distance
tested in a speaker-independent isolatedword recognition system using measure was the statistically weighted Euclidean distance
standard DTW (dynamic time warping) techniques. The measure is a
measure with vowel class specific weights. Nakatsu et al.
statistically weighted distance measure with weights equal to the in[l 11 investigated the effectiveness of weighting the cepversevariance ofthe cepstralcoefficients.Theexperimentalresults
show that the weighted cepstral distance measure works substantially stral coefficients by the inverse standard deviation in their
better than boththe Euclidean cepstral distance and the log likelihood speaker-independent isolated word recognition system.
ratio distance measures across two different databases. The recogniAccording to their experimental results, improvement of
tion error rate obtained using the weighted cepstral distance measure
recognition
accuracy by weighting was about 0.6 percent
was less than onewas about 1 percent for digit recognition. This result
from
97.3
to
97.9 percent. Recently, an attempt to use a
fourth of that obtained using the simple Euclidean cepstral distance
measure and about one-third of the results using thelog likelihood ra- weighted cepstral measure has been made by the author
tio distance measure. The most significant performance characteristic [ 121 in his speaker-independent digit recognition and subof the weighted cepstral distance
was that it tended to equalize the perstantial improvement of the recognition rate has been
formance of the recognizer across different talkers.
I. INTRODUCTION
N speech recognition based upon template matching,
spectral distance measure is one of the most important
problems. A number of distance measures have been tried
and still more have been proposed [1]-[5]. Among them,
the LPC-based log likelihood ratio distance measure proposed by Itakura [2] has been one of the most successful
distance measures, as shown in a recent comparative study
of several distortion measures [6]. TheEuclidean distance
between cepstral coefficients is another important spectral
distance measure. This distance measure is widely used
with LPC-derived cepstral coefficients since these coefficients may be recursively computed from the linear predictor coefficients [7].
The Euclidean cepstral distance measure has a number
of variants due not only to its simple form (i.e. , a Euclidean distance between two sets of coefficients) but also its
property that the resulting distance is an approximation to
the distance between the two log spectra represented by
the cepstral coefficient sets [3], [6], [SI.
One variant of the cepstral distance measure is a
weighted cepstral distance. Fumi [9] used such a weighted
cepstral distance measure for automatic speaker verification, where the weight for the cepstral coefficients was the
inverse of its intratalker variance.Paliwal [ 101 studied the
performance of a weighted cepstral distance measure for
vowel recognition and showed a 1.3 percent recognition
rate average improvement from 91.4 to 92.7 percent by
I
achieved by using weighting.
Based on the results obtained in previous studies on
weighted distancemeasures, it appears that weighting
works, however, there is no clear explanation for the reason why and how it works and how to choose an optimal
set of weights. It is the purpose of this paper to show that
substantial performance improvement in speech recognition can be obtained if the cepstral coefficients are
weighted appropriately and to discuss the characteristics
and performance of the weighted cepstral distance measure.
The organization of this paper is as follows.In Section
I1 we give statistical characteristics of the LPC derived
cepstral coefficients and discuss definitions of the
weighted cepstral distance measure. In Section I11 we describe experiments to evaluate the performance of the
weighted cepstral distance measure. All the recognition
results we show in this paper are from speaker-independent, isolated word, recognition experiments. Wealso
give experimental results obtained from databases of two
vastly different word vocabularies. In Section IV we discuss the results. Finally,in Section V our findings are
summarized.
zyxwvu
z
zyxwvu
zyx
zyxw
11. PRELIMINARIES
A. The LPC Euclidean Cepstral Distance Measure
The LPC cepstral coefficients c ( iare
) computed recursively from the pth-orderlinear predictor coefficients a ( i )
by the following relations [7]:
c(1) = - a ( l )
i-1
Manuscript received October 22, 1985; revised April 16, 1987.
The author is with ATR Auditoty & Visual Perception Research Laboratories, Twin 21-MID-Tower, 2-1-61 Shiromi, Osaka 540, Japan.
IEEE Log Number 8716005.
c ( i >= - a ( i )
-
C
k= 1
(1 - k / i ) a ( k ) c(i - k ) ,
1< i s p .
0096-3518/87/1000-1414$01.000 1987 IEEE
(1)
zyxwvutsrqp
zyxwvutsrqpon
zyxwvu
zyxwvuts
zyxwvutsrqponmlkjihgfedcbaZYXWVUTS
zyxwvutsr
mm
zyxwv
1415
TOHKURA: WEIGHTED CEPSTRAL DISTANCE MEASURE FOR SPEECH RECOGNITION
The cepstral distance dCEP,i.e., the Euclidean distance
between the LPC cepstral coefficients, is defined as follows:
P
&EP
=
,x
I =
1
2
(ct(i) -
.rei>)
(2)
where ct ( i ) and c,( i ) are ith cepstral coefficients of a test
frame and those of a reference frame, respectively.
B. Statistical Characteristics of the Cepstral
Coeflcients
Fig. 1 shows statistical distributions of the cepstral coefficients obtained from a digit databaseuttered by 100 talkersover dialed-up telephone lines. Individual distributions of the first cepstral coefficient, for each digit, are
shown in Fig. 2 .
From both figures we observe that the distributions of
each cepstral coefficient across all the digits are nearly
Gaussian, although the distribution within each digit is
somewhat different from each other. Also, it can be seen
in Fig. 1 that there is a large difference in variance among
the cepstral coefficients. That is, the variance of the higher
cepstral coefficients is much smaller than the variance of
the lower cepstral coefficients (e.g., compare C8 and C1
in Fig. 1). The variances of the individual cepstral coefficients are plotted in Fig. 3 as a function of the cepstral
coefficient index. The varianceof the eighth cepstral coefficient is about one-twentieth of the variance of the first
cepstral coefficient.
C. The Weighted Cepstral Distance Measure
A distance measure which has been widely used in
speech recognition is the Mahalanobis distance, d ~ c E p ,
defined as follows-:
-2.0
0.0
2.0 -2.0
0.0
2.0
PARAMETER RANGE'
Fig. 1. Statistical distributions of the cepstral coefficients.
DIGIT: 5
DIGIT: 1
I DIGIT:3 ,m
1
DIGIT:8
uI
,Mh
'
1
Y
I
zyxwvutsr
(ct
c,) v - ' ( c t - c,)'
(3)
where c, and c, are feature row vectors which are com- Fig. 2.
posed of the cepstral coefficients obtained from a test utterance and a reference template, respectively, and V is
the covariance matrix of the feature vector. The measure
dMCEp is a (covariance) weighted distance measure which
can be used for clustering feature vectors or for recognition purposes.
Therearesome
difficulties in applying the measure
dMCEP
to speech recognition. They are estimation accuracy of off-diagonal terms and sensitivity introduced by
the matrix inversion in computing dMCEp, since the offdiagonal terms are relatively small compared to the diagonal ones. In addition to these, it is computationally
expensive to compute the measure dMCEP.
One solution to these problems is to use the diagonal
part of the covariance matrix V . Then the measure dMCEP
is redefined asthe weighted cepstral distance measure
dWCEP
which is described by the following equation:
dMCEP
=
-
zyxwvut
zyxwv
P
dMcEp
=
PARAMETER RANGE
Statistical distributions of the first cepstral coefficients for each of
ten digits.
C~
i= 1
CEPSTRAL COEFF. INDEX
2
( i( c)t ( i ) - c r ( i ) )
(4)
Fig. 3. Cepstral coefficient variances.
zyxwvutsr
zyxwvutsrqpo
zyxwvutsr
zyxwvu
zyxwvutsr
zyx
where w ( i) is the inverse of the ith diagonal element uii
of the covariance matrix V. The measure dWCEPisa
weighted Euclidean distance measure where each individual cepstral component c ( i ) is variance-equalized by the
weight w ( i ) .
16BlT
PCM
100-3200 H z
6.67
H(z)=I-O95Z-'
p=a
kHz
45msec FRAME SPACED
[EVERY 15msec.
I
I
TEMPLATES
111. EXPERIMENTS
Several experiments were carried out to study the performance of the weighted cepstral distance measure. In
Section 111-A we describe the experimental isolated word
speech recognition system used in the experiments. Configurations of the experimental database are given in Section 111-B. Finally, in Section 111-C, we give results of
five recognition experiments which show the general performance and some characteristics of the weighted cepstral distance measure.
A . The Isolated Word Speech Recognition System
The speech recognition system used in the experiments
is a conventional LPC-based isolated word recognizer.
Fig. 4 shows a block diagram of the recognition system.
In this system, isolated word speech input, recorded off a
local telephone line, is band-pass filtered from 100 to 3200
Hz, and digitized to 16 bits at a sampling frequency of
6.67 kHz. Preemphasis of the digitized speech is accomplished by a first-order digital filter whose transfer function is H ( z ) = 1 - 0 . 9 5 ~ ~A' .45 ms Hamming window
spaced every 15 ms is used in an 8th-order autocorrelation
analysis. Following autocorrelation analysis, endpoints of
the input speech are detected.
The next stepin the recognizer is LPC analysis. By
using Durbin's recursive procedure [131, the linear predictor coefficients a ( i ) are calculated from the autocorrelation coefficients. The LPC-derived cepstral coefficients c ( i ) are obtained from the linear predictor
coefficients a ( i ) using (1). Thus, the input speech is represented by a set of feature vectors (a test pattern) each
of whose elements is an LPC-derived parameter.
A dynamic time warping (DTW) algorithm [14] is used
to compare a test pattern to each of the reference patterns
which have been prepared as templates in the recognition
system. In the algorithm, the local spectral distance between the test and the reference pattern is calculated by
either the Euclidean distance measure between the LPC
cepstral coefficients,. theweighted cepstral distance measure, or the LPC-based log likelihood ratio distance measure [2].
For the final recognition decision, the word whose reference pattern gives the smallest distance is chosen.
1
zyxwvu
1 REFERENCE
I PATTERNS
RECOGNIZED
Fig. 4. Block diagram of an isolated word recognition system.
low) were used in the experiments. These databases had
the following characteristics.
Database Z (DB-1):
1000 isolated digit utterances
spoken by 100 talkers (50
male and 50 female). Each
talker uttered each digit once.
Database ZZ (DB-2): 500 isolated digit utterances
spoken by 10 male talkers (5
male and 5 female). The talkersarea
subset of the 100
talkers in the database DB-1
but the utterances are different from those in DB-1.
Database ZZZ (DB-3): 1000 isolated digit utterances
spoken by the same 100 talkers as used in DB-1. Each
digit utterance is randomly
sampled from 20 recordings
across five different recording
sessions of each digit by each
talker.
The database of the 129-word airline vocabulary is as
follows.
Database ZV (DB-4): 1290 isolated word utterances
spoken by 10 male talkers.
Each talker uttered each word
in the 129 airline reservation
terms once.
zyxwvutsrq
B.Database
To evaluate the performance of the weighted cepstral
distance measure, two word vocabularies were used. The
first was a small size vocabulary consisting of the ten English digits (0-9). The second was a medium size vocabulary consisting of the 129 airline reservation terms [151.
For the digit vocabulary, three databases (described be-
All the databases were recorded over local dialed-up
telephone lines in a sound booth.
The set of speaker-independent templates used as the
reference patterns in the DTW process was obtained from
a statistical clustering analysis of a large database consisting of 100 replications of each word (i.e., once by
each of 100 talkers) [16]. The database used to create the
templates were different in talkers and recording conditions from the test databases described above. The number of templates per word was 12 in all the experiments.
C. Recognition Experiments and Results
Experiment Z-Performance of the Weighted Cepstral
Distance Measure: As discussed previously in Section 11,
the weighted cepstral distance measure, dWCEP,with
weights equal to the inverse variance of the cepstral coef-
TOHKURA: WEIGHTED
MEASURE
DISTANCE
CEPSTRAL
FOR SPEECH RECOGNITION
zyxwvuts
zy
1417
zyxw
zyxwvu
ficients, is reasonable from a statistical viewpoint. Before
discussing experimental results, it is important to explain
how the variance is calculated exactly.
In this study, the intraword, intertalker
variances are
computed first for each individual word of a database. All
the individual intraword variances are then averaged together to obtain the final intraword variances. When calculating the variance from a digits database, intertalker
variances for each digit are calculated and then they are
averaged across the ten digits.
In order to learn how much the variances depend upon
database and talkers, individual variances were obtained
from subsets of the databases DB-1 and DB-4. One subset
of DB-1 is composed of 500 male utterances and another
is composed of 500 female utterances. As seen in Fig. 5 ,
which gives plots of variance versus cepstral coefficient
index, we can observe some differences in variances between male and female utterances, however, these arenot
significant. Variances calculated from DB-4, the airline
vocabulary, were essentially the same as those calculated
from the digits vocabulary. Thus, it is assumed that we
can use any variance if it-is computed from a database
which has some amount of variability with respect to vocabulary words and talkers. In the following experiments
in this section, we use the weighting function shown in
Fig. 6 which is the inverse variance obtained from the
data of DB-1.
A digit recognition experiment was performed, as described in Section 111-A, in order to determine theoverall
performance of the recognition system using the weighted
cepstral distance measure. Databases used for testing were
DB-2 and DB-3. Results of this experiment, in the form
of recognition rates for different distance measures, are
shown in Table I. Recognition rates using the weighted
cepstral distance measure were 99 and 98.8 percent for
DB-2 and DB-3, respectively. For comparison purposes,
experimental results using the Euclidean cepstral distance
and the log likelihood ratio dLLRare also shown in the
table. The weighted cepstral distance measure provides
substantially better results than the Euclidean cepstral distance measure and significantly better results than the log
likelihood ratio distance measure. A confusion matrix of
digit errors, for the data of DB-3, is shown in Table 11.
In orderto understand the reason why the weighted cepstral distance measure works so well for isolated word
recognition, a series of recognition experiments was performed as described in the following sections.
Experiment11-Sensitivity
of Weighting: Inthe previous experiment, we have seen that the cepstral coefficients variance is relatively data independent if the database used to compute the variancesincludes some amount
of variability with respect to talkers and words in the vocabulary. An important question is whether the difference
of the variances (i.e., the difference of weighting) actually produces any significant difference in final recognition rate. To answer this questionwe need to know how
sensitive the final experimental results are to the weighting used in the weighted cepstral distance measure.
0.4
DIGITS (0-9)
-MALE
& FEMALE
zyxwvu
FEMALE
w
Y
2a
AIRLINE RESERVATION TERMS
MALE & FEMALE
>
(
1
2
3
4
5
6
7
8
CEPSTRAL COEFF. INDEX
Fig. 5 . Cepstral coefficient variance difference between male and female
utterances for the English ten digits (0-9) andthat between word vocabularies (i.e., digits and airline reservation terms).
I
I
I
I
I
I
I
I
1
2
3
4
5
6
7
8
CEPSTRAL COEFF. INDEX
Fig. 6. Inverse variance weighting in the weighted cepstral distance measure,
TABLE I
RECOGNITION
RATES(PERCENT)
FOR DIFFERENT
SPECTRAL
DISTANCE
MEASURES,
AND FOR THE Two
DIGITS DATABASES DB-2
AND 3
zyxwvuts
zyxwvuts
DB-3
DB-2
dcee
dWCEP
95.0
99.0
96.6
~LLR
96.5
98.8
97.4
In an experiment to study sensitivity of weighting, the
weighted cepstral distancemeasure dWCEP
in (4)was modified into dbcEPto produce a perturbation of the weights
by the parameter CY as follows:
P
dbCEp=
zyxwvu
zyx
i= 1
w(i)" (c,(i)-
That is, dbcEP= dCEPif CY = 0 , and dblCEP
= dWCEP
if
CY = 1 . Experimental results of recognition tests using the
zyxwvutsrq
zyx
zyxwvutsrqp
zyxwvutsrq
zyxwvutsrq
zyxwvutsrqpo
zyxwvutsrqp
IEEE TRANSACTIONS ON ACOUSTICS, SPE:ECH,AND
1418
SIGNAL PROCESSING, VOL. ASSP-35, NO. 10, OCTOBER 1987
Table I1
1
2
3
2
3
4
5
6
7
8
2
.
.
.
.
.
.
.
. 9 9 .
.
.
1 ' .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
,100
.
,100
.
.
.
.
.
.
.
7
.
.
.
1
.
.
.
8
.
.
.
.
9
.
.
.
2
4
5
6
9
1
.
Digit0
98
0
I .
100
. 9 9
.
.
.
.
.
,100
.
.
.
.
.
.
1
.
.
.
. 9 9
3
. 97
. 2 . 9 6
U
Fig. 7. Influence
of weighting coefficient perturbation
ognition error rate.
by w ( i ) * on rec-
modified distance of ( 5 ) are shown in Fig. 7, which gives
the recognition error rate as a functionof a , where a varies from 0 to 2. The database used in the experiment was
DB-2.
The curve of error rate versus parameter a , shown in
Fig. 7, has a minimum at a 5 1, which means that
weighting by the inverse variance of the cepstral coefficients is an optimal one among the kinds of weighting
functions defined by w(i ) a .We also observe that the error
rate is close to the minimum for a range of a from 0.9
through 1.1. By comparing the weighting perturbation
produced by varying a from 0.9 to 1.1, with the cepstral
coefficient variance perturbation from the different databases, it is noted that the latter is smaller than the former.
0
t
1
2
3
4
5
6
7
0
A conclusion, here, is that a small perturbation in weightNO
ing,due to the databases difference shown inFig. 5 ,
WEIGHTING
makes no significant difference in speech recognition re- Fig. 8. Recognitionerrorrateversuscepstralcoefficientindex
k above
which weighting coefficients are saturated.
sults.
Experiment III-Role of the Lower Order Cepstral Coefjicients: It is important to understand whether the gain in
of the cepstral coefficient index k . The databaseDB-2 was
recognizer performance using the weighted cepstral disused for this experiment.
tance measure is due to the deweighting of the lower order
From Fig. 8 we see that weighting clamping for cepcepstral coefficients, or the weighting of the higher order
stral coefficients higher than fourth-order does not affect
coefficients. For this purpose, the distance measure was
the recognition results. On theotherhand,
weighting
modified as follows:
clamping for the lower order cepstral coefficients is very
important and can significantly raise theerror rate. As
fewer low-ordercepstral coefficients are deweighted (with
respect to the higher order coefficients), the recognition
error rate rises. However, even if the first-order cepstral
P
2
w ( k + 1) ( c , ( i ) - ~ ( i ). ) ( 6 ) coefficient is the only one which is deweighted, the error
i=k+l
rate decreases by more than 2 percent. Since the first cepstral coefficient represents speech spectrum tilt [ 111, it is
This measure is intended to weight the lower k cepstral highly sensitive to talker differences and telephone chancoefficients by the respective inversevariances,and to nel characteristics. The error rate decrease achieved by
weight the higher p - k coefficients by a constant value deweighting the first cepstral coefficient suggests that unwhich is equal to the inverse variance of the k + 1th cep- desirable variation of the cepstral distance due to the varistral coefficient. That is, the weighting in the measure is ability of talkers and/or transmission can be reduced.
Experiment IV-Recognition Using the Airlines Vocabchanged to a fixed value, w ( k
1), when the cepstral
ulary: In Experiment I we achieved about 99 percent corcoefficient order is larger than k.
Using value of p = 8, recognition experiments were rect recognition rate for speaker-independent isolated
performed for various values of k from 1 through 7. Fig. word recognition by using the weighted cepstral distance
8 shows the resulting recognition error rate as a function measure. In that case, however, the vocabulary size was
zyxwvuts
+
+
zyxwv
1419
TOHKURA:WEIGHTEDCEPSTRALDISTANCEMEASUREFORSPEECHRECOGNITION
small (i.e., 10 digits) and the weighting used in the experiment was calculated from a different database of the
same vocabulary. Although we have shown that the cepstral coefficient variances stay relatively invariant, and
that the sensitivity of the inverse varianceweighting is not
high, it is still necessary to have an experiment using a
vocabulary which is different in both size and content from
the digits in order to confirm the robustness and wide applicability of the weighted cepstral distance measure.
For this purpose we chose a 129-word airline vocabulary which was described as the databaseDB-4 in Section
III-B. The cepstral distance weighting used here was exactly the same one (Fig. 6) used in the digits recognition.
Templates were generated from a database which was different in talkers and recording conditions from that used
in DB-4. The number of speaker independent templates
per word was again 12.
Recognition results (i.e., number of errors in 129
words) as a function of talker are shown in Fig. 9. Results
obtained using the log likelihood ratio distance measure
and the Euclidean cepstral distance measures are given for
comparison purposes in the figure. Average recognition
error rates using the weighted cepstral distance, the Euclidean cepstral distance and the log likelihood ratio distance measureare 7.5, 11.9, and 9.3 percent, respectively. We see from the figure that the improvement
obtained using the weighted cepstral distance is remarkable, particularly for two of the talkers who have a large
number of errors using either of the other two distance
measures. Thus, we see that
the weighted cepstral distance measure is quite effective in equalizing the performance of the recognizer across different talkers.
Experiment V-Relationship BetweentheNumber
of
Cepstral Coeficients and Recognition Rate: It was shown
in the previous experiments that the weighted cepstral distance measure works very well for two vastly different
sets of word vocabularies. In the weighted cepstral distance measure, the weighting by inverse variance of the
cepstral coefficients was a statistically optimal one. The
number of the cepstral coefficients was 8 (i.e., p = 8 )
throughout all of our experiments. An important question
is whether the inverse variance weighting would still be
an optimal one if p becomes larger than 8.
In the caseof the Euclidean cepstral distance, the larger
the number of cepstral coefficients used in the measure,
the higher the recognition rate; and the recognition rate
usually saturates when p reaches somewhere between 1216 (i.e., 1.5-2 times the size
of the LPC polynomial).
However, one would not expect to observe similar characteristics in the case of the weighted cepstral distance,
since values of the inversevariance weighting forthe
higher order cepstral coefficients continue to increase.
Fig. 10 shows experimental results, inthe form of error
rate as a function of the number of the cepstral coefficients, in comparing the weighted cepstral distance and
the Euclidean cepstral distance. The two curves show very
different characteristics. The errorrates go down, forboth
distance measures asthe number of cepstral coefficients
zyxwvutsrq
zyxwv
zyxw
zyxwvuts
zyxwvutsrqp
zyxwvut
Fig. 9. Number of word errors as function of talkers in recognition experiment using airline vocabulary.
ERROR
dCEP
EXTENDED (P>8)
-lo.oiw
7
\
\
\
I
I
I
I
I
I
I
2
4
6
8
10
12
14
lO.0Oi
16
NUMBER OF CEPSTRAL COEFFICIENTS
Fig. 10. Recognition error rate versus the number of cepstral coefficients
. comparing the weighted cepstral distance and the Euclidean cepstral distance, and the original ( p 8 ) and additional extended ( p '> 8 ) cepstral
coefficients variances for comparisons.
increases upto 8, buttheerror
rate increases for the
weighted cepstral distance measure as thenumber of coefficients increases above 8, whereas it remains fairly constant forthe Euclidean cepstral distance.Thedatabase
used in this experiment was DB-2, and cepstral coeffi-
1420
zyxwvutsr
zyxwvutsrqp
zyxwvutsrq
IEEE TRANSACTIONS ON ACOUSTICS, S P E E C H , A N D SIGNAL PROCESSING, VOL. ASSP-35, NO. 10, OCTOBER 1987
cients higher than 8 were maximum entropy extensions
[3] of the 8 lower order cepstral coefficients.
These results suggest that we have two effects in the
inverse variance weighting. One is to weight (actually,
deweight) the lower order cepstral coefficients, and it is a
positive effect. The other is to weight the higher order
cepstral coefficients and it can be a negative effect if they
are weighted too much.The negative effect increases
when p goes beyond 8.
Conclusions that can be drawn from the experimental
result are that the best performance is given when the
number of cepstral coefficients is equal to the original LPC
analysis order (i.e., p = 8 ) in case of the inverse variance
weighting, and that another weighting may be necessary
to achieve an effective use of the cepstral coefficients extended beyond 8.
or, equivalently,
m
Taking the derivative of (9) with respect to the frequency
w , we get
Using (lo), we get the following equation:
m
IV. DISCUSSION
A . Quefrency Weighted Cepstral Distance Measure
The quefrency weighted cepstral distance measure,
which is another form of weighted cepstral distance measure, has previously been applied to vowel recognition
experiments [lo]. The measure has the following form:
P
dQCEp=
P
=
2
,Z ( i c l ( i ) - i c r ( i ) )
2=1
C
i=l
2
i 2 ( c l ( i )- c r ( i ) ).
where SI ( w ) and S, ( w ) are power spectra of the test and
the reference frames. Since the quefrency weighted cepstral distance is a truncated version of the left term in (1 l),
it means that the quefrency weighted cepstral distance is
a measure of the integrated distance between local spectrum slopes. Thus, the quefrency weighted cepstral distance has some similarity to the weighted slope metric
proposed by Klatt [4]where critical band based spectral
slopes were used.
We can define a combination form of the quefrency
weighted and the Euclidean cepstral distance measure as
follows:
zyxw
zyxwvu
zyxwv
(7)
That is, each cepstral coefficient is multiplied by its respective quefrency . Comparing the weighting coefficient
i2 in dQCEP
to weighting w( i ) from Fig. 6, which is the
inverse variance of the cepstral coefficients, we can find
some similarity between these two, in particular, although
the weighting by i2 weights the higher ordercepstral coefficients much more than w ( i ) " in ( 5 ) for the case that 01
is slightly greater than 1.
A recognition experiment, whose experimental conditions were the same as those used in experiments I1 and
I11 in the previous section, was performed in order to
measure the performance of the quefrency weighted cepstral distance measure. The minimum digit recognition error rate was 1.6 percent, which is almost the same as the
error rate given by using a = 1.4 in the w ( i )" weighting
(see Fig. 7).
Consequently, the quefrency weighted cepstral distance
measure works quite well. Although the error rate using
this measure is slightly larger than that obtained by using
the inverse variance weighted cepstral distance measure,
it is far smaller than that obtained by using the Euclidean
cepstral distance measure.
In order tounderstand why these weighted cepstral distance measures work well, let us go back to the relationship between the speech spectrum and its cepstral coefficients. Let S ( w ) be the power spectrum of the speech
signal; then its respective cepstral coefficient c ( i ) is represented as follows.
+ CP i'(c,(i) - c r ( i )2)
i=l
=
C
P
(a
+ i 2 )( c , ( i ) -
i= 1
(12)
This new measure can bemade quite similarto the inverse
variance weighted cepstral distance by appropriate choice
of the value of a .
One conclusion that can be drawn from the discussion
above is that the weighted cepstral distance measure includes information about the distance between the local
spectrum slopes and it is effective for measuring the difference between two speech spectra.
B. Windowing in the Cepstral Domain
We have shown in Section III-C that as the number of
cepstral coefficients used in the weighted cepstral distance
measure increases, the recognition error rate decreases at
first, and then increases as thenumber of the cepstral coefficients becomes larger than 8 (the order of the LPC anal-
1421
TOHKURA: WEIGHTED CEPSTRAL DISTANCE MEASURE FOR SPEECH RECOGNITION
d
OUEFRENCY
(a)
zy
higher quefrency components, thereby reducing undesirable variability in the speech spectrum. In other words,
we can select the most useful quefrency components. This
is especially useful when we want to use a number of the
cepstral coefficients larger than 8. It is noted that the saturated weighting used in Experiment I11 (Section 111-C) is
quitesimilartothe
trapezoid lifterinFig.11(c).
v.
SUMMARY
zyz
In this paper the weighted cepstral distance measure,
with weighting coefficients set equal to the inverse variance of the cepstral coefficients, has been studied.
4
4
Through several experiments using a speaker-independent, isolated word, recognition system, based upon template matching, the characteristics and the performance
of
the weighted cepstral distance measure have been studied.
We summarize our experimental results and findings as
follows.
1) The weighted cepstral distance measure works substantially better than both the Euclidean cepstral distance
and the log likelihood ratio distance measure across two
different databases, namely, 10 digits and 129-word airline vocabulary databases.
2) The most significant performance characteristic of
(e)
the weighted cepstral distance is that it tended to signifiFig. 11. Liftersintheweightedcepstraldistancemeasures.(a)Rectancantly reduce the error rate variance across talkers.
gular, (b) asymmetric triangular, (c) trapezoid, (d) triangular, ( e ) raised
3) The inverse variance weighting coefficients are data
sine.
independent, since thecepstral coefficients variances, calculated from various kinds of database, do not show any
ysis). It has been concluded thatthis result is due to significant difference which affects final speech recogniweighting the higher ordercepstral coefficients too much. tion results.
One practical and simple solution to avoid the undesir4) The most important feature of the weighting is that
able effects caused by overly weighting the higher order it deweights the lower order cepstral coefficients, rather
coefficients is to use only the lower order cepstral coeffi- than weighting the higher order cepstral coefficients.
cients (below 8th order) or to saturate the weights for the
5 ) The statistically weighted cepstral distance measure
higher order cepstral coefficients as was done in Experi- with the inverse variance weight is characteristically quite
ment 111.
similar to the quefrency weighted cepstral distance meaAnother solution is to find an optimal weighting from a sure which is an expression of the spectral slope distance
viewpoint which is different from that used for the inverse measure arrived at from perceptual considerations.
variance or the quefrency weighting. One reasonable and
6 ) With respect to the relationship between the number
elegant view is to explain these weightings by looking at of cepstral coefficients and recognition rate, the recognithe windowing which occurs in the cepstral domain, i.e., tion performance is best when the number of cepstral coefso-called liftering [ 171.
ficients is equal to the LPC analysis order (i.e.; p = 8).
In Fig. 11 several types of lifters (windows in the cep- It is suggested that some of the band-pass lifters (prostral domain) are illustrated. A rectangular lifter is equiv- posed in Section IV) can provide improved performance
alent to using the first p cepstral coefficients without in achieving an effective use of the cepstral coefficients
weighting. An asymmetric triangular lifter,shown in Fig. extended beyond 8 [ 181.
1l(b), is equivalent to the quefrency weighting. Since this
ACKNOWLEDGMENT
lifter has the effect of looking like a differentiator in the
log frequency domain, it produces noise inthe higher
The author wishes to thank L. R. Rabiner, F. K. Soong,
quefrency components when the number of the cepstral and B. H. Juang for their valuable suggestions and stimcoefficients p becomes large. Therefore, it cannot be a
ulating discussions throughout this work. Thanks are also
desirable lifter to obtain good speech spectrum features due toJ. G. Wilpon for providing .speech recognition softfor recognition when p > 8.
ware. Additionally, the author would like to thank J. L.
Several symmetric lifters, such as the trapezoid, trian- Flanagan for inviting him to AT&T Bell Laboratories to
gular, and raised sine lifter, are shown in Fig. ll(c)-(e).
do this work.
These lifters work as band-pass lifters with respect to queFor furtherwork on the weighted cepstral distance meafrency. By using them, we eliminate both the lower and sures, the use of band-pass liftering in speech recognition
zyxwvutsrq
zy
zyxwvutsrq
,
1422
zyxwvutsr
zyxwvutsrq
zyxwvuts
zyxwvu
IEEE
TRANSACTIONS
ON ACOUSTICS,
SPEECH,
AND
SIGNAL
PROCESSING,
has been discussed by Juang et al. [ 181, and application
of the weighted cepstral distance measures for recognition
of noisy speech has been studied by Hanson and Wakita
~91.
REFERENCES
[l] F. Itakura and S. Saito, “An analysis-synthesis telephony based on
maximum likelihood method,” in Proc. Int. Congr. Acoust., C-5-5,
1968.
[2] F. Itakura, “Minimum prediction residual principle applied to speech
recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol.
ASSP-23, pp. 67-72, Feb. 1975.
[3] A. Gray and J. Markel, “Distance measures for speech processing,”
IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-24, pp.
380-391, Oct. 1976.
[4] D. H. Klatt, “Prediction of perceived phonetic distance from critical
band spectra: A first step,” in Proc. ICASSP 1982, vol. 2, May 1982,
pp.1278-1281.
[5]M.SugiyamaandK.Shikano,“FrequencyweightedLPCspectral
matching measures,” ZECE Trans., vol. J65-A, pp. 965-972, Sept.
1982 (in Japanese).
[6] N. Nocerino, F. K. Soong, L. R. Rabiner, and D. H. Klatt, “Comparative study of several distortion measures for speech recognition,”
in Proc. ICASSP 1985, vol. 1, Mar. 1985, pp. 25-28.
[7] B. S. Atal, “Effectiveness of linear prediction characteristics
of the
speech wave for automatic speaker identification and verification,”
J . Acoust. Soc. Amer., vol. 55, no. 6, pp. 1304-1312, June 1974.
[8] A. V . OppenheimandR.
W. Schafer, Digital Signal Processing.
Englewood Cliffs, NJ: Prentice-Hall, 1975.
[9] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP29, pp. 254-272, Apr. 1981.
[lo] K. K. Paliwal, “On the performance of the quefrency-weighted cepstral coefficients in vowel recognition,” Speech Commun.,vol. 1, pp.
151-154, May 1982.
[ l l ] R. Nakatsu, H. Nagashima, J . Kojima, and N. Ishii, “A speech recognition method for telephone voice,” IECE Trans., vol. J66-D, pp.
377-384, Apr. 1983 (in Japanese).
[12] Y. Tohkura, “Speaker-independent recognition of isolated digits using
a weighted cepstral distance,”
J . Acoust. Soc. Amer., supp. 1, vol.
77, E13, Spring 1985.
[13] J. Makhoul, “Linear prediction: A tutorial review,” Proc. Z E k , vol.
6 3 , pp. 562-580, Apr. 1975.
VOL. ASSP-35, NO. 10, OCTOBER 1987
[14]L.R.Rabinerand
S. E. Levinson,“Isolatedandconnected
word
IEEE Trans. Comrecognition-Theoryandselectedapplication,”
mun., vol. COM-29, pp. 621-659, May 1981.
[15] J. G. Wilpon, L. R.Rabiner,and A. Bergh,“Speaker-independent
isolated word recognition using a 129-word airline vocabulary,” J .
Acoust. SOC.Amer., vol. 72, no. 2 , pp. 390-396, Aug. 1982.
[16] L. R. Rabiner, S. E. Levinson, A. E. Rosenberg, and J. G. Wilpon,
“Speaker-independent recognition of isolated words using clustering
techniques,” IEEE Trans. Acoust., Speech, Signal Processing, vol.
ASSP-27, pp. 336-349, Aug. 1979.
[17] B. P.Bogert,M. J. R. Healy,andJ.W.Tukey,“Thequefrency
analysis of time series for echoes: Cepstrum, pseudo-autocovariance,
cross-cepstrumandsaphecracking,”in
Proc. Symp. Time Series
Anal., M.Rosenblatt,Ed.NewYork:Wiley,1963,pp.209-243.
[18] B. H. Juang, L. R. Rabiner, and J. G. Wilpon, “On the use of bandpass liftering in speech recognition,” in Proc. ICASSP I986, vol. 1,
Apr. 1986, pp. 765-768.
[19] B. A. Hanson and H. Wakita, “Spectral slope based distortion measures for all-pole models of speech,” imProc. ICASSP 1986, vol. 1,
Apr. 1986, pp. 757-760.
zyxwvutsrq