Academia.eduAcademia.edu

A weighted cepstral distance measure for speech recognition

1986, IEEE Transactions on Acoustics, Speech, and Signal Processing

A weighted cepstral distance measure is proposed and is tested in a speaker-independent isolated word recognition system using standard DTW (dynamic time warping) techniques. The measure is a statistically weighted distance measure with weights equal to the inverse variance of the cepstral coefficients. The experimental results show that the weighted cepstral distance measure works substantially better than both the Euclidean cepstral distance and the log likelihood ratio distance measures across two different databases. The recognition error rate obtained using the weighted cepstral distance measure was about 1 percent for digit recognition. This result was less than onefourth of that obtained using the simple Euclidean cepstral distance measure and about one-third of the results using the log likelihood ratio distance measure. The most significant performance characteristic of the weighted cepstral distance was that it tended to equalize the performance of the recognizer across different talkers.

1414 zyx zy IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-35, NO. 10, OCTOBER 1987 z A Weighted Cepstral Distance Measure for Speech Recognition Abstract-A weighted cepstral distance measure is proposed and is using weighting. In his experiment, the cepstral distance tested in a speaker-independent isolatedword recognition system using measure was the statistically weighted Euclidean distance standard DTW (dynamic time warping) techniques. The measure is a measure with vowel class specific weights. Nakatsu et al. statistically weighted distance measure with weights equal to the in[l 11 investigated the effectiveness of weighting the cepversevariance ofthe cepstralcoefficients.Theexperimentalresults show that the weighted cepstral distance measure works substantially stral coefficients by the inverse standard deviation in their better than boththe Euclidean cepstral distance and the log likelihood speaker-independent isolated word recognition system. ratio distance measures across two different databases. The recogniAccording to their experimental results, improvement of tion error rate obtained using the weighted cepstral distance measure recognition accuracy by weighting was about 0.6 percent was less than onewas about 1 percent for digit recognition. This result from 97.3 to 97.9 percent. Recently, an attempt to use a fourth of that obtained using the simple Euclidean cepstral distance measure and about one-third of the results using thelog likelihood ra- weighted cepstral measure has been made by the author tio distance measure. The most significant performance characteristic [ 121 in his speaker-independent digit recognition and subof the weighted cepstral distance was that it tended to equalize the perstantial improvement of the recognition rate has been formance of the recognizer across different talkers. I. INTRODUCTION N speech recognition based upon template matching, spectral distance measure is one of the most important problems. A number of distance measures have been tried and still more have been proposed [1]-[5]. Among them, the LPC-based log likelihood ratio distance measure proposed by Itakura [2] has been one of the most successful distance measures, as shown in a recent comparative study of several distortion measures [6]. TheEuclidean distance between cepstral coefficients is another important spectral distance measure. This distance measure is widely used with LPC-derived cepstral coefficients since these coefficients may be recursively computed from the linear predictor coefficients [7]. The Euclidean cepstral distance measure has a number of variants due not only to its simple form (i.e. , a Euclidean distance between two sets of coefficients) but also its property that the resulting distance is an approximation to the distance between the two log spectra represented by the cepstral coefficient sets [3], [6], [SI. One variant of the cepstral distance measure is a weighted cepstral distance. Fumi [9] used such a weighted cepstral distance measure for automatic speaker verification, where the weight for the cepstral coefficients was the inverse of its intratalker variance.Paliwal [ 101 studied the performance of a weighted cepstral distance measure for vowel recognition and showed a 1.3 percent recognition rate average improvement from 91.4 to 92.7 percent by I achieved by using weighting. Based on the results obtained in previous studies on weighted distancemeasures, it appears that weighting works, however, there is no clear explanation for the reason why and how it works and how to choose an optimal set of weights. It is the purpose of this paper to show that substantial performance improvement in speech recognition can be obtained if the cepstral coefficients are weighted appropriately and to discuss the characteristics and performance of the weighted cepstral distance measure. The organization of this paper is as follows.In Section I1 we give statistical characteristics of the LPC derived cepstral coefficients and discuss definitions of the weighted cepstral distance measure. In Section I11 we describe experiments to evaluate the performance of the weighted cepstral distance measure. All the recognition results we show in this paper are from speaker-independent, isolated word, recognition experiments. Wealso give experimental results obtained from databases of two vastly different word vocabularies. In Section IV we discuss the results. Finally,in Section V our findings are summarized. zyxwvu z zyxwvu zyx zyxw 11. PRELIMINARIES A. The LPC Euclidean Cepstral Distance Measure The LPC cepstral coefficients c ( iare ) computed recursively from the pth-orderlinear predictor coefficients a ( i ) by the following relations [7]: c(1) = - a ( l ) i-1 Manuscript received October 22, 1985; revised April 16, 1987. The author is with ATR Auditoty & Visual Perception Research Laboratories, Twin 21-MID-Tower, 2-1-61 Shiromi, Osaka 540, Japan. IEEE Log Number 8716005. c ( i >= - a ( i ) - C k= 1 (1 - k / i ) a ( k ) c(i - k ) , 1< i s p . 0096-3518/87/1000-1414$01.000 1987 IEEE (1) zyxwvutsrqp zyxwvutsrqpon zyxwvu zyxwvuts zyxwvutsrqponmlkjihgfedcbaZYXWVUTS zyxwvutsr mm zyxwv 1415 TOHKURA: WEIGHTED CEPSTRAL DISTANCE MEASURE FOR SPEECH RECOGNITION The cepstral distance dCEP,i.e., the Euclidean distance between the LPC cepstral coefficients, is defined as follows: P &EP = ,x I = 1 2 (ct(i) - .rei>) (2) where ct ( i ) and c,( i ) are ith cepstral coefficients of a test frame and those of a reference frame, respectively. B. Statistical Characteristics of the Cepstral Coeflcients Fig. 1 shows statistical distributions of the cepstral coefficients obtained from a digit databaseuttered by 100 talkersover dialed-up telephone lines. Individual distributions of the first cepstral coefficient, for each digit, are shown in Fig. 2 . From both figures we observe that the distributions of each cepstral coefficient across all the digits are nearly Gaussian, although the distribution within each digit is somewhat different from each other. Also, it can be seen in Fig. 1 that there is a large difference in variance among the cepstral coefficients. That is, the variance of the higher cepstral coefficients is much smaller than the variance of the lower cepstral coefficients (e.g., compare C8 and C1 in Fig. 1). The variances of the individual cepstral coefficients are plotted in Fig. 3 as a function of the cepstral coefficient index. The varianceof the eighth cepstral coefficient is about one-twentieth of the variance of the first cepstral coefficient. C. The Weighted Cepstral Distance Measure A distance measure which has been widely used in speech recognition is the Mahalanobis distance, d ~ c E p , defined as follows-: -2.0 0.0 2.0 -2.0 0.0 2.0 PARAMETER RANGE' Fig. 1. Statistical distributions of the cepstral coefficients. DIGIT: 5 DIGIT: 1 I DIGIT:3 ,m 1 DIGIT:8 uI ,Mh ' 1 Y I zyxwvutsr (ct c,) v - ' ( c t - c,)' (3) where c, and c, are feature row vectors which are com- Fig. 2. posed of the cepstral coefficients obtained from a test utterance and a reference template, respectively, and V is the covariance matrix of the feature vector. The measure dMCEp is a (covariance) weighted distance measure which can be used for clustering feature vectors or for recognition purposes. Therearesome difficulties in applying the measure dMCEP to speech recognition. They are estimation accuracy of off-diagonal terms and sensitivity introduced by the matrix inversion in computing dMCEp, since the offdiagonal terms are relatively small compared to the diagonal ones. In addition to these, it is computationally expensive to compute the measure dMCEP. One solution to these problems is to use the diagonal part of the covariance matrix V . Then the measure dMCEP is redefined asthe weighted cepstral distance measure dWCEP which is described by the following equation: dMCEP = - zyxwvut zyxwv P dMcEp = PARAMETER RANGE Statistical distributions of the first cepstral coefficients for each of ten digits. C~ i= 1 CEPSTRAL COEFF. INDEX 2 ( i( c)t ( i ) - c r ( i ) ) (4) Fig. 3. Cepstral coefficient variances. zyxwvutsr zyxwvutsrqpo zyxwvutsr zyxwvu zyxwvutsr zyx where w ( i) is the inverse of the ith diagonal element uii of the covariance matrix V. The measure dWCEPisa weighted Euclidean distance measure where each individual cepstral component c ( i ) is variance-equalized by the weight w ( i ) . 16BlT PCM 100-3200 H z 6.67 H(z)=I-O95Z-' p=a kHz 45msec FRAME SPACED [EVERY 15msec. I I TEMPLATES 111. EXPERIMENTS Several experiments were carried out to study the performance of the weighted cepstral distance measure. In Section 111-A we describe the experimental isolated word speech recognition system used in the experiments. Configurations of the experimental database are given in Section 111-B. Finally, in Section 111-C, we give results of five recognition experiments which show the general performance and some characteristics of the weighted cepstral distance measure. A . The Isolated Word Speech Recognition System The speech recognition system used in the experiments is a conventional LPC-based isolated word recognizer. Fig. 4 shows a block diagram of the recognition system. In this system, isolated word speech input, recorded off a local telephone line, is band-pass filtered from 100 to 3200 Hz, and digitized to 16 bits at a sampling frequency of 6.67 kHz. Preemphasis of the digitized speech is accomplished by a first-order digital filter whose transfer function is H ( z ) = 1 - 0 . 9 5 ~ ~A' .45 ms Hamming window spaced every 15 ms is used in an 8th-order autocorrelation analysis. Following autocorrelation analysis, endpoints of the input speech are detected. The next stepin the recognizer is LPC analysis. By using Durbin's recursive procedure [131, the linear predictor coefficients a ( i ) are calculated from the autocorrelation coefficients. The LPC-derived cepstral coefficients c ( i ) are obtained from the linear predictor coefficients a ( i ) using (1). Thus, the input speech is represented by a set of feature vectors (a test pattern) each of whose elements is an LPC-derived parameter. A dynamic time warping (DTW) algorithm [14] is used to compare a test pattern to each of the reference patterns which have been prepared as templates in the recognition system. In the algorithm, the local spectral distance between the test and the reference pattern is calculated by either the Euclidean distance measure between the LPC cepstral coefficients,. theweighted cepstral distance measure, or the LPC-based log likelihood ratio distance measure [2]. For the final recognition decision, the word whose reference pattern gives the smallest distance is chosen. 1 zyxwvu 1 REFERENCE I PATTERNS RECOGNIZED Fig. 4. Block diagram of an isolated word recognition system. low) were used in the experiments. These databases had the following characteristics. Database Z (DB-1): 1000 isolated digit utterances spoken by 100 talkers (50 male and 50 female). Each talker uttered each digit once. Database ZZ (DB-2): 500 isolated digit utterances spoken by 10 male talkers (5 male and 5 female). The talkersarea subset of the 100 talkers in the database DB-1 but the utterances are different from those in DB-1. Database ZZZ (DB-3): 1000 isolated digit utterances spoken by the same 100 talkers as used in DB-1. Each digit utterance is randomly sampled from 20 recordings across five different recording sessions of each digit by each talker. The database of the 129-word airline vocabulary is as follows. Database ZV (DB-4): 1290 isolated word utterances spoken by 10 male talkers. Each talker uttered each word in the 129 airline reservation terms once. zyxwvutsrq B.Database To evaluate the performance of the weighted cepstral distance measure, two word vocabularies were used. The first was a small size vocabulary consisting of the ten English digits (0-9). The second was a medium size vocabulary consisting of the 129 airline reservation terms [151. For the digit vocabulary, three databases (described be- All the databases were recorded over local dialed-up telephone lines in a sound booth. The set of speaker-independent templates used as the reference patterns in the DTW process was obtained from a statistical clustering analysis of a large database consisting of 100 replications of each word (i.e., once by each of 100 talkers) [16]. The database used to create the templates were different in talkers and recording conditions from the test databases described above. The number of templates per word was 12 in all the experiments. C. Recognition Experiments and Results Experiment Z-Performance of the Weighted Cepstral Distance Measure: As discussed previously in Section 11, the weighted cepstral distance measure, dWCEP,with weights equal to the inverse variance of the cepstral coef- TOHKURA: WEIGHTED MEASURE DISTANCE CEPSTRAL FOR SPEECH RECOGNITION zyxwvuts zy 1417 zyxw zyxwvu ficients, is reasonable from a statistical viewpoint. Before discussing experimental results, it is important to explain how the variance is calculated exactly. In this study, the intraword, intertalker variances are computed first for each individual word of a database. All the individual intraword variances are then averaged together to obtain the final intraword variances. When calculating the variance from a digits database, intertalker variances for each digit are calculated and then they are averaged across the ten digits. In order to learn how much the variances depend upon database and talkers, individual variances were obtained from subsets of the databases DB-1 and DB-4. One subset of DB-1 is composed of 500 male utterances and another is composed of 500 female utterances. As seen in Fig. 5 , which gives plots of variance versus cepstral coefficient index, we can observe some differences in variances between male and female utterances, however, these arenot significant. Variances calculated from DB-4, the airline vocabulary, were essentially the same as those calculated from the digits vocabulary. Thus, it is assumed that we can use any variance if it-is computed from a database which has some amount of variability with respect to vocabulary words and talkers. In the following experiments in this section, we use the weighting function shown in Fig. 6 which is the inverse variance obtained from the data of DB-1. A digit recognition experiment was performed, as described in Section 111-A, in order to determine theoverall performance of the recognition system using the weighted cepstral distance measure. Databases used for testing were DB-2 and DB-3. Results of this experiment, in the form of recognition rates for different distance measures, are shown in Table I. Recognition rates using the weighted cepstral distance measure were 99 and 98.8 percent for DB-2 and DB-3, respectively. For comparison purposes, experimental results using the Euclidean cepstral distance and the log likelihood ratio dLLRare also shown in the table. The weighted cepstral distance measure provides substantially better results than the Euclidean cepstral distance measure and significantly better results than the log likelihood ratio distance measure. A confusion matrix of digit errors, for the data of DB-3, is shown in Table 11. In orderto understand the reason why the weighted cepstral distance measure works so well for isolated word recognition, a series of recognition experiments was performed as described in the following sections. Experiment11-Sensitivity of Weighting: Inthe previous experiment, we have seen that the cepstral coefficients variance is relatively data independent if the database used to compute the variancesincludes some amount of variability with respect to talkers and words in the vocabulary. An important question is whether the difference of the variances (i.e., the difference of weighting) actually produces any significant difference in final recognition rate. To answer this questionwe need to know how sensitive the final experimental results are to the weighting used in the weighted cepstral distance measure. 0.4 DIGITS (0-9) -MALE & FEMALE zyxwvu FEMALE w Y 2a AIRLINE RESERVATION TERMS MALE & FEMALE > ( 1 2 3 4 5 6 7 8 CEPSTRAL COEFF. INDEX Fig. 5 . Cepstral coefficient variance difference between male and female utterances for the English ten digits (0-9) andthat between word vocabularies (i.e., digits and airline reservation terms). I I I I I I I I 1 2 3 4 5 6 7 8 CEPSTRAL COEFF. INDEX Fig. 6. Inverse variance weighting in the weighted cepstral distance measure, TABLE I RECOGNITION RATES(PERCENT) FOR DIFFERENT SPECTRAL DISTANCE MEASURES, AND FOR THE Two DIGITS DATABASES DB-2 AND 3 zyxwvuts zyxwvuts DB-3 DB-2 dcee dWCEP 95.0 99.0 96.6 ~LLR 96.5 98.8 97.4 In an experiment to study sensitivity of weighting, the weighted cepstral distancemeasure dWCEP in (4)was modified into dbcEPto produce a perturbation of the weights by the parameter CY as follows: P dbCEp= zyxwvu zyx i= 1 w(i)" (c,(i)- That is, dbcEP= dCEPif CY = 0 , and dblCEP = dWCEP if CY = 1 . Experimental results of recognition tests using the zyxwvutsrq zyx zyxwvutsrqp zyxwvutsrq zyxwvutsrq zyxwvutsrqpo zyxwvutsrqp IEEE TRANSACTIONS ON ACOUSTICS, SPE:ECH,AND 1418 SIGNAL PROCESSING, VOL. ASSP-35, NO. 10, OCTOBER 1987 Table I1 1 2 3 2 3 4 5 6 7 8 2 . . . . . . . . 9 9 . . . 1 ' . . . . . . . . . . . . . . . . . . . . . . . . . . . ,100 . ,100 . . . . . . . 7 . . . 1 . . . 8 . . . . 9 . . . 2 4 5 6 9 1 . Digit0 98 0 I . 100 . 9 9 . . . . . ,100 . . . . . . 1 . . . . 9 9 3 . 97 . 2 . 9 6 U Fig. 7. Influence of weighting coefficient perturbation ognition error rate. by w ( i ) * on rec- modified distance of ( 5 ) are shown in Fig. 7, which gives the recognition error rate as a functionof a , where a varies from 0 to 2. The database used in the experiment was DB-2. The curve of error rate versus parameter a , shown in Fig. 7, has a minimum at a 5 1, which means that weighting by the inverse variance of the cepstral coefficients is an optimal one among the kinds of weighting functions defined by w(i ) a .We also observe that the error rate is close to the minimum for a range of a from 0.9 through 1.1. By comparing the weighting perturbation produced by varying a from 0.9 to 1.1, with the cepstral coefficient variance perturbation from the different databases, it is noted that the latter is smaller than the former. 0 t 1 2 3 4 5 6 7 0 A conclusion, here, is that a small perturbation in weightNO ing,due to the databases difference shown inFig. 5 , WEIGHTING makes no significant difference in speech recognition re- Fig. 8. Recognitionerrorrateversuscepstralcoefficientindex k above which weighting coefficients are saturated. sults. Experiment III-Role of the Lower Order Cepstral Coefjicients: It is important to understand whether the gain in of the cepstral coefficient index k . The databaseDB-2 was recognizer performance using the weighted cepstral disused for this experiment. tance measure is due to the deweighting of the lower order From Fig. 8 we see that weighting clamping for cepcepstral coefficients, or the weighting of the higher order stral coefficients higher than fourth-order does not affect coefficients. For this purpose, the distance measure was the recognition results. On theotherhand, weighting modified as follows: clamping for the lower order cepstral coefficients is very important and can significantly raise theerror rate. As fewer low-ordercepstral coefficients are deweighted (with respect to the higher order coefficients), the recognition error rate rises. However, even if the first-order cepstral P 2 w ( k + 1) ( c , ( i ) - ~ ( i ). ) ( 6 ) coefficient is the only one which is deweighted, the error i=k+l rate decreases by more than 2 percent. Since the first cepstral coefficient represents speech spectrum tilt [ 111, it is This measure is intended to weight the lower k cepstral highly sensitive to talker differences and telephone chancoefficients by the respective inversevariances,and to nel characteristics. The error rate decrease achieved by weight the higher p - k coefficients by a constant value deweighting the first cepstral coefficient suggests that unwhich is equal to the inverse variance of the k + 1th cep- desirable variation of the cepstral distance due to the varistral coefficient. That is, the weighting in the measure is ability of talkers and/or transmission can be reduced. Experiment IV-Recognition Using the Airlines Vocabchanged to a fixed value, w ( k 1), when the cepstral ulary: In Experiment I we achieved about 99 percent corcoefficient order is larger than k. Using value of p = 8, recognition experiments were rect recognition rate for speaker-independent isolated performed for various values of k from 1 through 7. Fig. word recognition by using the weighted cepstral distance 8 shows the resulting recognition error rate as a function measure. In that case, however, the vocabulary size was zyxwvuts + + zyxwv 1419 TOHKURA:WEIGHTEDCEPSTRALDISTANCEMEASUREFORSPEECHRECOGNITION small (i.e., 10 digits) and the weighting used in the experiment was calculated from a different database of the same vocabulary. Although we have shown that the cepstral coefficient variances stay relatively invariant, and that the sensitivity of the inverse varianceweighting is not high, it is still necessary to have an experiment using a vocabulary which is different in both size and content from the digits in order to confirm the robustness and wide applicability of the weighted cepstral distance measure. For this purpose we chose a 129-word airline vocabulary which was described as the databaseDB-4 in Section III-B. The cepstral distance weighting used here was exactly the same one (Fig. 6) used in the digits recognition. Templates were generated from a database which was different in talkers and recording conditions from that used in DB-4. The number of speaker independent templates per word was again 12. Recognition results (i.e., number of errors in 129 words) as a function of talker are shown in Fig. 9. Results obtained using the log likelihood ratio distance measure and the Euclidean cepstral distance measures are given for comparison purposes in the figure. Average recognition error rates using the weighted cepstral distance, the Euclidean cepstral distance and the log likelihood ratio distance measureare 7.5, 11.9, and 9.3 percent, respectively. We see from the figure that the improvement obtained using the weighted cepstral distance is remarkable, particularly for two of the talkers who have a large number of errors using either of the other two distance measures. Thus, we see that the weighted cepstral distance measure is quite effective in equalizing the performance of the recognizer across different talkers. Experiment V-Relationship BetweentheNumber of Cepstral Coeficients and Recognition Rate: It was shown in the previous experiments that the weighted cepstral distance measure works very well for two vastly different sets of word vocabularies. In the weighted cepstral distance measure, the weighting by inverse variance of the cepstral coefficients was a statistically optimal one. The number of the cepstral coefficients was 8 (i.e., p = 8 ) throughout all of our experiments. An important question is whether the inverse variance weighting would still be an optimal one if p becomes larger than 8. In the caseof the Euclidean cepstral distance, the larger the number of cepstral coefficients used in the measure, the higher the recognition rate; and the recognition rate usually saturates when p reaches somewhere between 1216 (i.e., 1.5-2 times the size of the LPC polynomial). However, one would not expect to observe similar characteristics in the case of the weighted cepstral distance, since values of the inversevariance weighting forthe higher order cepstral coefficients continue to increase. Fig. 10 shows experimental results, inthe form of error rate as a function of the number of the cepstral coefficients, in comparing the weighted cepstral distance and the Euclidean cepstral distance. The two curves show very different characteristics. The errorrates go down, forboth distance measures asthe number of cepstral coefficients zyxwvutsrq zyxwv zyxw zyxwvuts zyxwvutsrqp zyxwvut Fig. 9. Number of word errors as function of talkers in recognition experiment using airline vocabulary. ERROR dCEP EXTENDED (P>8) -lo.oiw 7 \ \ \ I I I I I I I 2 4 6 8 10 12 14 lO.0Oi 16 NUMBER OF CEPSTRAL COEFFICIENTS Fig. 10. Recognition error rate versus the number of cepstral coefficients . comparing the weighted cepstral distance and the Euclidean cepstral distance, and the original ( p 8 ) and additional extended ( p '> 8 ) cepstral coefficients variances for comparisons. increases upto 8, buttheerror rate increases for the weighted cepstral distance measure as thenumber of coefficients increases above 8, whereas it remains fairly constant forthe Euclidean cepstral distance.Thedatabase used in this experiment was DB-2, and cepstral coeffi- 1420 zyxwvutsr zyxwvutsrqp zyxwvutsrq IEEE TRANSACTIONS ON ACOUSTICS, S P E E C H , A N D SIGNAL PROCESSING, VOL. ASSP-35, NO. 10, OCTOBER 1987 cients higher than 8 were maximum entropy extensions [3] of the 8 lower order cepstral coefficients. These results suggest that we have two effects in the inverse variance weighting. One is to weight (actually, deweight) the lower order cepstral coefficients, and it is a positive effect. The other is to weight the higher order cepstral coefficients and it can be a negative effect if they are weighted too much.The negative effect increases when p goes beyond 8. Conclusions that can be drawn from the experimental result are that the best performance is given when the number of cepstral coefficients is equal to the original LPC analysis order (i.e., p = 8 ) in case of the inverse variance weighting, and that another weighting may be necessary to achieve an effective use of the cepstral coefficients extended beyond 8. or, equivalently, m Taking the derivative of (9) with respect to the frequency w , we get Using (lo), we get the following equation: m IV. DISCUSSION A . Quefrency Weighted Cepstral Distance Measure The quefrency weighted cepstral distance measure, which is another form of weighted cepstral distance measure, has previously been applied to vowel recognition experiments [lo]. The measure has the following form: P dQCEp= P = 2 ,Z ( i c l ( i ) - i c r ( i ) ) 2=1 C i=l 2 i 2 ( c l ( i )- c r ( i ) ). where SI ( w ) and S, ( w ) are power spectra of the test and the reference frames. Since the quefrency weighted cepstral distance is a truncated version of the left term in (1 l), it means that the quefrency weighted cepstral distance is a measure of the integrated distance between local spectrum slopes. Thus, the quefrency weighted cepstral distance has some similarity to the weighted slope metric proposed by Klatt [4]where critical band based spectral slopes were used. We can define a combination form of the quefrency weighted and the Euclidean cepstral distance measure as follows: zyxw zyxwvu zyxwv (7) That is, each cepstral coefficient is multiplied by its respective quefrency . Comparing the weighting coefficient i2 in dQCEP to weighting w( i ) from Fig. 6, which is the inverse variance of the cepstral coefficients, we can find some similarity between these two, in particular, although the weighting by i2 weights the higher ordercepstral coefficients much more than w ( i ) " in ( 5 ) for the case that 01 is slightly greater than 1. A recognition experiment, whose experimental conditions were the same as those used in experiments I1 and I11 in the previous section, was performed in order to measure the performance of the quefrency weighted cepstral distance measure. The minimum digit recognition error rate was 1.6 percent, which is almost the same as the error rate given by using a = 1.4 in the w ( i )" weighting (see Fig. 7). Consequently, the quefrency weighted cepstral distance measure works quite well. Although the error rate using this measure is slightly larger than that obtained by using the inverse variance weighted cepstral distance measure, it is far smaller than that obtained by using the Euclidean cepstral distance measure. In order tounderstand why these weighted cepstral distance measures work well, let us go back to the relationship between the speech spectrum and its cepstral coefficients. Let S ( w ) be the power spectrum of the speech signal; then its respective cepstral coefficient c ( i ) is represented as follows. + CP i'(c,(i) - c r ( i )2) i=l = C P (a + i 2 )( c , ( i ) - i= 1 (12) This new measure can bemade quite similarto the inverse variance weighted cepstral distance by appropriate choice of the value of a . One conclusion that can be drawn from the discussion above is that the weighted cepstral distance measure includes information about the distance between the local spectrum slopes and it is effective for measuring the difference between two speech spectra. B. Windowing in the Cepstral Domain We have shown in Section III-C that as the number of cepstral coefficients used in the weighted cepstral distance measure increases, the recognition error rate decreases at first, and then increases as thenumber of the cepstral coefficients becomes larger than 8 (the order of the LPC anal- 1421 TOHKURA: WEIGHTED CEPSTRAL DISTANCE MEASURE FOR SPEECH RECOGNITION d OUEFRENCY (a) zy higher quefrency components, thereby reducing undesirable variability in the speech spectrum. In other words, we can select the most useful quefrency components. This is especially useful when we want to use a number of the cepstral coefficients larger than 8. It is noted that the saturated weighting used in Experiment I11 (Section 111-C) is quitesimilartothe trapezoid lifterinFig.11(c). v. SUMMARY zyz In this paper the weighted cepstral distance measure, with weighting coefficients set equal to the inverse variance of the cepstral coefficients, has been studied. 4 4 Through several experiments using a speaker-independent, isolated word, recognition system, based upon template matching, the characteristics and the performance of the weighted cepstral distance measure have been studied. We summarize our experimental results and findings as follows. 1) The weighted cepstral distance measure works substantially better than both the Euclidean cepstral distance and the log likelihood ratio distance measure across two different databases, namely, 10 digits and 129-word airline vocabulary databases. 2) The most significant performance characteristic of (e) the weighted cepstral distance is that it tended to signifiFig. 11. Liftersintheweightedcepstraldistancemeasures.(a)Rectancantly reduce the error rate variance across talkers. gular, (b) asymmetric triangular, (c) trapezoid, (d) triangular, ( e ) raised 3) The inverse variance weighting coefficients are data sine. independent, since thecepstral coefficients variances, calculated from various kinds of database, do not show any ysis). It has been concluded thatthis result is due to significant difference which affects final speech recogniweighting the higher ordercepstral coefficients too much. tion results. One practical and simple solution to avoid the undesir4) The most important feature of the weighting is that able effects caused by overly weighting the higher order it deweights the lower order cepstral coefficients, rather coefficients is to use only the lower order cepstral coeffi- than weighting the higher order cepstral coefficients. cients (below 8th order) or to saturate the weights for the 5 ) The statistically weighted cepstral distance measure higher order cepstral coefficients as was done in Experi- with the inverse variance weight is characteristically quite ment 111. similar to the quefrency weighted cepstral distance meaAnother solution is to find an optimal weighting from a sure which is an expression of the spectral slope distance viewpoint which is different from that used for the inverse measure arrived at from perceptual considerations. variance or the quefrency weighting. One reasonable and 6 ) With respect to the relationship between the number elegant view is to explain these weightings by looking at of cepstral coefficients and recognition rate, the recognithe windowing which occurs in the cepstral domain, i.e., tion performance is best when the number of cepstral coefso-called liftering [ 171. ficients is equal to the LPC analysis order (i.e.; p = 8). In Fig. 11 several types of lifters (windows in the cep- It is suggested that some of the band-pass lifters (prostral domain) are illustrated. A rectangular lifter is equiv- posed in Section IV) can provide improved performance alent to using the first p cepstral coefficients without in achieving an effective use of the cepstral coefficients weighting. An asymmetric triangular lifter,shown in Fig. extended beyond 8 [ 181. 1l(b), is equivalent to the quefrency weighting. Since this ACKNOWLEDGMENT lifter has the effect of looking like a differentiator in the log frequency domain, it produces noise inthe higher The author wishes to thank L. R. Rabiner, F. K. Soong, quefrency components when the number of the cepstral and B. H. Juang for their valuable suggestions and stimcoefficients p becomes large. Therefore, it cannot be a ulating discussions throughout this work. Thanks are also desirable lifter to obtain good speech spectrum features due toJ. G. Wilpon for providing .speech recognition softfor recognition when p > 8. ware. Additionally, the author would like to thank J. L. Several symmetric lifters, such as the trapezoid, trian- Flanagan for inviting him to AT&T Bell Laboratories to gular, and raised sine lifter, are shown in Fig. ll(c)-(e). do this work. These lifters work as band-pass lifters with respect to queFor furtherwork on the weighted cepstral distance meafrency. By using them, we eliminate both the lower and sures, the use of band-pass liftering in speech recognition zyxwvutsrq zy zyxwvutsrq , 1422 zyxwvutsr zyxwvutsrq zyxwvuts zyxwvu IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, has been discussed by Juang et al. [ 181, and application of the weighted cepstral distance measures for recognition of noisy speech has been studied by Hanson and Wakita ~91. REFERENCES [l] F. Itakura and S. Saito, “An analysis-synthesis telephony based on maximum likelihood method,” in Proc. Int. Congr. Acoust., C-5-5, 1968. [2] F. Itakura, “Minimum prediction residual principle applied to speech recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-23, pp. 67-72, Feb. 1975. [3] A. Gray and J. Markel, “Distance measures for speech processing,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-24, pp. 380-391, Oct. 1976. [4] D. H. Klatt, “Prediction of perceived phonetic distance from critical band spectra: A first step,” in Proc. ICASSP 1982, vol. 2, May 1982, pp.1278-1281. [5]M.SugiyamaandK.Shikano,“FrequencyweightedLPCspectral matching measures,” ZECE Trans., vol. J65-A, pp. 965-972, Sept. 1982 (in Japanese). [6] N. Nocerino, F. K. Soong, L. R. Rabiner, and D. H. Klatt, “Comparative study of several distortion measures for speech recognition,” in Proc. ICASSP 1985, vol. 1, Mar. 1985, pp. 25-28. [7] B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” J . Acoust. Soc. Amer., vol. 55, no. 6, pp. 1304-1312, June 1974. [8] A. V . OppenheimandR. W. Schafer, Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1975. [9] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP29, pp. 254-272, Apr. 1981. [lo] K. K. Paliwal, “On the performance of the quefrency-weighted cepstral coefficients in vowel recognition,” Speech Commun.,vol. 1, pp. 151-154, May 1982. [ l l ] R. Nakatsu, H. Nagashima, J . Kojima, and N. Ishii, “A speech recognition method for telephone voice,” IECE Trans., vol. J66-D, pp. 377-384, Apr. 1983 (in Japanese). [12] Y. Tohkura, “Speaker-independent recognition of isolated digits using a weighted cepstral distance,” J . Acoust. Soc. Amer., supp. 1, vol. 77, E13, Spring 1985. [13] J. Makhoul, “Linear prediction: A tutorial review,” Proc. Z E k , vol. 6 3 , pp. 562-580, Apr. 1975. VOL. ASSP-35, NO. 10, OCTOBER 1987 [14]L.R.Rabinerand S. E. Levinson,“Isolatedandconnected word IEEE Trans. Comrecognition-Theoryandselectedapplication,” mun., vol. COM-29, pp. 621-659, May 1981. [15] J. G. Wilpon, L. R.Rabiner,and A. Bergh,“Speaker-independent isolated word recognition using a 129-word airline vocabulary,” J . Acoust. SOC.Amer., vol. 72, no. 2 , pp. 390-396, Aug. 1982. [16] L. R. Rabiner, S. E. Levinson, A. E. Rosenberg, and J. G. Wilpon, “Speaker-independent recognition of isolated words using clustering techniques,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-27, pp. 336-349, Aug. 1979. [17] B. P.Bogert,M. J. R. Healy,andJ.W.Tukey,“Thequefrency analysis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrumandsaphecracking,”in Proc. Symp. Time Series Anal., M.Rosenblatt,Ed.NewYork:Wiley,1963,pp.209-243. [18] B. H. Juang, L. R. Rabiner, and J. G. Wilpon, “On the use of bandpass liftering in speech recognition,” in Proc. ICASSP I986, vol. 1, Apr. 1986, pp. 765-768. [19] B. A. Hanson and H. Wakita, “Spectral slope based distortion measures for all-pole models of speech,” imProc. ICASSP 1986, vol. 1, Apr. 1986, pp. 757-760. zyxwvutsrq