State mixture modelling applied to speech recognition

Dat Tran

State mixture modelling applied to speech recognition

Dat Tran

1999, Pattern Recognition Letters

visibility

…

description

8 pages

link

1 file

In state mixture modelling (SMM), the temporal structure of the observation sequences is represented by the state joint probability distribution where mixtures of states are considered. This technique is considered in an iterative scheme via maximum likelihood estimation. A fuzzy estimation approach is also introduced to cooperate with the SMM model. This new approach not only saves calculations from 2x (HMM direct calculation) and x 2 (Forward± backward algorithm) to just only 2NT calculations, but also achieves a better recognition result.

Pattern Recognition Letters 20 (1999) 1449±1456 www.elsevier.nl/locate/patrec State mixture modelling applied to speech recognition Dat Tran a,*, Michael Wagner a, Tongtao Zheng b a Human±Computer Communication Laboratory, School of Computing, University of Canberra, Canberra, Australia b School of Asian Languages & Studies, University of Tasmania, Launceston, TAS 7250, Australia Abstract In state mixture modelling (SMM), the temporal structure of the observation sequences is represented by the state joint probability distribution where mixtures of states are considered. This technique is considered in an iterative scheme via maximum likelihood estimation. A fuzzy estimation approach is also introduced to cooperate with the SMM model. This new approach not only saves calculations from 2N T T (HMM direct calculation) and N 2 T (Forward± backward algorithm) to just only 2NT calculations, but also achieves a better recognition result. Ó 1999 Elsevier Science B.V. All rights reserved. Keywords: Maximum likelihood estimation; Fuzzy estimation; Speech recognition 1. Introduction Let O o1 ; o2 ; . . . ; oT be an observation sequence of a spoken word and let K denote a model parameter set. The basic problem in speech modelling is how to compute P OjK eciently, the probability of the observation sequence O, given the model K. The simplest solution is to use a statistical independence assumption between observations only. P OjK is computed as the product of the probabilities of each observation. Computations are simple and in the case that observations are continuous vectors, probability density functions are applicable to model speech data. Its disadvantage is that the temporal structure of the observation sequence is not taken into account. An * Corresponding author. E-mail addresses: [email protected] (D. Tran), [email protected] (M. Wagner), Tongtao.Zheng @utas.edu.au (T. Zheng) application is Gaussian mixture modelling for speaker recognition (Reynolds, 1992). To overcome the above disadvantage, a better solution applied to the hidden Markov model (HMM) is to use hidden state variables modelled as Markov chains. Observations are statistically independent but dependent on states. The temporal structure of the observation sequence is represented by Markov chains where state variables are restricted to have a ®nite number of values and the state-transition probabilities are assumed to be time invariant (Rabiner and Juang, 1986). The complete parameter set of the HMM is K fp; A; Bg, where p is the initial state distribution, A is the state transition probability distribution and B is the observation symbol probability distribution. Although the application of the HMM to speech recognition has been a success, computations in the HMM are relatively complicated. Consider the computation cost required for evaluating P OjK. If an utterance consists of T acoustic vectors, for an N-state 0167-8655/99/$ - see front matter Ó 1999 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 6 5 5 ( 9 9 ) 0 0 1 1 3 - 0 1450 D. Tran et al. / Pattern Recognition Letters 20 (1999) 1449±1456 HMM, it requires on the order of 2N T T calculations. Thus, a more ecient algorithm called the forward±backward algorithm is used to reduce calculations. Even so, a scaling procedure is still required since in the reestimation procedure of the HMM, for suciently large T, the dynamic range of computations will exceed the precision range of essentially any machine (Rabiner and Juang, 1993, p. 365). Markov chains applied to the HMM in general is called ergodic. A property obtained from the Markov chain theory is the existence of the steady-state probability distribution. Given an initial probability distribution, the state probability distribution of such a Markov chain will tend to the steady-state probability distribution after a while, namely, it does not change in time (Kulkarni, 1995). The product of the statetransition probability and the state probability is the state-joint probability, therefore with the time invariant assumption of state-transition probabilities, the state-joint probabilities are also assumed to be time invariant. An alternative approach of the Markov chain theory to modelling speech is to use the state-joint probability distribution instead of the state-transition probability distribution to represent the temporal structure of the observation sequence. In this approach, states should be considered as mixtures of states rather than state sequences. In general, a state transition can be made from a mixture of states at previous time to a new mixture of states at current time. The complete parameter set of a state mixture model (SMM) is K fp; U ; Bg where U is the state-joint probability distribution. Computations in the SMM are simple, the forward±backward algorithm and the scaling procedure are not required. The parameter reestimation methods in the HMM is quite applicable to the SMM and, moreover, a fuzzy reestimation method is also proposed. The paper is organised as follows. The next section reviews the HMM theory for comparison with the SMM theory in Section 3. A fuzzy approach to the SMM is proposed in Section 4. Section 5 reports experimental results of applying HMMs and SMMs in speech recognition. 2. Hidden Markov models 2.1. The evaluation problem Consider an HMM for discrete symbol observations. Such a model is characterised by a set of N states, a sequence of observations O fo1 ; o2 ; . . . ; oT g and a set of observation symbols V fv1 ; v2 ; . . . ; vM g. The probability P OjK is computed as X P OjK P O; SjK all S X P OjS; KP SjK; 1 all S where S fs1 ; s2 ; . . . ; sT g denotes a state sequence. Applying the statistical independence assumption for P OjS; K and the Markov assumption for P SjK, we obtain P OjS; K T Y P ot jst ; stÿ1 ; . . . ; s1 ; K t1 T Y P ot jst ; K; 2 t1 P SjK P s1 T ÿ1 Y P st1 jst ; :::; s2 ; s1 t1 P s1 T ÿ1 Y P st1 jst : 3 t1 Denoting ps1 P s1 jK, ast st1 P st1 jst ; K and bst ot P ot jst ; K, from (1)±(3), we have P OjK T XY ast st1 bst1 ot1 ; 4 all S t0 where as0 s1 denotes ps1 for simplicity (Huang et al., 1990). A compact notation K fp; A; Bg is proposed to indicate the complete parameter set of this model, where 1. p fpi g; pi P s1 ijK, 1 6 i 6 N : the initial state distribution; 2. A faij g; aij P st1 jjst i; K, 1 6 i; j 6 N and 1 6 t 6 T ÿ 1 the state transition probability distribution, denoting the transition 1451 D. Tran et al. / Pattern Recognition Letters 20 (1999) 1449±1456 probability from state i at time t to state j at time t 1; and 3. B fbj kg; bj k P ot vk jst j; K, 1 6 j 6 N , 1 6 k 6 M, 1 6 t 6 T : the observation symbol probability distribution, denoting the probability of generating a symbol ot v k in state j with probability bj k. To interpret the computation, (4) can be rewritten as X ps1 bs1 o1 as1 s2 bs2 o2 asT ÿ1 sT bsT oT : P OjK s1 ...sT Using the forward and backward variables, (5) is written as P OjK N X aT i i1 N X N X pi bi o1 b1 i i1 10 at ibt i: i1 The computation of these variables requires on the order of N 2 T calculations, rather than 2N T T as required by the direct calculation. 5 It can be seen that, at time t 1, we are in state s1 with probability ps1 , generating the symbol o1 with probability bs1 o1 . A transition is then made from the initial state s1 to state s2 at time t 2 with transition probability as1 s2 ; generating the symbol o2 with probability bs2 o2 . This process continues until we make the last transition from state sT ÿ1 to state sT at time t T with transition probability asT ÿ1 sT , generating the symbol oT with probability bsT oT (Rabiner and Juang, 1993, p. 335). According to its direct de®nition, (5) involves on the order of 2N T T calculations. A more ecient algorithm called the forward±backward algorithm is required to solve this problem. The forward variable is de®ned as at i P o1 ; o2 ; . . . ; ot ; st ijK; 6 2.2. The estimation problem The most dicult problem in HMM is how to adjust the model parameters K to maximise P OjK. The iterative algorithm known as the Baum±Welch algorithm is used to solve this problem. We ®rst de®ne nt i; j, the probability of being in state i at time t, and state j at time t 1, given the model and the observation sequence nt i; j P st i; st1 jjO; K at iaij bj ot1 bt1 j P N PN i1 j1 at iaij bj ot1 bt1 j and ct i, the probability of being in state i at time t, given the model and the observation sequence where ct i P st ijO; K a1 i pi bi o1 ; " # N X at1 j at iaij bj ot1 ; 1 6 t 6 T ÿ 1: 7 pi c1 i; 8 where j1 12 PT ÿ1 nt i; j aij Pt1 ; T ÿ1 t1 ct i P t2o v ct j : bj k PTt k t1 ct j 1 6 t 6 T ÿ 1: nt i; j: f Bg are deterThe model parameters K p; A; mined as follows: The backward variable is de®ned as bT i 1; N X bt i aij bj ot1 bt1 j; N X j1 i1 bt i P ot1 ; ot2 ; . . . ; oT jst i; K; 11 9 13 Proofs of the Baum±Welch reestimation algorithm can be found in the literature (e.g., Rabiner and Juang, 1993; Huang et al., 1990). 1452 D. Tran et al. / Pattern Recognition Letters 20 (1999) 1449±1456 3. State mixture models 3.1. The evaluation problem Consider an SMM for discrete symbol observations where states, observation sequence and observation symbols are also characterised as the HMM. Let P st jK denote the probability of being in state st at time t. Applying the statistical independence assumption between observations, we obtain T ÿ1 Y P OjK P o1 jK P ot1 jK: 14 t1 Since P o1 jK X P o1 ; s1 jK all s1 X P s1 jKP o1 js1 ; K; 15 all s1 P ot1 jK X X P ot1 ; st ; st1 jK all st all st1 X X P st ; st1 jKP ot1 jst1 ; K: all st all st1 16 Denoting ps1 P s1 jK, ust st1 P st ; st1 jK and bst ot P ot jst ; K, we obtain ! T ÿ1 X X Y P OjK 17 ust st1 bst1 ot1 ; t0 all st all st1 P where all s0 us0 s1 denotes ps1 for simplicity. Since the summations are over all values of st , st1 , 1 6 st ; st1 6 N , and assuming the state-joint probability is independent of time, after taking the logarithm, we can write ! T ÿ1 N X N X X log uij bj ot1 : 18 log P OjK t0 i1 j1 A compact notation K fp; U ; Bg is proposed to indicate the complete parameter set of this model, where p, B are the same as the HMM and U fuij g is the state joint probability distribution, denoting the joint probability of state i at time t and state j at time t 1. We have uij P st i; st1 jjK, 1 6 i; j 6 N , 1 6 t 6 T ÿ 1. To interpret the computation, since P st ; st1 jK P st1 jst ; KP st jK and denoting P st jK cst as the probability of being in state st , (17) can be rewritten as ( " # X X ps 1 b s 1 o 1 ci1 ai1 s2 bs2 o2 P OjK i1 s1 ...sT " X ) # ciT ÿ1 aiT ÿ1 sT bsT oT : iT ÿ1 19 From (19), at time t 1, we are in state s1 with probability ps1 , generating the symbol o1 with probability bs1 o1 . A transition is then made to state s2 at time t 2 from a mixture of states i1 , where i1 1; . . . ; N (initial states at time t 1), weighed by ci1 with transition probability ai1 s2 and generating the symbol o2 with probability bs2 o2 . This process continues until we make the last transition to state sT at time t T from a mixture of states iT ÿ1 , where iT ÿ1 1; . . . ; N , weighed by ciT ÿ1 with transition probability aiT ÿ1 sT and generating the symbol oT with probability bsT oT . It can be seen that the computation involved in (18) requires on the order of 2NT calculations rather than N 2 T calculations as required in the HMM. 3.2. The estimation problem We ®rst de®ne gt i; j, the probability of being in state i at time t and in state j at time t 1, given the model and the observation at time t 1, gt i; j P st i; st1 jjot1 ; K PN k1 uij bj ot1 ; PN l1 ukl bl ot1 20 and ut i, the probability of being in state i at time t, given the model and the observation at time t, ut j P st jjot ; K N X gtÿ1 i; j: 21 i1 f are deterThe model parameters K p; U ; Bg mined as follows: pj u1 j; 1453 D. Tran et al. / Pattern Recognition Letters 20 (1999) 1449±1456 uij and Applying the Lagrange multiplier method and using the following constraints: T ÿ1 1 X g i; j T ÿ 1 t1 t P t2o v ut j : bj k PTt k t1 ut j 22 They can be applied directly without the scaling procedure since (20) does not consist of products in time as in the HMM. The reestimation formulas in (22) can be proven by maximising the following function: X X X P st ; st1 ; ot1 jK Q K; K P ot jK t all st all s t1 log P st ; st1 ; ot1 jK: 23 P Q K; K then It can be shown that if Q K; K P OjK P P OjK. Indeed, it follows that P ot1 jK log P ot1 jK X X P st ; st1 ; ot1 jK log P ot1 jK all st all s t1 X X P st ; st1 ; ot1 jK P st ; st1 ; ot1 jK log P ot1 jK P st ; st1 ; ot1 jK all st all s t1 X X P st ; st1 ; ot1 jK P st ; st1 ; ot1 jK log : P P ot1 jK P st ; st1 ; ot1 jK all st all s t1 24 Summing this inequality over t gives P OjK ÿ Q K; K: P Q K; K log P OjK 25 Since P st ; st1 ; ot1 jK P st ; st1 jKP ot1 jst1 ; K Pust st1 bst1 ot1 and note that ps1 is denoted by all s0 us0 s1 for simplicity, we can regroup (23) into three terms as follows: Q K; K N X u1 j log pj j1 ! N X N T ÿ1 X X gt i; j log uij i1 j1 N X M X j1 k1 t1 X t2ot vk ! ut j log bj k: 26 N X pj 1; j1 N X N X uij 1 i1 j1 and M X bj k 1; k1 27 is maximised with the we can show that Q K; K as determined in model parameters K f p; U ; Bg (22). 4. A proposed fuzzy approach to state mixture modelling SMM has been considered in the iterative scheme via maximum likelihood estimation. An iterative scheme for SMM via fuzzy estimation is proposed in this section. It is based on the fuzzy cmeans (FCM) clustering method that is the most widely used approach in both theory and practical applications of fuzzy clustering techniques to unsupervised classi®cation. The FCM algorithms are used to minimise the FCM functionals, where fuzzy mean vectors are iteratively updated. Gustafson and Kessel (1979) have proposed a modi®cation of the FCM algorithms where fuzzy covariance matrices of clusters have been de®ned. Recently, an alternative modi®cation has been proposed (Tran et al., 1998) where fuzzy mixture weights (prior probabilities) of clusters have been de®ned. The in®nite family of FCM functionals generalised by Bezdek (1987) is as follows: JF T X c X mFit dit2 ; 28 t1 i1 where mit represents the degree of sample xt belonging to the ith cluster and satis®es 0 6 mit 6 1, Pc m 1; F P 1 is a weight exponent on each i1 it fuzzy membership mit and is called the degree of fuzziness; and dit d xt ; li is the distance from xt to mean vector li , known as a measure of dissimilarity. Minimising the fuzzy objective function JF gives 1454 D. Tran et al. / Pattern Recognition Letters 20 (1999) 1449±1456 ( mit and c X 2= F ÿ1 d x t ; li = d x t ; lk k1 It can be shown that, when F tends to 1, the fuzzy estimation formulas in (33) approach to the maximum likelihood estimation formulas in (22). )ÿ1 PT mFit xt : li Pt1 T F t1 mit 29 For the SMM, since there are two state variables at time t and at time t 1, the distance is de®ned as 2 ÿ log P st i; st1 j; ot1 jK: dijt 30 An in®nite family of functionals for the SMM is proposed as JF T ÿ1 X N X N X t0 i1 ÿ 2 mFijt dijt j1 N N X T ÿ1 X X t0 mFijt j1 i1 log P st i; st1 j; ot1 jK; 31 where 0 6 mijt 6 1; N X N X i1 mijt 1: 32 j1 Regrouping (31) into three terms as in (26) and then applying the Lagrange multiplier method using constraints in (27), the fuzzy reestimation formulas for the SMM are computed as PN F i1 mij0 ; pj PN PN F j1 mij0 i1 PT ÿ1 F t1 mijt ; uij PT ÿ1 PN PN F t1 i1 j1 mijt PN P F i1 mij tÿ1 t2ot vk bj k PT PN ; 33 F t1 i1 mij tÿ1 where mijt ( N X N X k1 l1 dijt =dklt 2= F ÿ1 )ÿ1 : 34 5. Experimental results 5.1. The TI46 speech database This corpus of speech was designed and collected at Texas Instruments (TI). It contains 16 speakers, 8 female and 8 male, labelled f1±f8 and m1±m8, respectively. There are 46 words per speaker: ten digits from 0 to 9, 26 letters from a to z and ten command words including enter, erase, go, help, no, rubout, repeat, stop, start and yes. Each speaker repeated the words ten times in a single training session, and then again twice in each of 8 later testing sessions. The corpus was sampled at 12 500 samples per second and 12 bits per sample. The data were processed in 20.48 ms frames (256 samples) at a frame rate of 125 frames per second (100 sample shift). Frames were Hamming windowed and preemphasised with l 0:9. For each frame, 46 mel-spectral bands of a width of 110 mel and 20 mel-frequency cepstral coecients (MFCC) were determined (Wagner, 1996). 5.2. Speech recognition Experiments to compare SMMs and HMMs were carried out using the TI46 speech data corpus. Four subsets used for the isolated word recognition in speaker-dependent mode are the E set including 9 letters b, c, d, e, g, p, t, v and z, the 10digit set, the 10-command set and the 46-word set. Both HMMs and SMMs employ the same codebook obtained using the standard LBG algorithm proposed by Linde et al. (1980). The experimental results for 6-state HMMs and 6-state SMMs are given in Table 1. It is observed that SMMs performed better than HMMs on the E set with 5.08% points of recognition. An improvement obtained by applying fuzzy estimation to SMMs is found in Table 2. With the E set, the dependence of the recognition error rate on the codebook size is considered in Fig. 1. D. Tran et al. / Pattern Recognition Letters 20 (1999) 1449±1456 Table 1 Recognition error rates (%) for HMMs and SMMs, codebook size 128 Subset Discrete HMMs Discrete SMMs E set 10-digit set 10-command set 33.98 0.39 1.65 28.90 0.37 1.37 Table 2 Recognition error rates (%) for SMMs and Fuzzy SMMs, F 2 [1.05, 1.25] Subset SMMs Fuzzy SMMs E set 10-digit set 10-command set 28.90 0.37 1.37 28.21 0.21 1.14 1455 may sometimes miss the most likely candidate. Secondly, the fuzzy approach will better de®ne a search space, and given each candidate a more appropriate degree of belonging, while the nonfuzzy approach will simply reject or take-in a candidate without considering the real nature of the candidate. It therefore reduces the available number of candidates. Finally, the simple computations of SMMs can be seen as an advantage for implementation. For further reading, see (Baum, 1972; Levinson et al., 1983; Rabiner, 1989; Upper, 1997; Bezdek and Pal, 1992; Duda and Hart, 1973; Dunn, 1974; Juang, 1985). Discussion Kamel: You mentioned the SMM is simpler and needs fewer calculation. Can you give us an idea how much less this is in terms of time? Fig. 1. Plots of recognition error rates (%) versus codebook size on the E set. 6. Conclusion Some of the theoretical and practical issues of SMMs have been presented in this paper. With the steady state distribution assumption, the statejoint distribution has been proposed to apply the Markov chain theory in an alternative way to speech modelling. The forward±backward algorithm and the scaling procedure are not required in this approach. The optimisation methods in HMMs are quite applicable to SMMs. Obviously, the SMM is a new approach which should have a further study. There are two reasons to explain why an SMM and fuzzy approach achieve better recognition results. Firstly, SMM is able to select the most likely candidate from a wider range of mixed units, while the HMM can only select a limited number of strings from the next state, as it Zheng: Our approach uses 2NT calculations. With the HMM you need 2NT T, so this is a huge dierence. When you use the forward±backward plus HMM, you need N2 T. This is why a lot of people do not want to use the HMM, because of the huge calculation costs. Sagerer: For speech recognition and HMMs there exist very ecient searches like Viterbi forward calculation. The forward±backward algorithm is mainly relevant if you are really training, adapting the parameters. I do not understand why you introduce this scheme, because you can use an ecient algorithm for the search with HMMs. Time is not a serious problem, you can achieve HMMs for lexicons of, say, more than 2000 words in real time. Zheng: 2000 words is very small. Sagerer: But you are dealing with six states and a very small lexicon. Zheng: No, this is only for testing purposes. While you are training you have many more words. Our method is not only more ecient, the results are also better. With the forward±backward algorithm you only search a string. But the last string might be not the best choice. 1456 D. Tran et al. / Pattern Recognition Letters 20 (1999) 1449±1456 References Baum, L.E., 1972. An inequality and associated maximisation technique in statistical estimation for probabilistic functions of a Markov process. Inequalities 3, 1±8. Bezdek, J.C., 1987. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York. Bezdek, J.C., Pal, S.K., 1992. Fuzzy Models for Pattern Recognition. IEEE Press, New York. Duda, R.O., Hart, P.E., 1973. Pattern Classi®cation and Scene Analysis. Wiley, New York. Dunn, J., 1974. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated cluster. J. Cybernetics 3, 32±57. Gustafson, D.E., Kessel, W., 1979. Fuzzy clustering with a fuzzy covariance matrix. In: Fu, K.S. (Ed.), Proc. IEEECDC. IEEE Press, Piscataway, NJ, Vol. 2, pp. 761±766. Huang, X.D., Ariki, Y., Jack, M.A., 1990. Hidden Markov Models for Speech Recognition. Edinburgh University Press. Juang, B.H., 1985. Maximum likelihood estimation for multivariate observations of Markov sources. AT&T Technical J. 64, 1235±1239. Kulkarni, V.G., 1995. Modeling and Analysis of Stochastic Systems. Chapman & Hall, UK. Levinson, S.E., Rabiner, L.R., Sondhi, M.M., 1983. An introduction to the application of the theory of Probabilistic functions of a Markov process to automatic speech recognition. The Bell System Technical J. 62 (4), 1035±1074. Linde, Y., Buzo, A., Gray, R.M., 1980. An algorithm for vector quantisation. IEEE Trans. Comm. 28, 84±95. Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications speech recognition. Proc. IEEE 77 (2), 257±286. Rabiner, L.R., Juang, B.H., 1986. An introduction to hidden Markov models. IEEE Acoust. Speech Signal Process. Soc. Mag. 3 (1), 4±16. Rabiner, L.R., Juang, B.H., 1993. Fundamentals of Speech Recognition. Prentice-Hall PTR, Englewood Clis, NJ. Reynolds, D.A., 1992. A Gaussian mixture modeling approach to text-independent speaker identi®cation. Ph.D. thesis, Georgia Institute of Technology, USA. Tran, D., Le, T.V., Wagner, M., 1998. Fuzzy Gaussian mixture models for speaker recognition. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP98). Sydney, Australia, Vol. 3, pp. 759±762. Upper, D.R., 1997. Theory and algorithms for hidden Markov models and generalised hidden Markov models. Ph.D. thesis in Mathematics, University of California at Berkeley. Wagner, M., 1996. Combined speech-recognition/speakerveri®cation system with modest training requirements. In: Proceedings of the 6th Australian International Conference on Speech Science and Technology. Adelaide, Australia, pp. 139±143.

Log In

State mixture modelling applied to speech recognition

Related papers

Related papers

Related topics