INTERSPEECH 2004 - ICSLP
8th International Conference on Spoken
Language Processing
ICC Jeju, Jeju Island, Korea
October 4-8, 2004
ISCA Archive
http://www.isca-speech.org/archive
Speech Enhanced Multi-span Language Model
A. Nayeeemulla Khan and B. Yegnanarayana
Speech and Vision Laboratory
Department of Computer Science and Engineering
Indian Institute of Technology Madras, Chennai - 600 036, India
Email: nayeem,yegna@cs.iitm.ernet.in
10.21437/Interspeech.2004-496
Abstract
To capture local and global constraints in a language, statistical Ò -grams are used in combination with multi-span
language models for improved language modelling. Use
of latent semantic analysis (LSA) to capture the global
semantic constraints and bigram models to capture local constraints, is shown to reduce the perplexity of the
model. In this paper we propose a method in which the
multi-span LSA language model can be developed based
on the speech signal. Reference pattern vectors are derived from the speech signal for each word in the vocabulary. Based on the normalised distance between the reference word pattern vector and the pattern vector of a word
in the training data, the LSA model is developed. We
show that this model in combination with a standard bigram model performs better than the conventional bigram
+ LSA model. The results are demonstrated for a limited
vocabulary on a database for the Indian language, Tamil.
1. Introduction
In every language there exist dependencies in the usage of
words which could be syntactic, semantic or pragmatic.
Local level constraints are captured by means of statistical Ò -gram models. Æ -gram models are unable to predict
long range dependencies as this requires a large value of
Ò , making the parameter estimates of the model unreliable due to the limited training data available. To model
long range dependencies, equivalence classes on the Ò gram history [1] and structured language models [2] are
useful for limited domains. In less constrained domains
they are not as useful. Trigger based language models [3]
are also potential ways in which long range dependencies
can be captured. But trigger pair selection is a complex
task, with different pairs displaying different behaviors.
Use of latent semantic analysis to capture long range dependencies has been shown to be effective. In combination with Ò -gram models it results in a substantial reduction in perplexity [4][5]. In conventional language models no knowledge of the language is used. The data being
modelled could as well be a sequence of arbitrary symbols. It is essential to use knowledge sources available to
enhance the performance of the statistical language models.
One application of statistical language models is in
speech recognition. The use of speech knowledge, prosodic
constraints, large span semantic and local syntactic constraints, when integrated with the speech recogniser would
improve the performance of the recogniser. In this paper
we propose a method in which the semantic constraints
in terms of the co-occurrence of words in a document
are captured indirectly from the speech signal in the latent semantic analysis (LSA) framework. We show that
the speech enhanced LSA language model performs better than the Ò -gram and the hybrid Ò -gram + LSA model.
The reduction in perplexity for a test set is used to measure the performance of the model.
The paper is organised as follows, the next section
briefly illustrates the technique of LSA. Section 3 describes the development of the speech enhanced multispan language model. In Section 4 the database used is
described. Section 5 details the evaluation of the model,
followed by discussion of the results in Section 6. We
summarise the study in Section 7.
2. Latent semantic analysis
A brief overview of related work on LSA relevant to this
study as described in [4][5][6] is presented here. LSA is
an algebraic technique that can be used to infer the relationship among words by means of the co-occurrence of
the words in identical contexts. Given a set of Æ documents from a text corpus Ì , with a vocabulary of Å
words, it specifies a mapping between the discrete sets
and Ì and a continuous vector space . A document
could arbitrarily be a sentence, a paragraph or a larger
unit of text.
A matrix Ï containing the co-occurrence statistics
between words and documents is constructed. Here word
order is ignored unlike conventional Ò -gram modelling.
Each element of Ï is weighed by the normalised word
entropy and scaled for the document length. The element
(i,j) of Ï is given by
where
is the number of times word occurs in document , is the total number of words present in ,
, and is the normalised entropy of in
the entire corpus . The matrix Ï can be approximated
by its order-R singular value decomposition (SVD).
This results in three matrices
and
.
Í and Î are column orthonormal and Ë is a diagonal
matrix. This transformation to the lower dimensional space
captures major structural association between the words
and the documents and removes noise. It also provides a
Ê dimensional representation for both the words and the
documents. Based on information retrieval and language
modelling studies [5], values of Ê in the range of 100 to
300 seems to work reasonably. The Ê -dimensional scaled
representation of the word and document vector is given
by and where and are the corresponding
rows of Í and Î . Any new document (test document)
can be considered as an additional column of the matrix
Ï . Its representation in the reduced dimensional space
is given by .
For language modelling given such a representation,
and a distance metric in the Ê -dimensional space it is
possible to combine the standard Ò -grams and the LSA
model to derive a hybrid Ò -gram + LSA language model
as detailed in [4][5]. In the following sections we detail
the construction of the Ï matrix from the acoustic signal. Using this speech based Ï matrix we develop the
speech enhanced hybrid Ò -gram + LSA model.
number of frames needed to construct the desired pattern vector, frames are added/dropped. If the number
of frames in the word segment is less than the desired,
then the frame with the minimum euclidean distance is
replicated. For a word segment with larger number of
frames than desired, a frame is dropped if its euclidean
distance to its neighbor is minimum among the distances
computed. This is repeated until the desired number of
frames is obtained. It is assumed that there is minimal
distortion/loss in adding/dropping the above frames. The
selected frames are concatenated to form the fixed dimensional pattern vector representing the word. The resulting
pattern vector is large (390 to 572 dimensions). Comparing pattern vectors in such a high dimension space is not
preferable. It has been shown that non-linear compression of large dimensional pattern vectors of speech using
AANN models does not degrade the speech recognition
performance [7]. We use AANN models to compress the
large dimensional pattern vector into 40 to 100 dimensions. Thus a reduced dimension pattern vector is derived for each word segment in the entire training data.
Pattern vectors corresponding to a word in the training
set are used to derive a mean pattern vector, which serves
as the reference pattern vector for that word in the vocabulary. One such reference pattern vector is derived for
each word in the vocabulary.
For every word segment in a training document (speech
file), a compressed pattern vector is derived as explained.
The euclidean distance between this pattern vector and all
the reference pattern vectors in the vocabulary is determined. The resulting distances are normalised between
zero and one. The membership defined as (1 - normalised
distance) indicates how close the current pattern vector is
to each of the reference pattern vectors in the vocabulary.
If this membership is above a certain threshold then the
appropriate element is incremented by the membership value. The elements of Ï are also scaled for the
length of the document (No. of words) and weighted by
the entropy of the term. Thus the Ï matrix is derived
from the acoustic signal.
3. Speech enhanced multi-span language
model
4. Database
The block diagram of the proposed system for construction of matrix Ï is shown in Figure 1. Availability of a
database segmented in terms of words is assumed. The
duration of a word segment is variable. To find the closeness of a pattern vector representing a word, to other
words of the vocabulary, it is desirable to have fixed dimensional pattern vectors for all the words in the vocabulary. From the speech signal corresponding to a word
segment, for every frame of 15 msec and a frame shift
of 1 msec we derive 13 dimensional mel frequency cepstral coefficients. The euclidean distance between adjacent pairs of feature vectors is computed for all the frames
corresponding to the word segment. Depending on the
The database used for the study is the Indian language
speech corpus [8]. TV news bulletins from Doordarshan
for Tamil language were collected. Speech pertaining
to the news reader was manually transcribed and segmented into words representing around 4 hours of speech.
Among these bulletins 23 are spoken by females and 10
by males. For this task the database was partitioned manually into news stories belonging to 8 different categories.
The details of the database in terms of news stories are
shown in Tables 1 and 2. There are no standard text corpus of news bulletins or news wire corpora in Tamil language. As it is preferable to use bigram models trained
on data pertaining to the domain of use, we used a bigram
Acoustic
signal
Feature
extraction
13 dimensional
MFCC
Feature
selection
Pattern
vector
Compressed
AANN
compression
pattern
vector
Template
generation
Mean
pattern
vector
W
W(i,j)++,
scale,
weigh
Yes
Closeness
>
threshold
Closeness
Normalised distance
computation
Figure 1: Block diagram for construction of in the proposed speech enhanced LSA language model
model derived from the limited training data for integration with the LSA model.
Table 1: Description of the database in terms of
stories
Story
No. of documents
category
Training set Test Set
Economics
44
2
Events
104
23
Others
124
24
Politics
104
4
Sports
34
3
War
163
32
Weather
6
1
World politics
64
3
Total
643
92
Table 2: Database statistics
Training set Test set
No. of Docs
643
92
No. of Words
26,380
3,706
Min. No. of Words
6
8
Max. No. of Words
159
136
Avg. No. of Words
41
40
5. Experimental evaluation
From the transcription of the training data we chose a limited vocabulary of 1,278 words (inclusive of the unknown
word tag UNK), that had at least 4 occurrences in the
training data. For the 643 training documents a matrix of size 1278 643 is created. The average duration
of these words in the database is 431 msec. Assuming a
frame shift of 10 msec, 44 frames are chosen using the
procedure mentioned in Section 3 to represent the word.
The feature vectors concatenated together resulted in pattern vectors of 572 dimension. Such pattern vectors are
derived for each word in the training data. The pattern
vectors are non-linearly compressed to a smaller dimension (60, 80 or 100) using AANN models. The structure
of the AANN model is 572L 858N kN 858N 572L, where
represents a linear unit, represents a non-linear unit
and is the dimension of the desired compressed pattern
vector. All the word pattern vectors in the training data
are used to train the AANN model. The model in trained
for 200 epochs. The compressed feature vector is obtained from the compression layer of the trained AANN
model. These compressed feature vectors are used in the
construction of the matrix. As the structure of the
AANN model is large and the training patterns limited,
the AANN model may not generalise well. An alternative compact representation of a word using only 30
frames concatenated to form a 390 dimensional pattern
vector was also employed. These 390 dimension pattern
vectors were compressed to 40, 60 or 80 dimension using an AANN model represented by 390L 585N kN 585N
390L. Using these compressed pattern vectors of different dimensionalities, appropriate matrices were constructed. The integrated bigram + LSA model was derived as described in [4][5]. We report results for pattern
vectors compressed to 60 dimension.
To test the performance of the language model the
perplexity of the speech enhanced hybrid bigram + LSA
model was found for the test data of Table 1. During testing based on the transcription of the speech document, in
a manner similar to the standard LSA language model,
for every word in the test document, the appropriate vectors in the reduced dimensional space ½¾ and ½¾
are used for computation of the probabilities. The out of
vocabulary rate for the test set is very high (41%) due to
the limited vocabulary chosen, and the fact that the data
pertains to news bulletins. The out of vocabulary words
were ignored in perplexity computation.
6. Results
The performance of the speech enhanced hybrid bigram
+ LSA model is shown in Table 3 for different thresholds
of membership values, and a SVD order of 75 (optimal
order balancing reconstruction error and noise suppres-
Table 3: Perplexity of the speech enhanced bigram
+ LSA model for different pattern representation
and membership threshold, for a SVD order of 75
Word pattern
Membership Perplexity
representation
greater than
Compressed
0.98
227
from
0.97
231
572 to 60
0.96
231
0.92
233
Compressed
0.98
199
from
0.97
196
390 to 60
0.96
195
0.92
197
Table 4: Comparison of performance of three different language models. LSA models use SVD of
order 250
Model
Perplexity Improvement
over bigrams
Bigram
234
Bigram + LSA
199
15%
Speech enhanced
bigram + LSA
185
21%
sion). If the threshold is high (0.98) the Ï matrix is
similar to its text based counterpart in its sparseness. As
the threshold is lowered, more elements of the Ï matrix are filled, which is like smoothing. The performance
of the model improves marginally. For lower thresholds
the performance is likely to deteriorate. This behaviour
is observed for both the representations of the word pattern vectors. The performance of the model using patterns vectors compressed from 390 to 60 dimension is
better than the model using pattern vectors compressed
from 572 to 60 dimension.
The performance comparison of the three different
language models is shown in Table 4. The perplexity
of the speech enhanced hybrid Ò -gram model is better
than the standard bigram model by 21% and shows an
improvement of 6% over the conventional text based bigram + LSA model for a SVD order of 250. Order 250 is
chosen due to its better performance over SVD order 75.
7. Summary
In this study we proposed an approach for developing a
speech enhanced multi-span language model. We have
shown that the performance of the system is better than
the text based bigram + LSA model for the limited vocabulary of words. No use of word level and document
level smoothing [5] is made, which would further reduce
the perplexity of the model. Different parameters of the
system like word pattern vector representation, order of
compression of the pattern vector, membership threshold, SVD order and scaling factor for the LSA probabilities are not optimised. Doing so may improve the performance of the language model. This method of indirect incorporation of the speech information may be a small step
towards using speech level constraints in language models for better speech recognition performance. One limitation in extending the study is the lack of a large speech
corpus of the size required for language modelling, segmented in terms of words for Indian languages.
8. References
[1] Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, and Robert L. Mercer, “Classbased -gram models of natural language,” Computational Linguistics, vol. 18, no. 4, pp. 467–479,
1992.
[2] C. Chelba and F. Jelinek, “Recognition performance
of a structured language model,” in Proc. 6th Eur.
Conf. Speech Commun. Technol., Budapest, Hungary, Sept. 1999, vol. 4, pp. 1567–1570.
[3] R. Lau, R. Rosenfeld, and S. Roukos, “Trigger-based
language models: A maximum entropy approach,”
in Proc. IEEE Int. Conf. Acoust., Speech, and Signal
Processing, Minneapolis, USA, Apr. 1993, vol. 2, pp.
45–48.
[4] N. Coccaro and D. Jurafsky, “Toward better integration of semantic predictors in statistical language
modeling,” in Proc. Int. Conf. Spoken Language
Processing, Sydney, Australia, Dec. 1998, pp. 2403–
2406.
[5] J. R. Bellegarda, “Exploiting latent semantic information in statistical language modeling,” Proc.
IEEE, vol. 88, no. 8, pp. 1279–1296, Aug. 2000.
[6] T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction to latent semantic analysis,” Discourse Process, vol. 25, pp. 259–284, 1998.
[7] S. V. Gangashetty, C. Chandra Sekhar, and B. Yegnanarayana, “Dimension reduction using autoassociative neural network models for recognition of
consonant-vowel units of speech,” in Proc. Fifth Int.
Conf. Advances in Pattern Recognition, ISI Calcutta,
India, Dec. 2003, pp. 156–159.
[8] A. Nayeemulla Khan, Suryakanth V. Gangashetty,
and S. Rajendran, “Speech database for Indian
languages- A preliminary study,” in Proc. Int. Conf.
Natural Language Processing, NCST, Mumbai, India, Dec. 2002, pp. 295–301.