Lecture 5 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Speech Recognition

Lecture 5: Statistical Language Models -


Software Libray

Mehryar Mohri
Courant Institute of Mathematical Sciences
[email protected]
Software Library
GRM Library: Grammar Library. General software
collection for constructing and modifying weighted
automata and transducers representing grammars
and statistical language models (Allauzen, MM, and
Roark, 2005).

http://www.research.att.com/projects/mohri/grm

Mehryar Mohri - Speech Recognition page 2 Courant Institute, NYU


This Lecture
Counting
Model creation, shrinking, and conversion
Class-based models

Mehryar Mohri - Speech Recognition page 3 Courant Institute, NYU


Overview
Generality: to support the representation and use
of the various grammars in dynamic speech
recognition.
Efficiency: to support competitive large-vocabulary
dynamic recognition using automata of several
hundred million states and transitions.
Reliability: to serve as a solid foundation for
research in statistical language modeling.

Mehryar Mohri - Speech Recognition page 4 Courant Institute, NYU


Content
Statistical Language Models: creating, shrinking,
and converting language models.
Grammar compilation:
• weighted context-dependent rules, weighted
context-free grammars (CFGs).
• regular approximations of CFGs.
Text and grammar processing utilities:
• local grammars, suffix automata.
• local determinization.
• counting and merging.
Mehryar Mohri - Speech Recognition page 5 Courant Institute, NYU
Language Modeling Tools
Counts: automata (strings or lattices), merging.
Models:

• Backoff or deleted interpolation smoothing.


• Katz or absolute discounting.
• Knesser-Ney models.
Shrinking: weighted difference or relative entropy.
Class-based modeling: straightforward.

Mehryar Mohri - Speech Recognition page 6 Courant Institute, NYU


Corpus
Input:
Corpus Labels
hello. <s> 1
bye. </s> 2
hello. <unknown> 3
bye bye. hello 4
bye 5

Program:
farcompilestrings -i labels corpus.txt > foo.far
or cat lattice1.fsm ... latticeN.fsm > foo.far

Mehryar Mohri - Speech Recognition page 7 Courant Institute, NYU


Counting
Weights:

• use fsmpush to remove initial weight and create


a probabilistic automaton.

• counting from far files.


• counts produced in log semiring.
Algorithm:

• applies to all probabilistic automata.


• In particular, no cycle with weight zero or less.
Mehryar Mohri - Speech Recognition page 8 Courant Institute, NYU
Counting
Program:
grmcount -n 2 -s 1 -f 2 foo.far > foo.2.counts.fsm
grmmerge foo.counts.fsm bar.counts.fsm > foobar.counts.fsm

Graphical representation:
hello/2
2/0
<s>/4
bye/2

hello/2 3/0 </s>/2

0/0 1/0
</s>/4

bye/3 </s>/2

4/0 bye/1

Mehryar Mohri - Speech Recognition page 9 Courant Institute, NYU


This Lecture
Counting
Model creation, shrinking, and conversion
Class-based models

Mehryar Mohri - Speech Recognition page 10 Courant Institute, NYU


Creating Back-off Model
Making a backoff model from counts
ram:
Program:
make foo.2g.counts.fsm > foo.2g.lm.fsm
grmmake foo.2.counts.fsm > foo.2.lm.fsm
hical Representation:

Graphical representation:

bye/1.108 </s>/0.410
</s>/0.810 0/0
!!"#$%%
1 </s>/0.005
bye/0.698
bye/1.098 4 hello/1.504

!!&#&()
!!&#'%& 2
3 hello/0.698

Mehryar Mohri - Speech Recognition page 11 Courant Institute, NYU


Shrinking
Shrinking a backoffBack-off
model Model
Program:
k -s 4 foo.2g.lm.fsm > foo.2g.s4.lm.fsm
grmshrink -s 4 foo.2.lm.fsm > foo.2.s4.lm.fsm
Representation:

Graphical representation:

</s>/0.005
</s>/0.810 0/0
bye/1.098
1 !!"#$%"
hello/0.698
hello/1.504
3
!!"#"&'

2 bye/0.698

Mehryar Mohri - Speech Recognition page 12 Courant Institute, NYU


Back-off Smoothing
Definition: for a bigram model,
 dc(wi−1 wi ) c(wi−1 wi )

if k > 0;
Pr[wi |wi−1 ] = c(wi−1 )
α Pr[wi ] otherwise;

where

1 if k > 5;
dk = (k + 1)nk+1
≈ otherwise.
knk

Mehryar Mohri - Speech Recognition page 13 Courant Institute, NYU


Conversion of Back-off Model
Interpolated model:
grmconvert foo.lm.fsm -m interpolated > foo.int.lm.fsm

Failure function model - failure class:


• efficient representation of backoff models.
• requires on-the-fly composition for decoding.
grmconvert foo.lm.fsm -e failure_transitions > foo.flm.fsm
Failure function model - basic class:
• state-splitting: correct in tropical semiring.
• pre-optimization: on-the-fly composition not
required.
grmconvert foo.lm.fsm -e state_splitting > foo.elm.fsm

Mehryar Mohri - Speech Recognition page 14 Courant Institute, NYU


This Lecture
Counting
Model creation, shrinking, and conversion
Class-based models

Mehryar Mohri - Speech Recognition page 15 Courant Institute, NYU


Class-Based Models
Simple class-based models:
Pr[wi |h] = Pr[wi |Ci ] Pr[Ci |h].

Methods in GRM: no special utility needed.


• create transducer mapping strings to classes.
• use fsmcompose to map from word corpus to
classes.
• build and make model over classes.
• use fsmcompose to map from classes to words.
Generality: classes defined by weighted automata.
Mehryar Mohri - Speech Recognition page 16 Courant Institute, NYU
YE = {bye, bye bye}
Class-Based Model - Example
epresentation:
Example: BYE = {bye, bye bye}.
Graphical representation: mapping from strings to
classes.

hello:hello/0
bye:BYE/0.693
hello:hello/0
0/0 1/0
bye:!!"

Mehryar Mohri - Speech Recognition page 17 Courant Institute, NYU


Class-Based
Original counts
Model
Class-based counts –
- Counts
Graphical Representation

hello/2 hello/2
2/0 2/0
<s>/4 <s>/4
bye/2 BYE/2

hello/2 3/0 </s>/2 hello/2 3/0 </s>/2


1/0
0/0 1/0 0/0
</s>/4 </s>/4

bye/3 </s>/2 BYE/2 </s>/2

4/0 bye/1 4/0

Original counts. Class-based counts.


Mehryar
Weighted Mohri
Context-Free Grammars Weighted Context-Free
22 Grammars

Mehryar Mohri - Speech Recognition page 18 Courant Institute, NYU


Original Model

Models
bye/1.108 </s>/0.410
</s>/0.810 0/0
1 !!"#$%%
</s>/0.005
bye/0.698
bye/1.098 4 hello/1.504

!!&#&()
original model.
!!&#'%& 2
3 hello/0.698
Class-based model – Graphical Representation

</s>/0.005
</s>/0.693 0/0
1 !!"#$%&
</s>/0.005
BYE/0.698 BYE/1.386
Weighted Context-Free hello/1.386
Grammars
4 24

class-based model.
!!"#$%&
!!"#$%& 2
3 hello/0.698

Mehryar Mohri - Speech Recognition page 19 Courant Institute, NYU


Final Class-Based Model
Final model – Graphical Representation

bye/1.391
!!"#$%&
3 </s>/0.005
bye/0
bye/2.079

</s>/0.693
!!"#$%&
0 1 !!"#$%& </s>/0.005 4/0
5
!!"#$%&
hello/0.698 </s>/0.005

hello/1.386
2

Mohri Weighted Context-Free Grammars 25

Mehryar Mohri - Speech Recognition page 20 Courant Institute, NYU


Conclusion
GRM Library: general utilities for text and
grammar processing.

• generality: e.g., counts from arbitrary automata


or a class-based-model.

• efficiency: practical, e.g., counting lattices in


1/50th real-time.

• testing: statistical tests for reliability.

Mehryar Mohri - Speech Recognition page 21 Courant Institute, NYU


References
• Cyril Allauzen, Mehryar Mohri, and Brian Roark. Generalized Algorithms for Constructing
Statistical Language Models. In 41st Meeting of the Association for Computational Linguistics
(ACL 2003), Proceedings of the Conference, Sapporo, Japan. July 2003.

• Cyril Allauzen, Mehryar Mohri, and Brian Roark. The Design Principles and Algorithms of a
Weighted Grammar Library. International Journal of Foundations of Computer Science, 16(3):
403-421, 2005.

• Peter F. Brown,Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L.
Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18
(4):467-479.

• Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for
language modeling. Technical Report, TR-10-98, Harvard University. 1998.

• William Gale and Kenneth W. Church. What’s wrong with adding one? In N. Oostdijk and
P. de Hann, editors, Corpus-Based Research into Language. Rodolpi, Amsterdam.

• Good, I. The population frequencies of species and the estimation of population


parameters, Biometrica, 40, 237-264, 1953.

Mehryar Mohri - Speech Recognition page 22 Courant Institute, NYU


References
• Frederick Jelinek and Robert L. Mercer. 1980. Interpolated estimation of Markov source
parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in
Practice, pages 381-397.

• Slava Katz . Estimation of probabilities from sparse data for the language model
component of a speech recognizer, IEEE Transactions on Acoustics, Speech and Signal
Processing, 35, 400-401, 1987.

• Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling.
In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing,
volume 1, pages 181-184, 1995.

• David A. McAllester, Robert E. Schapire: On the Convergence Rate of Good-Turing


Estimators. Proceedings of Conference on Learning Theory (COLT) 2000: 1-6.

• Mehryar Mohri. Weighted Grammar Tools: the GRM Library. In Robustness in Language and
Speech Technology. pages 165-186. Kluwer Academic Publishers, The Netherlands, 2001.

• Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic
dependences in stochastic language modeling. Computer Speech and Language, 8:1-38.

Mehryar Mohri - Speech Recognition page 23 Courant Institute, NYU


References
• Kristie Seymore and Ronald Rosenfeld. Scalable backoff language models. In Proceedings of
the International Conference on Spoken Language Processing (ICSLP), 1996.

• Andreas Stolcke. 1998. Entropy-based pruning of back-off language models. In Proceedings


of DARPA Broadcast News Transcription and Understanding Workshop, pages 270-274.

• Ian H. Witten and Timothy C. Bell. The zero-frequency problem: Estimating the
probabilities of novel events in adaptive text compression, IEEE Transactions on Information
Theory, 37(4):1085-1094, 1991.

Mehryar Mohri - Speech Recognition page 24 Courant Institute, NYU

You might also like