Lecture 5 PDF
Lecture 5 PDF
Lecture 5 PDF
Mehryar Mohri
Courant Institute of Mathematical Sciences
[email protected]
Software Library
GRM Library: Grammar Library. General software
collection for constructing and modifying weighted
automata and transducers representing grammars
and statistical language models (Allauzen, MM, and
Roark, 2005).
http://www.research.att.com/projects/mohri/grm
Program:
farcompilestrings -i labels corpus.txt > foo.far
or cat lattice1.fsm ... latticeN.fsm > foo.far
Graphical representation:
hello/2
2/0
<s>/4
bye/2
0/0 1/0
</s>/4
bye/3 </s>/2
4/0 bye/1
Graphical representation:
bye/1.108 </s>/0.410
</s>/0.810 0/0
!!"#$%%
1 </s>/0.005
bye/0.698
bye/1.098 4 hello/1.504
!!&#&()
!!&#'%& 2
3 hello/0.698
Graphical representation:
</s>/0.005
</s>/0.810 0/0
bye/1.098
1 !!"#$%"
hello/0.698
hello/1.504
3
!!"#"&'
2 bye/0.698
where
1 if k > 5;
dk = (k + 1)nk+1
≈ otherwise.
knk
hello:hello/0
bye:BYE/0.693
hello:hello/0
0/0 1/0
bye:!!"
hello/2 hello/2
2/0 2/0
<s>/4 <s>/4
bye/2 BYE/2
Models
bye/1.108 </s>/0.410
</s>/0.810 0/0
1 !!"#$%%
</s>/0.005
bye/0.698
bye/1.098 4 hello/1.504
!!&#&()
original model.
!!&#'%& 2
3 hello/0.698
Class-based model – Graphical Representation
</s>/0.005
</s>/0.693 0/0
1 !!"#$%&
</s>/0.005
BYE/0.698 BYE/1.386
Weighted Context-Free hello/1.386
Grammars
4 24
class-based model.
!!"#$%&
!!"#$%& 2
3 hello/0.698
bye/1.391
!!"#$%&
3 </s>/0.005
bye/0
bye/2.079
</s>/0.693
!!"#$%&
0 1 !!"#$%& </s>/0.005 4/0
5
!!"#$%&
hello/0.698 </s>/0.005
hello/1.386
2
• Cyril Allauzen, Mehryar Mohri, and Brian Roark. The Design Principles and Algorithms of a
Weighted Grammar Library. International Journal of Foundations of Computer Science, 16(3):
403-421, 2005.
• Peter F. Brown,Vincent J. Della Pietra, Peter V. deSouza, Jennifer C. Lai, and Robert L.
Mercer. 1992. Class-based n-gram models of natural language. Computational Linguistics, 18
(4):467-479.
• Stanley Chen and Joshua Goodman. An empirical study of smoothing techniques for
language modeling. Technical Report, TR-10-98, Harvard University. 1998.
• William Gale and Kenneth W. Church. What’s wrong with adding one? In N. Oostdijk and
P. de Hann, editors, Corpus-Based Research into Language. Rodolpi, Amsterdam.
• Slava Katz . Estimation of probabilities from sparse data for the language model
component of a speech recognizer, IEEE Transactions on Acoustics, Speech and Signal
Processing, 35, 400-401, 1987.
• Reinhard Kneser and Hermann Ney. Improved backing-off for m-gram language modeling.
In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing,
volume 1, pages 181-184, 1995.
• Mehryar Mohri. Weighted Grammar Tools: the GRM Library. In Robustness in Language and
Speech Technology. pages 165-186. Kluwer Academic Publishers, The Netherlands, 2001.
• Hermann Ney, Ute Essen, and Reinhard Kneser. 1994. On structuring probabilistic
dependences in stochastic language modeling. Computer Speech and Language, 8:1-38.
• Ian H. Witten and Timothy C. Bell. The zero-frequency problem: Estimating the
probabilities of novel events in adaptive text compression, IEEE Transactions on Information
Theory, 37(4):1085-1094, 1991.