A Survey of Applications of Finite Automata in Natural Language Processing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

et

International Journal on Emerging Technologies (Special Issue NCETST-2017) 8(1): 62-64(2017)


(Published by Research Trend, Website: www.researchtrend.net)
ISSN No. (Print) : 0975-8364
ISSN No. (Online) : 2249-3255

A Survey of Applications of Finite Automata in Natural Language


Processing
Raj Kishor Bisht
Department of Applied Sciences, Amrapali Institute of Technology and Sciences,
Haldwani, (Uttarakhand), India
ABSTRACT: In the present paper a survey of applications of finite automata in natural language processing
has been done. Applications to different natural language processing tasks like computational lexicography,
morphological analysis etc. have been shown by considering some examples. Finally use of finite automata in
Hindi tense recognition has been elaborated.
Keywords: Finite automata, Natural Language Processing, Morphology, Parsing.
I. INTRODUCTION class, that is, variation in tense, number etc. comes
under inflectional morphology. For example, girl-girls,
Natural language processing (NLP) comprises a number
play-played etc. A combination of a stem and an affix
of tasks that deal with the processing of natural
that derives a word of new class, that is, which alters
language through computers. The theory of automata
the meaning of the word comes under derivational
plays a significant role in providing solutions of many
morphology. For example, kind- kindness, computer-
problems in natural language processing. For example,
computerization etc. Inflections can further be
speech recognition, spelling correction, information
categorized in two parts; regular and irregular.
retrieval etc. Finite state methods are quite useful in
Inflections that follow a regular pattern in converting
processing natural language as the modeling of
form singular to plural, present to past etc. are called
information using rules has many advantages for
regular inflections. For example, boy-boys, girl-girls,
language modeling. Finite state automaton has a
play-played etc.. Inflections that do not have a regular
mathematical model which is quite understandable; data
pattern for such conversion are called irregular. For
can be represented in a compacted form using finite
example, man- men, go-went etc.
state automaton and it allows automatic compilation of
Finite state automata (deterministic and non
system components.
deterministic finite automata) provide decision
Here we will discuss some of the NLP tasks.
regarding acceptance and rejection of a string while
Morphology is concerned with the study of the structure
transducers provide some output for a given input. Thus
of a word and the pattern of word formation. A lexeme
the two machines are quite useful in language
is a basic unit in a word that contains meaning. For
processing tasks. Finite state automata are useful in
example, ‘walk’, ‘walks’, ‘walking’, ‘walked’ have a
deciding whether a given word belongs to a particular
common lexeme ‘walk’. A morpheme is a minimal
language or not. Similarly, transducers are useful in
meaning bearing unit or grammatical function that is
parsing and generation of words from their lexical
used to form a word. For example, the word ‘walks’
form.
contains two morphemes; ‘walk’ and ‘s’. A morpheme
In the literature, we found a number of research
is called free morpheme if it has independent existence
papers regarding the application of finite automata in
as a single word and bound morpheme if it does not
natural language processing. Jayara, Kornai and
have independent existence as a single word, it must be
Sakarovitch [1] discussed the progress of work done in
attached to another word to get meaning. The
the direction of finite state methods and models in
morpheme ‘walk’ is free morpheme while the
natural language processing. Some early work
morpheme ‘s’ is bound morpheme. A free morpheme
describing morphological analysis through finite state
can also be described as a stem and bound morpheme is
machine can be seen in Kaplan and Kay [8],
described as an affix. An affix may be a prefix, or
Koskenniemi [4]. Shrivastava and Bhattachrayya [6]
suffix (circumfix or infix also in some of the
proposed HMM based part of speech tagging for Hindi.
languages). Morphology can be further categorized in
Sharma and Paul [7] developed a system identification
two ways: inflectional and derivational. A combination
and classification of clauses in Hindi text. Kumar, Deng
of a stem and an affix that produces a word of same
and Byrene [9] presented a weighted finite state

Bisht 62
transducer translation template model for statistical scale dictionary compilation and indexation of natural
machine translation. Manning and Schutze [2] and language text.
Jurafsky and Martin [3] provided details of applications
of finite automata in NLP in their books.
II. FINITE AUTOMATA IN NLP
In this section a survey of applications of finite
automata in natural language processing has been made.
A. Language Recognizer
There are many tasks that need language recognizing
Fig. 1. NFA for the words ‘boy’ and ‘bat’.
mechanism. For example, spelling checker,
morphological analysis, language identification etc.. Inflectional morphology can be well recognized by
Finite state machine are quite useful as a language finite automata. For each category of words we can
recognizer. For a given word, a NFA can be designed form a separate NFA and then combine them using ߣ
easily that recognize the word. For example, NFA for transitions. For example, nouns and their plural can be
the words ‘boy’ and ‘bat’ is shown in the Fig 1. recognized through one NFA and verbs and their
Similarly for every word a NFA can be designed and different forms can be recognized through another NFA
the different NFA’s can be combined to form spelling and finally the two NFA can be combined. Figure 2
checker or dictionary compilation for a language. Mohri shows the NFA for some words and their
[ ] showed the application of finite automata in large morphological variations.

Fig. 2. NFA for the some words and their morphological variations.
B. Morphological parsing and generation used for a noun. The word ‘girl’ can be replaced by
Morphological parsing is the process of producing a any other regular noun like ‘boy’.
lexical structure of a word, that is, breaking a word into
stem and affixes and labeling the morphemes with
category label. For example the word ‘books’ can be
parsed as book+s and further as book + N+PL, the
word ‘went’ can be parsed as go+V+PAST etc.
generation is the reverse process of parsing, that is
combining the lexical form of a word to produce the
word. For example, box+N+Pl generates the word Fig. 3. Generation of word from its lexical form
‘boxes’. Finite state transducers are quite useful in through transducer.
morphological parsing. Let us consider the lexicon form
Here simple regular nouns have been considered on
of regular inflectional ‘girl +N +PL’. Fig. 3 shows the
transducer which convert the lexicon form ‘girl+N+PL’ exemplarily basis, orthographic rules can be applied for
into the word ‘girls’. Let us assume that x represents the irregular inflections and derivations and NFAs can be
word ‘girl’ for simplification purpose as for every designed for every word and its variations in a certain
regular singular noun the input and output will remain language.
same in a transducer, that’s why a variable x can be

Bisht 63
C. Tense recognition in Hindi introduce the concept. Fig. 5 shows the recognition of
Tenses in Hindi are important part of grammar. Their two tenses in Hindi sentences through NFA. The
knowledge is quite useful in translation from Hindi to abbreviations showing final states are as follows: PI-
another language. Here we will show the application of Present indefinite and PC-Present continuous. Further
finite state machines in automatic recognition of tenses ‫{∈ ݔ‬Rkk gS] rh gS] rs gS} , y ∈{ jgk gS] jgh gS] jgs
in Hindi language. We have some grammatical rules in gS},
Hindi through which we can identify the tense of a
III. CONCLUSION
sentence and translate the sentence in English. We
know that there are three tenses and each tense has a In the present paper a survey of applications of finite
four sub tenses. For example, the sentences which end automata in natural language processing has been done.
with ‘Rkk gS] rh gS] rs gS’ etc. come under present Some examples have been taken to show the
indefinite tense, like *eksgu QqVcky [ksyrk gS^ and morphological analysis and parsing of English
the sentence which end with ‘jgk gS] jgh gS] jgs gS’’ sentences. Finally the application of finite automata has
etcetc. come under present continuous tense, like *xhrk been shown in tense recognition of Hindi sentences
[ksy jgh gS^. We may have the following parse tree which is an important part of rule based machine
structure of the second Hindi sentence for tense translation.
recognition purpose. Here the non terminal EF stands
for ‘End Form’, other non terminal have usual meaning. REFERENCES
[1]. A. Jayara, A. Kornai and A.Sakarovitch, “Finite-state
methods and models in natural language processing, Natural
Language Engineering 17 (2): 141–144, 2011.
[2]. C.D. Manning, H. Schutze, Foundations of Statistical
Natural Language Processing. MIT Press, 2002.
[3]. D. Jurafsky, and J.H. Martin, Speech and Language
Processing, Pearson Eucation India, 2004.
[4]. K. Koskenniemi, “Finite state morphology and and
information retrieval” in the proceedings of the ECAI
workshop extended finite state models of language, 1996, pp.
42-45.
[5]. M. Mohri, “On some applications of finite automata
theory to natural language processing”, Natural Language
Engineering, Vol. 1(1), pp. 000-000, 1995.
Fig. 4. Parse tree for the sentence *xhrk [ksy jgh gS^. [6]. M. Shrivastava, P. Bhattachrayya, “Hindi POS Tagger
Using Naive Stemming : Harnessing Morphological
Information Without Extensive Linguistic Knowledge, in the
proceedings of ICON-2008: 6th International Conference on
Natural Language Processing, Macmillan Publishers, India.
[7]. R. Sharma, S.Paul, “A rule based approach for automatic
clause boundary detection and classification in Hindi”,
Proceedings of the 5th Workshop on South and Southeast
Asian NLP, 25th International Conference on Computational
Fig. 5. NFA for recognition of two tenses. Linguistics, pp. 102–111, 2014.
[8]. R.M.Kaplan, M. Kay, “Regular models of rule syatems”,
A finite state machine can be designed for the verb Computational Linguistics, vol. 20(3), pp. 331-378, 1994.
phrase that appears at the end of the sentence for [9]. S. Kumar, Y. Deng and W. Byrne, “A weighted finite
deciding the tense of the sentence. Tough there are state transducer translation template model for statistical
number of complexities and variations in Hindi machine translation”, Natural Language Engineering 12 (1):
sentences but here a simple example has been taken to 35–75, 2005.

Bisht 64

You might also like