Parsing Bangla Grammar Using Context Free Grammar
Parsing Bangla Grammar Using Context Free Grammar
Parsing Bangla Grammar Using Context Free Grammar
net/publication/314048397
CITATIONS READS
0 136
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by K. M. Azharul Hasan on 22 July 2019.
Chapter 7
Parsing Bangla Grammar
Using Context Free Grammar
Al-Mahmud
Khulna University of Engineering and Technology (KUET), Bangladesh
Bishnu Sarker
Khulna University of Engineering and Technology (KUET), Bangladesh
K. M. Azharul Hasan
Khulna University of Engineering and Technology (KUET), Bangladesh
ABSTRACT
Parsing plays a very prominent role in computational linguistics. Parsing a Bangla sentence is a pri-
mary need in Bangla language processing. This chapter describes the Context Free Grammar (CFG)
for parsing Bangla language, and hence, a Bangla parser is proposed based on the Bangla grammar.
This approach is very simple to apply in Bangla sentences, and the method is well accepted for parsing
grammar. This chapter introduces a parser for Bangla language, which is, by nature, a predictive parser,
and the parse table is constructed for recognizing Bangla grammar. Parse table is an important tool to
recognize syntactical mistakes of Bangla sentences when there is no entry for a terminal in the parse
table. If a natural language can be successfully parsed then grammar checking of this language becomes
possible. The parsing scheme in this chapter works based on a top-down parsing method. CFG suffers
from a major problem called left recursion. The technique of left factoring is applied to avoid the problem.
DOI: 10.4018/978-1-4666-3970-6.ch007
Copyright © 2013, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Parsing Bangla Grammar Using Context Free Grammar
Bangla (or Bengali) is one of the more im- some kind of parse tree, abstract syntax tree or
portant Indo-Iranian languages, is the sixth most other hierarchical structure) implicitly in the
popular in the world and spoken by a population input tokens. Parsing can be defined as a method
that now exceeds 250 million. Geographically, where a parser algorithm is used to determine
Bangla-speaking population percentages are as whether a given input string is grammatically
follows: Bangladesh (over 95%), and the Indian correct or not for a given grammar. Parsing is a
States of Andaman & Nicobar Islands (26%), fundamental problem in language processing for
Assam (28%), Tripura (67%), and West Bengal both machines and humans. In general, the parsing
(85%). The global total includes those who are now problem includes the definition of an algorithm to
in diaspora in Canada, Malawi, Nepal, Pakistan, map any input sentence to its associated syntactic
Saudi Arabia, Singapore, United Arab Emirates, tree structure (Saha, 2006). The parser often uses
United Kingdom, and United States. a separate lexical analyzer to create tokens from
Bangla is still in degraded stage at least as far the sequence of input characters. Parsers may be
as work in the area of computational linguistics programmed by hand or may be automatically or
is concerned. Natural languages like English and semi-automatically generated (in some program-
even Hindi is rapidly progressing as far as work ming languages) by a tool.
done in processing by computers is concerned. A parse tree for a grammar is a tree where the
Unfortunately, Bangla lags more or less behind in root of the tree is the start symbol for the gram-
some crucial areas of research like parts of speech mar, the interior nodes are the non-terminals of
tagging, text summarization and categorization, the grammar, the leaf nodes are the terminals of
information retrieval and most importantly in the the grammar and the children of a node starting
area of grammar checking. The grammar checking from the left to the right correspond to the sym-
for a language has a wide variety of applications. bols on the right hand side of some production
The activity of breaking down a sentence into for the node in the grammar. Every valid parse
its constituent parts is known as parsing. Parsing is tree represents a string generated by the grammar
an earlier term for the diagramming of sentences (Yarowsky, 1995).
of natural languages, and is still used for the dia- A parser analyzes the sequence of symbols
gramming. Parsing a sentence involves the use of presented to it based on the grammar (Yarowsky,
linguistic knowledge of a language to discover the 1995). Natural language applications namely In-
way in which a sentence is structured. Exactly how formation Extraction, Machine Translation, and
this linguistic knowledge is represented and can be Speech Recognition, need to have an accurate
used to understand sentences is one of the questions parser (Haque & Khan, 2005). Parsing natural
that has engaged the interest of psycholinguists, language text is more difficult than the computer
linguists, computational linguists, and computer languages such as compiler and word processor
scientists. Bangla parsing is a challenging task. because the grammars for natural languages are
This chapter has a detail discussion over Bangla complex, ambiguous, and infinite in number of
parsing using Context Free Grammar. vocabulary. It is difficult to prepare formal rules
to describe informal behavior even though it is
clear that some rules are being followed. For a
BACKGROUND syntax based grammar checking the sentence is
completely parsed to check the correctness of it.
In computing, a parser is one of the components If the syntactic parsing fails, the text is considered
in an interpreter or compiler that checks for incorrect. On the other hand, for statistics based
correct syntax and builds a data structure (often approach, Parts of Speech (POS) tag sequences
138
Parsing Bangla Grammar Using Context Free Grammar
are prepared from an annotated corpus, and hence If we transfer it as word to word then the
the frequency and the probability (Sengupta & English form will become:
Chaudhuri, 1997). The text is considered correct if
the POS-tagged text contains POS sequences with I + rice + eat.
frequencies higher than some threshold (Anwar,
Anwar & Bhuiyan, 2009) Here, in English object is appeared after the
In this chapter, we implemented Bangla dic- verb whereas in Bangla the object is appeared
tionary in XML format using the corresponding before the verb. It’s clearly states that how diffi-
word as tag name and it’s POS as value. It is a cult to parse Bangla sentences rather than English
very useful technique because it needs less time to sentences.
store data as well as data retrieval from this data Roughly speaking, phrases and idioms are
storage is very easy and fast compared to other expressions whose meaning cannot be completely
popular forms of data storage. It helps to search understood from the meanings of the component
the dictionary very fast. Bangla grammar has a parts. For example, whereas it is possible to work
large number of forms and rules. The aim of this out the meaning of (1a) on the basis of knowledge
chapter is to describe a parser that parses a Bangla of English grammar and the meaning of words,
sentence, rather than generating sentences. A parse this would not be sufficient to work out that (1b)
tree is displayed if the Bangla parser decides that can mean something like ‘If Sam dies, her children
sentence is correct otherwise parse tree are not will be rich’. This is because kick the bucket is
generated. This technique of parsing system with an idiom.
XML checks whether every sentence of given
Bangla passage is grammatically correct or not 1a: If Sam mends the bucket, her children will
correct. It also indicates the location of problems. be rich.
If any sentence of the passage is found accepted 1b: If Sam kicks the bucket, her children will be
or correct then it generates a parse tree. rich.
There has been a structural difference between
the Bangla and English grammar. It is challenging The problem with phrases and idioms, in a
to parse Bangla and English sentence with the parsing context, is that it is not usually possible
same grammatical structure. If we match word to to parse them using the normal rules.
word with Bangla and English grammar structure Throughout the chapter there will be a detailed
then the parser will not give the correct answer. discussion over Context Free Grammar for pars-
For example, considering the grammatical ing Bangla Text and predictive Bangla parser for
structure: constructing parse table. Here a top down parsing
scheme is adopted and the problem of left recur-
Subject + verb + object (I + eat+ rice) sion is avoided applying left factoring over the
proposed CFG.
In Bangla it appears as:
But it is a non-traditional form. The traditional Modern Bangla morphology is very productive,
form is: especially for verbs, with the root verbs takes
around 168 different forms. Bangla lexicon also
আমি + ভাত + খাই. has a very large number of compound words
139
Parsing Bangla Grammar Using Context Free Grammar
(words that have more than one root), which parser blindly reaches the root of terminal using
can be created from at most any combination of the productions (Sengupta & Chaudhuri, 1997).
nouns, pronouns and adjectives. An effort is made Recursive descent Parsing is another way to
at building a complex morphological parser for produce parse trees for an input sentence. This
Bangla, where it can only handle simple words parser interprets the grammar as a specification of
with a single root. Though the addition of in- how to break a high-level goal into several lower-
flectional suffixes in Bangla compound word is level sub-goals. The top-level goal is to find the
fairly complex, the compound word’s individual start symbol of the grammar and sub-goals are the
root words may retain their inflectional suffixes. production that breaks down the start symbols of
Such a compound word morphological parser is the grammar. Such sub-goals are matched with an
developed which can efficiently parse compound input sentence and succeed to its next state if the
words having inflectional suffix and also resolves next word is matched. If there is no match the parser
ambiguities. Some rules are used to develop the backtracks to a previous node and tries a different
morphological analysis of simple and compound alternative. One problem with this recursive decent
Bangla words that can be used to make Universal method is that when a production is recursive,
Natural Language (UNL)-Bangla dictionary for the parser keeps on generating a never-ending
converting the natural Bangla sentences to UNL parse tree when it selects the recursive production
documents and vice versa. (Sengupta & Chaudhuri, 1997). An advantage for
There has been no organized formal initiative recursive decent over the shift-reduce method is
for generating a context-free grammar for Bangla that the parser can be forced to choose a specific
Language. There are some books available on production to continue parsing. Thus, this makes
context-free grammars for Bangla but most of it easier to produce correct parse tree of a sentence
them have been limited to the writers’ way of in a given grammar which may not be possible
generating the grammar rather than producing with a shift-reduce parser for the same grammar.
a generic grammar to cover the whole language A rule-based Bangla parser has been proposed
(Anwar, Anwar & Bhuiyan, 2009; Aho, Sethi & in (Saha, 2006), that handles semantics as well
Ullman, 2002). as POS identification from Bangla sentences
Shift-reduce parsing is the simplest kind of and ease the task of handling semantic issues in
bottom-up parsing. Here the parser takes an input machine translation.
string (sentence) and repeatedly pushes the next An open source morphological analyzer for
input word onto a stack which is the shift operation. Bangla using finite state technology is described
If the top n items on the stack match the n items in (Faridee & Tyers, 2009). They developed the
on the right-hand side of some production, then monolingual dictionary called Monodix, stored
they are popped off the stack and the item on the in XML file.
left-hand side of the production is pushed on the Anwar, Anwar & Bhuiyan (2009) address a
stack. This replacement operation is the reduce method to analyze syntactically Bangla sentence
part of the parsing. The parser ends its work when using context-sensitive grammar and interpret the
all the input is consumed and there is only one input Bangla sentence to English using the NLP
item remaining on the stack which is a parse tree conversion unit. The system is based on analyzing
with the start symbol of the grammar as its root. an input sentence and converting into a structural
The parser fails to parse a sentence when there representation.
are no remaining input words to shift or when There have been several parsing technique
there is no way to reduce the remaining items on proposed for Bangla Language. A parsing meth-
the stack. The second problem occurs because the odology for Bangla natural language sentences
140
Parsing Bangla Grammar Using Context Free Grammar
is proposed by Hoque and Ali (2003) and shows Bangla sentences in semantic manner is presented
how phrase structure rules can be implemented by in (Hoque, Rahman & Dhar, 2007) and (Hasan,
top-down and bottom-up parsing approach to parse Mondal & Saha, 2010) presents a technique for
simple sentences of Bangla. An approach named detecting the named entity based on classifier
Predicate Preserving Parser (PPP) is described for Bangla documents. But these papers (Anwar,
in (Ali, Ripon and Allayear 2012) which maps Shume & Bhuiyan, 2010; Hoque, Rahman &
Bangla text in UNL which then can be translated KumarDhar, 2007; Hasan, Mondal & Saha, 2010)
to any other natural languages. A technique of do not deal with the detail grammar recognition
unsupervised morphological learning for Bengali for Bangla sentences.
language is introduced in (Dasgupta & Ng, 2006).
A Bengali Dependency Parser based on statistical
data driven parsing system followed by a rule PARSING BANGLA SENTENCES
based post processing is presented in (Ghosh et. WITH CONTEXT FREE GRAMMAR
al., 2010). A comprehensive approach for Bangla
syntax analysis was developed (Mehedy, Arifin Context Free Grammar
& Kaykobad, 2003) where a formal language is
defined as a set of strings. Each string is a con- The most commonly used and mathematical sys-
catenation of terminal symbols. tem for modeling constituent structure in natural
Some other approaches such as Lexical Func- languages is the Context-Free Grammar, or CFG.
tional Grammar (LFG) (Sengupta & Chaudhuri, Context-free grammars are also called Phrase
1997), and Context Sensitive Grammar (CSG) Structure Grammars. A Context-Free Grammar
(Hoque & Ali, 2004; Murshed, 1998; Selim (CFG) is a set of recursive rewriting rules called
& Iqbal, 1999) have also been developed for productions used to generate string patterns. A
parsing Bangla sentences. LFG is a monotonic CFG consists of the following components (Aho,
theory of syntax. Instead of postulating different Sethi & Ullman, 2002):
derivational levels represented in the same formal
language, it incorporates different parallel levels 1. A start symbol for the grammar.
of information, which can all potentially access 2. A set of terminal symbols that are characters
each other, each with its own formal language. The that appear in the strings generated by the
assumption about parallel levels of information grammar.
extends even to non-syntactic aspects of gram- 3. A set of non-terminal symbols that are
mar. Thus, for example, semantic information placeholders for patterns of terminal symbols
is assumed to be available to various levels of that can be generated by the non-terminal
syntax, and syntactic levels can input into phonol- symbols.
ogy (Joshi, 1993). 4. A set of productions that are used to replace
Some developed Bangla parser using SQL to the non-terminals with other non-terminals
check the correctness of sentence; but its space or terminals.
complexity is inefficient. Besides, it takes more
time for executing SQL command. As a result A formal language is context-free if there is a
those parsers becomes slower. A technique is context-free grammar that generates it.
implemented to perform structural analysis of This chapter describes a way of producing
Bangla sentences of different tenses using Context- CFG for Bangla Language and designing a Bangla
Free Grammar (CFG) rule (Anwar, Shume & parser. The grammar discussed throughout the
Bhuiyan, 2010). A methodology for analyzing the chapter can successfully parse a number of sen-
141
Parsing Bangla Grammar Using Context Free Grammar
tences. We have defined a tag set and according Context-free grammars play a central role in the
to the tag set, we tagged the word. In the next step description and design of programming languages
of producing CFG, we defined the constituents and compilers. They are also used for analyzing
and finally the rules were generated. We used the syntax of natural languages. Noam Chomsky
top-down parsing techniques for designing Bangla (1956) has posited that all human languages are
parser by using XML. Our parser decides which based on context-free grammars at their core, with
sentence is correct and which is to be rejected. additional processes that can manipulate the output
And finally generate a parse tree for this sentence of the context-free component (the transformations
by using Bangla Parser. of early Chomskyan theory).
There are a number of important subclasses It is found that most parsing methods fall into
of the context-free grammars: one of these two classes, named as top-down
parsing method and bottom-up parsing method.
• LR(k) grammars (also known as deter- These terms refer to the order in which nodes in
ministic context-free grammars) allow the parse tree are constructed.
parsing (string recognition) with deter- In top-down parsing method, construction
ministic pushdown automata, but they can starts at the root and proceeds towards the leaves
only describe deterministic context-free i.e., this method works from sentence symbol to
languages. the sentence. Parsers designed using top-down
• Simple LR, Look-Ahead LR grammars are parsing method are known as top-down parsers.
subclasses that allow further simplification On the contrary, in case of bottom-up pars-
of parsing. ing method, construction starts at the leaves and
• LL(k) and LL(*) grammars allow parsing proceeds towards the root, i.e., this method works
by direct construction of a leftmost deriva- from the sentence to sentence symbol. And pars-
tion as described above, and describe even ers which are designed using bottom-up parsing
fewer languages. method are known as bottom-up parser.
• Simple grammars are a subclass of the There are some well-recognized problems
LL(1) grammars mostly interesting for its when concerning with bottom up parsing. Bottom
theoretical property that language equal- up parsing suffers from termination problem. It is
ity of simple grammars is decidable, while inefficient when dealing with empty categories.
language inclusion is not. Bottom up parsing is data directed as it attempts
• Bracketed grammars have the property that to parse the words that are there and inefficient
the terminal symbols are divided into left when there is great lexical ambiguity (grammar-
and right bracket pairs that always match driven control might help here).
up in rules. Parsing of Bangla sentences by using grammar
• Linear grammars have no rules with more rules is still in a rudimentary stage. Very little
than one non terminal in the right hand research has been conducted regarding parsing
side. of Bangla sentences but a significant number of
• Regular grammars are a subclass of the research activities have been conducted on the
linear grammars and describe the regular recognition of Bangla characters. Some developers
languages, i.e., they correspond to finite developed Bangla parser using SQL to check the
automata and regular expressions. correctness of sentence. But it takes more space
on our computers, meaning its space complexity
142
Parsing Bangla Grammar Using Context Free Grammar
is inefficient. Besides, it takes more time for ex- 3. Parse Table: A two dimensional array
ecuting SQL command. As a result those parsers M[A,n] where A is nonterminal and n is a
becomes slower. To decrease time and space, terminal or $ sign.
XML files re used for storing data which is more 4. Output.
speedy and takes up less space.
The extensions and refinements presented in Building the Word Repository/
this chapter over previous works especially in Dictionary Using XML
Hasan et al. (2011) are as follows:
XML (eXtensible Markup Language) has been
• A set of CFG rules. designed as a markup language and a textual for-
• Store data into XML file. mat. It provides for a description of a document’s
• Design a Bangla Parser by using table- contents, with non-predefined tags, and does not
driven method with top-down approach. provide for any presentational characteristics. The
• Remove grammar ambiguity by using left eXtensible Markup Language (XML) is now used
factoring. almost everywhere for its simplicity and ease of
• Construct a parse tree. use. It is a good format to store any kind of data.
The tasks behind using XML are always the same:
Bangla words are stored as tag and its corre- reading data from XML and writing data into it.
sponding constituents or POS are used as value. XML have several advantages. Some of them are:
More over the idea of left factoring have been
applied to remove the left recursion and ambigu- 1. Ease of access.
ity of the grammar and hence according to the 2. Efficient in response to access time.
designed grammar, the parser can detect the errors 3. Easy to format the grammar structure.
in a Bangla sentence. 4. No need to use any space consuming DBMS.
5. Ease of defining the category of words.
6. Easy to implement and integrate.
SOLUTION AND
RECOMMENDATIONS The general format of XML tag is
143
Parsing Bangla Grammar Using Context Free Grammar
Noun (noun)
রহিম, বিদ্যালয়ে, ছেলে, বই,
ভাত,টাকা| NP −> noun conjunction noun | noun ip| noun
pronoun conjunction pronoun| noun pronoun
Pronoun (pronoun) আমি, আমরা, তু মি, সে, তারা।
ip| noun pronoun noun| noun pronoun adjective|
Adjective (adjective) ভাল, দশ, অনেক, শখ। noun pronoun pronoun| noun pronoun tp| noun
Verb (verb) খাই, খেলে, পড়ছে,পড়া। adjective| noun noun conjunction pronoun| noun
Conjunction (conjunction) এবং, ও, চেয়ে| noun aw| noun noun| noun| noun pronoun| pronoun
Negative Description (neg) না , নয়| conjunction pronoun| pronoun ip| pronoun noun
conjunction noun| pronoun noun ip| pronoun noun
Modifier (modifier) এ, একটি, একদিন|
adjective| pronoun noun conjunction pronoun|
pronoun noun aw| pronoun noun| pronoun ad-
jective| pronoun pronoun conjunction pronoun|
144
Parsing Bangla Grammar Using Context Free Grammar
Table 3. Constituents for developing the grammar VP −> noun verb| noun verb verb ptrn| noun verb
Constituents
verb adjective verb| noun verb verb adjective noun|
Productions Examples
(Symbol) noun verb verb adjective pronoun| noun verb verb
NP −> noun রহিম, adjective| noun verb verb noun adjective verb|
Noun Phrase NP → pronoun noun verb verb noun ptrn| noun verb verb noun
(NP) NP → modifier noun আমি,
etc. এ পথে। aw| noun verb verb noun adjective verb| noun verb
VP −> noun verb
verb noun verb ptrn| noun verb verb noun verb aw|
বই পড়ছে, noun verb verb noun verb adjective| noun verb
Verb Phrase VP −> noun verb verb
(VP) VP −> noun verb ptrn পথ্য সেবন করে, verb noun ptrn| nounverb verb noun pronoun|
etc. খাওয়া হল না। noun verb verb| pronoun adjective verb| pronoun
AP −> adjective
Adjective
AP −> adjective noun
দশ, verb verb ptrn| pronoun verb verb adjective verb|
Phrase (AP) ভাল ছেলে| pronoun verb verb adjective noun| pronoun verb
etc .
verb adjective pronoun| pronoun verb ptrn| pro-
noun verb aw| pronoun verb adjective| pronoun
pronoun pronoun aw| pronoun pronoun| pronoun verb pronoun| pronoun verb noun| pronoun ptrn |
tp| pronoun| modifier noun| modifier adjective ptrn| verb verb verb adjective| verb verb verb adjective
modifier adjective| modifier pronoun| modifier verb| verb verb verb ptrn| verb verb verb pronoun|
conjunction adjective| modifier| adjective noun| verb verb ptrn| verb verb adjective verb| verb verb
adjective pronoun| adjective conjunction adjective| adjective noun| verb verb adjective pronoun| verb
adjective ptrn| adjective| ip| tp| xp ip| xp pronoun verb noun adjective verb| verb verb noun verb
conjunction pronoun| xp pronoun ip| xp pronoun ptrn| verb verb noun verb aw| verb verb noun
noun| xp pronoun adjective| xp pronoun tp| xp verb adjective| verb verb noun ptrn| verb verb
adjective| xp noun conjunction pronoun| xp noun noun verb pronoun| verb adjective| verb adjective
aw| xp noun| xp tp| xp aw| tp pronoun| tp adjective| verb ptrn| verb adjective noun| verb adjective verb
tp ip| tp pronoun conjunction pronoun| tp pronoun noun| verb adjective verb noun verb | adjective
noun| tp pronoun adjective| tp pronoun tp| tp noun noun verb adjective verb| adjective noun verb
conjunction pronoun| tp noun aw| tp aw. ptrn| adjective noun verb pronoun| adjective noun
adjective ptrn| adjective noun adjective aw| adjec-
AP −> adjective noun| adjective pronoun| tive noun adjective | adjective pronoun| adjective
adjective ptrn ptrn | conjunction
145
Parsing Bangla Grammar Using Context Free Grammar
The above grammar G is ambiguous. The parser AP1 −> noun | adjective AP2 | pronoun |
generated from this kind of grammar is not ef- conjunction AP | ε
ficient as it requires backtracking. To remove
the ambiguity from the grammar the grammar AP2 −> ptrn | ε
productions are reconstructed by left factoring.
Left factoring is a grammar transformation VP −> noun VP1| pronoun VP4 | verb VP2 | AP
useful for producing a grammar suitable for pre- VP3 | conjunction
dictive parsing. The basic idea is that when it is
not clear which of the productions are to use to VP1 −> verb VP2 | adjective VP3 | pronoun
expand a non terminal then it can defer to take noun |noun pronoun
decision until we get an input to expand it. In
general, if we have productions of form: VP2 −> verb VP3 | ptrn | aw | AP VP3 | ε
A → αβ1 αβ2 VP3 −> verb VP4 | ptrn | adjective VP5 | noun
VP4 | ε
We left factored productions by getting the
input α and break it as follows: VP4 −> AP verb | verb VP2| ptrn | pronoun
A′ → β1 β2
There have been lexical productions which
are as follows:
The above grammar is correct and is free from
conflicts. After left factoring the grammar G, the noun −> রহিম| বিদ্যালয়ে| ছেলে| বই| ভাত| টাকা
reconstructed left factored grammar productions | কণে | সিংহ | ঘাস
are as below:
pronoun −> আমি | আমরা | তু মি | সে | তারা
S −> NP VP
verb −> খাই | খেলে | করবে | পড়া | চলে | যাই
NP −> noun NP1|pronoun NP2|modifier
AP1|conjunction NP1| AP | NP2 |xp NP1 |tp adjective −>
ভাল | বলবান | গুণবান | একটু |
NP1 দশ | অনেক | শখ | দ্রুত
NP1 −> conjunction noun |ip | pronoun NP2 | conjunction −> এবং | ও | চেয়ে
adjective | noun NP3 | tp | aw |ε
modifier −> এ | একটি | একদিন | এই
NP2 −> conjunction pronoun | ip | noun NP1 |
adjective | pronoun NP3 | tp | ε ptrn −> না | নয়
NP3 −> conjunction pronoun| aw | ε
146
Parsing Bangla Grammar Using Context Free Grammar
147
Parsing Bangla Grammar Using Context Free Grammar
FOLLOW(S) = {noun, adjective, pronoun, Parse table is a two-dimensional array where each
modifier, ip, xp, tp, conjunction} row is leveled with non terminal symbol and each
column is marked with terminal or special symbol
FOLLOW(NP) ={noun, verb, adjective, $. Each cell holds a production rule.
pronoun} Let M[m, n] be a matrix where m is the num-
ber of non terminals in grammar G and n is the
FOLLOW(NP1) = {noun, verb, adjective, number of distinct input symbols that may occur
pronoun} in a sentence of grammar G. Table 4 shows the
resulting parse table for the grammar G constructed
FOLLOW(NP2) = {noun, verb, adjective, by applying the following rules:
pronoun}
1. For each production of the form A → α of
FOLLOW(NP3) = {noun, verb, adjective, the grammar G for each terminal a in first(α),
pronoun} add A → α to M[A, a].
2. If ε is in first(α) then for each terminal in
FOLLOW(AP) = {noun, verb, adjective,
FOLLOW(A) add A → α to M[A, a].
pronoun, ptrn, modifier, ip, xp, tp, conjunction}
The table shows all the valid entries for the
FOLLOW(AP1) = {noun, verb, adjective,
parse table. Entries not in this table are error en-
pronoun, ptrn, modifier, ip, xp, tp, conjunction}
tries. All other undefined entries of the parsing
FOLLOW(AP2) = {noun, verb, adjective, table are error entries.
pronoun, ptrn, modifier, ip, xp, tp, conjunction}
Parse Tree Generation
FOLLOW(VP) = {noun, adjective, pronoun,
modifier, ip, xp, tp, conjunction, $} A parse tree for a grammar G is a tree where the
root is the start symbol for G, the interior nodes
FOLLOW(VP1) = {noun, adjective, pronoun, are the non terminals of G and the leaf nodes are
modifier, ip, xp, tp, conjunction, $} the terminal symbols of G. The children of a node
T (from left to right) correspond to the symbols
FOLLOW(VP2) = {noun, adjective, pronoun, on the right hand side of some production for T
modifier, ip, xp, tp, conjunction, $} in G. Every terminal string generated by a gram-
mar has a corresponding parse tree and every
FOLLOW(VP3) = {noun, adjective, pronoun, valid parse tree represents a string generated by
modifier, ip, xp, tp, conjunction, $} the grammar. We store the parse table M using a
148
Parsing Bangla Grammar Using Context Free Grammar
two-dimensional array. To read an element from a Initially, the parser is in a configuration with
two-dimensional array, we must identify the sub- s$ in the input buffer and the start symbol S on
script of the corresponding row and then identify top of the stack and then $.
the subscript of the corresponding column. For
example, the production “NP→ modifier noun” Managing Non-Traditional Forms
is in row 2, column 3, (see Table 4) so it is identi-
fied as M[2][3]. The structure of Bangla language may change in
Let us explain the grammar for the sentence a sentence although the meaning does not change.
একটি ছেলে বই পড়ছে. Using the XML data file For example, the sentence আমি ভাত খাই can also
we get the tags “modifier noun noun verb”. Us- be written as আমি খাই ভাত or ভাত আমি খাই.
ing the production S→ NPVP of the grammar The latter two forms are also correct and have the
G the sentence matches to NP noun verb; in the same meaning. Hence it is sometimes difficult to
second iteration the noun verb part matches to detect such reorganized non traditional forms in
noun VP1. The VP1 in turn matches to verb VP4 a single grammar. The proposed grammar can
and VP4 produces ε. Table 5 shows the moves of detect the the non traditional forms. For example,
our implementation using a stack. the grammar G detects the forms “আমি খাই ভাত”
and “ভাত আমি খাই” are shown in Table 6 and
149
Parsing Bangla Grammar Using Context Free Grammar
Table 5. Moves made by a Bangla parser on in- Table 7. Moves made by a Bangla parser on input
put “modifier noun noun verb” for correct sen- “noun pronoun verb” for correct sentence “ভাত
tence আমি খাই”
Table 6. Moves made by a Bangla parser on input Figure 2. Parse tree for Bangla parser on input
“pronoun verb noun” for correct sentence “আমি “modifier noun noun verb”
খাই ভাত”
150
Parsing Bangla Grammar Using Context Free Grammar
151
Parsing Bangla Grammar Using Context Free Grammar
152
Parsing Bangla Grammar Using Context Free Grammar
plex, ambiguous, and specified by collections of Anwar, M. M., Anwar, M. Z., & Bhuiyan, M. A.
examples rather than complete formal rules. The A. (2009). Syntax analysis and machine transla-
aim of this Chapter was to Design and Develop tion of Bangla sentences. International Journal
algorithms for Bangla Parser, which can parse of Computer Science and Network Security, 9(8),
Bangla language. Parsing is a process of trans- 317–326.
forming natural language into an internal system
Anwar, M. M., Shume, N. S., & Bhuiyan, M. A.
representation, which can be trees, dependency
A. (2010). Structural analysis of Bangla sentences
graphs, frames or some other SR (Structural Rep-
of different tenses for automatic Bangla machine
resentation). If a natural language is successfully
translator. International Journal of Computer Sci-
parse then grammar checking from this language
ence and Information Security, 8(9).
becomes easy. Parsing a sentence that involves
finding a possible legal structure for sentence Dasgupta, S., & Ng, V. (2006). Unsupervised
and finally gets an SR. An SR generally graphic morphological parsing of Bengali. Language Re-
object and this representation cannot be deals sources and Evaluation, 40, 311–330. doi:10.1007/
with computer. The standard representation of an s10579-007-9031-y.
SR is a list that are one of the data structure that
Faridee, A. Z. M., & Tyers, F. M. (2009). Devel-
can be implemented and manipulated very easily
opment of a morphological analyzer for Bengali.
within a computer. Our proposed CFG rules can
In Proceedings of the First International Work-
be assigned all types of Bangla sentences into an
shop on Free/Open-Source Rule-Based Machine
SR. To represents a complex and a compound
Translation, (pp. 43-50). IEEE.
sentence into an SR, we used the decomposition
technique. Further modify and extending the CFG Ghosh, A., Das, A., Bhaskar, P., & Bandyopadhyay,
rules (with the addition of sentences consisting S. (2010). Bengali parsing system. Paper presented
of idioms and phrases, double word, change of at ICON NLP Tool Contest 2010.
voice, and narration), we can able to represents
Haque, M. N., & Khan, M. (2005). Parsing Bangla
the all kinds of Bangla sentences which are used
using LFG: An introduction. BRAC University
as an input of an MT engine to produced other
Journal, 2(1), 105–110.
equivalent sentences.
Hasan, K. M. A., Mahmud, A., Mondal, A., &
Saha, A. (2011). Recognizing Bangla grammar
REFERENCES using predictive parser. International Journal of
Computer Science & Information Technology,
Aho, A. V., Sethi, R., & Ullman, J. D. (2002). 3(6). doi: doi:10.5121/ijcsit.2011.3605.
Compilers principles, techniques and tools. New
York: Pearson Education. Hasan, K. M. A., Mondal, A., & Saha, A. (2010).
A context free grammar and its predictive parser
Ali, M. N. Y., Ripon, S., & Allayear, S. M. (2012). for Bangla grammar recognition. In Proceedings
UNL based Bangla natural text conversion: of International Conference on Computer and
Predicate preserving parser approach. Interna- Information Technology, (pp. 87 – 91). IEEE.
tional Journal of Computer Science Issues, 9(3),
259–265. Hoque, M. M., & Ali, M. M. (2003). A parsing
methodology for Bangla natural language sen-
tences. In Proceedings of International Confer-
ence on Computer and Information Technology,
(277-282). IEEE.
153
Parsing Bangla Grammar Using Context Free Grammar
Hoque, M. M., & Ali, M. M. (2004). Context- Yarowsky, D. (1995). Unsupervised word sense
sensitive phrase structure rule for structural rep- disambiguation rivaling supervised methods. In
resentation of Bangla natural language sentences. Proceedings of 33rd Annual Meeting of the ACL,
In Proceedings of International Conference on (pp. 189-196). ACL.
Computer and Information Technology, (pp.
615-620). IEEE.
Hoque, M. M., Rahman, M. J., & Dhar, P. K. KEY TERMS AND DEFINITIONS
(2007). Lexical semantics: A New approach to ana-
lyze the Bangla sentence with semantic features. Bottom-Up Parsing: Construction starts at
In Proceedings of the International Conference the leaves and proceeds towards the root, i.e.
on Information and Communication Technology, this method works from the sentence to sentence
(pp. 87 – 91). IEEE. symbol.
Context-Free Grammar (CFG): A set of
Joshi, S. (n.d.). Selection of grammatical and logi- recursive rewriting rules called productions used
cal functions in Marathi. (PhD Thesis). Stanford to generate string patterns.
University. Palo Alto, CA. FIRST(): FIRST(α) be the set of terminals that
Mehedy, L., Arifin, N., & Kaykobad, M. (2003). begin the strings derived from α. If α → ε then ε
Bangla syntax analysis: A comprehensive ap- is also included in FIRST(α).
proach. In Proceedings of International Confer- FOLLOW(): FOLLOW(A) of a non terminal
ence on Computer and Information Technology A is the set of terminals a that can appear im-
(ICCIT), (pp. 287-293). ICCIT. mediately to the right of A. If A is the right most
symbol in the sentential form then $ is added to
Murshed, M. M. (1998). Parsing of Bengali natural FOLLOW(A).
language sentences. In Proceedings of Interna- Parse Tree: A tree where the root of the tree
tional Conference on Computer and Information is the start symbol for the grammar, the interior
Technology (ICCIT), (pp. 185-189). ICCIT. nodes are the non-terminals of the grammar, the
Saha, G. K. (2006). Parsing Bengali text: An leaf nodes are the terminals of the grammar and
intelligent approach. ACM Ubiquity, 7(13), 1–5. the children of a node starting from the left to
doi:10.1145/1132512.1127026. the right correspond to the symbols on the right
hand side of some production for the node in the
Selim, M. R., & Iqbal, M. Z. (1999). Syntax analy- grammar.
sis of phrases and different types of sentences in Parser: A parser analyzes the sequence of
Bangla. In Proceedings of International Confer- symbols presented to it based on the grammar.
ence on Computer and Information Technology Parsing: A method where a parser algorithm
(ICCIT), (pp. 175-186). ICCIT. is used to determine whether a given input string
Sengupta, P., & Chaudhuri, B. B. (1997). A delayed is grammatically acceptable or not for a particular
syntactic-encoding-based LFG parsing strategy language.
for an Indian language – Bangla. Computational Top-Down Parsing: In top-down parsing
Linguistics, 23(2), 345–351. method, construction starts at the root and proceeds
towards the leaves i.e., this method works from
sentence symbol to the sentence. Parsers, which
are designed using Top-down parsing method, are
known as top-down parser.
154