Myanmar Spell Checker
Myanmar Spell Checker
Myanmar Spell Checker
Abstract: Natural Language Processing (NLP) is one of the most important research area carried out in the world of Human
Language. For every language, spell checker is an essential component of many of Office Automation Systems and Machine
Translation Systems. In this paper, we develop a Myanmar Spell Checker System which can handle Typographic errors, Sequence
errors, Phonetic errors, and Context errors. A Myanmar Text Corpus is created for developing Myanmar Spell checker. To check
Typographic Errors, corpus look up approach is applied. Myanmar3 Unicode is applied in this system so that it can automatically
reorder the character sequence. A compound misused word detection algorithm is proposed for Phonetic Errors checking and Bayesian
Classifier is applied for Context Errors checking. In this system, Levenshtein Distance Algorithm is applied to improve users’ efficiency
by providing a suggestion list for misspelled Myanmar Words. We provide evaluation results of the system and our approach can handle
various types of Myanmar spell errors.
Keywords: Levenshtein Distance Algorithm, Myanmar Spell Checker, Myanmar Text Corpus, Natural Language Processing, Naïve
Bayesian Classifier
words. User has the provision to select the suggestion among zwfawmhtoH) reflects the differences between spoken and
the list, ignore the suggestion or add the particular word to written Myanmar, as spelling is often not an accurate
the dictionary. Htay et.al [15] presented Myanmar word reflection of pronunciation. Some writers are writing with
segmentation using syllable level longest matching the pronunciation and careless of spell error. In Myanmar
approach. They used a combination of stored lists, suffix Language, every isolated word has meaning. And also there
removal, morphological analysis and syllable level n-grams have compound words. But some words are cannot combine
to hypothesize valid words with about 99% accuracy. The as a compound word. If we combine the two words, the
author [2] presented an approach consists of an approximate compound word’s meaning will be changed. For example,
word matching method, an N-best word segmentation (pdrf;Ægreen ) (vef;Æfresh ). If we combine these two
algorithm and used a statistical language model. Word-based words their meaning will be changed as (pdrf;vef; Æ green
correction method is proposed for Optical Character and lush). Typist may misuse the word (vef;) with
Recognition errors. It outperforms the conventional character
(vrf;Æroad). The two words (vef; and vrf; ) have same
based correction method. Fassati et.al [14] addressed the
pronunciation but different meanings. There is no
problem of real word spell checking and proposed a
combination of (pdrf; and vrf;). Myanmar words
methodology based on a mixed trigrams language model.
Their model had been trained and tested with data from the collocation depends on the previous meaning of words. One
Penn Treebank. Their approach has been evaluated in terms word has different meanings and different usages. So spell
of hit rate, false positive rate and coverage. checker is major issue and challenge for all computerized
Golding [3] proposed a hybrid method for context applications of Myanmar Language. Myanmar syllables and
defined symbols are shown in Table 1.
sensitive spelling correction by combining Bayesian
classifier and decision list. They extracted semantic and
Table 1: Type of Myanmar Syllables and Defined Symbols
grammatical features from the context of members of
confusion set using corpora. Chaudhuri [5] described a new
Defined
novel technique of location and correction of non-word Syllables Type of Syllables
Symbols
error. They pinpointed the error position in a big majority of
cases and thus reduce the number of correct alternatives to a u-t Consonants C
large extent. Their approach was based on matching the
j- ? -s Medials M1
string in the normal as well as a reserved dictionary. They
combined with a phonetic similarity key based approach -S ? -G Medials M2
where phonetically similar characters were mapped into a
single symbol and a nearly phonetic dictionary was formed. a Vowels V1
In this paper, we propose a Myanmar spell checker system
for handing Myanmar spell errors by applying Myanmar -m?-d ?-D?-k ?-l?–J ?-H Vowels V2
Text Corpus, Levenshtein Distance Algorithm and Naïve
Bayesian Classifier. -f Final F
-h ?-; Tone T
3. Nature and Collation of Myanmar
Language
Myanmar language is a very rich language and use as an Common misused characters and sample words are
official language of the Union of Myanmar. A Myanmar shown in Table 2.
syllable has a base character, and may also have (or not) a
pre-base character, a post-base character, an above-base Table 2: Common Misused Characters and Sample
character and a below-base character. Syllables have to be Words
constructed. Each syllable boundary should begin with a
base consonant. Myanmar languages have 33 consonants and Common Misused Characters Sample Words
the consonant combines with vowel and sometime it includes
medial to form the complete syllables in Myanmar language. u?c?* url? *rl ? cHk;wHwm;? *Hk;wHwm;
Besides, it has not delimiter between syllables and words. p ? q ? Z ? ps Z&uf ? quf&u?f pm;? qm;
Myanmar words are collated being based on syllables. A
P?e tEkjrL? tPkjrL
Myanmar syllable encoded in Unicode can be broken into 5
parts for collation [9]: <consonants> <vowels> <medial> y?z?b?A zl;? bl;? Al; ? ykef;? bkef;
<final> <tone>. In particular sentence, Typographic Errors o?w oHk; ?wHk;
(Non word errors) and Cognitive Errors (Phonetic Errors)
are collocated with two or more syllables. But Context ‘' ? " "g; ? 'g;
Errors (Real word errors) are only one syllable, which are ,?& ,uf ? &uf
ambiguous for poor reader.
The Myanmar saying “the pronunciation is merely the M ? –s Mum; ?usm;
sound, whilst the orthography is correct” (a&;awmhtrSef? wf ? uf ? yf wwf ?wuf? wyf
5.1 Corpus Creation tokenization of a word boundary with Finite State Automata.
Corpus is a large and structured set of texts. It is used to An automaton can be said to recognize a string [1]. In
spell checker, checking occurrences or validating linguistic Myanmar3, start state is always started with Consonant (C)
rules on a specific universe. Besides it is a fundamental basis and “end sate” is represented with double circle. Each
of many researches in NLP. Building of the text corpus is character in the input string passes through the
very helpful for the development of spell checking. In this corresponding edges to the next state. In this way, it reaches
work, Myanmar text corpus is created manually to apply in the final state, and then automatons accept the input string
Myanmar Spell Checker system. It contains various sense and return a word with boundary. According to the
meanings of ambiguous Myanmar words, compound words Myanmar word collation rule (e.g., ေက်ာင္း = က -် ေ- -ာ င -္ -း <
and training sentences. All words are collected from example C M1 V1 V2 C F T > ), we define the Finite State Automata in
sentences of “Myanmar Grammar” [10], “Myanmar Words figure 3.
Commonly misspelled and misused books [7]”, “Ornagai
Dictionary” [16] and “Wxpy Dictionary” [17].
Myanmar Syllable file is used for checking Typographic Examples of Myanmar Syllable collation
errors which consists of 1908 syllables. Myanmar ေက်ာင္း =က -် ေ- -ာ င -္ -း <C M1 V1 V2 C F T>
Compound Words files is used for checking compound လ်ွင္ = လ -် -ွ -င -္ <C M1 M2 C F>
misused errors which misused as phonetic errors, it also used ျမန္ = မ ျ - န -္ <C M1 C F>
for segmented words for the input string. In Myanmar
ေကာက္ = က ေ- -ာ က -္ < C V1 V2 C F >
Compound words file, which consists of 62582 compound
စိုက္ = စ -ိ -ု က -္ < C V2 V2 C F>
words. Myanmar Training sentences consists of 3600
sentences and average words in sentences is 12. Training
sentences are used for calculating the probabilities of
6. Implementation of Myanmar Spell Checker
Context words errors.
6.1 Detection of Typographic Errors
Non-word errors correction is an important task. Non-word
5.2 Tokenization error spelling correction is focused on the task of generating
Tokenization is a preprocessing step for this system. It is the and ranking a list of possible spelling corrections for each
process of breaking a stream of text up into words, phrases, words not existing in the corpus. It is also isolated words
symbols, or other meaningful elements called tokens. The errors checking and generating suggestion. The main steps of
list of tokens becomes input for further processing such as Typographic Errors checking process are:
parsing or text mining. Tokenization is useful both in 1. Look up the word in the corpus
linguistics and in computer science, where it forms part of 2. In case, the word exit, pass on to next word.
lexical analysis. Typically, tokenization occurs at the word 3. If the word is not found in the corpus, calculate the
level. However, it is sometimes difficult to define all similarity of the error words and word from corpus to
contiguous strings of alphabetic characters and to define generate suggestion list.
what is mean by a "word”. Tokens are separated by
6.2 Detection Phonetic Errors
whitespace characters, such as a space or line break, or by
punctuation characters. In languages such as English where Phonetic error is a special class of real words errors in which
words are delimited by whitespace, this approach is the writer substitutes a phonetically correct but
straightforward [19]. However, tokenization is more difficult orthographically incorrect sequence of letter for the intended
for languages such as Myanmar, Thai, Japanese, and Chinese words. Moreover, there exists a class of real word errors in
which have no word boundaries. which the misspellings result in a valid word. It occurs due
Myanmar text is a string of characters without explicit to the presence of words in the language having similar
word boundary, so it is hard to define word boundary. In this pronunciation but different meaning. In this paper, we
paper, we describe regular expression and pattern for
Volume 2 Issue 1, January 2013
336
www.ijsr.net
International Journal of Science and Research (IJSR), India Online ISSN: 2319-7064
spelt word is encountered and operate Insert, Delete and present for word senses is that it looks at the words around
Substitute transformations. At the end, the bottom-right confusion set in a large context window. Each content word
element of the array contains the answer. The resulted contributes potentially useful information about which sense
distance is the number of deletions, insertions, or of the ambiguous word is likely to be used with it. The
substitutions required to transform s into t [18]. supervised training of the classifier assumes that we have a
corpus where each use of ambiguous words is labeled with
6.4 Detection and Suggestion Generation of Context
its correct sense. For context error detection and correction
Errors
tasks, giving a word w, candidate classification variables
In Myanmar Language, most of context words are Myanmar S=(S1,S2,….,Sk) that represent the sense of the ambiguous
verb. For example, the confusion set {rSD? rD} have the same word and the feature F=(f2,f2,….,fn) by that describe the
pronunciation but difference meaning. In the context word context in which an ambiguous word occurs, the Naïve
“rSD” which would translate to English word “base on some Bayesian finds the proper sense s for the ambiguous word w
fact or evidence”, “rD” which use to combine with other by selecting the sense that maximizes the conditional
Myanmar Noun and verb. For example: “be with (reach, probability P(w=si|F). Suppose C is the context of the target
time, limit) (as in vufvSrf;rD? tcsdefrD) ”and then it can use as word w, and F=(f1,f2,…,fn) is the set of features extracted
from context C, to find the right sense s! of w given context
part :before; prior to (as in roGm;rD). Confusion set consists of C, we have:
words that are likely to be misused in place of one another.
⎡ ⎤
We can see in the following sentence that misused “ rSD ” s ' = arg max ⎢ ∑ log P ( f j | w = si ) + log P (w = si )⎥ (2)
instead of “ rD ”. olonfblwm&HkodkhtcsdefrSDvmonf/” The correct ⎢⎣ f j ∈C ⎥⎦
S = (S1,S2,….,Sk) sense of the context words
word for that sentence is “ He come railway station on time”.
F = (f1,f2,….,fn) , the set of features extracted from
In Myanmar Language, all context words can correct by
sentence which an confusion word occurs,
statistical techniques exception for {zl;?bl;}{bJ ? yJ} confusion
P(si) = The probability of sense (ambiguous word ) si
set. In Myanmar words, (bl; and bJ) are always use as P(fj|si)= the conditional probability of feature fj with
negative statement. The two words always combine with (r), observation of sense si
for example: ra&;bl;/ (not write), rpm;bJaeonf/ (live The probability of sense si, P (si), and the conditional
probability of feature fj with observation of sense si, P(fj|si),
without eating) . Myanmar verb always use between r and
are computed via Maximum-Likelihood Estimation:
bl; / bJ for describe negative statement. Myanmar context
P (s i )= C (s i ) / C ( w ) (3)
(f | w = si )= C ( f j , si ) / C (si )
words (confusion set) are shown in table 4.
P j (4)
Table 4: Myanmar Context Words
Where C(fj,si) is the number of occurrences of fj in a context
Confusion set of sense si in the training corpus, and C(si) is the number of
yJ bJ occurrences of si in the training corpus, and C(w) is the total
number of occurrences of the ambiguous word w [4]. To
zl; bl; avoid the effects of zero counts when estimating the
conditional probabilities of the model , when meeting a new
zuf buf
feature fj in a context of the test dataset, for each sense si, we
rSD rD set P(fj|w=si) equal 1/C(w).
b) Disambiguation
-for all sense si of W do
-score (si)=log P(si)
-for(all words fi in the context window c do
-score (si)=score(si)+log P(fi|si)
-end
-end
Choose s’=arg max score(si)
Figure 7. Similarity Score of Suggestion Generation for
Figure 5. Process of Detection and Suggestion Generation
Typgraphic Errors and Phonetic Errors base on Levenshtein
of Myanmar Context Errors
Distance Algorithm
7. Experimental Results
Accuracy
The performance of this system is evaluated in terms of Sentence Types
(%)
precision, recall and F-measure. Precision (P) means the
Test sentence in the corpus 98%
percentage of the correct word suggested by the system
which is divided by total number of error detected by the Test sentence that are partial words
91%
system. Recall (R) means the percentage of correct words include in the corpus
suggested by the system which is divided by the total number Test sentence that are not include in
82%
of sentence. F-score is the mean of recall and precision, that the corpus
is F= 2PR / (P+R). Testing sentences are used for evaluation
Table 5. Context Errors Detection and Suggestion
which consists of words include in corpus, test sentence that
Generation Results
are not exactly same sentences in corpus, and new words.
Corpus size is larger and larger because the tested sentences
are manually added to the corpus to get accuracy for new
words which are not included in corpus.
In this system, we tested with 500 sentences to get the
accuracy of the system. The average numbers of words
includes in one sentence is 12 words. Figure 6 shows the
accuracy of correctly detected on the testing sentences with
the compound words errors detected algorithm. Figure 7
shows similarity score suggestion generation for
Typographic errors and Phonetic errors by using Levenshtein
Distance Algorithm. In that figure, suggestion generation of
Typographic errors get 100% accuracy. But, at the Phonetic
errors, 91% similarity score of suggestion list are generated. Figure 8. OverAll System Evalutaion Results on Accuray of
Table 5 shows the accuracy of context errors detection and Correct words Vs. No. of Sentences
suggestion generation results. Average accuracy of overall
system gets 95% precision, 92.33% recall and 93% f-score. 8. Conclusion
We implemented a spelling checker system for Myanmar
language which can handle Typographic errors, Sequence
errors, Phonetic errors and Context errors. A Myanmar Text
Corpus is created and Mynamar3 Unicode is applied for
implementing the Myanmar Spell Checker system. We
applied Levenshtein Distance Algorithm, for generating
Volume 2 Issue 1, January 2013
339
www.ijsr.net
International Journal of Science and Research (IJSR), India Online ISSN: 2319-7064
suggestion list. The proposed algorithm is very useful in [15] H.H. Htay, K. N. Murthy, Myanmar Word Segmentation
checking compound misused word errors of Myanmar using Syllable Level Longest Matching, Proceeding of
language. This system emphasized on Myanmar sentences the 6th Workshop on Asian Language Resources, 2008.
[16] http://www.ornagai dictionary.com
which follow Myanmar grammar rules and it cannot handle
[17] http://www.Wxpy dictionary.com
Parli words. This system can be applied in Myanmar NLP [18] http://www.Enclyclopedia.com/ Natural language
applications. Evaluation results show that this system can understanding / Levenshtein_distance.html
provide promising accuracy. [19] http://www.wikipedia.com/Tokenization.hml
References
[1] Lewis II, P.M, Rosenkrantz, D.J, Stearns, R.E, Compiler
Author Profile
Design Theory, Addison_Wesley Publishing Company,
Third Printing November, 1978.
[2] M. Nagata, Context-Based Spelling Correction for
Japanese OCR, Proceeding of the 16th International
Conference on Computational Linguistics, 05-09, 1996. Aye Myat Mon is currently pursuing Ph.D degree
[3] Golding, A, A Bayesian hybrid method for context program in University of Computer Studies, Mandalay, Myanmar. I
sensitive spelling correction, In Proceeding of 3rd got M.C.Sc from University of Computer Studies, Yangon in 2008.
workshop on Vary Large Corpora, 3, Jun 1996. I am also an assistant lecturer. My current research interest is
Natural Language Processing.
[4] Christopher D. Manning, Foundations of Statistical E-mail:[email protected]
Natural Language Processing, The MIT Press,
Cambridge, Massachusetts, London, England, 1999
[5] B. Baran Chaudhuri, Reversed word dictionary and
phonetically similar word grouping based spell-checker
to Bangla text, LESAL workshop, 2001.
[6] T Dhanabalan, Ranjani Parthasarathi, T V Geetha,
Tamil Spell Checker, Resource Center for India Thandar Thein received her M.Sc. (Computer
Science) and Ph.D. (Information Technology) degrees in 1996 and
Language Technology Solutions, TDIL newsletter, 2004, respectively from University of Computer Studies, Yangon
2003,Tamilnadu, India. (UCSY), Myanmar. She did her post doctorate research fellowship
[7] Myanmar Words Commonly Misspelled and Misused in Computer Engineering Department of Korea Aerospace
Book, Department of Myanmar Language commission, University, the scholarship awarded by Korea Research Foundation
Ministry of education, Union of Myanmar July, 2003. Grant funded by the Korean Government. She is a faculty member
[8] N. UzZaman, M. Khan, A Bangla Phonetic Encoding of UCSY since 1996. Currently she is a professor and guiding the
Ph.D students. Her current research interests are in the areas of
for Better Spelling Suggestion, Center for Research on Software Aging, Virtualization, Green Computing, Wireless Sensor
Bangla Language Processing, Proceeding of 7th Networks and Natural Language Processing.
International Conference on Computer and Information E-mail:[email protected]
Technology, Dhaka, Bangladesh, Dec, 2004.
[9] Stribley, K, Collation of Myanmar in Unicode, 22,
August 2005
[10] Myanmar Grammar, Department of Myanmar Language
commission, Ministry of education, Union of Myanmar
June 2005.
[11] N. UzZaman, M. Khan, A Comprehensive Bangla
Spelling Checker, Center for Research on Bangla
Language Processing, Proceeding of International
Conference on Computer Processing on Bangla, 2006.
[12] ျမန္မာစာ ျမန္မာ စကား , Department of Myanmar
Language commission, Ministry of education, Union of
Myanmar June 2007.
[13] Md. Munshi Abdullah, Md. Zahurul Islam, Mumit
Khan, Error tolerant Finite State Recognizer and String
Pattern Similarity Based Spelling Checker for Bangla,
Proceeding of 5th International Conference on Natural
Language Processing (ICON), 2007.
[14] D. Fossati, B. D. Eugenio, A Mixed Trigrams Approach
for Context Sensitive Spell Checking, Proceeding of the
8th International Conference on Computational
Linguistics and Intelligence Text, 2007.