Kanada

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

International Journal of Computer Applications (0975 – 8887)

Volume 39– No.9, February 2012

Sentence Boundary Detection in Kannada Language

Deepamala. N Ramakanth Kumar. P


Assistant Professor Professor and Head
Dept. of Computer Science Dept. of Information Science
R.V. College of Engineering R.V. College of Engineering
Bangalore, India Bangalore, India

ABSTRACT disambiguate sentence boundary in different languages by


Sentence Boundary Detection is a pre-processing step for any different researchers.
Natural Language Processing application. Various algorithms
have been used to achieve Sentence Boundary Detection or
2.2 Sentence Boundary Detection in Indian
Disambiguation in different languages. In this paper, a rule Languages
based method is proposed and tested to achieve Sentence Mona et al. [9] has developed a methodology to disambiguate
Boundary Detection for Kannada Language. Kannada being a period in Kannada Language by using lists of words below
grammatically rich Indian language is analyzed based on some threshold extracted from corpus. Very limited research
semantics and tested with a 227K bytes corpus. The code is has been undertaken in the area of Sentence Boundary
written in C using wide characters, with support for Unicode. Disambiguation for Indian Languages.
Results showed 99.2% success in detecting sentence
boundary. 3. PRESENT WORK
General Terms 3.1 Algorithm
The algorithm used for Sentence Boundary Detection in this
Natural Language Processing, Kannada Language
paper is based on steps mentioned in [1] by Manning et .al. It
Keywords is a heuristic Sentence Division algorithm which has the
Sentence Boundary Detection, Verb Suffix, Abbreviation. following steps:

1. INTRODUCTION  Place putative sentence boundaries after all


Sentence Boundary detection is a preliminary step in any occurrences of . ? ! (and maybe ; : -_)
Natural Language Processing application. Many methods  Move the boundary after following quotation marks,
have been implemented and tested for Sentence Boundary if any.
Detection in English. Indian Languages being different in  Disqualify a period boundary in the following
semantics from English Language requires different kind of circumstances:
approach. Kannada Language is a southern Dravidian Indian - If it is preceded by a known abbreviation of a
Language which is grammatically different from English. It is sort that does not normally occur word finally,
one of the 30 most spoken languages in the world. In this but is commonly followed by a capitalized
paper, a rule-based algorithm is proposed with results to proper name, such as Prof. or vs.
detect Sentence Boundaries in Kannada Language Sentences. - If it is preceded by a known abbreviation and
Sentence ending verb suffixes and Abbreviations are used as a not followed by an uppercase word. This will
parameter to classify the sentences. deal correctly with most usages of
abbreviations like etc. or Jr. which can occur
2. LITERATURE REVIEW sentence medially or finally.
 Disqualify a boundary with a ? or ! if:
2.1 Sentence Boundary Detection in - It is followed by a lowercase letter (or a known
English and other languages name).
Researchers have tried many algorithms and techniques to  Regard other putative sentence boundaries as
detect sentence boundaries of English Language. Many sentence boundaries
methods have been developed for Sentence Boundary In Kannada language, there is no concept of upper case or
Detection such as a rule-based sentence boundary detection lower case letters. Hence, the above algorithm is modified and
algorithm by Manning et al. [1], using Maximum Entropy steps followed are listed below:
method by Reynar and Ratnaparkhi [2], Satz system by
Palmer and Hearst [3] , and using POS tagging information
Step1: Place putative sentence boundaries after all
by Mickheev [4].
occurrences of . ? ! ; : -_ . Let this be Sentence1.
Further, Kiss and Strunk propose a language independent - If Sentence1 is ? ! ; : - _ , regard the putative
method [5] by identifying abbreviations called Punkt Sentence sentence boundary as sentence boundary.
Tokenizer. Walker et al.[6] compare three approaches for Step2: Move the putative boundary after following quotation
boundary detection. Yuya et al.[7] follow Statistical Language marks, if any, to next occurance of .?!;:- Let this be
model (SLM) and support vector machine (SVM) approach to Sentence2.
find sentence boundaries in Japanese language. Pritam et al. Step3: Consider the last word of Sentence1 before period.
[8] propose algorithm using Maximum entropy and stop word Disqualify a period boundary of Sentence1 in the following
algorithm. Many approaches have been followed to circumstances:

38
International Journal of Computer Applications (0975 – 8887)
Volume 39– No.9, February 2012

- If period is preceded by a known abbreviation ಴ತ೯ಮ಺ನ ಏಕ಴ಚನ ಕೆ ಡು+ಉತತ+ಈಯೆ=ಕೆ ಡುತಿತೀಯೆ


of a sort that does not normally occur word
finally. Such abbreviations are listed in ಕ಺ಲ ಕೆ ಡು+ಉತತ+ಆನೆ=ಕೆ ಡುತ಺ತನೆ(M
ABBREVIATIONS file. G)
Step3: Regard the putative sentence boundary of Sentence1 as ಕೆ ಡು+ಉತತ+ಆಳೆ=ಕೆ ಡುತ಺ತಳ ೆ(FG
sentence boundary
- If it matches with any of the verb forms that )
can possibly end a sentence, such verb suffixes ಕೆ ಡು+ಉತತ+ಅದೆ=ಕೆ ಡುತತದೆ(NG
are listed in VERBS_SUFFIX file. )
Step4: Make Sentence2 as Sentence1 and Repeat from Step2. Present tense Plural – ಕೆ ಡು+ಉತತ+ಎವೆ=ಕೆ ಡುತೆತೀವೆ
ಬಹು಴ಚನ
ಕೆ ಡು+ಉತತ+ಈರಿ=ಕೆ ಡುತಿತೀರಿ
The Sentence Boundary Detection algorithm proposed in this ಴ತ೯ಮ಺ನ
paper uses 2 files which are as discussed below: ಕೆ ಡು+ಉತತ+ಆರೆ=ಕೆ ಡುತ಺ತರೆ(M
ABBREVIATIONS File: ಕ಺ಲ
The Abbreviations file contains a list of abbreviations listed G)
from Kannada newspapers like ಪ್ರೊ. (Prof.), ಡ಺. (Dr.) and ಕೆ ಡು+ಉತತ+ಅವೆ=ಕೆ ಡುತತವೆ(NG

Kannada translation of English alphabets like ಎ., ಬಿ., ಸಿ. … )


Future tense Singular – ಕೆ ಡು+ಉ಴+ಎನು=ಕೆ ಡುವೆನು
(A,B,C….). They are used as Initials before the First Name of ಏಕ಴ಚನ
a person. E.g.: ಎನ್. ದೀ಩ಮ಺ಲ (N. Deepamala). ಕೆ ಡು+ಉ಴+ಎ=ಕೆ ಡುವೆ
ಭವಿಷ್ಯತ಺ಾಲ
VERBS_SUFFIX File: ಕೆ ಡು+ಉ಴+ಅನು=ಕೆ ಡು಴ನು(M
All the 3 parts of speech take different form to indicate G)
different tense or ಕ಺ಲ. The Sentence ending verb takes ಕೆ ಡು+ಉ಴+ಅಳು=ಕೆ ಡು಴ಳು(F
different forms based on its suffix. The VERBS_SUFFIX file G)
contains the suffix form of different verbs. A Kannada
ಕೆ ಡು+ಉ಴+ಉದು=ಕೆ ಡು಴ುದು(
sentence is divided into ಕತೃ೯ ಩ದ (Noun), ಕಮ೯ ಩ದ(Object).
NG)
ಕ್ರೊಯ಺ ಩ದ(verb). Future tense Plural – ಕೆ ಡು+ಉ಴ು+ಎ಴ು=ಕೆ ಡುವೆ಴ು
Eg: ರ಺ಮನು(Noun) ಕ಺ಡಿಗೆ(Object) ಹೆ ೀದನು(Verb). ಬಹು಴ಚನ
ಕೆ ಡು+ಉ಴ು+ಇರಿ=ಕೆ ಡುವಿರಿ
ಭವಿಷ್ಯತ಺ಾಲ
Translation: Rama went to the forest.
ಕೆ ಡು+ಉ಴ು+ಅರು=ಕೆ ಡು಴ರು(M
The word preceding the period of a putative sentence is first
verified with ABBREVIATIONS file and if it does not match, G)
then with the list of suffixes in VERBS_SUFFIX file. ಕೆ ಡು+ಉ಴ು+ಉ಴ು=ಕೆ ಡು಴ು಴ು(N
G)
3.2 Verb Suffixes
The verb suffix forms based on tense are listed in Table 1. In Table 2. Verb classification with suffixes based on meaning
Table 1, MG is masculine gender, FG is feminine gender and Verb Type Verb suffix Example
NG is neutral gender. Verb types and its suffixes based on
meaning are listed in Table2. Some special verb suffixes are ವಿಧಯಥ೯ ಕ್ರೊಯ಺಩ದ ಎನು ಇರಿ ಮ಺ಡುವೆನು
used to describe the task like When? How? As listed in Table ಉದು ಆಗು ಮ಺ಡು಴ುದು
3 below:
ಅಲಿ ಉ಴ ಮ಺ಡು
Table 1. Verb classification with suffixes based on tense
Tense - Forms Verb Stem + Tense Phrase + ಉ ಓಣ ಮ಺ಡಲಿ
ಕಾಲ singular/ Verb Suffix = Inflected Verb ಗೆ ಆ ಮ಺ಡಿರಿ
plural ಧಾತು+ಕಾಲಸೂಚಕಪ್ರತಯಯ+
ಇ ಮ಺ಡೆ ೀಣ
ವಚನ ಅಖ್ಾಯತಪ್ರತಯಯ = ಕ್ರರಯಾಪ್ದ
ಸಂಭ಴಴ನ಺ಥ೯ಕ ಕ್ರೊಯ಺಩ದ ಆನು ಆರು ಬಂದ಺ರು
Past Tense Singular – ಹರಿ+ದ+ಎನು=ಹರಿದೆನು ಆಳು ಆ಴ು ಬಂದ಺ನು
ಏಕ಴ಚನ ಈತು/ಆತು ಈರಿ ಹಚೆಚೀನು
ಹರಿ+ದ+ಎ=ಹರಿದೆ
ಭ ತ ಕ಺ಲ
ಹರಿ+ದ+ಅನು=ಹರಿದನು(MG) ಈಯೆ ಏ಴ು ಹಚೆಚೀ಴ು
ಹರಿ+ದ+ಅಳು=ಹರಿದಳು (FG) ಏನು ತಿಂದೀತು
ಹರಿ+ದ+ಇತು=ಹರಿಯಿತು (NG) ನಿಷೆೀಧ಺ಥ೯ಕ ಕ್ರೊಯ಺಩ದ ಎನು ಅರು ಮ಺ಡೆನು
Past Tense Plural – ಹರಿ+ದ+ಎ಴ು=ಹರಿದೆ಴ು ಎ಴ು ಅ ಮ಺ಡೆ
ಬಹು಴ಚನ
ಹರಿ+ದ+ಇರಿ=ಹರಿದರಿ ಎ ಅಳು ಮ಺ಡರು
ಭ ತ ಕ಺ಲ
ಹರಿ+ದ+ಅರು=ಹರಿದರು(MG) ಇರಿ ಅದು ಮ಺ಡಳು
ಹರಿ+ದ+ಉ಴ು=ತಿಳಿದು಴ು (NG) ಅರಿ ಅ಴ು ಮ಺ಡದು
Present tense Singular – ಕೆ ಡು+ಉತತ+ಎನೆ=ಕೆ ಡುತೆತೀನೆ ಅನು

39
International Journal of Computer Applications (0975 – 8887)
Volume 39– No.9, February 2012

Table 3: Special verb descriptors 3.3 Implementation


Verb description Example Sentence Boundary Detection algorithm for Kannada
Language as discussed in the previous section is implemented
ಇದೆ ಕಷ್ಟವಾಗಿ ಇದೆ
using C language. Wide characters are used instead of
ಇವೆ ಸಾಮಾನ್ಯವಾಗಿ ಇವೆ characters to support Unicode. The C implementation of the
ಇಲ್ಲ ಮಾಡುವುದಿಲ್ಲ software contains Wide character string operations like
ಅಲ್ಲ ಮಾಡುವುದಲ್ಲ wcslen, wcstok, wcsspn, wcschr etc.

Start

Search for the delimiter .?!:;-


“Sentence 1”

Search for the next delimiter .?!:;-


“Sentence 2”

NO
If delimiter is „.‟ Get the word
preceding „.‟from Sentence 1
“last word”

YES

“Last word” found in YES


ABBREVIATIONS file Or
Append Sentence 2 to Sentence 1
IsNumber ?

NO

Sentence
Search for the suffix match in
vv Boundary not
VERBS_SUFFIX file
detected:
Failed

NO
Suffix found in
the list

YES

Sentence 1 is final sentence. Copy


Sentence 2 to Sentence 1.

Fig 1: Sentence Boundary Detection Flow chart

40
International Journal of Computer Applications (0975 – 8887)
Volume 39– No.9, February 2012

ending verb and its different suffixes are used to detect the
4. DISCUSSION
There are many approaches followed to achieve sentence Boundary for Kannada Language. The result is almost 99.2%
boundary detection. Very little research has been undertaken accurate. This technique is effective as no POS tagging or any
on sentence boundary detection for Indian Languages. other pre-requisite is required. As it is coded in C using wide
Kannada, one of the Indian languages follows the grammar characters, it is more portable. The same software can be used
and syntax, which is completely different from English. For for similar Indian languages like Telugu by changing the
example, the rule that a period is followed by capital letter at ABBREVIATIONS and VERB_SUFFIX files accordingly.
the beginning of the sentence in English is used as a feature
for sentence boundary identification [1, 5, 10, 11]. But, this 7. REFERENCES
feature is not applicable to Kannada language. [3, 4] use POS [1] Manning, C.D. and. Schütze., H. 2002. Foundations of
statistical natural language processing. The MIT Press,
tagging information, but Sentence Boundary Detection can be
London.
a pre-processing task for a POS tagger. The algorithm
proposed in Section 3 does not require POS tagging and rules [2] J. Reynar, and Ratnaparkhi. A. 1997. A Maximum
are based on Kannada grammar. Entropy Approach to Identifying Sentence Boundaries,
in Proceedings of the Fifth Conference on Applied
Natural Language Processing, Washington D.C, pp. 16-
Mona et al. [9] has explained disambiguation of sentence
19.
boundary for Kannada. Two lists, L1 and L2 are maintained,
where L1 is a sentence ending word list and L2 is word list [3] Palmer, D.D. and Hearst, M.A..1997. Adaptive
multilingual sentence boundary disambiguation.
extracted from corpus. The comparison is made with L1 and Computational Linguistics 23 241–267
L2 if last word length is below 5 (threshold). However, the
[4] Mikheev, A. 2000. Tagging Sentence Boundaries. In:
author has not mentioned the Programming Language used for
Proceedings of the NAACL, Seattle, pp 264-271.
implementation.
[5] T. Kiss and Strunk, J. 2006. Unsupervised multilingual
The algorithm proposed in Section 3 is unique since a generic sentence boundary detection. Computational Linguistics,
list of verb suffixes is maintained for comparison. Last word 32(4):485–525.
of any length in a sentence before period is matched with [6] Walker, Daniel J., David E. Clements, Maki, Darwin and
abbreviation file, and if not found, it is matched with the Jan, W. Amtrup. 2001. Sentence boundary detection: a
ending suffix. Substring match function is used to match the comparison of paradigms for improving MT quality. In:
verb suffix with the ending word. If suffix matches, end of Proceedings of the MT Summit VIII, Santiago de
Compostela, Spain.
sentence is identified. The identified sentences are correct
without ambiguity. Implementation using C wide characters [7] Akita, Y. 2006. Sentence Boundary Detection of
makes the application more portable. Spontaneous Japanese Using Statistical Language Model
and Support Vector Machines. In: Proceedings of.
Interspeech-ICSLP, Pittsburgh, PA.
5. RESULTS
The developed application has been tested using EMILLE [8] Singh, Preetam, Negi, Rauthan M.M.S and Dhami, H.S.
corpus. A corpus of 23,561 Kannada words (487KB) was 2010. Sentence Boundary Disambiguation: a User
given as Input to the Sentence Boundary Detection software, Friendly Approach. IJCA. Vol, 7-No.8.
which detected 2152 sentences. Manually wrong sentences [9] Mona Parakh, Rajesha N. and Ramya M. 2011. Sentence
were identified and found that an accuracy of 99.2% is Boundary Disambiguation in Kannada Texts, Language
achieved with the software developed using the proposed in India. www.languageinindia.com. 11:5 May 2011
algorithm. Special Volume: Problems of Parsing in Indian
Languages, pp. 17- 19.
The erroneous sentence boundary predictions were due to the [10] Gillick, D. 2009. Sentence Boundary Detection and the
following reasons: Problem with the U.S. In: Proceedings of the NAACL
 The ‘?’ within a given sentence were wrongly HLT: Short Papers, Boulder, Colorado.
predicted. [11] Agarwal N., Ford K., and Shneider M., Sentence
 If ‘.’ Or ‘?’ comes within quotes, they were wrongly Boundary Detection using a MaxEnt Classifier.
predicted. citeseerx.ist.psu.edu
 If verb has no suffix, then the sentence is wrongly
predicted. [12] Wang H. and Huang Y. 2003. Bondec - A sentence
Boundary Detector. CS224N Project, Stanford, 2003

6. CONCLUSION
Sentence Boundary Detection is a pre-processing step for any
Natural Language Processing application. In the present
implementation of Sentence Boundary Detection sentence

41

You might also like