Kanada
Kanada
Kanada
38
International Journal of Computer Applications (0975 – 8887)
Volume 39– No.9, February 2012
39
International Journal of Computer Applications (0975 – 8887)
Volume 39– No.9, February 2012
Start
NO
If delimiter is „.‟ Get the word
preceding „.‟from Sentence 1
“last word”
YES
NO
Sentence
Search for the suffix match in
vv Boundary not
VERBS_SUFFIX file
detected:
Failed
NO
Suffix found in
the list
YES
40
International Journal of Computer Applications (0975 – 8887)
Volume 39– No.9, February 2012
ending verb and its different suffixes are used to detect the
4. DISCUSSION
There are many approaches followed to achieve sentence Boundary for Kannada Language. The result is almost 99.2%
boundary detection. Very little research has been undertaken accurate. This technique is effective as no POS tagging or any
on sentence boundary detection for Indian Languages. other pre-requisite is required. As it is coded in C using wide
Kannada, one of the Indian languages follows the grammar characters, it is more portable. The same software can be used
and syntax, which is completely different from English. For for similar Indian languages like Telugu by changing the
example, the rule that a period is followed by capital letter at ABBREVIATIONS and VERB_SUFFIX files accordingly.
the beginning of the sentence in English is used as a feature
for sentence boundary identification [1, 5, 10, 11]. But, this 7. REFERENCES
feature is not applicable to Kannada language. [3, 4] use POS [1] Manning, C.D. and. Schütze., H. 2002. Foundations of
statistical natural language processing. The MIT Press,
tagging information, but Sentence Boundary Detection can be
London.
a pre-processing task for a POS tagger. The algorithm
proposed in Section 3 does not require POS tagging and rules [2] J. Reynar, and Ratnaparkhi. A. 1997. A Maximum
are based on Kannada grammar. Entropy Approach to Identifying Sentence Boundaries,
in Proceedings of the Fifth Conference on Applied
Natural Language Processing, Washington D.C, pp. 16-
Mona et al. [9] has explained disambiguation of sentence
19.
boundary for Kannada. Two lists, L1 and L2 are maintained,
where L1 is a sentence ending word list and L2 is word list [3] Palmer, D.D. and Hearst, M.A..1997. Adaptive
multilingual sentence boundary disambiguation.
extracted from corpus. The comparison is made with L1 and Computational Linguistics 23 241–267
L2 if last word length is below 5 (threshold). However, the
[4] Mikheev, A. 2000. Tagging Sentence Boundaries. In:
author has not mentioned the Programming Language used for
Proceedings of the NAACL, Seattle, pp 264-271.
implementation.
[5] T. Kiss and Strunk, J. 2006. Unsupervised multilingual
The algorithm proposed in Section 3 is unique since a generic sentence boundary detection. Computational Linguistics,
list of verb suffixes is maintained for comparison. Last word 32(4):485–525.
of any length in a sentence before period is matched with [6] Walker, Daniel J., David E. Clements, Maki, Darwin and
abbreviation file, and if not found, it is matched with the Jan, W. Amtrup. 2001. Sentence boundary detection: a
ending suffix. Substring match function is used to match the comparison of paradigms for improving MT quality. In:
verb suffix with the ending word. If suffix matches, end of Proceedings of the MT Summit VIII, Santiago de
Compostela, Spain.
sentence is identified. The identified sentences are correct
without ambiguity. Implementation using C wide characters [7] Akita, Y. 2006. Sentence Boundary Detection of
makes the application more portable. Spontaneous Japanese Using Statistical Language Model
and Support Vector Machines. In: Proceedings of.
Interspeech-ICSLP, Pittsburgh, PA.
5. RESULTS
The developed application has been tested using EMILLE [8] Singh, Preetam, Negi, Rauthan M.M.S and Dhami, H.S.
corpus. A corpus of 23,561 Kannada words (487KB) was 2010. Sentence Boundary Disambiguation: a User
given as Input to the Sentence Boundary Detection software, Friendly Approach. IJCA. Vol, 7-No.8.
which detected 2152 sentences. Manually wrong sentences [9] Mona Parakh, Rajesha N. and Ramya M. 2011. Sentence
were identified and found that an accuracy of 99.2% is Boundary Disambiguation in Kannada Texts, Language
achieved with the software developed using the proposed in India. www.languageinindia.com. 11:5 May 2011
algorithm. Special Volume: Problems of Parsing in Indian
Languages, pp. 17- 19.
The erroneous sentence boundary predictions were due to the [10] Gillick, D. 2009. Sentence Boundary Detection and the
following reasons: Problem with the U.S. In: Proceedings of the NAACL
The ‘?’ within a given sentence were wrongly HLT: Short Papers, Boulder, Colorado.
predicted. [11] Agarwal N., Ford K., and Shneider M., Sentence
If ‘.’ Or ‘?’ comes within quotes, they were wrongly Boundary Detection using a MaxEnt Classifier.
predicted. citeseerx.ist.psu.edu
If verb has no suffix, then the sentence is wrongly
predicted. [12] Wang H. and Huang Y. 2003. Bondec - A sentence
Boundary Detector. CS224N Project, Stanford, 2003
6. CONCLUSION
Sentence Boundary Detection is a pre-processing step for any
Natural Language Processing application. In the present
implementation of Sentence Boundary Detection sentence
41