Developing A Transliteration System For Telugu Script
Developing A Transliteration System For Telugu Script
Developing A Transliteration System For Telugu Script
1.0 Introduction Transliteration is a method of representing the text written in one writing system in another writing system. It does not reproduce the pronunciation of the letters in the source, but only attempts to replicate the way a text is written in the source. During transliteration, the source is preserved in the target script, by replacing it with the characters of the target script - ideally, in a one-to-one manner. But, the level of preservation of the source generally varies. In a fully reversible transliteration, the source spelling is completely preserved in the target script, which enables to recreate the original without any loss of information. 1.1 Indic Scripts All modern Indic scripts are ultimately derived from Brhmi, the script of the Akan inscriptions Being Brhmic descendants, the modern Indic scripts still maintain for the most part - a one-toone correspondence with each other. But apart from the common repository of characters, each script had also developed several other additional characters for representing the phonemes of the regional language. These include the perso-arabic extended characters present in the North Indic scripts, and the characters such as short e, short o, Trill a etc present in the South Indic scripts. 1.2 Modern Telugu Script The modern Telugu script is a descendant of the old Kannada-Telugu script that was employed for writing Telugu & Kannada languages[1]. The old script is in turn a derivation of the older Kadamba script which evolved from the southern Brhmic variant. Telugu Script has a common Brhmic character stock along with several characters required to write the native language. 1.3 Transliteration Standards Transliteration standards such as IAST, ISO 15919 exist to standardize the way in which Indic scripts are represented using Latin characters. IAST primary deals with Sanskrit, and ISO 15919 with all the Indic Languages. ISCII prescribes a standard that enables Dvangar script to represent all the characters of the Indic scripts. In all of these cases, special diacritic marks are introduced to extend the native character set and enable it to represent a foreign character set. 1.4 Transliteration System A Transliteration System is composed of two units. First, is the static constituent - which is an established Transliteration Standard for representing the non-native characters in the native script. The next is the actual working element a converter module which will actually implement the standard, by transliterating the input text to the native scripts characters in accordance with the standard. Both of these components integrate together to form a working Transliteration System. 1
2.0 Establishing a Transliteration Standard For developing such a Transliteration System in Telugu, the first step is to establish a Telugu transliteration standard to represent all the Indic characters. 2.1 Overview of Telugu Characters Devanagari is shown as the prototype for the Brhmic stock (and will also be used as the representative script for the north Indic scripts & Tamil for the south Indic scripts ) Common Brhmic Characters aiu ai au a a
East Indic
South Indic
2.2 ISCII & Pari Indian Standard Code for Information Interchange (ISCII) was the official standard for Indic text processing before the advent of Unicode. ISCII implemented Indic processing through 8-bit fonts. One of the important implications of ISCII was its ability to support transliteration across the various Indic scripts. ISCII was able to facilitate transliteration by using the same underlying code layout for all the Indic scripts[2]. Indic blocks in Unicode were a direct mapping from ISCII to maintain compatibility. Thus, Unicode also inherited the ability of facilitating easy transliteration from ISCII. Another important feature of the ISCII standard was the development of a iv ddha van a (Extended van a ) to represent all the characters of the Indic scripts using Dvangar script itself. Thereby establishing Devanagari as a pan-Indic transliteration script standard for entire Indic script range.
e o As seen above several new characters were created to represent the missing characters. In most of the cases a diacritic mark Nukta has been used to invent the new characters, in some other chases existing shapes were modified to create new ones (for instance, note the wavy stroke in as compared to ) 2.3 Transliteration to Telugu At times, mostly in scholarly works and also in dictionaries, it is necessary to express other Indic Languages in Telugu script, with the originality of the source preserved. In English, IAST/ISO enables such lossless transliteration into Latin script. However, such a situation is not possible for Telugu. Also, Telugu still has its presence and influence among the Vaiava group surrounding Tamil Nadu, with a need for transliteration. With Urdu as the second official language of Andhra Pradesh, it may be necessary to accurately represent the perso-arabic consonants in Urdu just like the north Indic scripts. There have been several attempts to transliterate other languages into Telugu script. Consider the sample below printed at 1928.
Tamil written in Telugu Script [3]. Note how Tamil letters & have been imported as Telugu characters by using Telugu mtras (The typesetter for some reasons didnt use the native archaic Telugu equivalents when they do exist). 3.0 ariv ha e u u ( )
As in Dvangar, there is a need and necessity to develop a iv ddha elu u for reasons akin. Based on the overview of characters seen earlier, the following characters need to be added and/or re-introduced to the standard Telugu script to establish a iv ddha elu u. The important part is to get these characters encoded in the Telugu Unicode Block. Detailed proposals must be made to the UTC to make these characters included in the Telugu Unicode standard based on the feedback and discussions from the user community.
3.1 Telugu Sign Combining Chandrabindu The Telugu Unicode block has the sign named Telugu Sign Chandrabindu encoded at U+0C01 already. But it is not a true equivalent of the north Indic Chandrabindu but a Telugu prosody character named ara sunna . A true Chandrabindu is needed for transliteration. It must be noted that several Vedic Sanskrit books in Telugu script already employ this character.
Vedic Sanskrit in Telugu Script [4] A Unicode proposal for encoding this Chandrabindu has already been submitted and accepted by the UTC to be encoded at U+0C00. 3.2 Telugu Sign LLLA is currently encoded in Unicode blocks of the three south Indic languages - Tamil , Kannada & Malayalam . Telugu Unicode doesnt have this letter currently encoded. But Old Telugu Script had this archaic letter included before it was dropped out of usage from Telugu. The letter also found its usage in the sanas written in the Telugu language.
Telugu script notes, TDIL [5] Telugu Sign LLLA looks very similar to the equivalent Kannada character . The Telugu Sign LLLA must be included at U+0C34 3.3 Telugu Sign Nukta Nukta (Arabic for Dot) is a dot shaped character used to extend the basic character set to represent new sounds. It is present in all the north Indic scripts and also Kannada. (In Tamil U+0B83 plays the de-facto role of the Nukta). The north Indic scripts use the Nukta to extend the basic characters to represent the perso-arabic consonants. Kannada uses the Nukta to represent /f/ and /z/.
Basically, the presence of the Nukta can enables a script to extend to denote any non-native sound. There have been efforts of using Nukta in the Telugu script such as below [6].
However Nukta at the center of the consonant overlaps with the mahpra mark in some consonants. A better solution is provided by the TDIL (Technology Development for Indian Languages) which has proposed a Telugu Nukta placed at the left of the consonant [7].
However a formal proposal was never made to the Unicode consortium. The Nukta sign can be used to create the equivalent of all the Indic characters in Telugu. It completely eases the transliteration of North Indic scripts into Telugu
It can have a variety of other uses also for instance, if there arises a need to represent the /w/ sound it can be done by placing the Nukta on - va by combining the Nukta and the required consonant. To avoid complications in font rendering, As in [6], Consonants with Nukta can be prevented to form sayuktkaras (Conjuncts). 3.4 Other Characters There are several other characters like Cillkaras in Malayalam, and Khanda Ta in Bengali, we may require unique representation. In such cases, U+02BC (Modifier letter apostrophe) can be used alongside with the usual halant-forms to denote these special pure-consonant forms. 3.5 Native Conventions Telugu has some native conventions to write specific phonemes. A transliteration system must also be able to apply these conventions and transform the text (if required). The Nasal letters must be converted into Anusvra (e.g.) (rather than ), (rather than ) The pure consonant 'm' when occurring as the final syllable, it must be changed into Anusvra (e.g.) . Any required sound can be represented
4.0 Aksharamukha Script Converter Aksharamukha (http://www.virtualvinodh.com/aksharamukha), an open source PHP based web transliteration application that works based on the proposed standard has been developed. This application converts all the Indic orthographies including East-Asian Scripts and also several Roman Transliteration standards such as ISO 15919, IAST, Harvard-Kyoto, Velthuis into Telugu Script & Vice Versa 5
Pariv ddha Telugu as discussed above has been implemented by the converter. A working Unicode font called Parivriddha Telugu [8] has been forked from the open source Telugu font Lohit Telugu with all the new glyphs introduced (Telugu Sign Combining Chandrabindu, Telugu Letter LLLA and Telugu Sign Nukta) at the proposed locations. The converter is presented as a model implementation of the proposed system.
5.0 Future Developments This paper as an initial attempt deals only with the wholesome representation of the Indic script characters. In the future, it is undoubtedly necessary to enable Telugu script to represent other phonemes with additional diacritics. Telugu script can be vitalized more by standardizing some diacritics that can be used along with existing Telugu characters to present vital non-native sounds such as /a/ in apple. There exists dedicated Unicode blocks for diacritics Spacing Modifier Letters & Combining Diacritical Marks. A required subset of these diacritics could be standardized to be used as consonantal and vowel modifiers to represent the non-existing sounds in Telugu. One possible usecase is an English-Telugu dictionary, where English phonemes need to be represented in Telugu. (e.g.) cot [kt] - <U+0C15,U+0C3E,U+0306,U+0C1F,U+0C4D> 6.0 Conclusion In the initial stages, a iv ddha elu u may be restricted to scholarly usage and possibly in dictionaries, but as time passes it may get adopted to mainstream usage also. Scholars will be able to represent other Indic languages in Telugu script with more ease. It will also enable people to read, learn and represent other Indic languages completely in Telugu script itself. Existence of a complete transliteration system will certainly enrich Telugu. 7.0 References 1. http://www.ancientscripts.com/telugu.html 2. Indian Script Code for Information Interchange (ISCII) IS13194 : 1991 3. tirumatrrthamu, dhra drvia vykhyna sahitamu, rmn kadai cryulu, 1928 4. Request to encode South Indian Chandrabindu-s, JTC1/SC2/WG2 N396 http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3964.pdf 5. Vishwabharat @ TDIL, Issue 5, April 2002 http://tdil.mit.gov.in/ori-guru-telu.pdf 6. http://prapatti.com/slokas/telugu/naalaayiram/periyaazvaar/tiruppallaandu.pdf 7. Proposed changes in Unicode Standards for Indic Scripts Telugu http://tdil.mit.gov.in/prop_uni/Telugu.pdf 8. http://www.virtualvinodh.com/download/fonts/Parivriddha%20Telugu.ttf 6