Ref 8
Ref 8
Ref 8
Soundex Algorithm
Zahid, Muhammad Adeel Rao, Naveed Iqbal
Image Processing Center, College of Signals, Image Processing Center, College of Signals,
National University of Sciences and Technology (NUST) National University of Sciences and Technology (NUST)
Rawalpindi, Pakistan. Rawalpindi, Pakistan.
[email protected] [email protected]
Siddiqui, Adil Masood
Department of Electrical Engineering, College of Signals,
National University of Sciences and Technology (NUST)
Rawalpindi, Pakistan.
[email protected]
Abstract— Transliteration algorithms are used to convert A system for transliterating Roman Urdu to Arabic Script
Romanized form of Urdu in Urdu script. But the accuracy of has been proposed in [1] that allows user to write Roman Urdu
such systems is greatly reduced by presence of English words like without any constraints or capitalization. It employs automatic
weak, next etc. in online conversations. In this paper we present cross-script-trie generation to address the problem of roman to
dictionary based solution to convert English word to Urdu script. Urdu transliteration. It caters for diversity of spelling in writing
In doing so accent conversion problem may arise that is handled single Urdu word and validates the words from Urdu
through Soundex based algorithm where relative positions of dictionary. The mapping is one-to-many so there is a word list
transcriptions and Urdu language rules are combined to assign for one Roman Urdu lexicon containing more than one
codes to English words which are then mapped to Urdu script.
legitimate Urdu Words. Similar work has been done in Google
We have integrated our work with an existing roman Urdu
transliteration system and experimental results have proved the
labs [6] that takes into account the diversity of roman spellings
significance of our work both for standalone English for one roman Urdu word and outputs a list of possible Urdu
transliteration and as a part of roman Urdu transliteration words for one roman lexicon.
framework. Tafseer et al. [7] proposed a solution to roman Urdu
transliteration based on Soundex algorithm. It encodes the
Keywords-transliteration; soundex; orthography; input roman words according to sound of each individual
I. INTRODUCTION alphabet of Romanized word and characters with similar sound
are assigned same code.
Transliteration is sub-area of Natural Language Processing
that deals with conversion of text from one script (writing The drawback of these systems [1,6,7] is that they do not
System) to another. This process is mostly rule based and address transliteration of English words that appear frequently
depends upon phonetic equivalence of letters in source and not only in speech but also in written roman Urdu. It is
target script [1]. Ideally, transliteration should be one to one desirable to convert these English words into Arabic script as
(one word in source language should be mapped to one word in well because ignoring them will result in inconsistency of
target language). whole system. To be effective, such conversion should be
performed according to local accent of native Urdu speakers.
Until now, Urdu is mostly written in Romanized form in Unavailability of English corpus in local accent makes this job
electronic applications (e-mails, blogs, sms, chatting) mainly more challenging. Aniruddha et al. [8] proposed a model that
for two reasons. First, people are accustomed to English considers the morphological analysis of English words,
keyboard. Second, Urdu keyboards are cumbersome to use phonological rules and letter to sound rules to generate
because there are much more alphabets in Urdu than on pronunciation and stress information for Indian English. A
English keyboards. dictionary for Indian English was also prepared but was
For Urdu there are schemes that achieve ideal one to one smaller in size and contained subset of words from much larger
mapping [2,3]. But this one to one property is achieved at the CMU dictionary.
cost of unease of user. Since there are more alphabets in Urdu Abbas et al. [9] discusses an approach of English to Urdu
than in English, these schemes make use of letter capitalization transliteration that combines syllabification and some Urdu
and special symbols. A few Romanization schemes are [4,5] language rules to convert English to Urdu script and to align it
but they are seldom followed. Learning these schemes is to local accent of native Urdu speakers. Moreover it relies on
equivalent to learning new language. This is the reason why introduction of vowels when a consonant cluster appears on the
they have gained little or no popularity. Moreover, they do not onset position of a syllable.
validate if the transliterated word is valid Urdu word or phrase
and just count on rigid one-to-one mapping from source to This paper deals with study of English to Urdu
target script. transliteration of English words based on Soundex algorithm.