Ref 8

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

English to Urdu Transliteration:An Application of

Soundex Algorithm
Zahid, Muhammad Adeel Rao, Naveed Iqbal
Image Processing Center, College of Signals, Image Processing Center, College of Signals,
National University of Sciences and Technology (NUST) National University of Sciences and Technology (NUST)
Rawalpindi, Pakistan. Rawalpindi, Pakistan.
[email protected] [email protected]
Siddiqui, Adil Masood
Department of Electrical Engineering, College of Signals,
National University of Sciences and Technology (NUST)
Rawalpindi, Pakistan.
[email protected]

Abstract— Transliteration algorithms are used to convert A system for transliterating Roman Urdu to Arabic Script
Romanized form of Urdu in Urdu script. But the accuracy of has been proposed in [1] that allows user to write Roman Urdu
such systems is greatly reduced by presence of English words like without any constraints or capitalization. It employs automatic
weak, next etc. in online conversations. In this paper we present cross-script-trie generation to address the problem of roman to
dictionary based solution to convert English word to Urdu script. Urdu transliteration. It caters for diversity of spelling in writing
In doing so accent conversion problem may arise that is handled single Urdu word and validates the words from Urdu
through Soundex based algorithm where relative positions of dictionary. The mapping is one-to-many so there is a word list
transcriptions and Urdu language rules are combined to assign for one Roman Urdu lexicon containing more than one
codes to English words which are then mapped to Urdu script.
legitimate Urdu Words. Similar work has been done in Google
We have integrated our work with an existing roman Urdu
transliteration system and experimental results have proved the
labs [6] that takes into account the diversity of roman spellings
significance of our work both for standalone English for one roman Urdu word and outputs a list of possible Urdu
transliteration and as a part of roman Urdu transliteration words for one roman lexicon.
framework. Tafseer et al. [7] proposed a solution to roman Urdu
transliteration based on Soundex algorithm. It encodes the
Keywords-transliteration; soundex; orthography; input roman words according to sound of each individual
I. INTRODUCTION alphabet of Romanized word and characters with similar sound
are assigned same code.
Transliteration is sub-area of Natural Language Processing
that deals with conversion of text from one script (writing The drawback of these systems [1,6,7] is that they do not
System) to another. This process is mostly rule based and address transliteration of English words that appear frequently
depends upon phonetic equivalence of letters in source and not only in speech but also in written roman Urdu. It is
target script [1]. Ideally, transliteration should be one to one desirable to convert these English words into Arabic script as
(one word in source language should be mapped to one word in well because ignoring them will result in inconsistency of
target language). whole system. To be effective, such conversion should be
performed according to local accent of native Urdu speakers.
Until now, Urdu is mostly written in Romanized form in Unavailability of English corpus in local accent makes this job
electronic applications (e-mails, blogs, sms, chatting) mainly more challenging. Aniruddha et al. [8] proposed a model that
for two reasons. First, people are accustomed to English considers the morphological analysis of English words,
keyboard. Second, Urdu keyboards are cumbersome to use phonological rules and letter to sound rules to generate
because there are much more alphabets in Urdu than on pronunciation and stress information for Indian English. A
English keyboards. dictionary for Indian English was also prepared but was
For Urdu there are schemes that achieve ideal one to one smaller in size and contained subset of words from much larger
mapping [2,3]. But this one to one property is achieved at the CMU dictionary.
cost of unease of user. Since there are more alphabets in Urdu Abbas et al. [9] discusses an approach of English to Urdu
than in English, these schemes make use of letter capitalization transliteration that combines syllabification and some Urdu
and special symbols. A few Romanization schemes are [4,5] language rules to convert English to Urdu script and to align it
but they are seldom followed. Learning these schemes is to local accent of native Urdu speakers. Moreover it relies on
equivalent to learning new language. This is the reason why introduction of vowels when a consonant cluster appears on the
they have gained little or no popularity. Moreover, they do not onset position of a syllable.
validate if the transliterated word is valid Urdu word or phrase
and just count on rigid one-to-one mapping from source to This paper deals with study of English to Urdu
target script. transliteration of English words based on Soundex algorithm.

978-1-4244-8003-6/10/$26.00 ©2010 IEEE


We want to make transliterated English text as close to the ‫گ‬ (Gaf) G
local accent as possible. For this, our approach is based on ‫ه‬ (Hay) H
English transcriptions, analysis of English alphabets within the ‫ج‬ (Jeem) J
word being transliterated and their position in the word. We ‫ک‬ (kaf) K
have integrated our solution with existing Urdu transliteration
‫ل‬ (Lam) L
framework [1] and it can work both as a part of that framework
and as standalone English transliteration system. ‫م‬ (Meem) M
‫ن‬ (Noon) N
Section II describes the problem and strategy towards its ‫و‬ (Wao) W
solution. Section III describes experimental setup and analyzes ‫پ‬ (Pay) P
the results. Section IV sums up this paper with discussion of ‫ر‬ (Ray) R
conclusion and future work.
‫س‬ (Seen) S
II. PROBLEM FORMULATION ‫ش‬ (Sheen) SH
We propose a solution of English to Urdu transliteration ‫ٹ‬ (Tay) T
based on Soundex algorithm [10]. We have used CMU ‫ز‬ (Zay) Z
pronunciation dictionary [11] to acquire transcriptions of ‫ژ‬ (Xay) ZH
English words. Since we are integrating our work with existing ‫ی‬ (Choti-Yeh) Y
Urdu transliteration system, until now, our system identifies ‫ے‬ (Bari –Yeh) E
words present in CMU dictionary as English word. Addressing ‫ئ‬ (Yeh-hamza) I
out of vocabulary problem is out of scope of this work. (WaoS-
‫ؤ‬ O
We used two-step coding, forward coding and backward Hamza)
coding. In backward coding, we will define coding of Urdu
alphabets in more readable and easy-to-use English characters. In the above coding scheme the alphabet “‫ ”و‬can act both as
In forward coding, we will map transcriptions of English words vowel and consonant but there is no difference in written form
to our codes based on phonetic similarity. We have formulated so we did not assign separate code for vowel case and
our mapping rules so that there is only one realization of code consonant case.
in Urdu script. Once we have code for every English word, B. Mapping Rules for English Transcriptons
mapping to Urdu script is then a trivial task of one-to-one
mapping. For mapping of English word’s transcriptions to our coding
system we have classified English transcriptions into two
A. Mapping Rules for Backward Coding groups. First group contains the transcriptions that have direct
The dictionary comprises of 39 distinct transcriptions from one-to-one mapping to Urdu script and hence to our coding
that there are 15 vowels and 24 consonants. In case of system. The second group contains transcriptions that can have
consonants there is almost always one-to-one mapping to Urdu multiple realizations in Urdu script and our coding scheme.
script so we have also assigned one-to-one mapping to these Mapping possibilities for both groups are enlisted in table II
consonants in our code. In Urdu inventory there are more and III respectively.
alphabets than English inventory but, for English to Urdu
transliteration we will use 21 consonants and 6 vowels from TABLE II. MAPPING RULES FOR GROUP 1

Urdu orthography. There are consonants in English


orthography that can map to more than one consonant in Urdu. Transcription Code
For example, the consonant “s” in English orthography can b B
map to “‫”س‬, “‫ ”ص‬or “‫ ”ٿ‬but we define only one mapping of
ch CH
“s” i.e “‫”س‬. This will reduce our mapping effort by ½.
Furthermore, Urdu alphabets are mapped to more intelligible d DD
backward codes for the sake of simplicity. These characters are dh D
assigned unique codes in our coding scheme as described in p P
table I. r R
s S
TABLE I. MAPPING RULES FOR CODES sh SH
t T
Urdu Letter Description Code th TH
‫آ‬ (Alif-madda) AA uh W
‫ا‬ (Alif) A uw W
‫ب‬ (Bay) B v W
‫چ‬ (Chay) CH w W
‫ڈ‬ (Dal) D y Y
‫د‬ (dal) DH z Z
‫ف‬ (Fay) F zh ZH
TABLE III. MAPPING RULES FOR GROUP 2 -ae-b-d-ah-k-sh-n, d-ae-n-s, k-l-ih-r, s-t-r-ae-t-jh-iy-z, ah-d-
Default ae-p-t, eh-r-ah-n
Transcription Codes
Code Step-2: Encode all transcriptions of group 1 according to
aa A, AA A mapping rules of table II.
ae AY, A, Y Y Since table II contains all transcriptions that have one-to-one
ah A,Y,W,null Null mapping to our coding system, so this mapping is performed
ao AW,A A before we move to more complex case of transcriptions
aw AAO,AO AO belonging to group 2 that have multiple realizations in our
ay AAI,AI AI coding system.
eh AY,H,I,Y Y Example Words: -ae-B-D-ah-K-SH-N, D-ae-N-S, K-L-ih-R,
er AR,WR,AIR,A R S-T-R-ae-T-J-Y-Z, ah-D-ae-P-T, eh-R-ah-n
ey AY,E,Y Y Step-3: If the transcription starts with aa, ae, ah, ao, ay, eh, er,
ih A,Y,YI,null Null ey, or ih, do the following replacements
oy AIE,AI AI
iy Y,AY Y aa = AA
ow AW,W W ae = AY
uh AW,W W ah = A
uw AW,W W ao = AW
ay = AAI
As evident from table III the mapping of transcriptions eh = A
belonging to group 2 is trickier in a sense that it depends upon
er = AR
the position of the particular sound in the transcription. For
example a transcription can give one sound in initial position, ey = AY
another sound in medial position and a different sound in final ih = A
positions of the word. For this group, simple one-to-one oy = AI
mapping will not suffice and it requires more sophisticated In Urdu no word can start with a vowel but with alif “‫“ا‬or ain
rules that consider position of arpabet as well as other “‫”ع‬. Since we are transliterating English words to Urdu script
encompassing transcriptions. Next sub-section describes the so we will not use ain and every word that starts with a vowel
algorithm that systematically converts the English words into arpabet is preceded by “A” that is code we have assigned to
equivalent Urdu words while taking into account all above alif.
mentioned concerns. Example words: -AY-B-D-ah-K-SH-N, D-ae-N-S, K-L-ih-R, ,
C. Mapping Algorithm S-T-R-ae-T-J-Y-Z, A-D-ae-P-T, AY-R-ah-N
The coding process starts with reading the transcriptions of Step-4: If transcription of English word ends with ah or ih do
English words from dictionary. We will encode these the following replacements
transcriptions using the mapping rules defined in table II and
III. Resulting dictionary then will contain English word, its ah = A
transcription and code. In second part when user inputs some ih = Y
word, its code from dictionary will be fetched and mapped to
Urdu script according to rules defined in table I. The CMU “ah” and “ih” in English are equivalent to Urdu short vowels
dictionary that we are using contains more than 130000 words Zabar and Zer. Since Urdu words cannot end with short vowel
with their transcriptions and can be assumed to be nearly so they are mapped to a closest long vowel.
complete. Example Words: -AY-B-D-ah-K-SH-N, D-ae-N-S, K-L-ih-R,
For the sake of this algorithm we will adapt the terminology , S-T-R-ae-T-J-Y-Z, A-D-ae-P-T, AY-R-ah-N
from [7]. Algorithm takes as input eng_word to be encoded. Step-5: For all occurrences of vowel “ih” before “R” do the
The transcription of this word, eng_trans, is then read from the following replacement.
dictionary.
ih= YI
The English words that we have selected as example are
‘abduction’, ‘dance’, ‘clear’, ‘strategies’ , ‘adapt’ and ‘aaron’. This is special case mapping of short vowel “ih” to code “YI”
There will be brief description after each step. and it deals with the series of words like near, bear, clear, ear,
year etc.
Step-1: Read the transcription, eng_trans, of English word that
is to be encoded. We will use hyphen (-) sign to mark the Example Words: -AY-B-D-ah-K-SH-N, D-ae-N-S, K-L-YI-
R, S-T-R-ae-T-J-Y-Z, A-D-ae-P-T, AY-R-ah-N
transcription boundary.
Example Words: we will have following transcriptions of the Step-6: if vowel “ah” appears between T and B, CH and B, TH
words we have selected. and B, J and M or V and B do the following replacements
ah = Y
In cases mentioned above “ah” will produce sound of “yeh” Step-12: If transcription ends with vowel “oy” do the
instead of its default sound of Zabar as in “approachable”, following replacements.
“detachable”, “eligible”, “acknowledgeable” etc.
oy = AIE
Example Words: -AY-B-D-ah-K-SH-N, D-ae-N-S, K-L-YI-R,
Vowel “oy” when appearing at the end of the word gives the
S-T-R-ae-T-J-Y-Z, A-D-ae-P-T, AY-R-ah-N
additional sound of bari-yeh along with its default sound. It
Step-7: If vowel “ah” appears between L and J do the handles the range of words like “coy”, “boy”, “destroy”,
following replacements “employ”, “enjoy” etc.
ah = W Example Words: -AY-B-D-ah-K-SH-N, D-ae-N-S, K-L-YI-R,
This case deals with the occurrences of ah where “ah”, S-T-R-ae-T-J-Y-Z, A-D-ae-P-T, AY-R-ah-N
opposite to its default sound, assumes the sound of vowel wao. Step-13: If vowel “ae” appears between D and P or D and N
This case handles the range of words. “Zoology”, “Biology”, do the following replacements
“Psychology” and “Morphology” are few to name.
ae = A
Example Words: -AY-B-D-ah-K-SH-N, D-ae-N-S, K-L-YI-R,
S-T-R-ae-T-J-Y-Z, A-D-ae-P-T, AY-R-ah-N When vowel “ae” appears between above mentioned pairs it
will produce the sound of alif then its default sound yeh. The
Step-8: If vowel “ao” appears between CH and K, F and R, B range of words it deals with includes “dance”, “adapt”,
and R, Y and R do the following replacements. “adapso” etc.
ao = A Example Words: -AY-B-D-ah-K-SH-N, D-A-N-S, K-L-YI-R,
This rule deals with cases where “ao” produces sound of Alif S-T-R-ae-T-J-Y-Z, A-D-A-P-T, AY-R-ah-N
rather than its default sound. Examples include “chalk”, Step-14: For the all vowels that are not still encoded, map
“chocolate”, “form”, “border” etc. them to their default code according to table III.
Example Words: -AY-B-D-ah-K-SH-N, D-ae-N-S, K-L-YI-R, Since we have taken care of all special cases of vowels and
S-T-R-ae-T-J-Y-Z, A-D-ae-P-T, AY-R-ah-N vowels appearing at initial and final position, we can now
Step-9: if transcription ends with vowel “eh” do the following safely map remaining vowels to their default codes.
replacements Example Words: -AY-B-D-K-SH-N, D-A-N-S, K-L-YI-R, S-
eh = H T-R-Y-T-J-Y-Z, A-D-A-P-T, AY-R-N
Analysis of dictionary reveals that there are only few words Step-15: Remove the hyphen sign that was used as delimiter
that end with vowel “eh” and these words are borrowed from and store the resulting code in the dictionary along with the
other languages and map to consonant “H”. word.
Example Words: -AY-B-D-ah-K-SH-N, D-ae-N-S, K-L-YI-R, Example Words: AYBDKSHN, DANS, KLYIR, STRYTJYZ,
S-T-R-ae-T-J-Y-Z, A-D-ae-P-T, AY-R-ah-N ADAPT, AYRN
Step-10: If a transcription ends with vowel “ey” do the D. Mapping to Urdu Script
following replacement Once we have encoded all the strings in dictionary,
ey = E converting them to Urdu script is trivial task of one-to-one
mapping. When an English word is input we perform following
When vowel “ey” appears at end of the word it gives sound of procedure to get equivalent word in Urdu script.
Urdu vowel bari-yeh rather than its default sound yeh. This Step-1: Fetch the code from the dictionary
case deals with range of words like “astray”, “attaché”,
“array”, “essay” etc. Step-2: Parse the code from left to right and perform mapping
of table I considering the longest match. This mapping on will
Example Words: -AY-B-D-ah-K-SH-N, D-ae-N-S, K-L-YI-R, produce following output for example words we have selected.
S-T-R-ae-T-J-Y-Z, A-D-ae-P-T, AY-R-ah-N
Example Words: ‫ايبڈکشن‬, ‫ڈانس‬, ‫کليئر‬, ‫سٹريٹجيز‬, ‫اڈاپٹ‬, ‫ايرن‬
Step-11: If vowel “ih” is followed by R do the following
replacements III. RESULTS AND DISCUSSION
ih = YI We implemented this algorithm in java and used IBM’s
This rule handles the range of words where “ih” will produce ICU4J library that has been provided under the project,
sound of YI rather than its default sound zer. These words International components on Unicode (ICU) [12]. Carnegie
include “fear”, “tear”, “sheer” etc. Mellon University’s CMU pronunciation dictionary was used
to read the transcriptions of English words. Microsoft Access
Example Words: -AY-B-D-ah-K-SH-N, D-ae-N-S, K-L-YI-R, was used at the backend to store the original words, their
S-T-R-ae-T-J-Y-Z, A-D-ae-P-T, AY-R-ah-N transcriptions and codes.
Figure 1. Demo of our Application
A

A there is no English dictiionary availabble in local acccent


As the dictionary annd because of enormous number
n of words;
w
againnst which we could validatee our results. For
F that, five Urdu
U me cases mightt have escapedd.
som
speaaking persons were asked too write us Ennglish text in Urdu U I future our goal will be to make a phoneme extraaction
In
scrippt. These perssons are all unniversity gradduates and nonne of tool that can helpp in automaticc extraction of
o patterns andd can
them
m had ever stayyed out of Pakkistan for moree than a year. mapp English orthhography to coorresponding sound in a word’s
w
Eachh one of thhem was reqquired to convert 3 diffferent transcription. Bassed on that auutomatic analyysis we also inntend
paraagraphs againnst which wee validated our o results. These
T to come up with dictionary inddependent solution and gennerate
paraagraphs weree randomly selected from m Internet [13]. transcriptions autoomatically witthout consultin
ng the dictionary.
Accuuracy of our system basedd on these tessts is nearly 87%. 8
Tablle IV shows the testing reesults whereas Fig-I shows the V. REEFERENCES
demmonstrations off our application in which we w have show wn the [1] U. Afzal, N. I. Rao
R and A. M. Shheri , “Adaptive Transliteration
T Baased on
transsliteration of two paragrapphs. One Engglish paragrapph is Cross-Script Triie Generation: A Case of Roman--Urdu,” Proceediings of
the Conference on o Language & T Technology 2009 pp.33-40.
takenn from famoous English shorts story “TThe Chapel”. The
[2] ApniUrdu.com> >Transliterate,httpp://www.apniurdu u.com/Transliteraate.ht
secoond paragraphh is roman Urrdu with somee containing some s ml
Engllish words andd they were allso transliteratted perfectly when
w [3] Urdu Transliteraation,http://www..node.pk/urdu/
workking in combbination with roman to Urdu U transliterration [4] Urdu Poetry Arrchive, Transliterration scheme fo or writing Urdu in
systeem [1]. It also shows that few words are a missing duue to English script. http://www.urrdupoetry.com/ itrrans.html
errorr rate and som
me words are not
n aligned to the local acceent of [5] UNGEGN (Worrking Group on Romanization Systems), S “Reporrt on
nativve Urdu speakkers. the Current Status
S of Unitted Nations Rom manization Systems for
Geographical Naames”, UNITED NATIONS GRO OUP OF EXPERT TS ON
TABL
LE IV. RESUL
LTS OF OUR ALGO
ORITHM
GEOGRAPHICAL NAMES, 20003, Version 2.2.
[6] Type in Urdu - Google
G Transliterration,
Total words (Distinct) 2000 http://www.googgle.com/transliterrate/
Correct Trannsliteration 1736 [7] T. Ahmed, “Rooman to Urdu T Transliteration usiing word list”, Online
O
Proceedings of Conference of L Language and Technology 09, Lahore,
L
Incorrect Traansliteration 264 2009
Success Ratee 86.8% [8] A. Sen, “Pronnunciation Rules for Indian En nglish Text-to-sqqpeech
Error Rate 13.2% Ssystem,” Workkshop on spokenn language proceessing, Mumbai, 2003,
pp. 141-148.
IV. CO
ONCLUSION AN
ND FUTURE DIR
RECTIONS [9] A. R. Ali, M. M Ijaz, “Englissh to Urdu Transliteration Syystem”,
Proceedings of the Conferencee on Language & Technology 2009,
O
Owing to the penetration
p off English in written
w and sppoken Lahore, pp. 15-223
Urduu, English to Urdu
U transliteeration is very useful application [10] Russell, Robert, U.S. patent no. 11261167, 1918.
to bee used with rooman Urdu traansliteration appplications. Inn this [11] CMU. Thhe CMU Pronunciaation Dictiionary,
papeer we proposeed an algorithhm that convverts the scrippt of www.speech.cs.cmu.edu/cgi-bin//cmudict, School of Computer Science,
Engllish words too Urdu script while takingg into accounnt the Carnegie Mellonn University, Pittssburgh, USA, 200 06.
locall accent of nattive Urdu speaakers. [12] IBM, ICU4J 3.44, International Coomponent for Un nicode for Java, Version
V
T limitationn of our workk is that this attempt of acccent
The 3.6. http://icu.soourceforge.net
locallization took a lot of effort in manual annalysis of worrds in [13] English Reaading: Short Stories | EnglishClubb.com,
http://www.engllishclub.com/readding/short-stories.htm

You might also like