Urdu Computing Standards: Urdu Zabta Takhti (UZT) 1.01
Urdu Computing Standards: Urdu Zabta Takhti (UZT) 1.01
Urdu Computing Standards: Urdu Zabta Takhti (UZT) 1.01
Contents of UZT
UZT 1.01 is a 256 bit code page. It has
been divided into various logical sections, as
described later. Figure 1 below shows the
specification of UZT 1.01, as approved by the Figure 1: UZT 1.01
Government of Pakistan.
Though the driving force behind the contents and
UZT 1.01 is divided into the following logical their arrangement in UZT were the Terms of
sections: References (TOR) (discussed in Afzal and Hussain
i. control characters (0 – 31, 127) 2001, this volume) agreed with the National
ii. punctuation and arithmetic symbols (32 – 47, Language Authority, care was taken to keep UZT
58 – 65) similar to ASCII code (where possible). This is
iii. digits (48 – 57) because people are familiar with the character
*
Head, Center for Research in Urdu Language Processing, FAST National University of Computer and
Emerging Science, B Block, Faisal Town, Lahore. Email: [email protected].
**
Professor, Fauji Foundation Institute of Management and Computer Science, Rawalpindi.
223
distributions in ASCII as it is a worldwide standard. 38 & Ampersand sign
In addition, owing to its universal acceptability, 39 ‘ Single apostrophe
many hardware and software systems (especially 40 ( Open parenthesis (close
the earlier ones, some of which are still deployed) parenthesis in Urdu)
conform very closely to ASCII standard. 41 ) Close parenthesis (open
Incompatibility of UZT 1.01 with ASCII would parenthesis in Urdu)
mean incompatibility with these systems as well, 42 * Asterisk or multiplication
which would not be a practical solution. symbol
The team therefore had to propose a 43 + Plus or addition symbol
solution that had underlying conformance with 44 comma Comma, inverted for Urdu
ASCII along with the explicit conformance to TOR. (different from English)
This point of view is necessary to understand some 45 - Minus sign
of the logical groups presented here. 46 decimal Decimal point (different for
Urdu than English e.g. small
Control Characters hamza is sometimes used for
ASCII contains special characters that are “ishariya” in Urdu)
not visible but act as control characters for various
47 / Division sign or slash
hardware and software operations. These
58 : Colon
characters are placed from decimal positions 0 – 31
and 127, and include null (0), new line (10), escape 59 semi- Semi-colon , inverted for Urdu
(27), delete (127), etc. As these characters are used colon (different from English)
universally and are hard coded in software and 60 < Less-than symbol
hardwired in the hardware, UZT has used the same 61 = Equal sign
slots for these characters to avoid conflicts. 62 > Greater-than symbol
Complete list of these characters and their details 63 question Question mark: inverted for
are available in all introductory books on computer mark Urdu (different from English)
programming. 64 @ “at” symbol now frequently
used in email addresses
Punctuation and Arithmetic Symbols 65 hard Actual space in Urdu (also see
The symbols used in ASCII are also space space (32) and Figure 2)
mapped to their Urdu equivalents (same symbols in
many cases) to provide the symbol coverage for The same symbols have been included in
Urdu and conform to existing ASCII. Table 1 UZT 1.01 as in ASCII, as these symbols are used in
below gives explanations for these symbols. Urdu and English languages. Some symbols are
Though the symbols could have been put in one logically similar however transformed into a
continuous space, they were broken into two spaces different shape. Their logical names are given and
(slots 32 – 47 and 58 – 65) to conform with ASCII. physical realizations are left to the vendors. For
The slots 48 – 57 in ASCII have been used for example, the dollar symbol ‘$’ is used in ASCII
digits and correspondingly for UZT 1.01 as well. which is replaced by currency in UZT 1.01.
Vendor may choose to give relevant symbol for
Table 1. Punctuation and Arithmetic Symbols in currency (e.g. rupee in Pakistan) as this symbol is
UZT not standardized. In addition percent, comma,
# Symbol Remarks decimal, semi-colon and question mark symbols get
31 space Break in Urdu connected inverted in Urdu. Even though not relevant for
words, not explicit space Urdu, at slot 64 ‘@’ symbol has been included
between words. Explicit space because of its universal usage in email addresses.
is achieved by hard space (65) These changes have been incorporated.
(see Figure 2 for details) Finally a hard space (65) has been
33 ! Exclamation mark included and distinguished from normal space (32)
34 “ Double apostrophe because Urdu has two distinct space requirements
35 # Hash symbol in word processing (unlike a single space
36 currency Currency symbol ($ in ASCII) requirement for English). In English, the writing
but may be replaced by Rs. or system is unconnected which entails that a single
equivalent Urdu symbol in space can mark a word boundary. In Urdu, the
UZT script is inherently connected. However breaks
37 percent Percent sign, may be inverted between characters may occur within words or
for Urdu across words as illustrated in Figure 2. Within
224
words, some breaks are natural (e.g. after characters 67 Kasr-e- Zer used to
which do not connect at the end, like alif) and some Izafat connect words,
may be intentionally inserted, still preserving the e.g. in “Bang-e-
word, as shown in Figure 2, to create multiple Dera”
ligatures (sequence of connected characters forming 68 Khari
a word or sub-word). To distinguish word Zabar
boundary from ligature boundary hard space and 69 Khari Zer
space are used respectively. Figure 2 illustrates that 70 Ulta Pesh
a single ligature can be broken into two ligatures 71 Leta Pesh Not in normal
with a Space, without breaking the word. Hard usage
space breaks ligatures into different words. 72 Leti Zer Not in normal
usage
73 Do Zabar
74 Do Zer
75 Do Pesh
76 Chota Toay Not in normal
usage
77 Jazm
(HS) (S)
78 Noon Not in normal
Ghunna usage; used as a
Figure 2. Hard Space (HS) between Characters
diacritic to
at Word Boundary and Space (S) between
indicate
Characters within Ligatures
nasalization
Digits 79 Shad
Urdu numerals from zero through nine are 123 Null To be used where
included in slots 48 – 57 to keep its parallelism to diacritic aerab are
ASCII. The shape conforms to Urdu numerals and necessary to be
is different from English. Urdu zero is written as a typed but there are
dot (slot 48) and should not be confused with no aerab present;
English decimal (slot 46). see section on
sorting.
Urdu Diacritics (Aerab) 124 Zabar
Urdu is very rich is diacritics (or aerab, 125 Zer
though many are implicit in normal writing). 126 Pesh
However, word-processing for Urdu still requires
these aerab to be represented in UZT 1.01. The Aerab will take a byte to represent;
aerab are included from slots 66 – 79 and slots 123 therefore when characters are written with aerab,
– 126. Again the aerab could have been included as two bytes are taken per character (and three bytes if
a single continuous set of slots. However, they the character includes do-chashmey hay, as
were broken into two logical groups to facilitate explained below).
sorting. According to the Terms of Reference the
aerab in slots 66 – 79 do not effect sorting order of Urdu Characters
words and aerab in slots 123 – 126 effect sorting There is not an agreement on total number
order when used. The latter group is placed after of characters in Urdu. Various researchers have
the Urdu characters to get the sorting order required indicated different number of Urdu characters (e.g.
by Urdu. Sorting is discussed in detail in a see Bokhari 1986, Hussain 1997, Kachru 1987,
subsequent section. Explanations of these diacritics Khan 1997, and Masica 1993). Differences in
are given in Table 2. Some of these symbols are various proposals are not in the scope of current
not in common usage of Urdu. paper. However, it may be pointed out here that the
main disagreement in the total number of characters
Table 2. Diacritics (Aerab) in UZT arises on the number of Urdu voiced and voiceless
Code Name Remarks aspirates (e.g. /lh/, /mh/, /nh/, /wh/, etc.), i.e. the
66 Hamza-e- Hamza used to letters formed by characters and ‘do-chashmey
Izafat connect words, hay’. To solve the problem, National Language
e.g. in Idara-e- Authority was requested to provide the current list
Tehqiq of characters in ‘standard’ Urdu and they were
included in the Terms of References. The
225
characters included in UZT 1.01 are all those expansion after it. Slots 192 – 199 include more
characters listed in TOR. However, UZT 1.01 has general symbols also in ASCII, including square
provision to expand these characters (with time) brackets, curly brackets, under-score and dash.
and also allows for (possibly) all characters to form More room is also left after these symbols for future
aspirated versions with ‘do-chashmey hay’. Urdu expansion.
characters fill slots 80 – 121, therefore there are 42
characters included. ‘Do-chashmey hay’ is included Reserved Expansion Space
at the end of the list in slot 122 (instead of its Slots 177 – 191, 200 – 207 and 240 – 253
traditional position adjacent to ‘gol hay’ in slot 117) have been reserved for future expansion. The
for two reasons. First, though written separately, committee devising Urdu standards will fill these
‘do-chashmey hay’ is not a character of Urdu but slots in the future. The slots have been left vacant
forms characters when combined with other in anticipation of the future needs of Urdu.
characters (e.g. /bh/, /ph/, etc.) and therefore needs
to be distinguished from characters. Second, the Vendor Area
sorting sequence, which requires all words starting The code page covers the general and
with /b/ to be before all words starting with /bh/ widely used Urdu language characters, aerab and
(and similarly with other unaspirated-aspirated symbols. However, there may be special needs that
consonantal pairs), comes out naturally if ‘do- may arise for formatting or for including other more
chashmey hay’ is placed at the end of the character specialized characters not commonly used in Urdu.
list. Slots 208 – 239 have been reserved for this purpose.
A few other notable additions are ‘alif- Vendors may use these slots to define their own
hamza’ and ‘wao-hamza’ in positions 81 and 116 symbols for specific applications or formatting
respectively. These were also included as (though see Urdu Text File Format section for some
characters as prescribed by National Language limitations).
Authority (NLA) because they take aerab (zer,
zabar, pesh etc.) like other characters. In addition, Toggle Character
according to NLA specifications, ‘noon-ghunna’ Finally, a character in slot 254 has been
was placed before ‘noon’. reserved to toggle between various code pages.
Currently toggle character followed by character
Reserved Control Space zero (slot 48) would mean start of UZT 1.01 and
ASCII is a seven-bit standard, going from toggle character followed by one (slot 49) would
0-127. Earlier systems, some of which are possibly mean ASCII. This will be relevant for the
still deployed, still conform to this standard. filename.utx (Urdu Text File) format discussed
Therefore, if one byte code is sent to these systems, below. Other standard code pages and their codes
they truncate the most significant bit and convert will be defined by the Urdu Standards Committee
the byte code to seven-bit code. This would be in due course of time. No toggle would default to
especially dangerous if slots 128 – 159 are used, UZT 1.01.
because truncating the most significant bit would
map these slots onto the control characters resulting Urdu Text File Format (filename.utx)
in unpredictable behavior. Therefore, these slots In addition to defining the contents of UZT
have not been used. In addition, slot not 255 is also 1.01, a file format for standard Urdu text files has
not used, as in such a case it would map onto its also been defined. This simple standard has been
seven-bit equivalent ‘delete’ in slot 127. These devised to enable exchange of files across different
slots may be used in the future, when eight-bit Urdu applications. This is a text-based format that
ASCII is universally acceptable. stores the UZT 1.01 eight-bit codes in key-press
For the same reasons of truncation, all the order. It may store all the characters in UZT 1.01
Urdu characters, aerab and digits have been except those in the vendor area, as the vendor-
included in lower 128 slots. This ensures that even defined characters are not portable across vendors.
if a system conforms to a seven-bit code, it may still Vendors are encouraged to define their own file
effectively use UZT to store information in Urdu. formats to store data which contains vendor-specific
codes. For example, using Microsoft Word one
Special Symbols may use vendor specific filename.doc extension to
Urdu writing system is very rich in store vendor specific formatting etc. but may also
symbols. These symbols come from a variety of ignore vendor specific information and store
sources including religion, poetry and calligraphy information as plain filename.txt, which is readable
and are used in Urdu word processing. Slots 160 – by other software as well.
176 include these symbols, with room for further
226
A special point to note is that Urdu is was not possible to achieve both levels of sorting
written from right to left but its number system is directly through code page (even if separate codes
written from left to right. The format assumes that are assigned for letters with different aerab, as
all characters are written from left to right. proposed by one of the intermediate code pages, see
Therefore the front-end formatting will have to take Afzal and Hussain, 2001, this volume). However
care of reversing the digit directions (this file the characters in the code page have been organized
format will store the numbers in the sequence they such that first level sorting without aerab is
are typed). achieved just based on the UZT 1.01 code of each
character (including character with do-chashmey
Sorting hay, even though they will take two bytes instead of
Sorting is a complex issue in Urdu because one to store).
it is achieved through the characters and aerab, but Level two sort must be achieved through
the aerab are normally not written in Urdu. A the software. Therefore, to perform correct sorting,
native speaker of Urdu knows which aerab are the sorting algorithm will require the aerab to be
implicitly present and therefore can sort these ignored in the first pass and sorting will be done on
words. However a computer would need the aerab character codes. Then second pass will insert aerab
explicitly defined to do aerab-specific sorting. to resolve conflicts of words with same initial
According to the TOR, the sequence in Figure 3 is character sequences.
used for sorting.
Future Directions
Standardization effort for Urdu computing
has just started. Standardizing the code page (UZT
1.01) was the first and basic step identified at the
First Seminar on the Standardization of Urdu
Keyboard Layout and Internal Character
Representation, however there is much more work
which needs to be accomplished. This section
briefly highlights some of the areas that need to be
addressed.
227
sequence, UZT 1.01 will still have to be consulted. to be expanded to represent Urdu completely in
In addition, for memory intensive monolingual Unicode. In addition more work needs to be done
Urdu applications, UZT 1.01 gives more optimal in the areas of fonts and keyboard. Application of
single byte storage (versus double byte storage for these standards to Urdu computing including
Unicode). desktop publishing, internet and other software
systems need to be further explored. As more Urdu
Standardization of Urdu Fonts applications are developed these standards will be
Urdu has a rich tradition of calligraphy tested and must be revised to achieve optimal
with many distinct schools of writing. Among the solutions.
more salient are the Naskh and Nastalique writing
traditions acquired through Arabic and Persian Acknowledgements
(Majeed 1989). Within these schools there is a lot The authors acknowledge the voluntary efforts of
of variation. Work needs to be done to identify and all the contributors, who have been involved in this
define these fonts for Urdu. In addition, some fonts standardization process, since 1998. Without their
need to be developed and made available as contributions, standardization would not have been
shareware for use across vendors. These fonts need possible. A partial list of contributors is given in
to be developed for character and ligature based Appendix B of another paper in this volume (Afzal
systems. and Hussain, 2001).
228