An Introduction To Unicode - The Trainer's Friend

Unicode
An Introduction to Unicode
Unicode Concepts and Terminology Unicode Mappings Appendix: UTF-32 Character Assignment Ranges Appendix: UTF-32 <-> UTF-16 <-> UTF-8 sample mappings
by Steve Comstock
The Trainer's Friend, Inc. http://www.trainersfriend.com 303-355-2752 [email protected]
v2.1
Copyright 2012 by Steven H. Comstock
Unicode
The following terms that may appear in these course materials are trademarks or registered trademarks: Trademarks of the International Business Machines Corporation: AD/Cycle, AIX, AIX/ESA, Application System/400, AS/400, BookManager, CICS, CICS/ESA, COBOL/370, COBOL for MVS and VM, COBOL for OS/390 & VM, Common User Access, CORBA, CUA, DATABASE 2, DB2, DB2 Universal Database, DFSMS, DFSMSds, DFSORT, DOS/VSE, Enterprise System/3090, ES/3090, 3090, ESA/370, ESA/390, Hiperbatch, Hiperspace, IBM, IBMLink, IMS, IMS/ESA, Language Environment, MQSeries, MVS, MVS/ESA, MVS/XA, MVS/DFP, NetView, NetView/PC, Object Management Group, Operating System/400, OS/400, PR/SM, OpenEdition MVS, Operating System/2, OS/2, OS/390, OS/390 UNIX, OS/400, QMF, RACF, RS/6000, SOMobjects, SOMobjects Application Class Library, System/370, System/390, Systems Application Architecture, SAA, System Object Model, TSO, VisualAge, VisualLift, VTAM, VM/XA, VM/XA SP, WebSphere, z/OS, z/VM, z/Architecture, zSeries Trademarks of Microsoft Corp.: Microsoft, Windows, Windows NT, Visual Basic, Microsoft Access, MS-DOS, Windows XP Trademarks of Micro Focus Corp.: Micro Focus Trademarks of America Online, Inc.: America Online Trademarks of Quercus Systems: Personal REXX, REXXTERM Trademark of Chicago-Soft, Ltd: MVS/QuickRef Trademark of Crystal Computer Services: Crystal Reports Trademark of Phoenix Software International: (E)JES Registered Trademarks of Institute of Electrical and Electronic Engineers: IEEE, POSIX Registered Trademarks of Corel Corporation: Corel, CorelDRAW, Corel VENTURA Registered Trademark of Oracle Corporation: Oracle Registered Trademark of The Open Group: UNIX Trademark of Sun Microsystems, Inc.: Java Registered Trademark of Linus Torvalds: LINUX Registered Trademark of Unicode, Inc.: Unicode Trademarks held on behalf of World Wide Web Consortium: W3C, XHTML, XSL, WebFonts
Unicode
Uni co de
Section Preview
p Introduction to Unicode Characters Characters, Glyphs, and Fonts Coding Schemes Code pages Standards Unicode
p Understanding Unicode Characters
Unicode
Characters
p A character is an abstract concept that has evolved along with our understanding of language and information
p Initially, when most of us think of characters, we think of a particular character set: alphabetic letters, numeric characters, special characters, and so on But subtleties start to appear when you consider issues of other languages, different fonts, and the need for special meaning characters 7 For example, you need characters to control a communications session when transmitting data 7 Obviously, you cannot use characters from the set of data you are sending to also be control characters: the communications process could not distinguish between data and the control functions
Unicode
Characters, Glyphs, and Fonts
p In computer terms, a character is a grouping of bits (binary ones and zeros) in packages of 8: one or more bytes
p There are two broad classes of characters: data characters and control characters Although one could make a case that data characters are just control characters whose function is to display a glyph 7 A glyph is the visible representation of a character Consider the [data] character called "upper case A"; the following are various glyphs that represent that character: 7 A - Arial 7 A - Times New Roman 7 A - Courier new 7 A - Garamond 7 A - Bodoni 7 A - Park Avenue 7 A - FlamencoD
And so on; notice the concept of a font sneaking in here: a font is a set of glyphs used to represent a collection of characters [usually in a similar style]
Unicode
Coding Schemes
p But an upper case A is an upper case A, regardless of the glyph used to represent it
p In computers, we assign characters to bit patterns (and vice versa), and a "character" is an abstract thing, independent of any glyph
p The rules we use to make these assignments between characters and bit patterns are called coding schemes, and there are many in use today, for historical reasons
p There are three coding schemes most people in the IS industry need to be cognizant of: EBCDIC - Extended Binary Coded Decimal Interchange Code; used by IBM mainframes and AS/400 machines ASCII - American Standard Code for Information Interchange; used by almost all other hardware Unicode - gaining wide acceptance in use by software
p There are other coding schemes available, but from a practical point of view, we can get the vast majority, if not all, of our work done if we are aware of these coding schemes
Unicode
Codepages
p Even awareness of coding schemes is not quite enough to get us all we need for practical use
p Again, for historical and cultural reasons, many coding schemes have several variations, each slightly different than the others For example, in some environments you have need of a symbol like , but in other environments, users are not even aware of this character So computer designers introduced the concept of a codepage, which is a variation of a coding scheme
p After all, in 8 bits you only have 256 possible patterns You can run out of available characters pretty quick if you allow all those strange foreign, mathematical, scientific, engineering, currency, and other symbols
p The solution was to use codepages (also spelled as two words: "code page" or "code pages") Users could set codepages for different environments Although you cannot mix codepages in a single environment: at any point in time your 256 bit patterns map to exactly one set of characters
Copyright 2012 by Steven H. Comstock 7 Unicode
Codepages, continued
p When a data character arrives in a computer from a magnetic tape, diskette, CD-ROM, network transmission, or so on, the computer just stores the character as it comes in, no judgements being made
p But consider what happens when a character comes in from a keyboard: The user presses a key with a glyph on it of, say, an upper case A The keyboard electronics assign a bit pattern to the character and transmit it to the computer, where it is received as part of an I/O program This I/O program may reassign the bit pattern before it is stored, depending on the current codepage: keyboard bit pattern > codepage > stored bit pattern
p Similarly, when a character is sent by the computer to a printer or display unit, that output device has a codepage mapping followed by a font mapping to decide how to display the character on the device stored bit pattern > codepage > character > font > glyph
Unicode
Standards
p To ensure consistency and clarity, a number of standards bodies have been created to develop and enhance standards for a variety of areas, including IS; these bodies include: ISO - the International Organization for Standardization ANSI - the American National Standards Institute, is the US member of ISO The ISO has a standard called ISCII which is very close to the ASCII character encoding standard
p Ultimately, one wants a single code page, a single, universal, encoding scheme for all characters
p From the perspective of international communications, one needs an encoding scheme that is Universal - covers all characters needed in all likely situations Efficient - avoids escape character sequences for special encoding, for example Unambiguous - every character has one and only one bit pattern mapping
Unicode
Unicode
p Unicode is an encoding scheme developed by the Unicode Consortium (incorporated under the name Unicode, Inc. in 1991) and the ISO
p The Unicode Consortium is backed by most of the major players in the IS game, including these (and many more): Adobe Systems Apple Computer Compaq Computer Ericsson Mobile Communications Hewlett-Packard IBM Microsoft NCR Netscape Oracle PeopleSoft Quark SAP SAS Institute SHARE Software AG Sun Microsystems Sybase Unisys
10
Unicode
Unicode, continued
p In 1992, the Unicode Consortium and ISO agreed to merge their character encoding standard, so the character sets map exactly In addition to assigning names and bit-pattern mappings to characters, in conjunction with the ISO, the Unicode standard also provides implementation algorithms, properties, and semantic information
p The basic, original premise, was to use 16-bits for every character This allows for 64K unique patterns (65,536) Maintaining compatibility with as many already existing standards as possible
p By the year 2000, however, it was clear that more character space was needed In May of 2001, 44,946 new characters were added (mostly CJK (Chinese, Japanese, Korean) characters, along with some historic scripts and several sets of symbols) 7 As of Unicode standard 3.1 there were 94,140 characters included As of Unicode standard 4.0 (June, 2003), there are 96,382 characters in the standard, and for 4.1 (March, 2005) the count is now 97,655; for 5.0 (July, 2006), there are 99,024; for 6.0 (February, 2011): 109,384; for 6.1 (January, 2012): 110,116
Copyright 2012 by Steven H. Comstock 11 Unicode
Unicode, continued
p There are three alternative ways of representing Unicode characters, including: UTF-16 the basic 16-bit encoding scheme: two bytes used for every Unicode character 7 But version 3.0 of the Unicode standard introduced a concept called surrogate pairs that allows some Unicode characters to be represented by a pair of two-byte values UTF-8 an algorithm for converting Unicode characters to a string of characters that are one, two, three, or four bytes in length, and back UTF-32 a 32-bit encoding, the basis for the ultimate character encoding, allowing for 1,114,112 character assignments (note: the leftmost 11 of the 32 bits must be all binary zeros) 7 This encoding was made an official part of the Unicode standard in version 3.1 in May, 2001
p UTF stands for Unicode Transformation Format
p Here are some pointers to Web sites for more information about Unicode: Unicode home page: http://www.unicode.org IBM: http://www-106.ibm.com/developerworks/unicode/
12
Unicode
Unicode, continued
p So why do we care about this on the mainframe? IBM is trying to position mainframes as the ultimate server for intranets and the Internet / World Wide Web 7 Web pages are generally coded in HTML (HyperText Markup Language) or XHTML (eXtensible HyperText Markup Language)
HTML 4 and all versions of XHTML require support for
Unicode XML (eXtensible Markup Language) is becoming one of the premier data exchange formats - requires Unicode support At some point in time, ("not too far down the road" to quote one of the z/OS developers) z/OS will require the Unicode support functions be installed DB2 can store / access Unicode data in CHAR, VARCHAR, and CLOB data types 7 In Version 8, the DB2 catalog is stored in UTF-8 Many databases and programming languages on UNIX and Windows machines support Unicode Current mainframe compilers (COBOL, PL/I, C, C++) all support Unicode p Unicode can provide, eventually, the ability to have a single codepage yet support all languages simultaneously
13
Unicode
Unicode, concluded
p Although there are people who are against Unicode (and even some competing standards), Unicode seems to be the way of the future Enabling single encoding and data interchange across platforms and simultaneous multiple language support on screens and reports
p Also note that z-series machines have a number of instructions that work with Unicode, in the UTF-16 format (PKU, UNPKU, CLCLU, MVCLU, TROO, TROT, TRTO, TRTT) But UTF-8 seems to be the most widely used format on the Web and, probably, in XML that's not even used on the Web Instructions to convert between UTF-16 and UTF-8 have been available since machines introduced in 1999 (CUTFU, CUUTF) In 2004, instructions were added to convert between: 7 UTF-8 7 UTF-8 <--> <--> UTF-16 UTF-32 UTF-32 (new names for old instructions) (new instructions) (new instructions)
7 UTF-16 <-->
14
Unicode
Section Preview
p Understanding Unicode Characters Unicode Representations UTF-32: Unicode Scalar Values UTF-16 Surrogate Pairs UTF-16 -> Unicode Scalar Value UTF-32 -> UTF-16 UTF-8 UTF-32 -> UTF-8 . UTF-8 -> UTF-32 . Other Mappings . Unicode - Conclusion
15
Unicode
Unicode Representations
p This section is a techincal discussion of how Unicode characters are stored using the three formats: UTF-8, UTF-16, and UTF-32 According to the standard, these are considered equally valid 7 In the sense that every Unicode character may be represented in any of these formats, and the mapping between formats is well-defined To work with Unicode data, one has to know which eoncoding format has been used 7 This may be supplied external to the data itself, as in an HTTP header or HTML META statement, for example 7 In some cases, the program processing a string may be able to examine the string and deduce the format being used (but this is not preferred)
16
Unicode
UTF-32: Unicode Scalar Values
p Every Unicode character is assigned an integer value, the Unicode scalar value
p UTF-32 is the set of Unicode Scalar Values This is also sometimes called UCS-4, meaning the Universal Character Set as 4-bytes per character
p The possible range of values in the Unicode scalar set is x'00000000' - x'0010FFFF' or, in binary, the maximum allowed value is 0000 0000 0001 0000 1111 1111 1111 1111 21 bits; in decimal, the values range from 0 to 1,114,111 Every Unicode character is assigned a number in this range, a point along the string of integers in this range (this is sometimes called a code point) 7 Although not every number in this range is assigned to a Unicode character 7 Also, a subset of this range is reserved for surrogate pairs...
17
Unicode
UTF-16
p UTF-16 was the beginning point of Unicode character assignments
p Initially, each UTF-16 character was a single two-byte unit But when surrogates needed to be introduced, to accomodate a larger character set, some characters became represented by a single two-byte unit, others by a pair of two-byte units
p When a processing program such as a browser or editor is working with UTF-16 data, it assumes each two-byte unit represents a character Except that certain values are reserved to represent surrogate pairs: situations where it takes two two-byte units to represent a single character
p The theoretical range of Unicode scalar values is x'0000 0000' - x'0010 FFFF' For Unicode scalar values greater than or equal to x'0001 0000', surrogate pairs are used Values in the range x'0000 D800' - x'0000 DFFF' are reserved for use in surrogate pairs
18
Unicode
Surrogate Pairs
p How to recognize when a two-byte unit begins a surrogate pair? If a UTF-16 unit has a value in the range x'D800' - x'DBFF', that unit is a high surrogate and you need to combine that two-byte unit and the next, which must be a low surrogate, to determine the actual character that is being represented Low surrogates are in the range x'DC00' - x'DFFF' 7 It is an error for a low surrogate not to be preceded by a high surrogate, and for a high surrogate not to be followed by a low surrogate
In binary: high surrogates: low surrogates: In decimal: high surrogates: low surrogates: 55,296 - 56,319 56,320 - 57,343 1101 1000 0000 0000 - 1101 1011 1111 1111 1101 1100 0000 0000 - 1101 1111 1111 1111
Notes Each surrogate range contains 1,024 values, so the possible number of surrogate pair values is 1,024 x 1,024 or 1,048,576 The vast majority of characters do not require surrogate pairs
19
Unicode
UTF-16 -> Unicode Scalar Value
p The algorithm to convert from a UTF-16 Unicode character to the Unicode scalar value (in other words, UTF-16 -> UTF-32) is this: If a two-byte unit is not a surrogate value, the Unicode scalar value is the two-byte value itself 7 So if the two- byte unit is in the range x'0000'- x'D7FF' or x'E000' - x'FFFF', the Unicode scalar value is that value (or, equivalently, x'0000 0000' - x'0000 D7FF' and x'0000 E000' - x'0000 FFFF') If a two-byte unit is a surrogate value, the character is composed of two two-byte units, so calculate the Unicode scalar value as the sum of 7 (The high surrogate - x'D800') * x'0400' 7 (The low surrogate - x'DC00') 7 x'0001 0000'
p We examine this algorithm more carefully ...
20
Unicode
UTF-16 -> Unicode Scalar Value, continued
Notes The first calculation provides the displacement into the high surrogate range (resulting in a number in the range x'0000'- x'03FF' or, in decimal, 0 - 1023 or, in binary: b'0000 0000 0000 0000'- b'0000 0111 1111 1111') Multiplying by x'0400' (decimal 1024) effectively shifts the value to the left 10 bits, producing numbers in the range x'0000 0000'x'000 FFC00' with the last 10 bits all zeros 0000 00xx xxxx xxxx 0000 0000 xxxx xxxx xx00 0000 0000 The second value is the displacement into the low surrogate range (also resulting in a number in the range x'0000'- x'03FF' or, in decimal, 0 - 1023) Adding the two numbers (inserting leading zeros in the first to make them the same length) and the x'1 0000': 0000 0000 0000 xxxx xxxx xx00 0000 0000 0000 0000 0000 0000 0000 00yy yyyy yyyy 0000 0000 0000 0001 0000 0000 0000 0000
Adding the x'0001 0000' ensures the resulting Unicode scalar values are in the range x'0001 0000'- x'0010 FFFF'
21
Unicode
UTF-16 -> Unicode Scalar Value, continued
p We can look at Unicode scalar assignments this way: x'0000 0000' - x'0000 D7FF' basic 16-bit codes x'0000 D800' - x'0000 DFFF' surrogate values (assigned, but not to characters) x'0000 E000' - x'0000 FFFF' basic 16-bit codes x'0001 0000' - x'0010 FFFF' computed from surrogate pairs
p These ranges may be further subdivided for study but these subsets are not of interest in this paper However, the next level of detail is presented in the first Appendix to this document The second Appendix lists specific mapping values between UTF-32, UTF-16, and UTF-8
p At one point in the cycle of development, there was a Unicode encoding called UCS-2 (Universal Character Set as 2-bytes per character) UCS-2 is, essentially, UTF-16 without support for surrogate pairs 7 UCS-2 is currently supplanted by UTF-16
22
Unicode
UTF-32 -> UTF-16
p On the other hand, given a UTF-32 character, how do you represent it in UTF-16? This algorithm, of course, reverses the steps before: 7 For characters less than x'0001 0000', the UTF-16 representation is simply the rightmost 16 bits 7 For characters in the range x'0001 0000' to x'0010 FFFF' we need to build the high surrogate (HS) and low surrogate (LS) two-byte patterns, as follows...
Subtract x'0001 0000'; this gives a value in the range
x'0000 0000' to x'000F FFFF', call this value char
HS = x'D800' + (char / x'400') LS = x'DC00' + (char % x'400')
7 In other words, consider the 20 rightmost bits of char to be designated this way: xxxx xxxx xx|yy yyyy yyyy 7 then the HS is x'D800' plus the leftmost 10 bits (the x's) and the LS is x'DC00' plus the rightmost 10 bits (the y's)
23
Unicode
UTF-8
p One of the major reasons Unicode has been successful is a deliberate decision to encompass as many existing standards as possible
p When it came to the 7-bit ASCII / ISCII standard, Unicode allowed an 8-bit representation for characters with binary code points from b'0000 0000' to b'0111 1111' In other words, the first 127 Unicode characters are ASCII!
p UTF-8 allows you to represent any Unicode character as a string of one, two, three, or four bytes!
p Conversely, when a processor knows it is dealing with UTF-8, it starts by looking at one byte and the value will imply whether it needs to use one, two, three, or four bytes to construct the UTF-32 value (Unicode scalar value)
24
Unicode
UTF-32 -> UTF-8
p The mapping works like this Unicode scalar values in the range 00000000-0000007F map to single byte values 00-7F
Unicode scalar values in the range 00000080-000007FF map to two byte values, where the first byte is in the range C2-DF and the second byte is in the range 80-BF 7 Specifically looking at the bit patterns: 0000 0yyy 110y yyyy yyxx xxxx 10xx xxxx UTF-32 maps to UTF-8
Unicode scalar values in the range 00000800-00000FFF map to three byte values, where the first byte is E0, the second byte is in the range A0-BF, and the third byte is in the range 80-BF 7 Specifically looking at the bit patterns: 0000 1yyy 1110 0000 yyxx xxxx 101y yyyy
25
UTF-32 maps to 10xx xxxx UTF-8

Unicode
UTF-32 -> UTF-8, continued
p The mapping works like this, continued
Unicode scalar values in the range 00001000-0000FFFF map to three byte values, where the first byte is in the range E1-EF, the second byte is in the range 80-BF, and the third byte is in the range 80-BF 7 Specifically looking at the bit patterns: zzzz yyyy 1110 zzzz yyxx xxxx 10yy yyyy UTF-32 maps to 10xx xxxx UTF-8
Unicode scalar values in the range 00010000-0003FFFF map to four byte values, where the first byte is F0, the second byte is in the range 90-BF, the third byte is in the range 80-BF, and the fourth byte is in the range 80-BF 7 Specifically looking at the bit patterns: 0000 1111 00uu 0000 zzzz yyyy 10uu zzzz yyxx xxxx 10yy yyyy UTF-32 maps to 10xx xxxx UTF-8
26
Unicode
UTF-32 -> UTF-8, continued
p The mapping works like this, continued Unicode scalar values in the range 00040000-000FFFFF map to four byte values, where the first byte is in the range F1-F3, the second byte is in the range 80-BF, the third byte is in the range 80-BF, and the fourth byte is in the range 80-BF 7 Specifically looking at the bit patterns: 0000 1111 uuuu 00uu zzzz yyyy 10uu zzzz yyxx xxxx 10yy yyyy UTF-32 maps to 10xx xxxx UTF-8
Unicode scalar values in the range 00100000-0010FFFF map to four byte values, where the first byte is F4, the second byte is in the range 80-BF, the third byte is in the range 80-BF, and the fourth byte is in the range 80-BF 7 Specifically looking at the bit patterns: 000u 1111 uuuu 0uuu zzzz yyyy 10uu zzzz yyxx xxxx 10yy yyyy UTF-32 maps to 10xx xxxx UTF-8
27
Unicode
UTF-8 -> UTF-32
p Obviously, the reverse mapping works based on the value a processor finds in the first byte of a UTF-8 string: If a byte is in the range 00-7F, it constructs the UTF-32 output by prefixing hex 000000 If a byte is in the range C2-DF, it knows to take that byte and the next to build the UTF-32 value If a byte is E0-EF, it knows to take that byte and the next two bytes to build the UTF-32 value If a byte is F0-F4, it knows to take that byte and the next three to build the UTF-32 scalar value and then to map that to the surrogate pair Any other values in the first byte indicate an error
p The mechanics should be reasonably apparent from the preceding pages, and details are left as a task for the interested reader to do on your own The second Appendix to this document shows many code point mappings between UFT-32, UTF-16, and UTF-8
28
Unicode
Other Mappings
p We have not discussed these mappings UTF-8 -> UTF-16 UTF-16 -> UTF-8
p You can accomplish these mappings by using UTF-32 as an intermediate format then using the existing UTF-32 <-> UTF-16 mappings Or it's possible to devise direct mappings based on what we've discussed
p On IBM z/Architecture machines, there are instructions that convert between Unicode formats: CU12 - from UTF-8 to UTF-16 (also known as CUTFU) CU21 - from UTF-16 to UTF-8 (also knon as CUUTF) CU14 - from UTF-8 to UTF-32 CU41 - from UTF-32 to UTF-8 CU24 - from UTF-16 to UTF-32 CU42 - from UTF-32 to UTF-16
p Our interest in this paper has been simply in demonstrating the interrelationships between the three Unicode formats and we feel that has been explored thoroughly enough at this point
29
Unicode
Endian-ness
p One issue we didn't raise that needs to be raised in certain circumstances: UTF-16 and UTF-32 each have variants depending on the order the bytes are physically stored In mainframe (and many other systems) these characters are stored in Big Endian (BE) format: most signficant byte first 7 Officially called UTF-16BE and UTF-32BE In other systems, characters are stored in Little Endian (LE) format: least signficant byte first 7 Officially designated UTF-16LE and UTF-32LE
p When passing UTF-16 and UTF-32 strings, one has to specify the "endian-ness" of the strings, in headers or by inserting a hex value called a Byte Order Mark (BOM) at the front of the data In UTF-16, an intial byte sequence of x'FEFF' indicates big endian encoding, while x'FFFE' indicates little endian encoding In UTF-32, an initial byte sequence of x'0000 FEFF' indicates big endian encoding, while x'FFFE 0000' indicates little endian order In both cases, any BOM is not considered part of the data, and the absence of any BOM value implies big endian encoding
30
Unicode
Unicode - Conclusion
p Unicode is here, is well-defined, and is gaining wide acceptance on a variety of hardware and software platforms
p Although new characters and additional refinements continue to be made, the Unicode Consortium strives to make each new version backward compatible In other words, current algorithms and mappings will carry forward
p There is plenty of work to do for those interested in exploring the internationalization of the Web and of computer work in general The ultimate goal is effective communication between men and women of all languages and cultures on the planet
31
Unicode
Section Preview
p Appendix A UTF-32 character allocation ranges Tools for working with Unicode
32
Unicode
UTF-32 Character Allocation Ranges
p Every Unicode character is assigned a scalar value: an integer This is a 21-bit number in the range: binary 000000000000000000000 - 1 0000 1111 1111 1111 1111 or hex 00 00 00 - 10 FF FF decimal 0 - 1,114,111
33
Unicode
UTF-32 Assignments, continued
p UTF-32 is really just the 21-bit numbers, right justified, padded on the left with 11 binary zeros (so only six relevant hex digits) General allocation is this (not all gaps and details are shown):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 01 02 02 03 03 04 05 05 05 06 07 07 09 09 0A 0A 0B 0B 0C 0C 0D 0D 0E 0E 0F 10 10 11 00 80 00 80 50 B0 00 70 00 00 30 90 00 00 50 00 80 00 80 00 80 00 80 00 80 00 80 00 00 A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 01 02 02 02 03 03 04 05 05 05 06 07 07 09 09 0A 0A 0B 0B 0C 0C 0D 0D 0E 0E 0F 10 10 11 7F FF 7F 4F AF FF 6F FF FF 2F 8F FF FF 4F 7F 7F FF 7F FF 7F FF 7F FF 7F FF 7F FF FF 9F FF FF 7-bit ASCII Controls and Latin-1 Supplement Latin Extended-A Latin Extended-B IPA extensions (some Latin and Greek) Spacing modifier letters Combining diacritical marks Greek and Coptic Cyrillic Cyrillic supplementary Armenian Hebrew Arabic Syriac Thaana (note gap here) Devanagari Bengali Gurmukhi Gujarati Oriya Tamil Telugu Kannada Malayalam Sinhala Thai Lao Tibetan Myanamar Georgian Hangul Jamo
34
Unicode
p UTF-32 scalar values allocation, continued:

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12 13 14 16 16 17 17 17 17 17 18 19 19 19 1D 1E 1F 20 20 20 20 21 21 21 22 23 24 24 24 25 25 25 26 00 A0 00 80 A0 00 20 40 60 80 00 00 50 E0 00 00 00 00 70 A0 D0 00 50 90 00 00 00 40 60 00 80 A0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 13 13 16 16 16 17 17 17 17 17 18 19 19 19 1D 1E 1F 20 20 20 20 21 21 21 22 23 24 24 24 25 25 25 26 7F FF 7F 9F FF 1F 3F 5F 7F FF AF 4F 7F FF 7F FF FF 6F 9F CF FF 4F 8F FF FF FF 3F 5F FF 7F 9F FF FF Ethiopic Cherokee Unified Canadian Aboriginal symbols Ogham Runic Tagalog Hanunoo Buhid Tagbanwa Khmer Mongolian (note gap here) Limbu Tai Le (note gap here) Khmer symbols (note gap here) Phonetic extensions (note gap here) Latin extended additional Greek extended General punctuation Superscripts and subscripts Currency symbols Combining diacritical marks for symbols Letterlike symbols Number forms Arrows Mathematical operators Miscellaneous technical Control pictures Optical character recognition Enclosed alphanumerics Box drawing Block elements Geometric shapes Miscellaneous symbols
35
Unicode

00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 27 27 27 28 20 29 2A 2B 2C 2E 2F 2F 30 30 30 31 31 31 31 31 32 33 34 4D 4E A0 A4 AC D8 DC F9 FB FB 00 C0 F0 00 00 80 00 00 00 80 00 F0 00 40 A0 00 30 90 A0 F0 00 00 00 C0 00 00 90 00 00 00 00 00 50 = 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 27 27 27 28 29 29 2A 2B 2E 2E 2F 2F 30 30 30 31 31 31 31 31 32 33 4D 4D 9F A4 A4 D7 DB DF FA FB FD BF EF FF FF 7F FF FF FF 7F FF DF FF 3F 9F F0 2F 8F 9F BF FF FF FF BF FF AF 8F CF AF FF FF FF 4F FF Dingbats Miscellaneous mathematical symbols-A Supplemental arrows-A Braille patterns Supplemental arrows-B Miscellaneous mathematical symbols-B Supplemental mathematical operators Miscellaneous symbols and arrows unassigned CJK radicals supplement Kangxi radicals (note gap here) Ideographic description characters CJK symbols and punctuation Hiragana Katakana Bopomofo Hangul compatibility Jamo Kanbun Bopomofo extended (note gap here) Katakana phonetic extensions Enclosed CJK letters and months CJK compatibility CJK unified ideographs extension A Yijing Hexagram symbols CJK unified ideographs Yi syllables Yi radicals (note gap here) Hangul syllables (note gap here) high surrogates low surrogates (note gap here) CJK compatibility ideographs Alphabetic presentation forms Arabic presentation forms-A
36
Unicode

00 00 00 00 00 00 00 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 02 02 02 02 0E 0E 0E 0F 0F 10 10 FE FE FE FE FE FF FF 00 00 01 03 03 03 04 04 04 08 D0 D1 D3 D4 D8 00 A6 F8 FA 00 01 01 00 FF 00 FF 00 20 30 50 70 00 F0 00 80 00 00 30 80 00 50 80 00 00 00 00 00 00 00 E0 00 20 00 00 F0 00 FE 00 FE 00 00 00 00 00 00 00 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 02 02 02 0D 0E 0E 0E 0F 0F 10 10 FE FE FE FE FE FF FF 00 00 01 03 03 03 04 04 04 08 D0 D1 D3 D7 FF A6 F7 FA FF 00 01 FF FF FF FF FF 0F FF 4F 6F FF EF FF 7F FF 3F 2F 4F 9F 4F 7F AF 3F FF FF 5F FF FF DF FF 1F FF 7F EF FF FD FF FD FF Variation selectors Combining half marks CJK compatibility forms Small form variants Arabic presentation forms-B Halfwidths and fullwidth forms Specials Linear B syllabary Linear B ideograms Aegean numbers (note gap here) Old Italic Gothic (note gap here) Ugaritic (note gap here) Deseret Shavian Osmanya (note gap here) Cypriot syllabary (note gap here) Byzantine musical symbols Musical symbols (note gap here) Tai Xuan Jing symbols (note gap here) Mathematical alphanumeric symbols unassigned CJK unified ideographs extension B unassigned CJK compatibility ideographs supplement unassigned Tags (note gap here) Variant selectors supplement unassigned Private use area non-characters Private use area non-characters
37
Unicode
Tools for working with Unicode
p There are a variety of resources available on the web; a few: http://sas-crash.homelinux.net/unicode.php - a web page that lets you enter one of: a decimal code point (e.g.: &#26412) or a hexadecimal code point (e.g.: &#x672c) or an actual Unicode character (e.g.: ) and get back the other two
David Stephens publishes a quarterly newsletter, LongPella Expertise; the edition for February 2012 has an excellent article on Unicode in z/OS, including samples of utilities and a pointer to a tool they have for conversions: 7 http://www.longpelaexpertise.com.au/ezine/LostinTranslation3.php
http://www.alanwood.net/unicode/index.html is a website that is rich in information about available support for Unicode
http://www.utf8-chartable.de/unicode-utf8-table.pl contains a series of pages with the UTF-16 and UTF-8 information in a helpful format
38
Unicode
Section Preview
p Appendix B UTF-32 <-> UTF-16 <-> UTF-8 sample mappings
39
Unicode
Mappings
UTF-32 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 01
UTF-16 00 00 00 01
UTF-8 00 01
00 00 00 00
7E 7F 80 81
00 00 00 00
7E 7F 80 81
7E 7F C2 80 C2 81
00 00 00 00
BE BF C0 C1
00 00 00 00
BE BF C0 C1
C2 C2 C3 C3
BE BF 80 81
00 00 01 01
FE FF 00 01
00 00 01 01
FE FF 00 01
C3 C3 C4 C4
BE BF 80 81
01 01 01 01
3E 3F 40 41
01 01 01 01
3E 3F 40 41
C4 C4 C5 C5
BE BF 80 81
40
Unicode
Mappings, 2
UTF-32 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 01 01 01 01 7E 7F 80 81
UTF-16 01 01 01 01 7E 7F 80 81
UTF-8 C5 C5 C6 C6 BE BF 80 81
01 01 01 01
BE BF C0 C1
01 01 01 01
BE BF C0 C1
C6 C6 C7 C7
BE BF 80 81
01 01 02 02
FE FF 00 01
01 01 02 02
FE FF 00 01
C7 C7 C8 C8
BE BF 80 81
02 02 02 02
3E 3F 40 41
02 02 02 02
3E 3F 40 41
C8 C8 C9 C9
BE BF 80 81
02 02 02 02
7E 7F 80 81
02 02 02 02
BE BF 00 01
C9 C9 CA CA
BE BF 80 81
41
Unicode
Mappings, 3
UTF-32 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 02 02 02 02 BE BF C0 C1
UTF-16 02 02 02 02 BE BF C0 C1
UTF-8 CA CA CB CB BE BF 80 81
02 02 03 03
FE FF 00 01
02 02 03 03
FE FF 00 01
CB CB CC CC
BE BF 80 81
03 03 03 03
3E 3F 40 41
03 03 03 03
3E 3F 40 41
CC CC CD CD
BE BF 80 81
03 03 03 03
7E 7F 80 81
03 03 03 03
7E 7F 80 81
CD CD CE CE
BE BF 80 81
42
Unicode
Mappings, 4
UTF-32 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 03 03 03 03 BE BF C0 C1
UTF-16 03 03 03 03 BE BF C0 C1
UTF-8 CE CE CF CF BE BF 80 81
03 03 04 04
FE FF 00 01
03 03 04 04
FE FF 00 01
CF CF D0 D0
BE BF 80 81
04 04 04 04
3E 3F 40 41
04 04 04 04
3E 3F 40 41
D0 D0 D1 D1
BE BF 80 81
04 04 04 04
7E 7F 80 81
04 04 04 04
7E 7F 80 81
D1 D1 D2 D2
BE BF 80 81
43
Unicode
Mappings, 5
UTF-32 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 04 04 04 04 BE BF C0 C1
UTF-16 04 04 04 04 BE BF C0 C1
UTF-8 D2 D2 D3 D3 BE BF 80 81
04 04 05 05
FE FF 00 01
04 04 05 05
FE FF 00 01
D3 D3 D4 D4
BE BF 80 81
05 05 05 05
3E 3F 40 41
05 05 05 05
3E 3F 40 41
D4 D4 D5 D5
BE BF 80 81
05 05 05 05
7E 7F 80 81
05 05 05 05
7E 7F 80 81
D5 D5 D6 D6
BE BF 80 81
44
Unicode
Mappings, 6
UTF-32 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 05 05 05 05 BE BF C0 C1
UTF-16 05 05 05 05 BE BF C0 C1
UTF-8 D6 D6 D7 D7 BE BF 80 81
05 05 06 06
FE FF 00 01
05 05 06 06
FE FF 00 01
D7 D7 D8 D8
BE BF 80 81
06 06 06 06
3E 3F 40 41
06 06 06 06
3E 3F 40 41
D8 D8 D9 D9
BE BF 80 81
06 06 06 06
7E 7F 80 81
06 06 06 06
7E 7F 80 81
D9 D9 DA DA
BE BF 80 81
45
Unicode
Mappings, 7
UTF-32 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 06 06 06 06 BE BF C0 C1
UTF-16 06 06 06 06 BE BF C0 C1
UTF-8 DA DA DB DB BE BF 80 81
06 06 07 07
FE FF 00 01
06 06 07 07
FE FF 00 01
DB DB DC DC
BE BF 80 81
07 07 07 07
3E 3F 40 41
07 07 07 07
3E 3F 40 41
DC DC DD DD
BE BF 80 81
07 07 07 07
7E 7F 80 81
07 07 07 07
7E 7F 80 81
DD DD DE DE
BE BF 80 81
46
Unicode
Mappings, 8
UTF-32 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 07 07 07 07 BE BF C0 C1
UTF-16 07 07 07 07 BE BF C0 C1
UTF-8 DE DE DF DF BE BF 80 81
07 07 08 08
FE FF 00 01
07 07 08 08
FE FF 00 01
DF DF E0 A0 E0 A0
BE BF 80 81
08 08 08 08
3E 3F 40 41
08 08 08 08
3E 3F 40 41
E0 E0 E0 E0
A0 A0 A1 A1
BE BF 80 81
08 08 08 08
7E 7F 80 81
08 08 08 08
7E 7F 80 81
E0 E0 E0 E0
A1 A1 A2 A2
BE BF 80 81
47
Unicode
Mappings, 9
UTF-32 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 08 08 08 08 BE BF C0 C1
UTF-16 08 08 08 08 BE BF C0 C1 E0 E0 E0 E0
UTF-8 A2 A2 A3 A3 BE BF 80 81
08 08 09 09
FE FF 00 01
08 08 09 09
FE FF 00 01
E0 E0 E0 E0
A3 A3 A4 A4
BE BF 80 81
09 09 09 09
3E 3F 40 41
09 09 09 09
3E 3F 40 41
E0 E0 E0 E0
A4 A4 A5 A5
BE BF 80 81
-- a big jump here, since the pattern is established -0F 0F 10 10 FE FF 00 01 0F 0F 10 10 7E 7F 00 01 E0 E0 E1 E1 BF BF 80 80 BE BF 80 81
48
Unicode
Mappings, 10
UTF-32 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 00 00 00 00 . . . 1B 1B 1C 1C FE FF 00 01
UTF-16 1B 1B 1C 1C 7E 7F 00 01 E1 E1 E1 E1
UTF-8 AF AF B0 B0 BE BF 80 81
-- another big jump here, since the pattern is established --
-- another big jump here, since the pattern is established -1F 1F 20 20 FE FF 00 01 1F 1F 20 20 7E 7F 00 01 E1 E1 E2 E2 BF BF 80 80 BE BF 80 81
49
Unicode
Mappings, 11
UTF-32 00 00 00 00 . . . 4F 4F 50 50 FE FF 00 01
UTF-16 4F 4F 50 50 7E 7F 00 01
UTF-8 E4 BF E4 BF E5 80 E1580 BE BF 80 81
-- another big jump here, since the pattern is established --
-- another big jump here, since the pattern is established -00 D7 FE 00 D7 FF D7 FE D7 FF ED 9F BE ED 9F BF
-- at this point, we have reached the place where surrogate characters occur; a surrogate character by itself is not a valid Unicode character; we pick up again, after the surrogate points: 00 00 . . . 00 00 F9 00 F9 01 F9 00 F9 01 EF A4 80 EF A4 81
FF FE FF FF
FF FE FF FE
EF BF BE EF BF BF
-- the next code point, x'010000', and all code points after this, will require pairs of surrogate characters for the UTF-16 vlaues ...
50
Unicode
Mappings, 12
UTF-32 01 01 . . . 01 01 01 01 . . . 01 01 00 00 00 01
UTF-16 D8 00 DC 00 D8 00 DC 01
UTF-8 F0 90 80 80 F0 90 80 81
01 01 02 02
FE FF 00 01
D8 D8 D8 D8
00 00 00 00
DD DD DE DE
FE FF 00 01
F0 F0 F0 F0
90 90 90 90
87 87 88 88
BE BF 80 81
02 FE 02 FF
D8 00 DE FE D8 00 DE FF
F0 90 88 BE F0 90 88 BF
-- another big jump here, since the pattern is established -01 E0 00 01 E0 01 D8 01 DC 00 D8 01 DC 01 F0 90 90 80 F0 90 90 81
-- another big jump here, since the pattern is established -01 20 00 01 20 01 D8 40 DC 00 D8 41 DC 01 F0 90 90 80 F0 90 90 81
51
Unicode
Mappings, 12
UTF-32
UTF-16
UTF-8
-- a final big jump here, up to the end of assigned characters 0E 0E . . . 0E 0E 00 00 00 01 DB 40 DC 00 DB 40 DC 01 F3 A0 80 80 F3 A0 80 81
01 EE 01 EF
DB 40 DD EE DB 40 DD EF
F3 A0 87 AE F3 A0 87 AF
52
Unicode

An Introduction To Unicode - The Trainer's Friend

Uploaded by

Copyright:

Available Formats

An Introduction To Unicode - The Trainer's Friend

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An Introduction To Unicode - The Trainer's Friend

Uploaded by

Copyright:

Available Formats

Unicode

The Trainer's Friend, Inc. http://www.trainersfriend.com 303-355-2752 [email protected]

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

p Understanding Unicode Characters

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

Characters, Glyphs, and Fonts

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

p UTF stands for Unicode Transformation Format

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

UTF-32: Unicode Scalar Values

Copyright 2012 by Steven H. Comstock

p UTF-16 was the beginning point of Unicode character assignments

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

UTF-16 -> Unicode Scalar Value

p We examine this algorithm more carefully ...

Copyright 2012 by Steven H. Comstock

UTF-16 -> Unicode Scalar Value, continued

Copyright 2012 by Steven H. Comstock

UTF-16 -> Unicode Scalar Value, continued

Copyright 2012 by Steven H. Comstock

UTF-32 -> UTF-16

x'0000 0000' to x'000F FFFF', call this value char

HS = x'D800' + (char / x'400') LS = x'DC00' + (char % x'400')

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

UTF-32 -> UTF-8

UTF-32 maps to 10xx xxxx UTF-8

Copyright 2012 by Steven H. Comstock

UTF-32 -> UTF-8, continued

p The mapping works like this, continued

Copyright 2012 by Steven H. Comstock

UTF-32 -> UTF-8, continued

Copyright 2012 by Steven H. Comstock

UTF-8 -> UTF-32

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

Copyright 2012 by Steven H. Comstock

UTF-32 Character Allocation Ranges

Copyright 2012 by Steven H. Comstock

UTF-32 Assignments, continued

Copyright 2012 by Steven H. Comstock

UTF-32 Assignments, continued

p UTF-32 scalar values allocation, continued:

Copyright 2012 by Steven H. Comstock

UTF-32 Assignments, continued

p UTF-32 scalar values allocation, continued: