INTERNET-DRAFT Larry Masinter Xerox Corporation Martin Duerst draft-masinter-url-i18n-05.txt W3C/Keio University Expires End of September 2000 March 2000 Internationalized Uniform Resource Identifiers (IURI) Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This document is not a product of any working group, but may be discussed on the mailing list or . For more information on the topic of this internet-draft, please also see [W3C IURI]. Abstract URIs [RFC 2396] are defined as sequences of characters chosen from a limited subset of the repertoire of ASCII characters both for transmission in network protocols and representation in spoken and written human communication. This document defines IURIs (Internationalized URIs) as a sequence of characters from the repertoire of the UCS (Universal Character Set). A mapping of IURIs to URIs and guidelines for the use and deployment of IURIs in various elements of software that deal with URIs are given. 0. Change History From -04 to -05 - Added www-international@w3.org for discussion - Updated reference to Unicode TR #15. - Added reference [IETFNorm]. - Tried to make sure that everything also applies to URI References. Added text to Intro, and added Section 2.5. - Word polishing. - Made sure it's clear that the first limitation in 1.2 is of general nature. - Added reference to [RFC 2732]. - Updated reference to [RFC 2640]. - Added an example for not using characters outside ASCII for syntactical purposes. - Rewrote 2.2 Mapping of IURIs to URIs to make it a more direct mapping from characters to characters, and to align it with [CharMod]. - Added a note that Normalization may not be actually necessary in many cases. - Reworded text about conversion back from IURIs to URIs, to be much more careful. - Tried to explain 'component' in 2.4. - Changed MUST to SHOULD for alternative display of IURIs in 3.2. - Updated Acknowledgements. From -03 to -04 Changed copyright statement, added/updated some references. From -02 to -03 The main change from draft-masinter-url-i18n-02.txt is a rewrite to introduce IURIs as sequences of (abstract) characters. This mainly affected the overall structure and wording, but not the actual details. 1. Introduction 1.1 Overview URIs [RFC 2396] are defined as sequences of characters chosen from a limited subset of the repertoire of ASCII characters. This document defines IURIs (Internationalized URIs) as a sequence of characters from a much wider repertoire. The base for the repertoire is the UCS (Universal Character Set, [ISO 10646]), but as in the case of URIs and ASCII, certain restrictions apply. The characters in URIs are frequently used for representing words of natural languages. However, due to the limited character repertoire of URIs, this favors some languages over others; most languages of the world are not merely written with the letters A-Z. Using words from natural languages in identifiers has various advantages. This should be quite obvious from the fact that such identifiers are extremely widespread. Such identifiers are: - easier to memorize - easier to interpret - easier to transcribe - easier to create - easier to guess - easier to identify with Also, for native speakers, all these operations are much easier to do in the script they are used to; handling Latin letters is as difficult for many people around the world as handling the letters of another script for people used to the Latin alphabet, and transcriptions to Latin letters usually introduce additional ambiguities. In addition, URIs are not primary identifiers, but define a mechanism to integrate a large number of different kinds of identifiers and mechanisms into a uniform representation. Using characters beyond the ASCII repertoire is a widespread and increasing practice for some kinds of primary identifiers, and some conventions of how to convert these to URIs in a well-defined way are necessary. [RFC 2396] also defines URI References, which are URIs followed by a '#' and a fragment identifier. This document applies to all kinds of URI variants such as relative URIs and URI References. 1.2 Limitations The use of words from natural languages in identifiers also can bring with it some problems, which are shortly discussed here for completeness. A first problem, of general nature and also present for ASCII only is that natural language identifiers seemingly create an associations between the meaning(s) of a word and the contents or function of a resource. This tends to exclude other meanings that may be associated with the resource with equal or better reason, and makes it impossible to associate the same meaning with another resource that would also deserve this association. It may also lead to misunderstandings about the content of the resource because most words have more than one meaning associated with them. In addition, because the content of resources and the meaning of words changes over time, it is difficult to maintain the association over time, which means that either the identifier becomes meaningless or misleading, or it has to be changed, thus breaking existing references. The advantages and disadvantages of creating identifiers from natural languages therefore have to be carefully considered. The use of characters outside the strictly limited repertoire of a subset of ASCII introduces additional limitations, and gives rise to additional considerations, which are discussed wherever appropriate throughout this document. As a base for this discussion, it should be noted that the infrastructure for the appropriate handling of characters from local scripts is widely deployed in local versions of software, and that software that can handle a wide variety of scripts and languages at the same time is increasing. It is therefore not appropriate to force users in all language communities to be restricted to a single alphabet. In particular, it is not appropriate to impose the difficulties of using an unfamiliar alphabet in cases where for example 99.9% or more of the potential users of an identifier are more comfortable when using that identifier in the appropriate language and script. However, the decision of where and how the use of identifiers with characters other than A-Z is appropriate remains with the creators and users of these identifiers. This document defines IURIs and their mapping to URIs in order to remove technical restrictions on user-oriented decisions, and in order to extend the benefits of using native languages and scripts without excluding those that do not know these languages or scripts or do not have the appropriate software. 1.3 Notation The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119. 2. Syntax This section defines the syntax of Internationalized Uniform Resource Identifiers (IURI), and their mapping to URIs. In accordance with [RFC 2396, Section 1.5], an IURI is defined as a sequence of characters, which is not always represented as a sequence of octets. This assures that IURIs cannot only be transmitted electronically, but can also be written on paper or read over the radio. Defining IURIs as characters also takes into account the varieties of encodings used on different computer systems worldwide. While this is of considerably higher importance for IURIs than for URIs, please note that this is also relevant for URIs. URIs transported e.g. in HTML documents that are encoded in EBCDIC or UTF-16 are not represented using the same octet sequences as URIs in an ASCII-encoded HTML document. Note: Please note that the encoding and the octets discussed here are not the same as those discussed in [RFC 2396, Section 2.1]. Here, we are discussing the encoding of URIs/IURIs, defined in terms of characters, for digital transfer and storage. There, the encoding of octets used in protocols for which URIs are defined into URI characters is defined. 2.1 IURI Syntax IURIs are defined by extending the syntax of URIs defined in [RFC 2396], as extended by [RFC 2732] (which adds '[' and ']' to the reserved category) as follows: The character category "unreserved" [RFC 2396, Section 2.3] is extended by adding all the characters of the UCS (Universal Character Set, [ISO 10646/Unicode]) beyond U+0080, subject to the limitations given in Section 2.3. The syntax and use of URI components and reserved characters is exactly the same as that in [RFC 2396, Section 3]. This means that all the operations defined in [RFC 2396], such as the resolution of relative URIs, can be applied to IURIs by IURI-processing software exactly in the same way as this is done by URI-processing software. Characters outside the ASCII range cannot and therefore MUST NOT be used for syntactical purposes, i.e. to delimit components in newly defined URL schemes. As an example, it would not be possible to use U+00A2, CENT SIGN, as a delimiter an URL scheme. 2.2 Mapping of IURIs to URIs This section defines the mapping from IURIs to URIs: 1) Represent the IURI characters as a sequence of ISO 10646 characters. 2) Normalize the character sequence according to Normalization Form C as defined in [UNI15] and [IETFNorm]. Please refer to further discussion in Section 2.3. Further advice can also be found in [CharMod]. 3) For each character that is syntactically not allowed by the generic URI syntax (all non-ASCII characters, plus the excluded characters in [RFC 2396, Section 2.4.3] except "#" and "%" (and "[" and "]"), apply the following: 3.1) Convert the character to a sequence of one or more bytes using UTF-8 [RFC 2279]. 3.2) Escape each of the bytes in the sequence with the URI escaping mechanism [RFC 2396, Section 2.4.1] (i.e. convert each byte to %HH, where HH is the hexadecimal notation of the byte value). 3.3) Replace the original non-allowed character by the resulting character sequence. Note: Step 2) in many cases is not actually necessary, because of the way Normalization Form C is defined. In many cases, Step 2) can be avoided by making sure that the transcoding used (from e.g. Latin-1 to ISO 10646, or from characters on paper to UTF-8) produces normalized results. In step 3), octets allowed in URIs MUST NOT be escaped further, because they are already in their correct escaping stage in IURIs. The above mapping produces an URI fully conforming to [RFC 2396] out of each IURI. In addition, due to the properties of the UTF-8 character encoding, it results in the identity transformation for URIs. Every URI is therefore by definition an IURI. URIs obtained by converting IURIs as above are also called "escaped IURIs" in circumstances where it is necessary to distinguish them from URIs in general. In contrast to this, IURIs as defined in Section 2.1 are called "native IURIs" when an explicit distinction is necessary. Escaped IURIs may under some circumstances be converted back to their native form, but great care should be applied before actually doing so. Some cases are trivial because the URI and the IURI are identical. In many cases, it may not be clear whether an URI is the result of escaping an IURI or not. Converting back may end up with characters that have nothing to do with the URI in question. Due to the regularity of the UTF-8 character encoding, the chances that an URI that looks like an escaped IURI actually is an escaped IURI are rather high, but never 100%. The actual conversion from URIs back to IURIs would be done by inverting the steps 1) to 4) above. Please note that escaped URIs are also consistent with the URN syntax [RFC 2141], and with recent URL scheme definitions [RFC 2192], [RFC 2384], because these already are based on UTF-8 to encode characters outside the ASCII repertoire. As a consequence, in contexts where IURIs are acceptable and will be transformed to URIs when necessary, URNs and IMAP and POP URIs can be expressed in the form of IURIs. In some cases, some components of an URI may be convertible to native IURI form, whereas other components contain %HH escapings that cannot be converted back, or that are not converted back. In these cases, we use the term "partial IURI". In partial IURIs, the conversion between native and escaped representation and back can be applied whenever and wherever appropriate and possible, but those parts that cannot be converted MUST be left intact. 2.3 IURI Syntax Limitations This section gives the limitations on characters and character sequences usable for URIs. These limitations are of varying nature, with respect to their strictness (expressed with the usual vocabulary from [RFC 2119]) and with respect to their enforceability. In particular, some limitations are very strict, but are not easily or only partially enforceable due to the fact that the repertoire of the UCS is still being expanded. In addition, the list of limitations contains limitations that can be expressed in terms of codepoints, but not in terms of characters. - The repertoire of characters allowed in each URI component is limited by the definition of this component. For example, the definition of host names currently does not allow the use of e.g. "_", nor of any character outside the ASCII repertoire. This specification does not relax any such limitations, it only provides the base to relaxing this limitation in the context of URIs when and where this is thought to be appropriate. In particular, please note that the scheme component, which is fully defined in [RFC 2396], is not extended by this specification, i.e. scheme names with characters outside the ASCII repertoire are not allowed. - Characters similar to those in the categories "space", "delims", and "unwise" in [RFC 2396, Section 2.4.3] MUST NOT be used. [exact definition; do we need an exception for Persian here? Do we want to repeat the categories from [RFC 2396], Section 2.4.3? do we want to give a list of characters?] - Full-width ASCII equivalents, half-width Katakana,... - "Control characters" MUST NOT be used. This includes symmetric swapping, plane-14 language tag characters,... (for BIDI, see below) - Code points reserved for private use MUST NOT be used. - Code points reserved for surrogates MUST NOT be used. - Where there exist duplicate ways of encoding a certain character as visible to the user, Normalization Form C as defined in [UNI15][IETFNorm] MUST be used. - Other cases, to be studied,... 2.4 Bidirectional IURIs Bidirectional (BIDI) IURIs, i.e. IURIs containing characters with an inherent right-to-left writing direction, require additional attention when being converted from a visual representation to a digital representation and back. In digital representations (as well as when read/spelled), the sequence of components and characters is in logical order. This conforms to the specifications for the UCS and allows generic operations, such as the resolution of relative IURIs, to be carried out without special provisions. A visual representation placing the IURI characters strictly from left to right would make some of its components, such as words written in Arabic or Hebrew, unreadable. On the other hand, an uncontrolled reversion of the whole IURI would make components with Latin or other left-to-right words unreadable, and/or would obscure the sequence of the IURI components. In addition, a direct application of the Unicode bidirectionality algorithm [????] would relocate the reserved characters that define the structure of an URI because most of them have neutral directionality. The visual representation of IURIs is therefore defined as follows: - The IURI as a whole is presented from left to right, component by component. Components of an URI are parts of an URI that are delimited by reserved characters. - Within each component, the Unicode bidi algorithm is applied, assuming a left-to-right embedding context. For display, this behavior can be achieved by preceding the IURI with an LRE (left to right embedding) character, following it with a PDF (pop directional formatting) character, and preceding and following each reserved character by an LRM (left to right mark) character. In this form, it can be passed to a display engine supporting the Unicode BIDI algorithm. 2.5 IURI references While this document has been written discussing URIs and their internationalization to IURIs, its application to URI references (URIs followed by a '#' and a fragment identifier) is straightforward. The terms IURI reference, escaped IURI Reference, native IURI reference, and so on have their obvious meaning. 3. Software requirements Supporting IURIs in the same places where URIs are currently used requires cooperation from the providers of several different components of the URI infrastructure: software interfaces that handle IURIs, software that allows users to enter IURIs, software that generates IURIs, software that displays IURIs, formats that transport IURIs, and software that interprets IURIs. This section tries to explain the issues arising in each case. 3.1 IURI software interfaces Software interfaces that handle URIs, such as URI-handling APIs and protocols transferring URIs, SHOULD be upgraded to handle IURIs. In case the current handling is based on ASCII, UTF-8 SHOULD be chosen as the encoding for IURIs, because this is compatible with ASCII, is in accordance with the recommendations of [RFC 2277], makes it easy to convert to escaped IURIs where necessary, and can significantly reduce the space needed for IURIs. In any case, the encoding used MUST not be left undefined. Upgrading to IURIs is important because for certain scripts, for example Thai or Georgian, a character that needs one octet in a native representation expands to nine octets in an escaped IURI, but only three octets in UTF-8. While it can be assumed that there should in general be enough slack in the existing length limits for URIs to accommodate an expansion to three octets, an expansion by a factor of nine is more dangerous. Software components that transfer from components that allow IURIs to components that can only handle URIs MUST escape IURIs. Software components that transfer in the other direction MAY unescape IURIs. It is preferable to not unescape IURIs when there is a chance that this cannot be done correctly. For example, if it cannot be checked whether the sequence of %HH escapes corresponds to a valid sequence of UTF-8 octets, unescaping should not be done. 3.2 IURI entry One component of software that deals with IURIs allows users to enter a IURI, e.g. by typing or dictation. For example, a person viewing a visual representation of a IURI (as a sequence of glyphs, in some order, in some visual display) might use a keyboard entry method for keys in that language to create the IURI. Depending on the script and the input method used, this may be a more or less complicated process. The process of IURI entry MUST assure as far as possible that the limitations defined in Section 2.4 are met. This may be done by choosing appropriate input methods or variants thereof, by appropriately converting the characters being input, by eliminating characters that cannot be converted, and/or by issuing a warning or error message to the user. An input field primarily or only used for the input of URIs/IURIs SHOULD allow the user to view an IURI in its escaped form. Places where the input of IURIs is frequent SHOULD provide the possibility for viewing an IURI in its escaped form. An IURI input component that interfaces to components that handle URIs, but not IURIs, MUST escape the IURI before passing it to such a component. The input of IURIs with right-to-left characters requires additional care to keep the visual and the internal representation in synch, and to eliminate control characters and marks used to control the display before passing the IURI over to a resolver. IURI input fields that allow the input of right-to-left characters MUST provide the appropriate functionality. 3.3 URI generation Systems that are offering resources through the Internet, where those resources have logical names, sometimes offer the ability to generate URIs for the resources they offer. For example, some HTTP servers offer the ability to generate a 'directory listing' for file directories under their purview, and then to respond to the generated URIs with the files. Many legacy character encodings are in use in various file systems. Currently deployed systems do not transform the local character representation of the underlying system before generating URIs. For maximum interoperability, systems that generate resource identifiers should do the appropriate transformations and use escaped IURIs in cases where it cannot be expected that the recipient understands native IURIs. Due to the way most user agents currently work, native IURIs, encoded in UTF-8, may be used if the recipient announces that it can interpret UTF-8. This recommendation in particular applies to HTTP servers. For FTP servers, similar considerations apply, see in particular [RFC 2640]. 3.4 URI selection In some cases, resource owners and publishers have control over the IURIs used to identify their resources. Such control is mostly executed by controlling the resource names, such as file names, directly. In such cases, it is RECOMMENDED to avoid choosing IURIs that are easily confused. For example, for ASCII, the lower-case ell "l" is easily confused with the digit one "1", and the upper-case oh "O" is easily confused with the digit zero "0". Publishers should avoid to unintentionally confuse users with "br0ken" or "1ame" identifiers. Outside of the ASCII range, there are many more opportunities for confusion; a complete set of guidelines is too lengthy to include here. As long as names are limited to characters from a single script, native writers of a given script or language will know best when ambiguities can appear, and how they can be avoided. What may look ambiguous to a stranger may be completely obvious to the average native user. Please note that the limitations defined in Section 2.3 and the recommendations given here are of a different nature. The limitations defined in Section 2.3 are necessary to avoid duplicate encodings that are artifacts of digital representation and that the user has no way to distinguish visually. On the other hand, in a given context, an identifier such as "BOX0021" can be completely appropriate, and it is impossible to find a an algorithm that distinguishes the appropriate from the confusing identifiers. Say something about Latin vs. Greek vs. Cyrillic "A"???? Here or in 2.1???? 3.5 Display of URIs Many systems contain software that presents URIs to users as part of the system's user interface (sometimes presenting 'friendly' URIs; do we need a definition for 'friendly' URIs? I don't know what it is.). This section applies to this presentation, as well as to the strategy for printing URIs in magazines, newspapers, or reading them over the radio. Software that displays identifiers to users should follow a general principle: "Don't display something to a user that the user would not be able to enter." The consequences of this principle require judgement about the availability of software that implements the entry methods described in Section 3.2. a) In situations where a viewer is not likely to have software that implements non-ASCII character entry (as described in Section 3.1), or where it can be expected that only a limited range of non-ASCII characters can be entered, any part of an IURI containing characters outside the range allowed in [RFC 2396] or any additions SHOULD be escaped before being displayed. b) In situations where a viewer _is_ likely to have such software, IURIs MAY be displayed directly. For display of BIDI IURIs, please see section 2.4. 3.6 Interpretation of URIs Software that interprets IURIs as the names of local resources SHOULD accept IURIs in multiple forms, and convert and match them with the appropriate local resource names. First, multiple representations includes both IURIs in the native character encoding of the protocol (UTF-8 if not otherwise defined) and escaped IURIs. Second, it MAY include URIs constructed based on other character encodings than UTF-8. Such URIs may be produced by user agents that do not conform to this specification and use legacy encodings to convert non-ASCII characters to URIs. Whether this is necessary, and what character encodings to cover, depends on a number of factors, such as the local character encodings and the distribution of various versions of user agents. For example, software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in addition to UTF-8. [we have to say more clearly that we are speaking about HTTP here; I'm not sure how much this is applicable in general] Third, it MAY include additional mappings to be more user-friendly and robust against transmission errors. These would be similar to how currently some servers treat URIs as case-insensitive, or perform additional matchings to account for spelling errors. For characters beyond the ASCII repertoire, this may e.g. include ignoring the accents on received IURIs or resource names where appropriate. [add warning about dependency of casing and "accents" on language] It may seem to be difficult to unambiguously identify a resource if too many mappings are taken into consideration. This can indeed be the case. However, because escaped and native IURIs can easily be distinguished, and because of the regularity of UTF-8, the potential for collisions is usually lower than it may seem at first sight. 3.7 Transportation of IURIs in document formats Document formats that transport URIs should be upgraded to allow the transport of IURIs. In those cases where the document as a whole has a native character encoding, IURIs should also be encoded in this encoding, and converted accordingly by the parser and interpreter. IURI characters that are not expressible in the native encoding SHOULD be escaped according to Section 2.2, or MAY be escaped in another way if the document format provides a way to do this (e.g. numeric character references in HTML/XML/SGML). Please note that an interpretation of characters in URIs outside the ASCII repertoire as IURIs, i.e. conforming to this specification, is already defined as error behavior in HTML 4.0 [HTML4] and in XML 1.0 [XML1]. Also, it is under discussion to require this behavior from all W3C formats [CharMod]. 4. Upgrading strategy As this recommendation places further constraints on software for which many instances are already deployed, it is important to introduce upgrade carefully, and to be aware of the various interdependencies. 4.1 Upgrade dependencies If IURIs cannot be interpreted correctly, they should not be generated or transported. This suggests that upgrading URI interpreting software to accept IURIs should have highest priority. On the other hand, a single IURI is interpreted only by a single or very few interpreters that are known in advance, while it may be entered and transported very widely. Therefore, IURIs benefit most from a broad upgrade of software to be able to enter and transport IURIs, but before publishing any individual IURI, care should be taken to upgrade the corresponding interpreting software in order to cover the forms expected to be received by various versions of entry and transport software. The upgrade of generating software to IURIs (instead of a local encoding) should happen only after the service is upgraded to accept IURIs. Similarly, IURIs should only be generated when the service accepts IURIs and the intervening infrastructure and protocol is known to transport them safely. Display software should be upgraded only after upgraded entry software has been widely deployed to the population that will see the displayed result. These recommendations, when taken together, will allow for the extension from URIs to IURIs in order to handle scripts other than ASCII while minimizing interoperability problems. 4.2 Examples: upgrading to IURIs within various contexts 4.2.1 IURIs within HTTP The HTTP protocol [RFC 2616] includes the URI of the resource being accessed as the 'Request-URI' in the request line. Most deployed HTTP servers do not restrict the octets allowed in the protocol. Therefore, upgrading from URIs to IURIs encoded in UTF-8 according to the recommendations of Section 3.1 will not be difficult. However, most deployed HTTP servers that access resources with localized non-ASCII naming do not currently translate the Request-URI's character encoding to a local form, and will need to be upgraded to accept such aliases. In order for URI composition and transmission software to know that the recipient HTTP server has been upgraded, it may be useful to define an extension field for HTTP which explicitly informs the client about the server's capabilities and translation rules in this area. For this purpose, the OPTIONS method can be used, with a return value that includes a header which has two known enumerated values: inturi = "inturi" ":" ("iuri" | "utf8") "iuri" means the server accepts and correctly interprets escaped IURIs. "utf8" means that the server also accepts IURIs sent in UTF-8, according to Section 3.1. This doesn't guarantee that the transport path can handle native UTF-8 all the way through a chain of proxies (a hop-by-hop header would be needed to ensure that). 5. Security Considerations If IURI entry software normalizes the characters entered, but the resource names on the interpreting side are not normalized accordingly, and the interpreting software does not take this into account, there is a possibility of "spoofing". Similar possibilities turn up when interpreting software accepts URIs in various native encodings or allows accents and similar things to be ignored. "Spoofing" means that somebody may add a resource name that looks the same or similar to the user while actually being different, or a resource name that contains the same characters, but in a different encoding. The added resource may pretend to be the real resource by looking very similar, but may contain all kinds of changes that may be difficult to spot but can cause all kinds of problems. Conceptually, this is no different from the problems surrounding the use of case-insensitive web servers. For example, a popular web page with a mixed case name (http://big.site/PopularPage.html) might be "spoofed" by someone who obtains access to (http://big.site/popularpage.html). However, the introduction of character normalization, of additional mappings for user convenience, and of mappings for various encodings may increase the number of spoofing possibilities. In some cases, in particular for Latin-based resource names, this is usually easy to detect because UTF-8-encoded names, when interpreted and viewed as legacy encodings, produce mostly garbage. In other cases, when concurrently used encodings have a similar structure, but there are no characters that have exactly the same encoding, detection is more difficult. A good example may be the concurrent use of Shift_JIS and EUC-JP on a Japanese server. Administrators of large sites which allow independent users to create subareas may need to be careful that the aliasing rules do not create chances for spoofing. Acknowledgements The issue addressed here has been discussed at numerous times over the last many years; for example, there was a thread in the HTML working group in August 1995 (under the topic of "Globalizing URIs") in the www-international mailing list in July 1996 (under the topic of "Internationalization and URLs"), and ad-hoc meetings at the Unicode conferences in September 1995 and September 1997. Thanks to Francois Yergeau, Chris Wendt, Yaron Goland, Graham Klyne, Roy Fielding, M.T. Carrasco Benitez, James Clark, Andrea Vine, and many others for help with understanding the issues and possible solutions. Thanks also to the members of the W3C I18N Working Group and Interest Group for their work on [CharMod]. Copyright Copyright (C) The Internet Society, 1997. All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." Author's addresses Larry Masinter Xerox Corporation 3333 Coyote Hill Road Palo Alto, CA 94304 masinter@parc.xerox.com http://www.parc.xerox.com/masinter Fax: +1 650 812-4333 Martin J. Duerst W3C/Keio University 5322 Endo, Fujisawa 252-8520 Japan duerst@w3.org http://www.w3.org/People/D%C3%BCrst/ Tel/Fax: +81 466 49 1170 Note: The homepage URI of the second author contains a working escaped IURI. Note: Please write "Duerst" with u-umlaut wherever possible, i.e. as "Dürst" in HTML. References [CharMod] M. Duerst, Ed., Character Model for the World Wide Web, . [HTML4] "HTML 4.0", World Wide Web Consortium, . [IETFNorm] M. Duerst, M. Davis, "Character Normalization in IETF Protocols", Internet Draft , March 2000, , work in progress. [RFC 2119] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", March 1997. [RFC 2141] R. Moats, "URN Syntax", May 1997. [RFC 2192] C. Newman, "IMAP URL Scheme", September 1997. [RFC 2279] F. Yergeau. "UTF-8, a transformation format of ISO 10646.", January 1998. [RFC 2384] R. Gellens, "POP URL Scheme", August 1998. [RFC 2396] T.Berners-Lee, R.Fielding, L.Masinter. "Uniform Resource Identifiers (URI): Generic Syntax." August, 1998. [RFC 2616] R.Fielding, J.Gettys, et al, "Hypertext Transfer Protocol -- HTTP/1.1", June 1999. [RFC 2640] B. Curtis, "Internationalization of the File Transfer Protocol", July 1999. [RFC 2732] R. Hinden, B. Carpenter, L. Masinter, "Format for Literal IPv6 Addresses in URL's", December 1999. [UNI15] M.Davis and M.Duerst, "Unicode Normalization Forms", Unicode Technical Report #15, November 1999. [W3C IURI] Internationalization - URIs and other identifiers . [XML1] "XML 1.0", World Wide Web Consortium Recommendation, . Glossary/Index (to be completed) URI, URN, IURI, UCS, escaped IURI, native IURI, URI reference, IURI refernce,...