IETF

Internet protocols

In order for two comnputer applications to communicate, they must do so using an agreed protocol - a "language" between them
Messages can be connectionless (like letters) or connection-oriented (like a phone call)
Connectionless protocols include
- HTTP
Connection-oriented protocols include
- telnet
- ftp
From the viewpoint of i18n, connection-oriented or connectionless is not important
Protocols can send messages in
- byte format (e.g. short integers use 2 bytes)
- string oriented (e.g. a 5 digit integer takes 5 bytes in an ASCII encoding, 10 bytes in UTF-16)
String oriented formats are very important to i18n

Distributed versus local applications

A local application can assume known data-types and message formats

e.g. a Java program can ask for the set of locales and will get an array of Locale objects
The definition of Locale objects gives methods to get strings for country and language

A distributed application will send messages from a process on one computer to a process on another
The format of messages must be either
- assumed; or
- negotiated
Text messages have three aspects
- character set e.g. Unicode
- coded character set e.g. Unicode 16-bit code points
- character encoding e.g. UTF-8 or UTF-16
There may also be also transport encoding issues e.g. big-endian or little-endian

Architectural Principles of the Internet

RFC 1958 "Architectural Principles of the Internet" gives general guidelines for applications, protocols etc designed for the internet http://www.faqs.org/rfcs/rfc1958.html
With regard to names, it states: "4.3 Public (i.e. widely visible) names should be in case-independent ASCII. Specifically, this refers to DNS names, and to protocol elements that are transmitted in text format."
"5.4 Designs should be fully international, with support for localisation (adaptation to local character sets). In particular, there should be a uniform approach to character set tagging for information content."

Charset Registration Procedures

RFC 2278 - IANA Charset Registration Procedures http://www.faqs.org/rfcs/rfc2278.html gives rules on how to register new names
The syntax of charset identifiers is defined here, and is basically: a charset name is an ASCII string - e.g. ""
Of course, the charset described can have non-ASCII characters in its character set!
The site "Character Sets" http://www.iana.org/assignments/character-sets/ contains a list of charsets registered

e.g.

Name: ISO-10646-UCS-2
MIBenum: 1000
Source: the 2-octet Basic Multilingual Plane, aka Unicode
        this needs to specify network byte order: the standard
        does not specify (it is a 16-bit integer space)
Alias: csUnicode

IETF terminology

RFC 3536 "Teminology used in Internationalization in the IETF" at http://www.faqs.org/rfcs/rfc1958.html gives a glossary of all i18n related terms and defines them for use in any other IETF documents. Many of these terms are as used and defined by other groups
Fundamental terms

language, script, character, coded character, coded character set, character encoding form (CES), repertoire, glyphm glyph code, transcoding, character encoding scheme (CES), charset, internationalization, localization, i18n, l10n, multilingual, displaying and rendering text

Language

A language is a way that humans interact...the most common of which are speech, writing and signing. Language identifiers for use in internet protocols are defined in RFC3066

Script

A set of graphic characters used for the written form of one or more languages

CCS

Coded Character Set (CCS) is a mapping from a set of abstract characters to a set of integers

CES

Character Encoding Scheme (CES) is a mapping from a Coded Character Set or several coded character sets to a set of octets. A definition of a character encoding scheme consists of: - A description of an algorithm which transforms every possible sequence of octets to either a sequence of pairs or to the error state "illegal octet sequence" - Specifications, either by reference to CCS's registered by IANA or in text, of each CCS upon which this CES is based. 'Encoding' is used synonymously to 'CES'.

charset

The term "charset" means a set of rules for mapping from a sequence of octets to a sequence of characters, such as the combination of a coded character set and a character encoding scheme; this is also what is used as an identifier in MIME "charset=" parameters, and registered in the IANA charset registry ... (Note that this is NOT a term used by other standards bodies, such as ISO).

Domain Name System

The DNS is a core part of the internet, mapping names to IP addresses
Names in the current DNS uses a subset of US ASCII, defined in RFC1035: upper- and lower-case ASCII letter, digits and the hyphen '-'. Names must start with a letter. Domains and subdomains are names which are separated by a full stop '.'. The subset is sometimes labelled the "LDH" (letters, digits, hypphen) subset of US ASCII
The DNS is a distributed lookup system, with name servers arranged in a tree-like hierarchy. Requests are made of nodes, and the request may be passed up and down the tree until a match is found. Servers can cache names to improve speed of common lookups
The only pattern-matching allowed is that domain names may be inferred sometimes e.g. a search for www may add in the domain names www.monash.edu.au
Name matching is exact - a byte-by-byte comparison of the ASCII strings. No wildcard matching, or near-miss matching is supported by DNS servers. However, the match is case insensitive
Note that names are usually not words. e.g. www is not a word in any language
Any changes to the DNS must be backward compatable and cannot break any existing DNS server
Specialising to i18n, any attempt to internationalise the DNS must not break existing DNS servers
i18n attempts also run into locale and unicode problems, which is why it hasn't been finalised yet

Internationalizing Domain Names in Applications (IDNA, RFC3490)

Existing DNS servers use a subset of ASCII for domain names/labels, so any i18n versions of domain names should only use this subset too
IDNA allows labels to be in Unicode (version 3.2)
IDNA defines two operations toASCII and toUnicode
A label that is to be stored or matched against a DNS entry must be converted to ASCII using toASCII
A label that should be displayed to the user should be converted to Unicode by toUnicode. This means that any application that does e.g.
```
print getHostByName()
```
should be altered to
```
print toUnicode(getHostByName())
```
or even to
```
print toBig5(toUnicode(getHostByName()))
```
A label in any other character set (e.g. Big-5) must be converted to or from Unicode

The full stop '.'

Domain labels in ASCII are separated by the ASCII character '.' with code point 0x2E
In IDNA, whenever dots are used as label separators, the following characters MUST be recognized as dots: U+002E (full stop), U+3002 (ideographic full stop), U+FF0E (fullwidth full stop), U+FF61 (halfwidth ideographic full stop).

The hyphen

The hyphen character '-' with ASCII code point 0x2D is allowed in the LDH subset
There are many possible hyphen characters in Unicode: 002D HYPHEN-MINUS, 058A ARMENIAN HYPHEN, 1806 MONGOLIAN TODO SOFT HYPHEN, 2010. HYPHEN..HORIZONTAL BAR, 2053 SWUNG DASH, 207B SUPERSCRIPT MINUS, 208B SUBSCRIPT MINUS, 2212 MINUS SIGN, 301C WAVE DASH, 3030 WAVY DASH. FE58 SMALL EM DASH, FE63 SMALL HYPHEN-MINUS, and FF0D FULLWIDTH HYPHEN-MINUS

ToASCII algorithm

If the sequence contains any code points outside the ASCII range (0..7F) then proceed to step 2, otherwise skip to step 3.
Perform the steps specified in [NAMEPREP] and fail if there is an error. The AllowUnassigned flag is used in [NAMEPREP].
If the UseSTD3ASCIIRules flag is set, then perform these checks:
1. Verify the absence of non-LDH ASCII code points; that is, the absence of 0..2C, 2E..2F, 3A..40, 5B..60, and 7B..7F.
2. Verify the absence of leading and trailing hyphen-minus; that is, the absence of U+002D at the beginning and end of the sequence.
If the sequence contains any code points outside the ASCII range (0..7F) then proceed to step 5, otherwise skip to step 8.
Verify that the sequence does NOT begin with the ACE prefix.
Encode the sequence using the encoding algorithm in [PUNYCODE] and fail if there is an error.
Prepend the ACE prefix.
Verify that the number of code points is in the range 1 to 63 inclusive.

ToUnicode algorithm

If all code points in the sequence are in the ASCII range (0..7F) then skip to step 3.
Perform the steps specified in [NAMEPREP] and fail if there is an error. (If step 3 of ToASCII is also performed here, it will not affect the overall behavior of ToUnicode, but it is not necessary.) The AllowUnassigned flag is used in [NAMEPREP].
Verify that the sequence begins with the ACE prefix, and save a copy of the sequence.
Remove the ACE prefix.
Decode the sequence using the decoding algorithm in [PUNYCODE] and fail if there is an error. Save a copy of the result of this step.
Apply ToASCII.
Verify that the result of step 6 matches the saved copy from step 3, using a case-insensitive ASCII comparison.
Return the saved copy from step 5.

ACE prefix

The ACE prefix, used in the conversion operations, is two alphanumeric ASCII characters followed by two hyphen-minuses.
The prefix cannot be any of the prefixes already used in earlier documents, which includes the following: "bl--", "bq--", "dq--", "lq--", "mq--", "ra--", "wq--" and "zq--". The ToASCII and ToUnicode operations MUST recognize the ACE prefix in a case-insensitive manner.
The ACE prefix for IDNA is "xn--" or any capitalization thereof.
This means that an ACE label might be "xn--de-jg4avhby1noc0d", where "de-jg4avhby1noc0d" is the part of the ACE label that is generated by the encoding steps in [PUNYCODE].
While all ACE labels begin with the ACE prefix, not all labels beginning with the ACE prefix are necessarily ACE labels. Non-ACE labels that begin with the ACE prefix will confuse users and SHOULD NOT be allowed in DNS zones.

Punycode (RFC 3492)

Punycode is a simple and efficient transfer encoding syntax designed for use with Internationalized Domain Names in Applications (IDNA).
It uniquely and reversibly transforms a Unicode string into an ASCII string.
ASCII characters in the Unicode string are represented literally, and non-ASCII characters are represented by ASCII characters that are allowed in host name labels (letters, digits, and hyphens).
For details, see RFC3492 http://www.faqs.org/rfcs/rfc3492.html

StringPrep (RFC3454)

StringPrep is a general framework for preparing Unicode strings so that strings from multiple sources (from a DNS server, entered at a keyboard, etc) can be meaningfully compared
StringPrep is general. A profile is a particular specification of StringPrep algorithms
Applications need to specify a StringPrep profile (e.g. DNS specifies the NamePrep profile)

StringPrep algorithm

Map -- For each character in the input, check if it has a mapping and, if so, replace it with its mapping
- map a character to nothing e.g. O0AD SOFT HYPHEN
- map upper to lower case
- addtional mapping e.g. 2102 DOUBLE STRUCK CAPITAL C to 0043 LATIN CAPITAL LETTER C
Normalize -- Possibly normalize the result of step 1 using Unicode normalization. This is described in section 4.
Prohibit -- Check for any characters that are not allowed in the output. If any are found, return an error. This is described in section 5.
Check bidi -- Possibly check for right-to-left characters, and if any are found, make sure that the whole string satisfies the requirements for bidirectional strings. If the string does not satisfy the requirements for bidirectional strings, return an error. This is described in section 6.

NamePrep (RFC3491)

This is the StringPrep profile used by IDNA
This profile uses Unicode 3.2
This profile specifies mapping using the tables from StringPrep which describe ignorable characters and normalisation using Unicode KC
This profile specifies using Unicode normalization form KC
This profile specifies prohibiting non-ASCII space characters, non-ASCII control characters, and others unsuitable for display
This profile specifies checking bidirectional strings as described in [STRINGPREP] section 6. This excludes characters such as 202A; LEFT-TO-RIGHT EMBEDDING

So: What's wrong with IDNA?

Reference: "Review and Recommendations for Internationalized Domain Names"

Unicode is now at version 5.0. IDNA requires Unicode 3.2, and the upgrade path is not clear. For example, code points not explicitly listed in NamePrep cannot be used, so the new characters in later versions of Unicode cannot be used

IDNA is character based, not language based, so language normalisations are not used

Scripts are ignored in string equivalences, so e.g. the three ways of writing in Japanese are not treated as equivalent

Visually confusable character differences were ignored, opening the door to phishing attacks e.g. the GREEK LETTER SMALL LETTER OMICRON can be used in place of LATIN SMALL LETTER O and will look the same - a phisher could register such names, until IANA said that all characters must belong to the same script (defined at http://www.unicode.org/reports/tr24/)

User issues

Users get domain names from a variety of sources: not only electronically (as in email) but also from conversation, billboards, TV, etc
Users will remember domain names and then try to recall them later
The characters LATIN SMALL LETTER O WITH STROKE and LATIN SMALL LETTER O WITH DIARESIS are treated as equivalent by most people in Sweden and Norway. The actual character may not be recalled correctly. They are not equivalent in Unicode and cannot be made equivalent (the second is in German, but not the first) Should registration of one name mean automatic registration of another?
Japanese has three scripts: strings in one script are not equivalent to strings in another. So a label remembered in Kanji and recalled in Katakana will be different. Should registration of one name mean automatic registration of another?
There are no transformations between language variants, such as Traditional Chinese and Simplified Chinese
Bidi text brings in special issues: "ABCD" will appear the same as "R-TO-L DCBA"
Some characters belong to many scripts and may have different meanings in each script (e.g the characters common to Japanese, Chinese and Korean). Names containing these characters will be ambiguous unless the intended script is known, and this information cannot be included

Jan Newmarch <jan@newmarch.name>

Last modified: Mon Sep 4 11:56:26 EST 2006