Unicode

Intro

Unicode is a coded character set, with code values normally upto 65,000 - meaning the "natural" representation is as 16-bit numbers
Unicode is intended to represent characters, not glyphs
Each character is intended to have a unique code point
Various compromises to these ideals exist for backward compatability with older character sets
Main site is www.unicode.org
Reference book is "Unicode - a Primer" by Tony Graham

Character names

Each character is known by a unique name, such as "D, DOUBLE-STRUCK ITALIC CAPITAL" with code point U+2145 and glyph
The index of these is at http://www.unicode.org/charts/charindex.html

Code charts

Unicode covers a large number of characters from different languages and scripts
The characters are organised into code charts such as Coptic, Cyrillic, Greek, Basic Latin, Latin-1, Latin Extended, Hiragana, Katakana, etc
The charts are listed at http://www.unicode.org/charts/
Each chart lists a number of characters, a sample glyph, the code point and some properties
e.g. the Yen sign (Japanese currency symbol) is called "YEN SIGN" with code point U+00A5. A sample glyph is . It is in the "Latin-1 Supplement" code chart

Character type info

Type information commonly used by programs using ASCII characters is
- Alphabetic: is it a letter?
- Case: if so, is it uppercase or lowercase?
- Is it a digit?
- Is it a punctuation character?
- Is it whitespace?
These classifications also apply to Unicode, plus some extra categories
- All categories are expanded e.g. Numeric includes DIGIT ONE, ARABIC-INDIC DIGIT ONE, DEVANAGARI DIGIT ONE, BENGALI DIGIT ONE, ...
- Not only upper- and lower-case, but also title-case
- Whitespace is separated into space (SPACE, NO-BREAK SPACE, OGHAM SPACE MARK, MONGOLIAN VOWEL SEPARATOR,...) and line breaking possibilities (SPACE, EXCLAMATION MARK, QUOTATION MARK,...)
- Directionality (left-to-right, right-to-left)
- East Asian width
- Surrogate/Decomposition/Combining (see later)
Info about character types is stored in a set of files such as UnicodeData.txt and PropList.txt, from the Unicode site

Combining characters

In some of the earlier character encodings, a pair of characters would represent a single character - typically an alphabetic character plus a non-spacing accent
For backward compatability, Unicode also has this
e.g. U+04D6 CYRILLIC CAPITAL LETTER IE WITH BREVE is a single character. It is equivalent to U+0415 CYRILLIC CAPITAL LETTER IE combined with the breve accent U+0306 COMBINING BREVE
The two strings "U+04D6" and "U+0415 U+0306" both display the same glyph, but are not equal

Canonical forms

To check equality of strings, you can't just compare characters for equality - you also need to check for combined forms
Unicode distinguishes between two forms of transformation canonical and compatable
A canonical transformation is reversible such as 'A' combined with 'circle above' (U+0041 U+030A) being equivalent to 'A with circle above' (U+00C5) and also to 'Angstrom' (U+212B)
Canonical transforms are given in UnicodeData.txt:
```
      00C5;LATIN CAPITAL LETTER A WITH RING ABOVE;Lu;0;L;0041 030A;
      212B;ANGSTROM SIGN;Lu;0;L;00C5;
```
Angstrom sign is equivalent to U+00C5 which in turn is equivalent to U+0041 U+030A
A transformation can be a decomposition (one character into two) or a composition (two characters into one)
A character such as 'a with circle above and cedilla below' can be transformed into several decompositions:
- 'a' with 'circle above' with 'cedilla below'
- 'a' with 'cedilla below' with 'circle above'
A canonical reordering algorithm will choose a standard order out of these possibilities
A compatable transformation is a way-one way transformation that may lose information. For example, a ligature such as the single character 'fi' may often be represented by the two characters 'f' and 'i'. This info is in UnicodeData.txt labelled by <compat>
```
FB01;LATIN SMALL LIGATURE FI;Ll;0;L;<compat> 0066 0069;
```

Unicode recognises four transformation algorithms

	Not followed by canonical composition	Followed by canonical composition
Canonical decomposition	D	C
Compatable decomposition	KD	KC

If you want to compare strings, you first transform them both using the same transformation, and then compare them character by character
More strings are equal under compatable transformations than under canonical ones. For example, the ligature fi is comparable but not compatable to f+i

Character presentations

Some languages such as Arabic have different presentation forms based on context
The glyph in one context might be different to a glyph in another - see pp31-35 of the textbook
Unicode has these presentation forms for compatability for previous standards
It is expected that ordinary programs will not use these - specialised programs such as renderers might use them

Text direction

start at next section

Text direction is left-to-right for most languages, but right-to-left for some such as Arabic and Hebrew
A string is stored in logical order, not in displayed order: a right-to-left string is stored from the rightmost character e.g
- left-to-right "abcde" is stored as "abcde"
- right-to-left "edcba" is stored as "abcde"
- Mixed: L-to-R "abcde" + R-to-L "jihgf" is stored as "abcdefghij"
Most characters have a directional property defined in UnicodeData.txt
Text may be bidirectional, a mixture of R-to-L and L-to-R. The directional property is used to layout the text
You can override the layout algorithm by embedding directional characters in the text: RIGHT-TO-LEFT MARK (u+200f), RIGHT-TO-LEFT EMBEDDING (U+202B), RIGHT-TO-LEFT OVERRIDE (U+202E), POP DIRECTIONAL FORMATTING (U+202C)

Extended Characters

Unicode isn't big enough to represent all characters in all languages
It has been extended to allow 32-bit characters
These supplementary characters have first 16 bits in the range U+D800-U+DBFF
Some algorithms will only look at the 16 bit chars. Others may look at the the value of the first 16 bits and then also look at the next 16 bits

Character properties in Java

The Character class has a number of static methods to give character properties
- getDirectionality(char ch) Returns the Unicode directionality property for the given character.
- getType(char ch) Returns a value indicating a character's general category, such as COMBINING_SPACING_MARK, CURRENCY_SYMBOL, DASH_PUNCTUATION, DECIMAL_DIGIT_NUMBER, END_PUNCTUATION, ...
- isDigit(char ch) Determines if the specified character is a digit.
- isLetter(char ch) Determines if the specified character is a letter.
- isLetterOrDigit(char ch) Determines if the specified character is a letter or digit.
- isLowerCase(char ch) Determines if the specified character is a lowercase character.
- isSpaceChar(char ch) Determines if the specified character is a Unicode space character.
- isTitleCase(char ch) Determines if the specified character is a titlecase character.
- isUpperCase(char ch) Determines if the specified character is an uppercase character.
Plus some conversion functions
- toLowerCase(char ch) Converts the character argument to lowercase using case mapping information from the UnicodeData file.
- toTitleCase(char ch) Converts the character argument to titlecase using case mapping information from the UnicodeData file.


  
      The class java.lang.CharacterData is responsible
      for storing and accessing Unicode character data
  
  
      Methods such as Character.isLetter() are passed
      to CharacterData
  
  
      The data in this class is generated from the Unicode files
      "UnicodeData.txt" and "SpecialCasing.txt" 
  
  
      It is highly optimised: the 1M of Unicode text data is condensed to
      14520 bytes, and the methods use fast bit-matching patterns
      where possible









 UTF-16 encoding 



  
      To store strings in files or to send them between applications, a character
      encoding must be used
  
  
      The simplest encoding for Unicode is UTF-16
  
  
      Each code point becomes a 16-bit integer, and a sequence of code points
      become a sequence of bytes
  
  
      This is the "natural" encoding for Unicode
  




 UTF-8 encoding 



  
      Most text has been for the ASCII character set
  
  
      Using 16 bits for characters that can be written in 8 bits is wasteful
      of space
  
  
      UTF-8 is an encoding which uses 8, 16, 24 or 32 bits for each character,
      where the ASCII characters are written using 8 bits (top bit zero)
  
  
      When the top bit is one, it means 2, 3 or 4 bytes will represent the
      character
  
  
      This is a stateful encoding, in which the value of a byte must be
      worked out in context
  





 Other encodings 



  
      ISO 10646 is a 32-bit character set, which coincides with Unicode on
      the 16 bit subset. It has encodings as UCS-4 (4 byte) and
      UCS-2. UCS-2 is the same as UTF-8
  
  
      UTF-7 is a 7-bit encoding
  






 Checking encoding  



  
      The ZERO WIDTH NO-BREAK SPACE U+FF FE takes up zero space in display
      and shows nothing
  
  
      In the different possible encodings of Unicode, it is represented by
      
	
	    00 00 FF FE UCS-4, big-endian
	
	
	    FF FE 00 00 UCS-4 little-endian
	
	
	    EF BB BF UTF-8
	
	
	    FE FF UCS-2 or UTF-16, big-endian
	
	
	    FF FE UCS-2 or UTF-16, little-endian
	
      
  
  
      If given as the first character in a string, it can be used as a signal
      for the encoding
  
  
      Both preparer and receiver have to use this convention for it to work
  
  
    
  






Jan Newmarch <jan@newmarch.name>

Last modified: Fri Apr 15 13:47:49 EST 2005