Normalization, sorting and searching

Normalization

Some characters can be represented in multiple ways e.g.
- LATIN CAPITAL LETTER A WITH RING ABOVE U+00C5
- ANGSTROM SIGN U+121B
- LATIN CAPITAL LETTER A followed by COMBINING RING ABOVE, U+0041 U+030A
are all regarded by Unicode as canonically equivalent, whereas in
- LATIN SMALL LIGATURE FI U+FB01
- LATIN SMALL LETTER F followed by LATIN SMALL LETTER I U+0066 U+0069
transforming the first to the second is a compatable transformation that loses some information

Unicode recognises four transformation algorithms

	Not followed by canonical composition	Followed by canonical composition
Canonical decomposition	D	C
Compatable decomposition	KD	KC

Java Normalizer class

This class was developed by IBM and is publically available in the ... package
It is also included as a private implementation class in Sun's Java, as sun.text.Normalizer - code using this may break in time. It is used internally by the Collator class


      package sun.text;
      
      public class Normalizer {
          public static Mode COMPOSE;
          public static Mode COMPOSE_COMPAT;
          public static Mode  DECOMP;
          public static Mode DECOMP_COMPAT;
      
          public static String normalize(String str, Mode mode, int options);
          public static String compose(String source, boolean compat, int options);
          public static String decompose(String source, boolean compat, int options);
      }

Warning: this program uses IBM code private to Sun's implementation of Java and may not be supported in later versions of Java




import sun.text.Normalizer;

public class Normal {

    public static void main(String[] args) {
	String str = "\u00C5 \u121B \u0041\u030A \uFB01 \u0066\u0069";

	System.out.println("Chars are\n" +
			   "   U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE\n" +
			   "   U+121B  ANGSTROM SIGN\n" +
			   "   U+0041  LATIN CAPITAL LETTER A\n" +
			   "   U+030A  COMBINING RING ABOVE\n" +
			   "   U+FB01  LATIN SMALL LIGATURE fi\n" +
			   "   U+0066  LATIN SMALL LETTER F\n" + 
			   "   U+0069  LATIN SMALL LETTER I");
 
	printUnicode("Original\t", str);
	printUnicode("Decomposed\t", Normalizer.normalize(str, Normalizer.DECOMP, 0));
	printUnicode("Composed\t", Normalizer.normalize(str, Normalizer.COMPOSE, 0));
	printUnicode("Decomposed compat", Normalizer.normalize(str, Normalizer.DECOMP_COMPAT, 0));
	printUnicode("Composed compat\t", Normalizer.normalize(str, Normalizer.COMPOSE_COMPAT, 0));
    }

    private static void printUnicode(String label, String s) {
	System.out.print(label + '\t');
	for (int n = 0; n < s.length(); n++) {
	    int ch = s.charAt(n);
	    if (ch == ' ') {
		System.out.print(' ');
	    } else {
	    System.out.print("\\u" + Integer.toHexString(ch));
}

	    
	}
	System.out.println();
    }
}

String equality

In ASCII, two strings are equal if they are byte-by-byte equal
Two Unicode strings may be considered equal if their normalised forms are equal
There are two distinct equalities: equal if the canonical transforms are equal, and equal if the compatable transforms are equal. It doesn't matter if they are transformed to composed or decomposed form
If two strings are canonically equal, then they are compatably equal, but maybe not the other way

Collator equality

The Java Collator class is the major public one for testing strings
It has a string equality method boolean equals(String source, String target) and a string comparison method int compare(String source, String target)
Collator can take a Locale parameter in the contructor or use the default locale
Collator is an abstract class that must be subclassed. The JDK supplies one subclass RuleBasedCollator



 Collator normalisation


  
      Collator applies the normalization rules of
      compose/decompose/compatable/non-compatable of
      Normalizer
  
  
      Normalisation can be set by
      setDecomposition(int decompositionMode)
  

      Values are   	CANONICAL_DECOMPOSITION   (Normalization Form D)
       FULL_DECOMPOSITION   (Normalization Form KD)
       NO_DECOMPOSITION   (default)
        


 Strength 

  
      In addition, it applies strength rules
      
	
	  
	      Strength
	  
	  
	      Description
	  
	  
	      Example
	  
	
	
	  
	      PRIMARY
	  
	  
	      The base letters are different
	  
	  
	      A versus B
	  
	
	
	  
	      SECONDARY
	  
	  
	      The base letters are the same, but the accents are different
	  
	  
	      A versus Á
	  
	
	
	  
	      TERTIARY
	  
	  
	      The letters are the same but differ by case
	  
	  
	      A versus a
	  
	
	
	  
	      IDENTICAL
	  
	  
	      The letters are identical
	  
	  
	      A versus A
	  
	
  
  
 



 String comparison 



  
      In ASCII one string is less than another if a left-to-right scan
      finds a byte in the first string less than a byte in the second
  
  
      In ASCII, "Abc" is less than "abc" because 'A' < 'a'
  
  
      Other orders are possible, even in English
      
	
	    In ASCII ordering,
	    

	    'A' < 'B' < ... < 'a' < 'b'
	
	
	    A dictionary ordering might be
	    

	     'a' < 'A' < 'b' < 'B' < 'c' ...
	
	
	    In telephone books, "Saint" < "St" < "San"
	
      
  




 String comparison in German 



  
      German has the special character LATIN SMALL LETTER SHARP S U+00DF
      which is "SS" in uppercase
  
  
      The lower case character is treated as "ss" in sorting
  
  
      Other characters are also treated as pairs of characters for comparison
  




 String comparison in French 



  
      Strings are first compared as strings with no accents on any character
  
  
      Strings that compare as equal are then compared for accents
      from
      right to left
  




 Ordering in Chinese 



  
      Ordering may be in Pinyin - with or without accents
  
  
      Ordering may be on radicals first, then on stroke count for non-radical component
  
  
      Ordering may be on coded character set values, like the default ASCII ordering
  





 Collator comparison 



  
      The Collator class has a factory method that takes a Locale
      Collator.getInstance(Locale)
  
  
      The resultant Collator orders string according to the locale rules
      and collator strength and canonicalisation
  
  
      if (collate.compare(str1, str2) > 0) ...
  


import java.text.Collator;
import java.util.Locale;

public class Compare1 {
    static Collator collate = Collator.getInstance();

    public static void main(String[] args) {
	collate.setDecomposition(Collator.CANONICAL_DECOMPOSITION);

	System.out.println("Locale is " + Locale.getDefault().toString());

	System.out.println("Default strength is TERTIARY");
	compare("aaa", "bbb");
	compare("Abc", "abc");
	compare("Abc", "bbc");
	compare("\u00c0bc", "abc");

	collate.setStrength(Collator.SECONDARY);
	System.out.println("\nStrength is SECONDARY");
	compare("Abc", "abc");
	compare("\u00c0bc", "Abc");
	compare("\u00c0bc", "\u00c1bc");
	compare("Abc", "\u00c1bc");

	collate.setStrength(Collator.PRIMARY);
	System.out.println("\nStrength is PRIMARY");
	compare("Abc", "abc");
	compare("\u00c0bc", "abc");
    }
    
    static void compare(String s1, String s2) {
	int comp = collate.compare(s1, s2);
	if (comp == 0) {
	    print("equals", s1, s2);
	} else if (comp < 0) {
	    print("is before", s1, s2);
	} else {
	    print("is after", s1, s2);
	}
    }
	
    static void print(String state, String s1, String s2) {
	System.out.println("\"" + s1 +"\" " + state + " \"" +s2);
    }
}




Making your own rules



      You can make your own rules for RuleBasedCollator

import java.text.Collator;
import java.text.RuleBasedCollator;
import java.util.Locale;

public class Compare2 {
    static String rule = "< a < c < b";
    static RuleBasedCollator collate
;
    public static void main(String[] args) throws java.text.ParseException {
	collate = new RuleBasedCollator(rule);

	System.out.println("Locale is " + Locale.getDefault().toString());

	System.out.println("Default strength is TERTIARY");
	compare("aaa", "bbb");
	compare("bbb", "ccc");
    }
    
    static void compare(String s1, String s2) {
	int comp = collate.compare(s1, s2);
	if (comp == 0) {
	    print("equals", s1, s2);
	} else if (comp < 0) {
	    print("is before", s1, s2);
	} else {
	    print("is after", s1, s2);
	}
    }
	
    static void print(String state, String s1, String s2) {
	System.out.println("\"" + s1 +"\" " + state + " \"" +s2);
    }
}



 Sorting 



  
      Sorting is based on string comparison: if one string is "less than" another,
      then order it earlier
  

      For example, the class Collections has a static method
      sort(List list, Comparator c) 
      and Collator implements Comparator
  
  
      String sorting algorithms are not changed by i18n - they just rely on comparing
      string values
  
  
      Most sorting algorithms measure complexity based on the number of
      comparisons. This is valid for Unicode, since comparisons
      can be expensive
  


import java.text.Collator;
import java.util.Locale;
import java.util.Vector;
import java.util.Collections;

public class Sort {
    static Collator collate = Collator.getInstance();

    public static void main(String[] args) {
	collate.setDecomposition(Collator.CANONICAL_DECOMPOSITION);

	Vector list = new Vector();
	list.add("abc");
	list.add("aaa");
	list.add("aab");

	Collections.sort(list, collate);

	for (int n = 0; n < list.size(); n++) {
	    System.out.println(list.elementAt(n));
	}

    }
}




  Text boundaries 



  
      In English, words are separated by whitespace or punctuation characters
  
  
      Sometimes it can be hard to decide what a punctuation character is:
      e.g. "e.g." is an abbreviation and is a single word; 1.23 is a
      number that should be treated as a single word in English
      (but in French it would be two numbers with punctuation) 
  
  
      Chinese and other languages have no concept of whitespace,
      and words of one or more characters just run into each other
  
  
      Possible boundaries are character, word, line and sentence (but not paragraph)
  




 BreakIterator class 



  
      The BreakIterator class can be used to segment text
  
  
      There are factory methods to get iterators:
      
      BreakIterator.getCharacterInstance(Locale l);
      BreakIterator.getWordInstance(Locale l);
      BreakIterator.getLneInstance(Locale l);
      BreakIterator.getSentenceInstance(Locale l);
      
  
  
      Any of the break iterators need to told the text it uses
      iterator.setText(String s)
  
  
      This doesn't act like a standard Java iterator:
      it has methods
      
      int iterator.first();
      int iterator.next();
      int iterator.DONE
      
      which return indexes into the current text string, and DONE when there
      are no more
  


import java.text.BreakIterator;

public class WordBreak {

    public static void main(String[] args) {
	String str ="An string, with!???! and others";
	System.out.println(str);
	BreakIterator iterator = BreakIterator.getWordInstance();
	iterator.setText(str);
	int start;
	int end;
	start = end = iterator.first();
	while (true) {
	    System.out.print("Boundary at " + end);
	    String word = str.substring(start, end);
	    System.out.println(", word is \"" + word + "\"");
	    if (end == BreakIterator.DONE) {
		break;
	    }
	    start = end;
	    end = iterator.next();
	}
	
    }
    
}





Word break in Chinese  



  
      Chinese words/phrases are made up of one or more characters
  
  
      The characters are run together with no spaces between them
  
  
    A possible sentence might be
      
      
      
      consisting of a one-character word and a two-character word.
  
  
      The break iterator classes supplied just give this as one word
  
  
      Finding word breaks in Chinese is still a research problem
  
  e.g. see 
      http://bar.austlii.edu.au/au/other/CompLRes/2003/4.html




 Regular expressions  



  
      Regular expressions are a powerful way of doing pattern matching on strings
  
  
      Originally from the Unix editors, they are used by sed, awk, perl, ...
  
  
      You have a pattern such as "a*b" which means "zero or more 'a's followed by a 'b'" 
  
  
      Then strings such as "b", "ab", "aab" etc all match
  
  
      Java has regexp support, but it does not seem to support decompositions,
      strength, etc
  



import java.util.regex.*;

public class Regex {

    public static void main(String[] args) {
	Pattern p = Pattern.compile("A*b");
	Matcher m = p.matcher("aaaaab");
	boolean b = m.matches();
	if (b) {
	    System.out.println("Matched");
	} else {
	    
	    System.out.println("Didn't match");
	}

	m = p.matcher("\u00c0b");
	b = m.matches();
	if (b) {
	    System.out.println("Matched");
	} else {
	    
	    System.out.println("Didn't match");
	}

    }
    
}






Jan Newmarch <jan@newmarch.name>

Last modified: Mon Aug 28 11:25:09 EST 2006


Copyright © Jan Newmarch, Monash University, 2007



This work is licensed under a
Creative Commons License


The moral right of Jan Newmarch to be identified as the author of this page has been asserted.

Strength	Description	Example
PRIMARY	The base letters are different	A versus B
SECONDARY	The base letters are the same, but the accents are different	A versus Á
TERTIARY	The letters are the same but differ by case	A versus a
IDENTICAL	The letters are identical	A versus A