Not followed by canonical composition | Followed by canonical composition | |
---|---|---|
Canonical decomposition |
|
|
Compatable decomposition |
|
|
sun.text.Normalizer
- code using this may break in time.
It is used internally by the Collator
class
package sun.text;
public class Normalizer {
public static Mode COMPOSE;
public static Mode COMPOSE_COMPAT;
public static Mode DECOMP;
public static Mode DECOMP_COMPAT;
public static String normalize(String str, Mode mode, int options);
public static String compose(String source, boolean compat, int options);
public static String decompose(String source, boolean compat, int options);
}
import sun.text.Normalizer;
public class Normal {
public static void main(String[] args) {
String str = "\u00C5 \u121B \u0041\u030A \uFB01 \u0066\u0069";
System.out.println("Chars are\n" +
" U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE\n" +
" U+121B ANGSTROM SIGN\n" +
" U+0041 LATIN CAPITAL LETTER A\n" +
" U+030A COMBINING RING ABOVE\n" +
" U+FB01 LATIN SMALL LIGATURE fi\n" +
" U+0066 LATIN SMALL LETTER F\n" +
" U+0069 LATIN SMALL LETTER I");
printUnicode("Original\t", str);
printUnicode("Decomposed\t", Normalizer.normalize(str, Normalizer.DECOMP, 0));
printUnicode("Composed\t", Normalizer.normalize(str, Normalizer.COMPOSE, 0));
printUnicode("Decomposed compat", Normalizer.normalize(str, Normalizer.DECOMP_COMPAT, 0));
printUnicode("Composed compat\t", Normalizer.normalize(str, Normalizer.COMPOSE_COMPAT, 0));
}
private static void printUnicode(String label, String s) {
System.out.print(label + '\t');
for (int n = 0; n < s.length(); n++) {
int ch = s.charAt(n);
if (ch == ' ') {
System.out.print(' ');
} else {
System.out.print("\\u" + Integer.toHexString(ch));
}
}
System.out.println();
}
}
Collator
class is the major public one for testing
strings
boolean equals(String source,
String target)
and a string comparison method
int compare(String source, String target)
-
Collator
can take a Locale
parameter in the contructor or use the default locale
-
Collator
is an abstract class that must be subclassed.
The JDK supplies one subclass RuleBasedCollator
Collator
applies the normalization rules of
compose/decompose/compatable/non-compatable of
Normalizer
setDecomposition(int decompositionMode)
CANONICAL_DECOMPOSITION
(Normalization Form D)
FULL_DECOMPOSITION
(Normalization Form KD)
NO_DECOMPOSITION
(default)
Strength | Description | Example |
---|---|---|
PRIMARY | The base letters are different | A versus B |
SECONDARY | The base letters are the same, but the accents are different | A versus Á |
TERTIARY | The letters are the same but differ by case | A versus a |
IDENTICAL | The letters are identical | A versus A |
Collator
class has a factory method that takes a Locale
Collator.getInstance(Locale)
Collator
orders string according to the locale rules
and collator strength and canonicalisation
if (collate.compare(str1, str2) > 0) ...
import java.text.Collator;
import java.util.Locale;
public class Compare1 {
static Collator collate = Collator.getInstance();
public static void main(String[] args) {
collate.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
System.out.println("Locale is " + Locale.getDefault().toString());
System.out.println("Default strength is TERTIARY");
compare("aaa", "bbb");
compare("Abc", "abc");
compare("Abc", "bbc");
compare("\u00c0bc", "abc");
collate.setStrength(Collator.SECONDARY);
System.out.println("\nStrength is SECONDARY");
compare("Abc", "abc");
compare("\u00c0bc", "Abc");
compare("\u00c0bc", "\u00c1bc");
compare("Abc", "\u00c1bc");
collate.setStrength(Collator.PRIMARY);
System.out.println("\nStrength is PRIMARY");
compare("Abc", "abc");
compare("\u00c0bc", "abc");
}
static void compare(String s1, String s2) {
int comp = collate.compare(s1, s2);
if (comp == 0) {
print("equals", s1, s2);
} else if (comp < 0) {
print("is before", s1, s2);
} else {
print("is after", s1, s2);
}
}
static void print(String state, String s1, String s2) {
System.out.println("\"" + s1 +"\" " + state + " \"" +s2);
}
}
You can make your own rules for RuleBasedCollator
import java.text.Collator;
import java.text.RuleBasedCollator;
import java.util.Locale;
public class Compare2 {
static String rule = "< a < c < b";
static RuleBasedCollator collate
;
public static void main(String[] args) throws java.text.ParseException {
collate = new RuleBasedCollator(rule);
System.out.println("Locale is " + Locale.getDefault().toString());
System.out.println("Default strength is TERTIARY");
compare("aaa", "bbb");
compare("bbb", "ccc");
}
static void compare(String s1, String s2) {
int comp = collate.compare(s1, s2);
if (comp == 0) {
print("equals", s1, s2);
} else if (comp < 0) {
print("is before", s1, s2);
} else {
print("is after", s1, s2);
}
}
static void print(String state, String s1, String s2) {
System.out.println("\"" + s1 +"\" " + state + " \"" +s2);
}
}
Collections
has a static method
sort(List list, Comparator super T> c)
and Collator
implements Comparator
import java.text.Collator;
import java.util.Locale;
import java.util.Vector;
import java.util.Collections;
public class Sort {
static Collator collate = Collator.getInstance();
public static void main(String[] args) {
collate.setDecomposition(Collator.CANONICAL_DECOMPOSITION);
Vector list = new Vector();
list.add("abc");
list.add("aaa");
list.add("aab");
Collections.sort(list, collate);
for (int n = 0; n < list.size(); n++) {
System.out.println(list.elementAt(n));
}
}
}
BreakIterator
class can be used to segment text
BreakIterator.getCharacterInstance(Locale l);
BreakIterator.getWordInstance(Locale l);
BreakIterator.getLneInstance(Locale l);
BreakIterator.getSentenceInstance(Locale l);
iterator.setText(String s)
int iterator.first();
int iterator.next();
int iterator.DONE
which return indexes into the current text string, and DONE when there
are no more
import java.text.BreakIterator;
public class WordBreak {
public static void main(String[] args) {
String str ="An string, with!???! and others";
System.out.println(str);
BreakIterator iterator = BreakIterator.getWordInstance();
iterator.setText(str);
int start;
int end;
start = end = iterator.first();
while (true) {
System.out.print("Boundary at " + end);
String word = str.substring(start, end);
System.out.println(", word is \"" + word + "\"");
if (end == BreakIterator.DONE) {
break;
}
start = end;
end = iterator.next();
}
}
}
import java.util.regex.*;
public class Regex {
public static void main(String[] args) {
Pattern p = Pattern.compile("A*b");
Matcher m = p.matcher("aaaaab");
boolean b = m.matches();
if (b) {
System.out.println("Matched");
} else {
System.out.println("Didn't match");
}
m = p.matcher("\u00c0b");
b = m.matches();
if (b) {
System.out.println("Matched");
} else {
System.out.println("Didn't match");
}
}
}