Locale
Intro
-
Applications need to be written in ways that do not require hard-coded
strings and images
-
At runtime an application should be able to find strings, images,
fonts, sounds etc that allow the application to display and collect information
-
A way is needed to express
-
Country of location
-
Language of choice
-
Culture
-
Locales offer a simple way of doing this - not perfect, but better than nothing
-
The idea is that by specifying a locale, you specify everything:
date formats, addresses, fonts, images, cultural terms, ...
Unrealistic?
Countries
-
Countries are represented by 2-letter codes such as "AU" or 3-letter codes
such as "AUS"
-
They are standardised by
ISO 3166
-
All the codes are in ASCII, and are sometimes related to their English names
("CN" for China), sometimes by their local names ("ES" for Spain, Espagne)
-
ISO changes the standard on request
-
1998: Replace the old official name "Independent State of Western Samoa"
with the new official name "Independent State of Samoa".
Codes and abbreviations remain the same
-
2002: Replace the name "East Timor" with "Timor-Leste" and change its
abbreviations from "TP" and "TMP" to "TL" and "TLS", but keep its code
value the same
-
1991: Yugoslavia collapses and new countries Croatia, etc, are created
-
2003: The remaining part of Yugoslavia splits into two states Serbia and Montenegro
and has its 2-letter abbrevation changed from "YU" to "CS"
-
Politics means that ISO 3166 codes will keep changing - not a good idea for
a supposedly fixed computing database - there is no "version control"
mechanism
Java ISO country listing
The following program lists the 2-letter country codes and their
names. It cannot handle 3-letter codes. It is not complete - it does not include
all the countries in ISO 3166.
This information is hard-coded as an array of strings into the
Locale.java
source code, so the class definition is unstable
Language codes
-
Canadians speak French and English, so one language per country is not enough
-
Language codes are standardised by
ISO639
as 2-letter (eg "en" for English) or 3-letter codes (e.g. "eng")
-
Most changes are additions such as Cornish and Manx in 1998
-
Some changes are to codes: Indonesian changed from "id" to "in" in 1989
-
Some languages disappear: Serbo-Croation was replaced by Bosnian, Croation, Serbian
etc following the breakup of Yugoslavia - dialects became separate languages
-
Many languages (such as Achinese, Acoli, Adangme, ...) do not have a 2-letter code, only a
3-letter code
-
Extensions to ISO 639 add in features such as fonts
Java ISO language listing
The following program lists the 2-letter language codes and their
names. It cannot handle 3-letter codes. It is not complete.
This information is hard-coded as an array of strings into the
Locale.java
source code, so the class definition is unstable
Variants
-
Non-standard extra information about a locale
-
Example use: specify euros instead of traditional currency in European countries
-
Example: specify dictionary versus phonebook collation
-
Example: specify a dialect within a country/language e.g. Glaswegian
Locale representation - ASCII
Locales can be written in textual form
-
Language only: "en", "fr"
-
Language and country: "en_EN", "fr_FR", "fr_CA", "fr_CH"
-
ISO 639 extensions which add extra information to languages use a "dot"
notation e.g. "zh_TW.big5" (Chinese Taiwan - Big5) or
"zh_TW.eucTW" (Chinese Taiwan - EUC)
-
Variants are added using an "@" notation e.g."zh_TW.eucTW@radical" or
"zh_TW.eucTW@stroke"
Locale listing
A Java program to list information about all known locales is
Supported locales for Java 1.4.2 are described in
http://java.sun.com/j2se/1.4.2/docs/guide/intl/locale.doc.html
Locale display
-
Locales act as labels for different countries and languages
-
Locales do not enforce any particular way of handling or displaying
information
-
The Java
Locale
class does however have methods
to display locale information using locales themselves
to guide the display
-
A program to display information in both the default locale
and in particular locales is
-
Partial output from this (in my en_US) locale by running
java LocaleDisplay
is
Default locale is en_US
--------------
In default locale for en_US
Country: United States
Language: English
In locale en_US
Country: United States
Language: English
--------------
In default locale for fr_FR
Country: France
Language: French
In locale fr_FR
Country: France
Language: français
-
Many implementations of Java allow the default locale to be overridden
at the commandline (or in properties files) as in
java -Duser.language=fr -Duser.country=FR LocaleDisplay
-
Running the same program in the French locale results in
Default locale is fr_FR
--------------
In default locale for en_US
Country: Etats-Unis
Language: anglais
In locale en_US
Country: United States
Language: English
--------------
In default locale for fr_FR
Country: France
Language: français
In locale fr_FR
Country: France
Language: français
where the default locale print statements have resulted in different output
-
This information is pulled out of locale data files as Java class files with source in
JAVA_SRC/sun/text/resources
such as
LocaleElements.java
, LocaleElements_fr.java
and LocaleElements_fr_CA.java
Locales in ANSI C
-
C is the most common systems programming language and is one of the most influential
languages for i18n standards
-
The header files
locale.h
defines structures, functions and constants
for i18n
-
The structure
lconv
holds all i18n related information for number and money
formatting
struct lconv
/* Describes formatting of monetary and other numeric values: */
char* decimal_point;
/* decimal point for non-monetary values */
char* grouping;
/* sizes of digit groups for non-monetary values */
char* thousands_sep;
/* separator for digit groups for non-monetary values (left of "decimal point") */
char* currency_symbol;
/* currency symbol */
char* int_curr_symbol;
/* international currency symbol */
char* mon_decimal_point;
/* decimal point for monetary values */
char* mon_grouping;
/* sizes of digit groups for monetary values */
char* mon_thousands_sep;
/* separator for digit groups for monetary values (left of "decimal point") */
char* negative_sign;
/* negative sign for monetary values */
char* positive_sign;
/* positive sign for monetary values */
char frac_digits;
/* number of digits to be displayed to right of "decimal point" for monetary values */
char int_frac_digits;
/* number of digits to be displayed to right of "decimal point" for international monetary values */
char n_cs_precedes;
/* whether currency symbol precedes (1) or follows (0) negative monetary values */
char n_sep_by_space;
/* whether currency symbol is (1) or is not (0) separated by space from negative monetary values */
char n_sign_posn;
/* format for negative monetary values:
0
parentheses surround quantity and currency symbol
1
sign precedes quantity and currency symbol
2
sign follows quantity and currency symbol
3
sign immediately precedes currency symbol
4
sign immediately follows currency symbol */
char p_cs_precedes;
/* whether currency symbol precedes (1) or follows (0) positive monetary values */
char p_sep_by_space;
/* whether currency symbol is (1) or is not (0) separated by space from non-negative monetary values */
char p_sign_posn;
/* format for non-negative monetary values, with values as for n_sign_posn */
}
-
It defines the following constants, giving a fine granularity to locale control:
LC_ALL
category argument for all categories
LC_NUMERIC
category for numeric formatting information
LC_MONETARY
category for monetary formatting information
LC_COLLATE
category for information affecting collating functions
LC_CTYPE
category for information affecting character class tests functions
LC_TIME
category for information affecting time conversions functions
-
It defines two functions
struct lconv* localeconv(void);
/* returns pointer to formatting information for current locale */
char* setlocale(int category, const char* locale);
C Locale
-
This is the minimal locale needed to read and compile C programs
-
No other locales are defined by ANSI C
GNU C
Unix locale
-
Defines a locale
POSIX
equivalent to the C locale
-
A locale name is typically of the form language[_territory][.codeset][@modifier],
where language
is an ISO 639 language code, territory is an ISO 3166 country code,
and codeset is a character set
or encoding identifier like ISO-8859-1 or UTF-8.
Java
-
The
Locale
class is used to define a locale
-
This class can be used by other Java classes such as
Collator
and DateFormat
-
A program can use multiple locales by creating multiple locale objects
Locale names
-
ANSI C defines the locale "C" and Unix defines the locale "POSIX"
-
Country and language codes are defined differently by ISO, Microsoft and
Apple (Unix usually follows ISO)
-
Example: French is "fr" (ISO), "LANG_FRENCH" (Windows) and
"langFrench" (Apple)
-
A list is
Language Codes: ISO 639, Microsoft and Macintosh
Issues with locales
-
Neither country codes nor language codes are stable in time.
There is no guaranteee that a locale used by a program now
will be the same in 20 years time - maintenance, evolution issues
-
A locale is meant to identify a "culture" - language/country is only
an approximation to this
-
The gypsies (Romany) form an identifiable culture, as do the Jews
and (to a lesser extent) religious groups such as Sunni Muslims.
They don't belong to any single country/language combination
-
Is Los Angeles "hip-hop" a culture? Locales don't go down to that level
of detail
-
Europe is now switching to the euro for currency. Should "fr_FR"
use French francs or euros?
-
If an American is in Australia, should they use their American locale
(zip codes vs postcodes) or Australian (dd-mm-yy vs mm-dd-yy)?
-
-
Different software (eg ANSI C, Unix, Java) has different scopes and
capabilities from locales.An ANSI C application will not be as portable
as one using Unix C, which in turn may not be as flexible as a Java one.
Should the user need to know the source language of an application?
-
There are about 160 language codes and 240 country codes. That makes nearly
40,000 combinations. While most make no sense, which ones do? Different
vendors implement different subsets
-
How do you perform matches in a partially complete environment? e.g.
if the locale is "fr_CH" (French spoken in Switzerland)
and this doesn't exist, should it match
"fr_CA" (French spoken in Canada) or "it_CH" (Italian spoken in Switzerland).
Different software matches in different ways - inconsistent user experience
-
Use of 2-letter ASCII language codes only allows 26*26 combinations (676) -
too few for the 6000+ languages currently in existence
-
See Tex Texin,
What's wrong with locales?
Summary
-
Culture-specific informatio is meant to be labelled by a locale
-
Different programming languages have differing support for locales
-
Locales rely on ISO countries and languages whch are not completely stable
-
Locales do not really map ideally onto cultures
-
If a person from one culture is in a different cultural environment,
locales do not say what elements should be used
Jan Newmarch (http://jan.newmarch.name)
jan@newmarch.name
Last modified: Fri Mar 11 22:03:36 EST 2005
Copyright ©Jan Newmarch