# Fonts

## Character

• i18n needs to be very careful about definitions of various terms
• A character is a unit of information that roughly corresponds to a grapheme (written symbol) of a natural language, such as a letter, numeral, or punctuation mark (Wikipedia)
• A character is "the smallest component of written language that has a semantic value" (Unicode)
• This includes letters such as 'a' and 'À' (or letters in any other language), digits such as '2', punctuation characters such as ',' and various symbols such as the English pound currency symbol '£'
• It is some sort of abstraction of any actual symbol: the character 'a' is to any written 'a' as a Platonic circle is to any actual circle
• The concept also includes control characters, which do not correspond to natural language symbols but to other bits of information used to process texts of the language
• A character does not have any particular appearance, although we use the appearance to help recognise the character.
• In mathematics, if you see the symbol π (pi) it is the character for the ratio of circumference to radius of a circle.
• If you are reading Greek text, it is the sixteenth letter of the alphabet: π ρ ο σ is the greek word for "with"

## Character repertoire/character set

• A set of distinct characters, such as the Latin alphabet
• No particular ordering is assumed. In English, although we say that 'a' is earlier in the alphabet than 'z', we wouldn't say that 'a' is less than 'z'. The "phone book" ordering which puts "McPhee" before "MacRea" shows that "alphabetic ordering" isn't critical to the characters
• A repertoire specifies the names of the characters and often a sample of how the characters might look. e.g the letter 'a' might look like 'a', 'a' or 'a'. But it doesn't force them to look like that - they are just samples
• The repertoire may make distinctions such as upper and lower case, so that 'a' and 'A' are different. But it may regard them as the same, just with different sample appearances. (Just like some programming languages treat upper and lower as different - e.g. Java - but some don't e.g. Basic.)
• On the other hand, a repertoire might contain different characters with the same sample appearance: the repertoire for a greek mathematician would have two different characters with appearance π
• This is also called a noncoded character set

## Character code

• A character code is a mapping from characters to integers
• This is also called a coded character set or code set
• The value of each character in this mapping is often called a code point
• ASCII is a code set. The codepoint for 'a' is 97 and for 'A' is 65 (decimal)
```       Oct   Dec   Hex   Char           Oct   Dec   Hex   Char
------------------------------------------------------------
000   0     00    NUL '\0'       100   64    40    @
001   1     01    SOH            101   65    41    A
002   2     02    STX            102   66    42    B
003   3     03    ETX            103   67    43    C
004   4     04    EOT            104   68    44    D
005   5     05    ENQ            105   69    45    E
006   6     06    ACK            106   70    46    F
007   7     07    BEL '\a'       107   71    47    G
010   8     08    BS  '\b'       110   72    48    H
011   9     09    HT  '\t'       111   73    49    I
012   10    0A    LF  '\n'       112   74    4A    J
013   11    0B    VT  '\v'       113   75    4B    K
014   12    0C    FF  '\f'       114   76    4C    L
015   13    0D    CR  '\r'       115   77    4D    M
016   14    0E    SO             116   78    4E    N
017   15    0F    SI             117   79    4F    O
020   16    10    DLE            120   80    50    P
021   17    11    DC1            121   81    51    Q
022   18    12    DC2            122   82    52    R
023   19    13    DC3            123   83    53    S
024   20    14    DC4            124   84    54    T
025   21    15    NAK            125   85    55    U
026   22    16    SYN            126   86    56    V
027   23    17    ETB            127   87    57    W
030   24    18    CAN            130   88    58    X
031   25    19    EM             131   89    59    Y
032   26    1A    SUB            132   90    5A    Z
033   27    1B    ESC            133   91    5B    [
034   28    1C    FS             134   92    5C    \   '\\'
035   29    1D    GS             135   93    5D    ]
036   30    1E    RS             136   94    5E    ^
037   31    1F    US             137   95    5F    _
040   32    20    SPACE          140   96    60    `
041   33    21    !              141   97    61    a
042   34    22    "              142   98    62    b
043   35    23    #              143   99    63    c
044   36    24    \$              144   100   64    d
045   37    25    %              145   101   65    e
046   38    26    &              146   102   66    f
047   39    27    '              147   103   67    g
050   40    28    (              150   104   68    h
051   41    29    )              151   105   69    i
052   42    2A    *              152   106   6A    j
053   43    2B    +              153   107   6B    k
054   44    2C    ,              154   108   6C    l
055   45    2D    -              155   109   6D    m
056   46    2E    .              156   110   6E    n
057   47    2F    /              157   111   6F    o
060   48    30    0              160   112   70    p
061   49    31    1              161   113   71    q
062   50    32    2              162   114   72    r
063   51    33    3              163   115   73    s
064   52    34    4              164   116   74    t
065   53    35    5              165   117   75    u
066   54    36    6              166   118   76    v
067   55    37    7              167   119   77    w
070   56    38    8              170   120   78    x
071   57    39    9              171   121   79    y
072   58    3A    :              172   122   7A    z
073   59    3B    ;              173   123   7B    {
074   60    3C    <              174   124   7C    |
075   61    3D    =              175   125   7D    }
076   62    3E    >              176   126   7E    ~
077   63    3F    ?              177   127   7F    DEL
```
• There are many, many, many code sets. EBCDIC is another for American English

## Character encoding

• To communicate a character via computer you need to encode it in some way. To transmit a string, you need to encode all characters in the string
• There are many possible encodings for any code set
• 7-bit ASCII values can be encoded as themselves into 8-bit bytes (an octet). So ASCII 'A' (with codepoint 65) is encoded as the 8-bit octet 01000001
• A different encoding would be to use the top bit for parity checking e.g. with odd parity ASCII 'A" would be the octet 11000001
• Some protocols such as Sun's RPC use 32-bit word-length encoding. ASCII 'A' would be encoded as 00000000 00000000 0000000 01000001
• The encoding extends to strings of characters. A word-length even parity encoding of "ABC" might be
10000000 (parity bit in high byte) 0100000011 (C) 01000010 (B) 01000001 (A in low byte)

## Glyphs

• Glyphs are the visual representation of characters
• Three glyphs for the character 'a' are 'a', 'a' or 'a'. Look at http://www.fonts101.com/ for may other glyphs for common characters
• There isn't always a one-to-one correspondence between glyphs and characters - glyphs are units of visual representation, characters are units of textual representation
• When the character 'f' is displayed with other characters, it is often combined with them to form a glyph called a ligature
• Examples of characters having multiple glyphs arise in "handwriting" scripts where letters come in various appearances so that they can "join up" with others. Arabic has many different glyphs for some characters
• Fonts (see later) are collections of glyphs

## ASCII

• ASCII is the most common coded character set in use today
• The repertoire is the set of upper and lower case English letters, the digits 0-9, various punctuation characters and some control characters
• The code points are defined by the familiar ASCII table
```
Char  Dec  Oct  Hex | Char  Dec  Oct  Hex | Char  Dec  Oct  Hex | Char Dec  Oct   Hex
-------------------------------------------------------------------------------------
(nul)   0 0000 0x00 | (sp)   32 0040 0x20 | @      64 0100 0x40 | `      96 0140 0x60
(soh)   1 0001 0x01 | !      33 0041 0x21 | A      65 0101 0x41 | a      97 0141 0x61
(stx)   2 0002 0x02 | "      34 0042 0x22 | B      66 0102 0x42 | b      98 0142 0x62
(etx)   3 0003 0x03 | #      35 0043 0x23 | C      67 0103 0x43 | c      99 0143 0x63
(eot)   4 0004 0x04 | \$      36 0044 0x24 | D      68 0104 0x44 | d     100 0144 0x64
(enq)   5 0005 0x05 | %      37 0045 0x25 | E      69 0105 0x45 | e     101 0145 0x65
(ack)   6 0006 0x06 | &      38 0046 0x26 | F      70 0106 0x46 | f     102 0146 0x66
(bel)   7 0007 0x07 | '      39 0047 0x27 | G      71 0107 0x47 | g     103 0147 0x67
(bs)    8 0010 0x08 | (      40 0050 0x28 | H      72 0110 0x48 | h     104 0150 0x68
(ht)    9 0011 0x09 | )      41 0051 0x29 | I      73 0111 0x49 | i     105 0151 0x69
(nl)   10 0012 0x0a | *      42 0052 0x2a | J      74 0112 0x4a | j     106 0152 0x6a
(vt)   11 0013 0x0b | +      43 0053 0x2b | K      75 0113 0x4b | k     107 0153 0x6b
(np)   12 0014 0x0c | ,      44 0054 0x2c | L      76 0114 0x4c | l     108 0154 0x6c
(cr)   13 0015 0x0d | -      45 0055 0x2d | M      77 0115 0x4d | m     109 0155 0x6d
(so)   14 0016 0x0e | .      46 0056 0x2e | N      78 0116 0x4e | n     110 0156 0x6e
(si)   15 0017 0x0f | /      47 0057 0x2f | O      79 0117 0x4f | o     111 0157 0x6f
(dle)  16 0020 0x10 | 0      48 0060 0x30 | P      80 0120 0x50 | p     112 0160 0x70
(dc1)  17 0021 0x11 | 1      49 0061 0x31 | Q      81 0121 0x51 | q     113 0161 0x71
(dc2)  18 0022 0x12 | 2      50 0062 0x32 | R      82 0122 0x52 | r     114 0162 0x72
(dc3)  19 0023 0x13 | 3      51 0063 0x33 | S      83 0123 0x53 | s     115 0163 0x73
(dc4)  20 0024 0x14 | 4      52 0064 0x34 | T      84 0124 0x54 | t     116 0164 0x74
(nak)  21 0025 0x15 | 5      53 0065 0x35 | U      85 0125 0x55 | u     117 0165 0x75
(syn)  22 0026 0x16 | 6      54 0066 0x36 | V      86 0126 0x56 | v     118 0166 0x76
(etb)  23 0027 0x17 | 7      55 0067 0x37 | W      87 0127 0x57 | w     119 0167 0x77
(can)  24 0030 0x18 | 8      56 0070 0x38 | X      88 0130 0x58 | x     120 0170 0x78
(em)   25 0031 0x19 | 9      57 0071 0x39 | Y      89 0131 0x59 | y     121 0171 0x79
(sub)  26 0032 0x1a | :      58 0072 0x3a | Z      90 0132 0x5a | z     122 0172 0x7a
(esc)  27 0033 0x1b | ;      59 0073 0x3b | [      91 0133 0x5b | {     123 0173 0x7b
(fs)   28 0034 0x1c | <      60 0074 0x3c | \      92 0134 0x5c | |     124 0174 0x7c
(gs)   29 0035 0x1d | =      61 0075 0x3d | ]      93 0135 0x5d | }     125 0175 0x7d
(rs)   30 0036 0x1e | >      62 0076 0x3e | ^      94 0136 0x5e | ~     126 0176 0x7e
(us)   31 0037 0x1f | ?      63 0077 0x3f | _      95 0137 0x5f | (del) 127 0177 0x7f

```
(ASCII table) This ASCII set is US ASCII
• The Europeans wanted only a subset, to allow their own characters. ISO 646 defines the subset
• This table shows all US ASCII characters with the ISO 646 in blue and the additional US characters in red
 ! " # \$ % & ' ( ) * + , - . 0 1 2 3 4 5 6 7 8 9 : ; < = > @ A B C D E F G H I J K L M N P Q R S T U V W X Y Z [ \ ] ^ ` a b c d e f g h i j k l m n p q r s t u v w x y z { | } ~
• The C language caters for these variations by defining a set of trigaphs to represent the possibly missing characters. So instead of
``#define MAX     32``
you can write (if you have to!)
``??=define MAX     32``
No other language accomodates this!
• The languages that can comfortably use US ASCII are Latin, Swahili, Hawaiian and American English

## ASCII national variants

This is from A tutorial on character code issues by Jukka Korpela (this is really good!)

dec oct hex glyph official Unicode name National variants
35 43 23 # number sign £ Ù
36 44 24 \$ dollar sign ¤
64 100 40 @ commercial at É § Ä à ³
91 133 5B [ left square bracket Ä Æ ° â ¡ ÿ é
92 134 5C \ reverse solidus Ö Ø ç Ñ ½ ¥
93 135 5D ] right square bracket Å Ü § ê é ¿ |
94 136 5E ^ circumflex accent Ü î
95 137 5F _ low line è
96 140 60 ` grave accent é ä µ ô ù
123 173 7B { left curly bracket ä æ é à ° ¨
124 174 7C | vertical line ö ø ù ò ñ f
125 175 7D } right curly bracket å ü è ç ¼
126 176 7E ~ tilde ü ¯ ß ¨ û ì ´ _
Some of the languages which can use these variants are discussed in Der Globalzeichensatz Unicode im Betriebssystem Unix

## Extended ASCII

• This is not a standard, but is still useful for graphics applications that want to draw boxes around text, particularly in the MSDOS world
• This is from Extended ASCII chart:
• This was often used for the ability to draw "boxes" and represent tables using the characters with code points 179-218
• It was often used with the ANSI control sequences. e.g. ANSI.SYS Escape Sequences
• Extended ASCII is still used by many point-of-sale terminals and other systems that never got upgraded from DOS to Windows

## ISO 6937

• This was ignored as a standard by the computing community
• It was adopted by the teletex and videotex community
• Basically it adds in the accented characters of European languages by adding to ASCII a set of non-spacing characters - these produce two-byte sequences of non-spacing accent followed by ordinary spacing character
• Some combinations are fine - an acute accent '´' followed by an 'a' is the French character 'á'
• Most others are not - an acute accent followed by a 'z' is illegal
• ISO 6937 also has tables of "legal" combinations: how non-spacing characters and spacing characters can be combined
• Another 8-bit set using the same non-spacing mechanism is MARC-8: used by libraries for bibliographic databases
• The important concept from ISO 6937 is that some characters may be "decomposed" into smaller bits that could (but probably shouldn't) be considered as characters by themselves

## ISO 8859

• Octets are now standard for bytes. This allows 128 extra code points for extensions to ASCII
• A number of different code sets to capture the repertoires of various subsets of European languages are the 8859- series
• This is being extended to other languages which have "small enough" alphabets, such as Thai - but probably not Vietnamese, because its accents are too complex
• ISO 8859-1 is also known as Latin-1 and covers many languages in western Europe
• The variants of ISO 8859 cover many languages Wikipedia ISO 8859
• They are also described in The ISO 8859 Alphabet Soup

## ISO 2022

• This adds a further variation in how you can pack lots of characters into an 8-bit format
• ISO 2022 allows a number of different 7-bit code pages - upto four (G0, G1, G2 and G3). each page can have 94 or 96 characters, so it will support upto 284 characters - enough for most of Europe
• The design of ISO 2022 was influenced by the Dec VT100 terminal which supported upto four pages Linux man page for charsets
• One code page can be "loaded" into the bottom half of 8-bit codes (GL), another can be loaded into the top half (GR)
• Each code page has the escape character in the same location
• Pages are loaded by locking shifts.
• LOCKING-SHIFT ZERO (LS0), which invokes the G0 set into the GL area, is coded as 00/15;
• LOCKING-SHIFT ONE (LS1), which invokes the G1 set into the GL area, is coded as 00/14
• LOCKING-SHIFT TWO (LS2), which invokes the G2 set into the GL area, is coded as ESC 06/14
• LOCKING-SHIFT ONE RIGHT (LS1R), which invokes the G1 set into the GR area, is coded as ESC 07/14; etc
• See Basic principles of ISO/IEC 2022
• The specification costs money from ISO, but is free from ECMA as Character Code Structure and Extension
• The major contribution of ISO 2022 is that of a state model where many different characters can be encoded depending on the state of the encoding

## Issues with ISO 2022

• To print a character, you have to know the state of the system: which page is loaded where (e.g. G1 in GL, G2 in GR)
• A character might have a unique codepoint on one of the four pages G0-G3, but the encoding is not unique.
• Suppose 'a' is in page G1 at location N/M
• `LS1 N/M` prints 'a'
• `LS1R (N+8)/M` also prints 'a'

• To check if two strings are equal, you have check character by character, checking equality of code values won't work - you have to check whether state plus value gives the same character
• You can't search for a character or substring by just starting at the middle, because the state "in the middle" is a combination of all the state transforms from the beginning
• ISO 2022 also has non-locking shifts which change state for one character only and then changes back again
• Locking and non-locking shifts makes equality tests even harder

## Two-byte coded character sets

• Languages based on alphabets can usually be fitted into single-byte encodings
• The ISO 8859 series can handle several languages within each set, but it is cumbersome
• Code page switching techniques such as ISO 2022 can extend them further
• But Chinese, Korean, Japanese, early Egyptian, Mayan, ... have far too many characters to fit into one byte, so they need 2-byte coded character sets

## Chinese character sets

• Traditional Chinese is still used in Taiwan, Hong Kong, etc
• Two character sets for this are Big5 and EUC-TW. See Answers.com Big5
• The People's Republic of China simplified several thousand of the characters
• The simplified character sets include GB2312 and GBK/GBX. See Answers.com Guobiao code
• Pinyin is a Latin-1 representation of simplified characters

## Japanese character sets

There is a discussion of Japanese character sets at http://www.debian.org/doc/manuals/intro-i18n/ch-languages.en.html

## Unicode

• Unicode is supposed to be the amswer to all this confusion about character sets and encodings
• It is a 16-bit coded character set, with a repertoire large enough to cover most existing languages
• Unicode is now upto version 4; version 3 had over 49,000 characters
• Each Unicode character has a name such as "LATIN CAPITAL LETTER A WITH GRAVE" and codepoint written as "U+00C0" in hexadecimal
• An alphabetical list of characters is at http://www.unicode.org/charts/charindex.html and lists of charts showing representative glyphs are at http://www.unicode.org/charts/
• There are a variety of different encodings for Unicode: the "native" encoding UCS-2 uses the codepoints as two-byte values; UTF-8 is an optimised representation using one byte where it can; UTF-7 is a 7-bit encoding
• Java uses Unicode for internal representation of characters. The escape sequence for a Unicode character in Java strings is '\uXXXX' where XXXX is the 4-digit hexadecimal codepoint in uppercase
• Much more on Unicode in later lectures...

## Java and the console

• Java writes to the console using `System.out` which is of type `PrintStream`
• Javadoc: "All characters printed by a PrintStream are converted into bytes using the platform's default character encoding" - if it doesn't support the character you want
• To display e.g. ISO 8859-2 characters in a console window, you have to work around this: start a console window supporting that font e.g. in Linux
``````
``````
Then write the appropriate bytes into that window e.g.
``````
System.out.write(65);  // 'A'
System.out.write(0xA1); // in ISO 8859-2, an 'A' with a cedilla?
``````

## Java character encodings

• Text files contain information written using a character encoding
• The class `java.nio.charset.Charset` is "a named mapping between sequences of sixteen-bit Unicode characters and sequences of bytes"
• It can be used to convert in either direction between internal Unicode characters and various encodings
• A list of recognised encodings and their aliases is given by
``````

import javax.swing.*;
import java.awt.*;
import java.io.*;
import java.util.ResourceBundle;
import java.nio.charset.Charset;
import java.util.Map;

public class ListEncodings {

public static void main(String[] args) {
Map availableEncodings = Charset.availableCharsets();
Object[] encodings =  availableEncodings.values().toArray();
for (int n = 0; n < encodings.length; n++) {
Charset encoding = (Charset) encodings[n];
System.out.println(encoding.toString());
Object[] aliases =  encoding.aliases().toArray();
for (int m = 0; m < aliases.length; m++) {
System.out.println("    " + aliases[m].toString());
}

}
}
}
``````

## Reading and writing text files

• The classes `InputStreamReader` and `OutputStreamWriter` have constructors which take a `Charset` parameter, and perform conversion as they read and write
• This program converts an input file in one encoding to an output file in another
``````

import javax.swing.*;
import java.awt.*;
import java.io.*;
import java.util.ResourceBundle;
import java.nio.charset.Charset;
import java.util.Map;

public class EncodingConverter {

public static ResourceBundle bundle;
private Font unicodeFont = new Font("Bitstream Cyberbit",
Font.PLAIN, 16);

public static void main(String[] args) throws Exception {
bundle = ResourceBundle.getBundle("EncodingConverter");

if (args.length != 4) {
fatal("Usage");
}

new EncodingConverter(args[0], args[1], args[2], args[3]);
}

public static void fatal(String errorType) {
Map availableEncodings = Charset.availableCharsets();
Object[] encodings = availableEncodings.values().toArray();
for (int n = 0; n < encodings.length; n++) {
System.out.println(encodings[n].toString());
}

String errorMsg = bundle.getString(errorType);
System.out.println("Fatal error: " + errorMsg);
System.exit(1);
}

public EncodingConverter(String inFileName, String inEncoding,
String outFileName, String outEncoding) {
FileInputStream fin = null;
FileOutputStream fout = null;
try {
fin = new FileInputStream(inFileName);
fout = new FileOutputStream(outFileName);
} catch (FileNotFoundException e) {
fatal("FileNotFound");
}

OutputStreamWriter out = null;
try {
out = new OutputStreamWriter(fout, outEncoding);
} catch (UnsupportedEncodingException e) {
fatal("NoEncoding");
}

String s = null;
try {
while ((s = in.readLine()) != null) {
out.write(s + "\n");
}
} catch (IOException e) {
// do nothing
} finally {
try {
fin.close();
fout.close();
} catch (IOException e) {
// do nothing
}
}

}
}
``````

## Fonts

• From http://en.wikipedia.org/wiki/Typeface A font (originally fount, from typefoundry) is a set of glyphs (images) representing the characters from a particular character set in a particular typeface.
• Traditionally a font was specific to a given size (the actual height of characters), weight (how dark the text appears e.g. bold, light) and style (most commonly regular, italic or condensed).
• In digital fonts, the image of each character may be encoded either as a bitmap (in a bitmap font) or by a higher-level description in terms of lines and curves enclosing space (an outline font, also called "vector font").
• There is a blur between characters and glyphs in a font: sometimes a character may need several glyphs, or a set of characters form a glyph. Should you make up extra "dummy" characters for these extra glyphs? Unicode is basically a character set, but has sections for things like "Alphabetic Presentation Forms" which includes the "presentation character" `LATIN SMALL LIGATURE FFI`, and similar "Arabic Presentation Forms"
• There are hundreds of fonts listed and available for download at David McCreedy's Gallery of Unicode Fonts

## Java fonts

• Internally, Java represents all characters using Unicode - see later
• To draw glyphs in GUI components, Java uses fonts
• Font support in Java has improved from poor to okay
• Drawing glyphs in AWT components such as `Label` requires support from the underlying window system since they use "native" window objects (the Java `Label` uses a "peer" object - a Motif `XmLabel` under X Windows)
• For Swing objects, Java can draw the glyphs itself and has better support

## Java logical fonts

• The Java logical fonts were developed for Java 1.0, where cross-platform support using native GUI objects was the primary aim
• Logical fonts are Serif, SansSerif, Monospaced, Dialog, and DialogInput
• These are the only fonts that can be used in AWT components
• The fonts can have size and style (plain, italics, bold) specified in a constructor e.g.
``Font f = new Font("Serif", Font.PLAIN, 12)``
• These fonts are mapped to system fonts using the locale value, using files in `JRE_HOME/lib/fonts`. e.g in Fedora Linux RC2:
```font.properties
font.properties.ja
font.properties.ja.Redhat6.1
font.properties.ja.Redhat6.2
font.properties.ja.Redhat7.2
font.properties.ja.Redhat7.3
font.properties.ja.Redhat8.0
font.properties.ja.Turbo
font.properties.ja.Turbo6.0
font.properties.ko.Redhat8.0
font.properties.Redhat6.1
font.properties.Redhat8.0
font.properties.SuSE8.0
font.properties.zh_CN.Redhat8.0
font.properties.zh.Turbo
font.properties.zh_TW.Redhat8.0
```
• Lines in these files map logical fonts to X Window fonts e.g.
```serif.0=-b&h-lucidabright-medium-r-normal--*-%d-*-*-p-*-iso8859-1
serif.italic.0=-b&h-lucidabright-medium-i-normal--*-%d-*-*-p-*-iso8859-1
serif.bold.0=-b&h-lucidabright-demibold-r-normal--*-%d-*-*-p-*-iso8859-1
```
(to the right of the '=' is a pattern for an X Windows font name). These are all ISO 8859-1 fonts
• The X Window fonts are stored in `/usr/X11R6/lib/X11/fonts/`. e.g in the subdirectory `75dpi` the file `fonts.dir` has lines like
```lubI10-ISO8859-1.pcf.gz -b&h-lucidabright-medium-i-normal--10-100-75-75-p-57-iso8859-1
```
which map the X Windows font name to an actual font file `lubI10-ISO8859-1.pcf.gz`
• Under Unix, the code set can be seen using the command `xfd`
``xfd -fn '-b&h-lucidabright-medium-r-normal--*-*-*-*-p-*-iso8859-1``
where '%d' is replaced by the wild-card '*'

## TrueType fonts

• A font file contains a glyph for each codepoint in the repertoire
• TrueType fonts contain a glyph that is scalable
• To maintain the right "shape" for each glyph, there is a small program (per glyph!) to control how it maps to pixels
• See TrueType Hinting

## Java physical fonts

• Java only uses TrueType fonts, stored in `JRE_HOME/lib/fonts`
• The file `fonts.dir` lists the font files included in each TrueType file e.g. the entry for LucidaTypewriterRegular.ttf contains
```-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-ascii-0
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-fcd8859-15
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso10646-1
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso8859-1
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso8859-10
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso8859-15
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso8859-2
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso8859-3
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso8859-4
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso8859-5
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso8859-6
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso8859-7
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso8859-8
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-iso8859-9
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-koi8-r
-b&h-Lucida Sans Typewriter-medium-r-normal--0-0-0-0-m-0-koi8-ru
```
This file contains various 8-bit fonts

## Font families

• Fonts are typically designed by a company, to give the "same" appearance to sets of characters
• This is an orthogonal concept to character repertoires: the English characters form one repertoire, the French form another. But if one company designs glyphs for both of these sets that have a common style, then they belong to the same family
• Fonts come in families, representing a set of fonts which are designed on the same principles of appearance (Web Typography)

• Within each family are variations in the glyphs: bold, italics, etc

## Java font families

This program lists the fonts families known to Java

``````
import java.awt.Font;
import java.awt.GraphicsEnvironment;

public class ListFontFamilies {

public static void main(String[] args) {

String[] fontFamilyNames = GraphicsEnvironment.
getLocalGraphicsEnvironment().getAvailableFontFamilyNames();

System.out.println("Font family names:");
for (int n = 0; n < fontFamilyNames.length; n++) {
System.out.println("   " + fontFamilyNames[n].toString());
}

}
}
``````
These can be used to create fonts as in
``````
Font f = new Font("Bitstream Charter", Font.PLAIN, 12);
``````

## Java fonts

• The different fonts within each font family can be listed by
``````
import java.awt.Font;
import java.awt.GraphicsEnvironment;

public class ListFonts {

public static void main(String[] args) {
Font[] allfonts = GraphicsEnvironment.getLocalGraphicsEnvironment().getAllFonts();

System.out.println("Fonts");
for (int n = 0; n < allfonts.length; n++) {
System.out.println(allfonts[n].toString());
}

String[] fontFamilyNames = GraphicsEnvironment.getLocalGraphicsEnvironment().getAvailableFontFamilyNames();

System.out.println("\n\n\nFont family names");
for (int n = 0; n < fontFamilyNames.length; n++) {
System.out.println(fontFamilyNames[n].toString());
}

}
}
``````
• An entry such as
``````
java.awt.Font[family=Bitstream Charter,name=Bitstream Charter Bold,style=plain,size=1]
``````
is interpreted as: font family name is "Bitstream Charter", font family face is "Bitstream Charter Bold" (???)
• The site http://java.sun.com/j2se/1.4.2/docs/guide/intl/font.html gives info on what these fonts support for JDK 1.4, and http://java.sun.com/j2se/1.5.0/docs/guide/intl/font.html does the same for JDK 1.5

• The US version of Java for Linux doesn't have a font with enough glyohs to show all of the Unicode characters - in particular, it doesn't have glyphs for Chinese, Japanese and Korean
• There may be Asian fonts on the Windows NT distribution. For Linux, the RPM `ttfonts-zh` will have the `zysong.ttf` TrueType fonts. The `Cyberbit.ttf` font is available from e.g. http://www.carfield.com.hk/mirror/pub/font/Cyberbit.ttf
• Copy one or more of these font files to `JRE_HOME/lib/fonts`. For X Windows, run `ttmkfdir > fonts.dir` to regenerate the font directory
• Then you can create fonts capable of handling Asian languages by
``````
Font font = new Font("ZYSong18030", Font.PLAIN, 24); // or
font =      new Font("Bitstream Cyberbit", ...)

``````
• If you want to make these the default fonts, edit the `font.properties` files in `JRE_HOME/lib`
• Large fonts have size penalties: an ISO 8859-1 font file (non-TrueType) is typically about 5 kilobytes, an ISO 8859-1 TrueType file is about 80 kilobytes whereas `Cyberbit.ttf` is 13 Megabytes

## Checking font coverage

• The class `Font` has methods `canDisplay()` and `canDisplayUpTo()` to check if the font will display characters and strings

## Displaying text files using Java

This is based on a program from Chinese in Java

``````

import javax.swing.*;
import java.awt.*;
import java.io.*;
import java.util.ResourceBundle;
import java.nio.charset.Charset;
import java.util.Map;

public static ResourceBundle bundle;
private Font unicodeFont = new Font("Bitstream Cyberbit",
Font.PLAIN, 16);

public static void main(String[] args) throws Exception {

if (args.length != 2) {
fatal("Usage");
}

}

public static void fatal(String errorType) {
String errorMsg = bundle.getString(errorType);
System.out.println("Fatal error: " + errorMsg);
System.exit(1);
}

public FileReader(String fileName, String encoding) {
FileInputStream fin = null;
try {
fin = new FileInputStream(fileName);
} catch (FileNotFoundException e) {
fatal("FileNotFound");
}

try {
} catch (UnsupportedEncodingException e) {
fatal("NoEncoding");
}

JFrame frame = new JFrame();
JTextArea text = new JTextArea(20, 100);
JScrollPane pane = new JScrollPane(text);

text.setFont(unicodeFont);

frame.setSize(600,600);
frame.setVisible(true);

String s = null;
try {
while ((s = in.readLine()) != null) {
text.append(s + "\n");
}
} catch (IOException e) {
// do nothing
}
}
}
``````

Jan Newmarch (http://jan.newmarch.name)
jan@newmarch.name