Text: Characters and Strings

General

Introduction

Text occurs in multiple ways in a computer system. Text can be stored in files, be used in the programs of particular programming languages, appear as the input or output to programs or be sent between programs, either on the same host or on different hosts.

Once upon a time ASCII and EBCDIC were the predominant forms for almots all of these. EBCDIC has pretty much disappeared, but even ASCII is showing problems.

ASCII (American Standard Code for Information Interchange) is oriented to the US version of english. It doesn't include, for example the UK pound symbol '£'. It also doesn't include the various symbols of the european languages such as 'â' and 'ß'. There are various versions of ASCII which allow for some of these.

Wider sets of characters are the ISO8859 series, and at one stage the ISO8859-1 set was used as the 'standard' for the web.

But even these don't include the characters of Chinese, Thai, Arabic, Japanese, ... . There have been multiple ways of representing these, but fortunately all of these are giving way to Unicode.

Unicode

Unicode attempts to represent all the characters of all the different human (and Klingon and Tolkein's elvish!). originally there were less than 64k such characters considered and the Basic Multilingual Plane (BMP) of these characters would fit into 2 bytes. Some languages such as Java set their character type as 2-byte integers.

Now at 143,696 graphic characters, Unicode 13.0 requires more than 2 bytes, and some more recent languages use 4 bytes to represent each character.

However, using 32 bits per character is generally wasteful of space, so there are more compact versions

UTF-32: Uses the full 32 bits per character
UTF-16: Uses only 16 bits per character. Some characters require 32 bits however
UTF-8: This uses 8 bits for some of the characters (notably the ASCII ones) and 16, 24 or 32 to give the complete set. UTF-8 is now the most popular format for Web pages and for transporting Web documents across the network

Unicode normalisation

Some characters do not have a unique representation in Unicode, having single character and double-character representations. Comparing two strings cannot be done using a character by character comparison. There are 4 'normal' forms that strings can be converted too, and NFC is the most commonly used one.

Internationalized domain names

The name geschäft.com with IDN form as xn--geschft-9wa.com resolves to a domain which is sellable (for $2,795!) - but never mind, it is a good test for IDNs. For more details on IDNs, see Internet in All Languages: Internationalized Domain Names.

General Resources

ICU - International Components for Unicode "ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications"
The Unicode® Standard Version 13.0 – Core Specification

Java

Character and string representations

The Java char type is a 2 byte integer. It can only represent chanracters from the Basic plane. The String type is a sequence of characters, each of 2 bytes.

To represent characters outside of the BMP, you need to use a string of two Java chars, one for the high surrogate and the other for the low surrogate. The class Character can be used to construct an array of chars of length one or two, using the static method


      char[] Character.toChars(int codepoint)

Unicode normalization

The Java class Normalizer can be used to convert a string to normalized form:


    String normalized_string = Normalizer.normalize(target_chars,
                                                    Normalizer.Form.NFD);

where the target_chars can be a String.

Converting strings to and from UTF-8

Strings can be converted to an array of UTF-8 bytes by


      byte[] bytes = string.getBytes(StandardCharsets.UTF_8);

To convert the other way, use


      String string = new String(bytes, StandardCharsets.UTF_8);

Reading and writing UTF-8 strings

Java uses the default charset for many classes converting between strings and bytes. It's value can be seen by


      Charset.defaultCharset().displayName();

The page Guide to Character Encoding claims that on macOS this will be UTF-8 while for Windows systems it will be Windows-1252. To avoid errors, the relevant classes should explicitly determine the charset used.

To read UTF-8 strings from an InputStream, wrap it in an InputStreamReader with the character encoding:


      new InputStreamReader(inputStream, StandardCharsets.UTF_8)

To write UTF-8 strings to an OutputStream, wrap it in an OutputStreamWriter with the character encoding:


      new OutputStreamWriter(outputStream, StandardCharsets.UTF_8)

Other relevant classes are dealt with similarly.

A revised EchoClient using this has only a few lines changed: EchoClient.java:


import java.io.*;
import java.net.*;
import java.nio.charset.StandardCharsets;

public class EchoClient {

    public static final int SERVER_PORT = 2000;
    
    public static void main(String[] args){

	if (args.length != 1) {
	    System.err.println("Usage: Client address");
	    System.exit(1);
	}

	InetAddress address = null;
	try {
	    address = InetAddress.getByName(args[0]);
	} catch(UnknownHostException e) {
	    e.printStackTrace();
	    System.exit(2);
	}

	Socket sock = null;
	try {
	    sock = new Socket(address, SERVER_PORT);
	} catch(IOException e) {
	    e.printStackTrace();
	    System.exit(3);
	}

	InputStream in = null;
	try {
	    in = sock.getInputStream();
	} catch(IOException e) {
	    e.printStackTrace();
	    System.exit(4);
	}

	OutputStream out = null;
	try {
	    out = sock.getOutputStream();
	} catch(IOException e) {
	    e.printStackTrace();
	    System.exit(5);
	}

	BufferedReader socketReader = new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8));

	PrintStream socketWriter = new PrintStream(out, false, StandardCharsets.UTF_8);

	BufferedReader consoleReader =  
                   new BufferedReader(new InputStreamReader(System.in));

	String line = null;
	while (true) {
	    line = null;
	    try {
		System.out.print("Enter line:");
		line = consoleReader.readLine();
		System.out.println("Read '" + line + "'");
	    } catch(IOException e) {
		e.printStackTrace();
		System.exit(6);
	    }

	    if (line.equals("BYE"))
		break;
	    
	    try {
		socketWriter.println(line);
	    } catch(Exception e) {
		e.printStackTrace();
		System.exit(7);
	    }

	    try {
		System.out.println(socketReader.readLine());
	    } catch(IOException e) {
		e.printStackTrace();
		System.exit(8);
	    }
	}
	
	System.exit(0);
    }
} // EchoClient

Internationalized domain names

To convert a hostname to an IDN name, use


      String IDN.toASCII(hostname)

To convert back, use


      String IDN.toUnicode(hostname)

Java Resources

Go

Character and string representations

A character is represented by a rune, which is an alias for an int32. It represents a Unicode code point and is stored in UTF-8 format.

A string is a sequence of bytes. Usually it is used to hold text in UTF-8 format. This mean it can be accessed in two ways:

As a sequence of bytes


for i := 0; i < len(str); i++ {
    fmt.Printf("%x  starts at byte position %d\n", str[i], i)
}

with output


e6  starts at byte position 0
97  starts at byte position 1
a5  starts at byte position 2
e6  starts at byte position 3
9c  starts at byte position 4
ac  starts at byte position 5
e8  starts at byte position 6
aa  starts at byte position 7
9e  starts at byte position 8

As a sequence of runes


for index, runeValue := range str {
    fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}

with output


9e  starts at byte position 8
U+65E5 '日' starts at byte position 0
U+672C '本' starts at byte position 3
U+8A9E '語' starts at byte position 6

These can be checked against a site such as Unicode Converter - Decimal, text, URL, and unicode converter which shows that '日' for example has UTF-8 format of '\xe6\x97\xa5'.

Unicode normalization

Normalization is done using the package norm by import "golang.org/x/text/unicode/norm". For example, to normalize a byte array, use norm.NFC.Bytes(b).

Converting strings to and from UTF-8

Strings will usually contain characters encoded in UTF-8. The UTF-8 bytes will be given by treating the string as an array/slice of bytes. An array of UTF-8 bytes can be converted to a UTF-8 string by casting it: string([]byte). If Go cannot properly decode bytes into UTF-8, then it gives the Unicode Replacement Character \uFFFD.

Reading and writing UTF-8 strings

Nothing special has to be done.

Internationalized domain names

Go has the package golang.org/x/net/idna with functios toASCII() and ToUnicode()

Go Resources

Python

Character and string representations

Python does not have a char type: singloe characters are strings of length one. In Python 3, strings are sequences of UTF-8 encoded characters. See Unicode HOWTO.

Unicode normalization

This can be done using the unicodedata.normalize() function


      unicodedata.normalize('NFD', s)

Converting strings to and from UTF-8

Strings are already in UTF-8 format.

Reading and writing UTF-8 strings

To encode a string to an array of bytes use


      str.encode('utf-8')

To encode an array of bytes to a string use


      str(bytes, encoding='utf-8')

Internationalized domain names

The module encodings.idna has methods ToASCII() and ToUnicode()

Python Resources

Javascript

Character and string representations

Characters are (normally) represented as 2-byte UTF-16 integers, covering the BMP subset of Unicode.

Strings are a asequence of 16-bit integer values. Normally this would be a sequence of UTF-16 encoded characters.

Unicode normalization

JavaScript simplifies normalized text handling by leaving it to others: source code is assumed to be in Unicode Normalised Form C, and "textual data coming into the execution environment from outside (e.g., user input, text read from a file or received over the network, etc.) be converted to Unicode Normalised Form C before the running program sees it." ( ECMA: The String Type )

However, there is also the function String.prototype.normalize() to convert strings to normal form.

Converting strings to and from UTF-8

The function socket.write() by default writes a string in UTF-8 format. For reading, the socket can have its encoding set by socket.setEncoding('utf8') and then data read will be encoded from UTF-8 to the JavaScript string.

To convert between strings and byte arrays has been discussed at Stackoverflow How to convert UTF8 string to byte array?

Internationalized domain names

The node.js Punycode moduke has been deprecated and instead recommended to use the user-supplied punycode.js module at A robust Punycode converter that fully complies to RFC 3492 and RFC 5891. .

JavaScript Resources

Internationalization Support

Rust

Character and string representations

The Rust char type is 32-bits in size. From Rust by Example: Strings

A String is stored as a vector of bytes (Vec(lt;u8>), but guaranteed to always be a valid UTF-8 sequence. String is heap allocated, growable and not null terminated. &str is a slice (&[u8]) that always points to a valid UTF-8 sequence, and can be used to view into a String, just like &[T] is a view into Vec<T>.

Unicode normalization

There are several crates on crates.io which offer Unicode normalization.

Converting strings to and from UTF-8

Rust strings have the function as_bytes() which returns a byte slice of the stringss contents.

An array of u8 (presumably of UTF-8 bytes) can be converted to a string by str::from_utf8()

Internationalized domain names

There are several crates on crates.io which offer conversion to IDN.

Rust Resources

Julia

Character and string representations

The Julia Char is 32 bits in size, and can represent all Unicode characters. The function Int(ch) will return the Unicode codepoint value, while the function Char(int) will convert a Unicode code point to a Char.

Strings are encoded in UTF-8 format. Strings treated as arrays of bytes can be indexed by byte location as in str[1]. But for non-ASCII characters, they will occupy two or more bytes, so many indices will be invalid and throw an error.

The length of a string length(s) is the number of characters it contains, which may be less than the number of bytes. However, a string is an iterable object, so you can loop through all the characters in a string by


      for c in s
           println(c)
      end

Unicode normalization

Julia Unicode strings can be normalized using the function


      Unicode.normalize(s::AbstractString, normalform::Symbol)

where the normalform is one of :NFC, :NFD, :NFKC, or :NFKD.

Converting strings to and from UTF-8

Julia strings are already in UTF-8 form.

Reading and writing UTF-8 strings

Julia reads and writes in UTF-8 anyway.

Internationalized domain names

This does not appear to have been dealt with yet. There is a Punycoder.jl which should do it. Also there is a post on Github stdlib/Sockets: `getaddrinfo("☃.net")` non-ASCII hostname (RFC 3492) suggesting that in 2018 upstream libuv will do this automatically, but it hasn't flowed through to Julia on my machine yet.

Julia Resources

Strings

Copyright © Jan Newmarch, jan@newmarch.name

" Network Programming using Java, Go, Python, Rust, JavaScript and Julia" by Jan Newmarch is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .
Based on a work at https://jan.newmarch.name/NetworkProgramming/ .

If you like this book, please contribute using PayPal