Text: Characters and Strings

General

Introduction

Text occurs in multiple ways in a computer system. Text can be stored in files, be used in the programs of particular programming languages, appear as the input or output to programs or be sent between programs, either on the same host or on different hosts.

Once upon a time ASCII and EBCDIC were the predominant forms for almots all of these. EBCDIC has pretty much disappeared, but even ASCII is showing problems.

ASCII (American Standard Code for Information Interchange) is oriented to the US version of english. It doesn't include, for example the UK pound symbol '£'. It also doesn't include the various symbols of the european languages such as 'â' and 'ß'. There are various versions of ASCII which allow for some of these.

Wider sets of characters are the ISO8859 series, and at one stage the ISO8859-1 set was used as the 'standard' for the web.

But even these don't include the characters of Chinese, Thai, Arabic, Japanese, ... . There have been multiple ways of representing these, but fortunately all of these are giving way to Unicode.

Unicode

Unicode attempts to represent all the characters of all the different human (and Klingon and Tolkein's elvish!). originally there were less than 64k such characters considered and the Basic Multilingual Plane (BMP) of these characters would fit into 2 bytes. Some languages such as Java set their character type as 2-byte integers.

Now at 143,696 graphic characters, Unicode 13.0 requires more than 2 bytes, and some more recent languages use 4 bytes to represent each character.

However, using 32 bits per character is generally wasteful of space, so there are more compact versions

UTF-32: Uses the full 32 bits per character
UTF-16: Uses only 16 bits per character. Some characters require 32 bits however
UTF-8: This uses 8 bits for some of the characters (notably the ASCII ones) and 16, 24 or 32 to give the complete set. UTF-8 is now the most popular format for Web pages and for transporting Web documents across the network

Unicode normalisation

Some characters do not have a unique representation in Unicode, having single character and double-character representations. Comparing two strings cannot be done using a character by character comparison. There are 4 'normal' forms that strings can be converted too, and NFC is the most commonly used one.

Internationalized domain names

The name geschäft.com with IDN form as xn--geschft-9wa.com resolves to a domain which is sellable (for $2,795!) - but never mind, it is a good test for IDNs. For more details on IDNs, see Internet in All Languages: Internationalized Domain Names.

General Resources

ICU - International Components for Unicode "ICU is a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support for software applications"
The Unicode® Standard Version 13.0 – Core Specification

Copyright © Jan Newmarch, jan@newmarch.name

" Network Programming using Java, Go, Python, Rust, JavaScript and Julia" by Jan Newmarch is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .
Based on a work at https://jan.newmarch.name/NetworkProgramming/ .

If you like this book, please contribute using PayPal