Text occurs in multiple ways in a computer system. Text can be stored in files, be used in the programs of particular programming languages, appear as the input or output to programs or be sent between programs, either on the same host or on different hosts.
Once upon a time ASCII and EBCDIC were the predominant forms for almots all of these. EBCDIC has pretty much disappeared, but even ASCII is showing problems.
ASCII (American Standard Code for Information Interchange) is oriented to the US version of english. It doesn't include, for example the UK pound symbol '£'. It also doesn't include the various symbols of the european languages such as 'â' and 'ß'. There are various versions of ASCII which allow for some of these.
Wider sets of characters are the ISO8859 series, and at one stage the ISO8859-1 set was used as the 'standard' for the web.
But even these don't include the characters of Chinese, Thai, Arabic, Japanese, ... . There have been multiple ways of representing these, but fortunately all of these are giving way to Unicode.
Unicode attempts to represent all the characters of all the different human (and Klingon and Tolkein's elvish!). originally there were less than 64k such characters considered and the Basic Multilingual Plane (BMP) of these characters would fit into 2 bytes. Some languages such as Java set their character type as 2-byte integers.
Now at 143,696 graphic characters, Unicode 13.0 requires more than 2 bytes, and some more recent languages use 4 bytes to represent each character.
However, using 32 bits per character is generally wasteful of space, so there are more compact versions
Some characters do not have a unique representation in Unicode, having single character and double-character representations. Comparing two strings cannot be done using a character by character comparison. There are 4 'normal' forms that strings can be converted too, and NFC is the most commonly used one.
The name geschäft.com
with IDN form as
xn--geschft-9wa.com
resolves to a domain which is sellable
(for $2,795!) - but never mind, it is a good test for IDNs.
For more details on IDNs, see
Internet in All Languages: Internationalized Domain Names.
Copyright © Jan Newmarch, jan@newmarch.name
" Network Programming using Java, Go, Python, Rust, JavaScript and Julia"
by
Jan Newmarch
is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License
.
Based on a work at
https://jan.newmarch.name/NetworkProgramming/
.