Text occurs in multiple ways in a computer system. Text can be stored in files, be used in the programs of particular programming languages, appear as the input or output to programs or be sent between programs, either on the same host or on different hosts.
Once upon a time ASCII and EBCDIC were the predominant forms for almots all of these. EBCDIC has pretty much disappeared, but even ASCII is showing problems.
ASCII (American Standard Code for Information Interchange) is oriented to the US version of english. It doesn't include, for example the UK pound symbol '£'. It also doesn't include the various symbols of the european languages such as 'â' and 'ß'. There are various versions of ASCII which allow for some of these.
Wider sets of characters are the ISO8859 series, and at one stage the ISO8859-1 set was used as the 'standard' for the web.
But even these don't include the characters of Chinese, Thai, Arabic, Japanese, ... . There have been multiple ways of representing these, but fortunately all of these are giving way to Unicode.
Unicode attempts to represent all the characters of all the different human (and Klingon and Tolkein's elvish!). originally there were less than 64k such characters considered and the Basic Multilingual Plane (BMP) of these characters would fit into 2 bytes. Some languages such as Java set their character type as 2-byte integers.
Now at 143,696 graphic characters, Unicode 13.0 requires more than 2 bytes, and some more recent languages use 4 bytes to represent each character.
However, using 32 bits per character is generally wasteful of space, so there are more compact versions
Some characters do not have a unique representation in Unicode, having single character and double-character representations. Comparing two strings cannot be done using a character by character comparison. There are 4 'normal' forms that strings can be converted too, and NFC is the most commonly used one.
The name geschäft.com
with IDN form as
xn--geschft-9wa.com
resolves to a domain which is sellable
(for $2,795!) - but never mind, it is a good test for IDNs.
For more details on IDNs, see
Internet in All Languages: Internationalized Domain Names.
The Java char
type is a 2 byte integer.
It can only represent chanracters from the Basic plane.
The String
type is a sequence of
characters, each of 2 bytes.
To represent characters outside of the BMP, you need to use
a string of two Java char
s, one for the
high surrogate and the other for the low surrogate.
The class Character
can be used to construct
an array of chars of length one or two, using the static method
char[] Character.toChars(int codepoint)
The Java class Normalizer
can be used to convert a
string to normalized form:
String normalized_string = Normalizer.normalize(target_chars,
Normalizer.Form.NFD);
where the target_chars
can be a String
.
Strings can be converted to an array of UTF-8 bytes by
byte[] bytes = string.getBytes(StandardCharsets.UTF_8);
To convert the other way, use
String string = new String(bytes, StandardCharsets.UTF_8);
Java uses the default charset for many classes converting between strings and bytes. It's value can be seen by
Charset.defaultCharset().displayName();
The page
Guide to Character Encoding
claims that on macOS this will be UTF-8 while for Windows systems
it will be Windows-1252.
To avoid errors, the relevant classes should explicitly determine
the charset used.
To read UTF-8 strings from an InputStream
,
wrap it in an InputStreamReader
with the
character encoding:
new InputStreamReader(inputStream, StandardCharsets.UTF_8)
To write UTF-8 strings to an OutputStream
,
wrap it in an OutputStreamWriter
with the
character encoding:
new OutputStreamWriter(outputStream, StandardCharsets.UTF_8)
Other relevant classes are dealt with similarly.
A revised EchoClient using this has only a few lines changed: EchoClient.java:
import java.io.*;
import java.net.*;
import java.nio.charset.StandardCharsets;
public class EchoClient {
public static final int SERVER_PORT = 2000;
public static void main(String[] args){
if (args.length != 1) {
System.err.println("Usage: Client address");
System.exit(1);
}
InetAddress address = null;
try {
address = InetAddress.getByName(args[0]);
} catch(UnknownHostException e) {
e.printStackTrace();
System.exit(2);
}
Socket sock = null;
try {
sock = new Socket(address, SERVER_PORT);
} catch(IOException e) {
e.printStackTrace();
System.exit(3);
}
InputStream in = null;
try {
in = sock.getInputStream();
} catch(IOException e) {
e.printStackTrace();
System.exit(4);
}
OutputStream out = null;
try {
out = sock.getOutputStream();
} catch(IOException e) {
e.printStackTrace();
System.exit(5);
}
BufferedReader socketReader = new BufferedReader(new InputStreamReader(in, StandardCharsets.UTF_8));
PrintStream socketWriter = new PrintStream(out, false, StandardCharsets.UTF_8);
BufferedReader consoleReader =
new BufferedReader(new InputStreamReader(System.in));
String line = null;
while (true) {
line = null;
try {
System.out.print("Enter line:");
line = consoleReader.readLine();
System.out.println("Read '" + line + "'");
} catch(IOException e) {
e.printStackTrace();
System.exit(6);
}
if (line.equals("BYE"))
break;
try {
socketWriter.println(line);
} catch(Exception e) {
e.printStackTrace();
System.exit(7);
}
try {
System.out.println(socketReader.readLine());
} catch(IOException e) {
e.printStackTrace();
System.exit(8);
}
}
System.exit(0);
}
} // EchoClient
To convert a hostname to an IDN name, use
String IDN.toASCII(hostname)
To convert back, use
String IDN.toUnicode(hostname)
A character is represented by a rune
, which is an alias
for an int32
. It represents a Unicode code point
and is stored in UTF-8 format.
A string is a sequence of bytes. Usually it is used to hold text in UTF-8 format. This mean it can be accessed in two ways:
for i := 0; i < len(str); i++ {
fmt.Printf("%x starts at byte position %d\n", str[i], i)
}
with output
e6 starts at byte position 0
97 starts at byte position 1
a5 starts at byte position 2
e6 starts at byte position 3
9c starts at byte position 4
ac starts at byte position 5
e8 starts at byte position 6
aa starts at byte position 7
9e starts at byte position 8
for index, runeValue := range str {
fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
with output
9e starts at byte position 8
U+65E5 '日' starts at byte position 0
U+672C '本' starts at byte position 3
U+8A9E '語' starts at byte position 6
Normalization is done using the package norm
by import "golang.org/x/text/unicode/norm"
.
For example, to normalize a byte array, use
norm.NFC.Bytes(b)
.
Strings will usually contain characters encoded in UTF-8.
The UTF-8 bytes will be given by treating the string as an array/slice of bytes.
An array of UTF-8 bytes can be converted to a UTF-8 string by casting it:
string([]byte)
. If Go cannot properly decode bytes into UTF-8,
then it gives the Unicode Replacement Character \uFFFD.
Nothing special has to be done.
Go has the package golang.org/x/net/idna
with functios toASCII()
and
ToUnicode()
Python does not have a char type: singloe characters are strings of length one. In Python 3, strings are sequences of UTF-8 encoded characters. See Unicode HOWTO.
This can be done using the unicodedata.normalize()
function
unicodedata.normalize('NFD', s)
Strings are already in UTF-8 format.
To encode a string to an array of bytes use
str.encode('utf-8')
To encode an array of bytes to a string use
str(bytes, encoding='utf-8')
The module encodings.idna
has
methods ToASCII()
and ToUnicode()
Characters are (normally) represented as 2-byte UTF-16 integers, covering the BMP subset of Unicode.
Strings are a asequence of 16-bit integer values. Normally this would be a sequence of UTF-16 encoded characters.
JavaScript simplifies normalized text handling by leaving it to others: source code is assumed to be in Unicode Normalised Form C, and "textual data coming into the execution environment from outside (e.g., user input, text read from a file or received over the network, etc.) be converted to Unicode Normalised Form C before the running program sees it." ( ECMA: The String Type )
However, there is also the function
String.prototype.normalize()
to convert strings to normal form.
The function socket.write()
by default writes a string in UTF-8
format.
For reading, the socket can have its encoding set by
socket.setEncoding('utf8')
and then data read will be encoded from UTF-8 to the JavaScript string
.
To convert between strings and byte arrays has been discussed at Stackoverflow How to convert UTF8 string to byte array?
The node.js Punycode moduke has been deprecated and instead recommended to use the user-supplied punycode.js module at A robust Punycode converter that fully complies to RFC 3492 and RFC 5891. .
The Rust char
type is 32-bits in size.
From
Rust by Example: Strings
A String is stored as a vector of bytes (Vec(lt;u8>), but guaranteed to always be a valid UTF-8 sequence. String is heap allocated, growable and not null terminated. &str is a slice (&[u8]) that always points to a valid UTF-8 sequence, and can be used to view into a String, just like &[T] is a view into Vec<T>.
There are several crates on crates.io which offer Unicode normalization.
Rust strings have the function as_bytes()
which returns a
byte slice of the stringss contents.
An array of u8
(presumably of UTF-8 bytes) can be converted to a
string by str::from_utf8()
There are several crates on crates.io which offer conversion to IDN.
The Julia Char
is 32 bits in size, and can represent
all Unicode characters.
The function Int(ch)
will return the Unicode codepoint value,
while the function Char(int)
will convert a Unicode code point
to a Char
.
Strings are encoded in UTF-8 format. Strings treated as arrays of bytes
can be indexed by byte location as in str[1]
.
But for non-ASCII characters, they will occupy two or more bytes,
so many indices will be invalid and throw an error.
The length of a string length(s)
is the number of characters
it contains, which may be less than the number of bytes.
However, a string is an iterable object, so you can loop through all the
characters in a string by
for c in s
println(c)
end
Julia Unicode strings can be normalized using the function
Unicode.normalize(s::AbstractString, normalform::Symbol)
where the normalform is one of :NFC
,
:NFD
, :NFKC
, or :NFKD
.
Julia strings are already in UTF-8 form.
Julia reads and writes in UTF-8 anyway.
This does not appear to have been dealt with yet.
There is a
Punycoder.jl which should do it.
Also there is a post on Github
stdlib/Sockets: `getaddrinfo("☃.net")` non-ASCII hostname (RFC 3492)
suggesting that in 2018 upstream libuv
will
do
this automatically, but it hasn't flowed through to Julia
on my machine yet.
Copyright © Jan Newmarch, jan@newmarch.name
" Network Programming using Java, Go, Python, Rust, JavaScript and Julia"
by
Jan Newmarch
is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License
.
Based on a work at
https://jan.newmarch.name/NetworkProgramming/
.