Text: Characters and Strings

Javascript

Character and string representations

Characters are (normally) represented as 2-byte UTF-16 integers, covering the BMP subset of Unicode.

Strings are a asequence of 16-bit integer values. Normally this would be a sequence of UTF-16 encoded characters.

Unicode normalization

JavaScript simplifies normalized text handling by leaving it to others: source code is assumed to be in Unicode Normalised Form C, and "textual data coming into the execution environment from outside (e.g., user input, text read from a file or received over the network, etc.) be converted to Unicode Normalised Form C before the running program sees it." ( ECMA: The String Type )

However, there is also the function String.prototype.normalize() to convert strings to normal form.

Converting strings to and from UTF-8

The function socket.write() by default writes a string in UTF-8 format. For reading, the socket can have its encoding set by socket.setEncoding('utf8') and then data read will be encoded from UTF-8 to the JavaScript string.

To convert between strings and byte arrays has been discussed at Stackoverflow How to convert UTF8 string to byte array?

Internationalized domain names

The node.js Punycode moduke has been deprecated and instead recommended to use the user-supplied punycode.js module at A robust Punycode converter that fully complies to RFC 3492 and RFC 5891. .

JavaScript Resources

Internationalization Support

Copyright © Jan Newmarch, jan@newmarch.name

" Network Programming using Java, Go, Python, Rust, JavaScript and Julia" by Jan Newmarch is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .
Based on a work at https://jan.newmarch.name/NetworkProgramming/ .

If you like this book, please contribute using PayPal