unicode

terminology

The word character is overloaded and may refere to different things. https://utf8everywhere.org/#characters

When a programming language or a library documentation says ‘character’, it typically means a code unit.

When an end user is asked about the number of characters in a string, he will count the user-perceived characters.

The Unicode Standard uses the word character as a synonym for coded character.

For example, U+1F428 is a coded character which represents the abstract character 🐨 koala.

UTF-8 (encoding)

unicode normalization

TLDR

There are two standards to have in mind when talking about Unicode:

The Universal Coded Character Set (UCS) is a standard set of characters defined by the international standard ISO/IEC 10646

Since 1991, the Unicode Consortium and the ISO/IEC have developed The Unicode Standard and ISO/IEC 10646 in tandem.

The repertoire, character names, and code points of The Unicode Standard Version 2.0 exactly match those of ISO/IEC 10646-1:1993.

From version ISO/IEC 10646-1:1993 plus Amendments 5 to 7 = Unicode 2.0 and onwards supports encoding of 1,112,064 code points from 17 planes, this restriction is because of the UTF-16 encoding.

The same characters with the same numbers exist on both standards, although The Unicode Standard releases new versions and adds new characters more often.

The Unicode Standard adds rules for collation, normalisation of forms, and the bidirectional algorithm for right-to-left scripts.

If bidirectional scripts are used, it is not enough to support ISO/IEC 10646; The Unicode Standard must be implemented.

The latest version for ISO/IEC 10646: ISO/IEC 10646:2021 (This is in sync with The Unicode Standard Version 14.0

The latest version for The Unicode Standard: The Unicode Standard Version 15.1

What to implement

  1. Use the character set that is defined in the latest ISO/IEC 10646 version.
  2. Use the UTF-8 encoding for the external encoding.
  3. For internal representation probably use the UTF-8 encoding.

There are several encodings (map bytes to codepoints):

UCS-2 is obsolete terminology (this term should now be avoided!) which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard.

UCS-4. UCS-4 stands for "Universal Character Set coded in 4 octets." It is now treated simply as a synonym for UTF-32, and is considered the canonical form for representation of characters in 10646.

UCS-2. UCS-2 stands for "Universal Character Set coded in 2 octets" and is also known as "the two-octet BMP form." It was documented in earlier editions of 10646 as the two-octet (16-bit) encoding consisting only of code positions for plane zero, the Basic Multilingual Plane. This documentation has been removed from ISO/IEC 10646:2011 and subsequent editions, and the term UCS-2 should now be considered obsolete. It no longer refers to an encoding form in either 10646 or the Unicode Standard.

Byte order

UTF-8 is mandated by the standard to be "network byte order", aka big endian.

UTF-16 and UTF-32 does not specify the byte order.

Encoding forms: UTF-8, UTF-16, or UTF-32

Encoding schemes: UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE, UTF-32LE, or UTF-32

Comparrison of coding languages

C

The string is defined as bytes that end with the null character.

C assumes that a string is only ASCII because of historical reasons.

char is always 8 bits; thus UTF-8 codepoints that is in the first 128 code points (ASCII) need 1 byte, is valid in a C string.

Glibc

wchar_t is always defined as 32 bytes.

The wide character set is always UCS-4 in GNU C Library.

The libunistring library provides functions for manipulating Unicode strings and for manipulating C strings according to the Unicode standard. https://www.gnu.org/software/libunistring/manual/libunistring.html

Go

The string is defined as bytes. It can hold arbitrary bytes, thus not required to be in UTF-8. With the \xNN (00 to FF) notation you can set any byte value.

Source code in Go is defined to be UTF-8 encoded; thus string literals will be coded as UTF-8.

rune is an alias for int32; to be clear when an integer value represents a code point. (The encoding is not specified)

Java

https://openjdk.org/jeps/254

JavaScript

https://mathiasbynens.be/notes/javascript-encoding

Python

https://peps.python.org/pep-0393/

https://peps.python.org/pep-0623/

https://tenthousandmeters.com/blog/python-behind-the-scenes-9-how-python-strings-work/

Rust

Lua