Module ende::utf8

source ·

Expand description

UTF-8 encoding and decoding.

§Encoding

A unicode code point is represented using one to four bytes in UTF-8, depending on its value.

If the unicode code point is in the range 0x0000 to 0x007F, it is represented using one byte.
If the unicode code point is in the range 0x0080 to 0x07FF, it is represented using two bytes.
If the unicode code point is in the range 0x0800 to 0xFFFF, it is represented using three bytes.
If the unicode code point is in the range 0x10000 to 0x10FFFF, it is represented using four bytes.

§Decoding

A UTF-8 code point is decoded into a unicode code point using the following rules:

If the first bit of the UTF-8 code point is 0, the unicode code point is represented using one byte.
If the first three bits of the UTF-8 code point are 110, the unicode code point is represented using two bytes.
If the first four bits of the UTF-8 code point are 1110, the unicode code point is represented using three bytes.
If the first five bits of the UTF-8 code point are 11110, the unicode code point is represented using four bytes.

When a unicode code point is represented using two, three or four bytes, these bytes of the UTF-8 code point are continuation bytes. The continuation bytes start with the bit pattern 10.

§Representation

Note:

UTF-8 is a prefix code, which means that no UTF-8 code point is a prefix of another UTF-8 code point. This means that the first byte of a UTF-8 code point is enough to determine the length of the UTF-8 code point and decode it in an unambiguous way.
UTF-8 is capable of encoding all 1,112,064 valid unicode code points in Unicode.
The number of xs on the right side of the 0 in the unicode code point are the number of free bits on the UTF-8 code point.

§One byte

Encoding: If the eighth bit of the unicode code point is 0, the unicode code point is represented in UTF-8 using one byte.

Decoding: If the UTF-8 code point starts with a 0, the unicode code point is represented using only the first eight least significant bits.

Unicode code point: nnnnnnnn|nnnnnnnn|nnnnnnnn|0xxxxxxx
UTF-8 code point: 0xxxxxxx

§Two bytes

Encoding: If the twelfth bit of the unicode code point is 0, the unicode code point is represented in UTF-8 using two bytes.

Decoding: If the UTF-8 code point starts with 110 it has only one continuation byte, and the unicode code point is represented using the first eleven least significant bits (excluding the prefix bits).

Unicode code point: nnnnnnnn|nnnnnnnn|nnnn0xxx|xxxxxxxx
UTF-8 code point: 110xxxxx|10xxxxxx

§Three bytes

Encoding: If the seventeenth bit of the unicode code point is 0, the unicode code point is represented using three bytes.

Decoding: If the UTF-8 code point starts with 1110 it has two continuation bytes, and the unicode code point is represented using the first sixteen least significant bits (excluding the prefix bits).

Unicode code point: nnnnnnnn|nnnnnnn0|xxxxxxxx|xxxxxxxx
UTF-8 code point: 1110xxxx|10xxxxxx|10xxxxxx

§Four bytes

Encoding: If the twentysecond bit of the unicode code point is 0, the unicode code point is represented using four bytes.

Decoding: If the UTF-8 code point starts with 11110 it has three continuation bytes, and the unicode code point is represented using the first twenty-one least significant bits (excluding the prefix bits).

Unicode code point: nnnnnnnn|nn0xxxxx|xxxxxxxx|xxxxxxxx
UTF-8 code point: 11110xxx|10xxxxxx|10xxxxxx|10xxxxxx

Functions§

decode_from_utf8
Decode a vector of UTF-8 code points into a vector of unicode code points.
encode_in_utf8
Encode a vector of unicode code points into a vector of UTF-8 code points.
print_utf8
Pretty print the UTF-8 encoding in hexadecimal and decimal of a vector of UTF-8 code points.
print_utf8_b
Pretty print the UTF-8 encoding in hexadecimal, binary and decimal of a vector of UTF-8 code points.

Module ende::utf8Copy item path