Module ende::utf8

source ·
Expand description

UTF-8 encoding and decoding.

§Encoding

A unicode code point is represented using one to four bytes in UTF-8, depending on its value.

  • If the unicode code point is in the range 0x0000 to 0x007F, it is represented using one byte.
  • If the unicode code point is in the range 0x0080 to 0x07FF, it is represented using two bytes.
  • If the unicode code point is in the range 0x0800 to 0xFFFF, it is represented using three bytes.
  • If the unicode code point is in the range 0x10000 to 0x10FFFF, it is represented using four bytes.

§Decoding

A UTF-8 code point is decoded into a unicode code point using the following rules:

  • If the first bit of the UTF-8 code point is 0, the unicode code point is represented using one byte.
  • If the first three bits of the UTF-8 code point are 110, the unicode code point is represented using two bytes.
  • If the first four bits of the UTF-8 code point are 1110, the unicode code point is represented using three bytes.
  • If the first five bits of the UTF-8 code point are 11110, the unicode code point is represented using four bytes.

When a unicode code point is represented using two, three or four bytes, these bytes of the UTF-8 code point are continuation bytes. The continuation bytes start with the bit pattern 10.

§Representation

Note:

  • UTF-8 is a prefix code, which means that no UTF-8 code point is a prefix of another UTF-8 code point. This means that the first byte of a UTF-8 code point is enough to determine the length of the UTF-8 code point and decode it in an unambiguous way.
  • UTF-8 is capable of encoding all 1,112,064 valid unicode code points in Unicode.
  • The number of xs on the right side of the 0 in the unicode code point are the number of free bits on the UTF-8 code point.

§One byte

Encoding: If the eighth bit of the unicode code point is 0, the unicode code point is represented in UTF-8 using one byte.

Decoding: If the UTF-8 code point starts with a 0, the unicode code point is represented using only the first eight least significant bits.

  • Unicode code point: nnnnnnnn|nnnnnnnn|nnnnnnnn|0xxxxxxx
  • UTF-8 code point: 0xxxxxxx

§Two bytes

Encoding: If the twelfth bit of the unicode code point is 0, the unicode code point is represented in UTF-8 using two bytes.

Decoding: If the UTF-8 code point starts with 110 it has only one continuation byte, and the unicode code point is represented using the first eleven least significant bits (excluding the prefix bits).

  • Unicode code point: nnnnnnnn|nnnnnnnn|nnnn0xxx|xxxxxxxx
  • UTF-8 code point: 110xxxxx|10xxxxxx

§Three bytes

Encoding: If the seventeenth bit of the unicode code point is 0, the unicode code point is represented using three bytes.

Decoding: If the UTF-8 code point starts with 1110 it has two continuation bytes, and the unicode code point is represented using the first sixteen least significant bits (excluding the prefix bits).

  • Unicode code point: nnnnnnnn|nnnnnnn0|xxxxxxxx|xxxxxxxx
  • UTF-8 code point: 1110xxxx|10xxxxxx|10xxxxxx

§Four bytes

Encoding: If the twentysecond bit of the unicode code point is 0, the unicode code point is represented using four bytes.

Decoding: If the UTF-8 code point starts with 11110 it has three continuation bytes, and the unicode code point is represented using the first twenty-one least significant bits (excluding the prefix bits).

  • Unicode code point: nnnnnnnn|nn0xxxxx|xxxxxxxx|xxxxxxxx
  • UTF-8 code point: 11110xxx|10xxxxxx|10xxxxxx|10xxxxxx

Functions§

  • Decode a vector of UTF-8 code points into a vector of unicode code points.
  • Encode a vector of unicode code points into a vector of UTF-8 code points.
  • Pretty print the UTF-8 encoding in hexadecimal and decimal of a vector of UTF-8 code points.
  • Pretty print the UTF-8 encoding in hexadecimal, binary and decimal of a vector of UTF-8 code points.