Expand description
UTF-8 encoding and decoding.
§Encoding
A unicode code point is represented using one to four bytes in UTF-8, depending on its value.
- If the unicode code point is in the range
0x0000
to0x007F
, it is represented using one byte. - If the unicode code point is in the range
0x0080
to0x07FF
, it is represented using two bytes. - If the unicode code point is in the range
0x0800
to0xFFFF
, it is represented using three bytes. - If the unicode code point is in the range
0x10000
to0x10FFFF
, it is represented using four bytes.
§Decoding
A UTF-8 code point is decoded into a unicode code point using the following rules:
- If the first bit of the UTF-8 code point is 0, the unicode code point is represented using one byte.
- If the first three bits of the UTF-8 code point are 110, the unicode code point is represented using two bytes.
- If the first four bits of the UTF-8 code point are 1110, the unicode code point is represented using three bytes.
- If the first five bits of the UTF-8 code point are 11110, the unicode code point is represented using four bytes.
When a unicode code point is represented using two, three or four bytes, these bytes of the UTF-8 code point are continuation bytes. The continuation bytes start with the bit pattern 10
.
§Representation
Note:
- UTF-8 is a prefix code, which means that no UTF-8 code point is a prefix of another UTF-8 code point. This means that the first byte of a UTF-8 code point is enough to determine the length of the UTF-8 code point and decode it in an unambiguous way.
- UTF-8 is capable of encoding all 1,112,064 valid unicode code points in Unicode.
- The number of
x
s on the right side of the0
in the unicode code point are the number of free bits on the UTF-8 code point.
§One byte
Encoding: If the eighth bit of the unicode code point is 0, the unicode code point is represented in UTF-8 using one byte.
Decoding: If the UTF-8 code point starts with a 0, the unicode code point is represented using only the first eight least significant bits.
- Unicode code point:
nnnnnnnn|nnnnnnnn|nnnnnnnn|0xxxxxxx
- UTF-8 code point:
0xxxxxxx
§Two bytes
Encoding: If the twelfth bit of the unicode code point is 0, the unicode code point is represented in UTF-8 using two bytes.
Decoding: If the UTF-8 code point starts with 110
it has only one continuation byte, and the unicode code point is represented using the first eleven least significant bits (excluding the prefix bits).
- Unicode code point:
nnnnnnnn|nnnnnnnn|nnnn0xxx|xxxxxxxx
- UTF-8 code point:
110xxxxx|10xxxxxx
§Three bytes
Encoding: If the seventeenth bit of the unicode code point is 0, the unicode code point is represented using three bytes.
Decoding: If the UTF-8 code point starts with 1110
it has two continuation bytes, and the unicode code point is represented using the first sixteen least significant bits (excluding the prefix bits).
- Unicode code point:
nnnnnnnn|nnnnnnn0|xxxxxxxx|xxxxxxxx
- UTF-8 code point:
1110xxxx|10xxxxxx|10xxxxxx
§Four bytes
Encoding: If the twentysecond bit of the unicode code point is 0, the unicode code point is represented using four bytes.
Decoding: If the UTF-8 code point starts with 11110
it has three continuation bytes, and the unicode code point is represented using the first twenty-one least significant bits (excluding the prefix bits).
- Unicode code point:
nnnnnnnn|nn0xxxxx|xxxxxxxx|xxxxxxxx
- UTF-8 code point:
11110xxx|10xxxxxx|10xxxxxx|10xxxxxx
Functions§
- Decode a vector of UTF-8 code points into a vector of unicode code points.
- Encode a vector of unicode code points into a vector of UTF-8 code points.
- Pretty print the UTF-8 encoding in hexadecimal and decimal of a vector of UTF-8 code points.
- Pretty print the UTF-8 encoding in hexadecimal, binary and decimal of a vector of UTF-8 code points.