Skip to main content

Module utf8

Module utf8 

Source
Expand description

Definitions for UTF-8 encoding and decoding of character sequences.

UTF-8 is a variable-width character encoding scheme. Each character is encoded with between 1 and 4 bytes. Specifications for encoding and decoding characters to their UTF-8 byte sequences are given by [encode_utf8] and [decode_utf8], respectively. Characters in the ASCII character set are encoded in UTF-8 with 1-byte encodings identical to those used by ASCII. Thus, some UTF-8 byte sequences can also be considered ASCII byte sequences, as defined in [is_ascii_chars].

UTF-8 encodes numerical values called Unicode scalars (see below), which assign a unique value to each Unicode character. A scalar value is encoded in UTF-8 using a leading byte and between 0 and 3 continuation bytes, where larger scalar values require more continuation bytes. The first part of the bit pattern in the leading byte is reserved for describing the number of bytes in the scalar’s encoding (e.g., [is_leading_byte_width_1]). The rest of the leading byte contains data bits corresponding to the scalar’s value (e.g., [leading_bits_width_1]). The continuation bytes also follow a specific bit pattern ([is_continuation_byte]) and contain the remainder of the data bits ([continuation_bits]).

This module makes use of terminology from the Unicode standard. A Unicode scalar is a numerical value (represented in this module as a u32) corresponding to a character that can be encoded in UTF-8. All Rust chars correspond to Unicode scalars ([char_is_scalar]), and every numerical value encoded in a UTF-8 byte sequence must fall within the range defined for Unicode scalars ([is_scalar]). The Unicode standard also defines a codepoint to be a numerical value which falls in the range available for encoding characters in UTF-8. This may sound similar to the definition of scalar. However, the definition of codepoint is more permissive than that for scalars, as it includes some values which are technically possible to encode in the UTF-8 scheme, but in fact are not legal Unicode values (namely, the high-surrogate and low-surrogate ranges). To align with the Unicode terminology, in this module, we use the term “scalar” to describe the numerical values which can be encoded in valid UTF-8 byte sequences, and the term “codepoint” to describe numerical values which are learned upon decoding a byte sequence but may or may not be legal Unicode values.

Functions§

group_utf8_lib