Expand description

Functions for converting between different in-RAM representations of text and for quickly checking if the Unicode Bidirectional Algorithm can be avoided.

By using slices for output, the functions here seek to enable by-register (ALU register or SIMD register as available) operations in order to outperform iterator-based conversions available in the Rust standard library.

Note: “Latin1” in this module refers to the Unicode range from U+0000 to U+00FF, inclusive, and does not refer to the windows-1252 range. This in-memory encoding is sometimes used as a storage optimization of text when UTF-16 indexing and length semantics are exposed.

The FFI binding for this module are in the encoding_c_mem crate.

Enums

Classification of text as Latin1 (all code points are below U+0100), left-to-right with some non-Latin1 characters or as containing at least some right-to-left characters.

Functions

Checks whether a valid UTF-8 buffer contains code points that trigger right-to-left processing or is all-Latin1.

Checks whether a potentially invalid UTF-8 buffer contains code points that trigger right-to-left processing or is all-Latin1.

Checks whether a potentially invalid UTF-16 buffer contains code points that trigger right-to-left processing or is all-Latin1.

Converts bytes whose unsigned value is interpreted as Unicode code point (i.e. U+0000 to U+00FF, inclusive) to UTF-8 such that the validity of the output is signaled using the Rust type system.

Converts bytes whose unsigned value is interpreted as Unicode code point (i.e. U+0000 to U+00FF, inclusive) to UTF-8 such that the validity of the output is signaled using the Rust type system with potentially insufficient output space.

Converts bytes whose unsigned value is interpreted as Unicode code point (i.e. U+0000 to U+00FF, inclusive) to UTF-8.

Converts bytes whose unsigned value is interpreted as Unicode code point (i.e. U+0000 to U+00FF, inclusive) to UTF-8 with potentially insufficient output space.

Converts bytes whose unsigned value is interpreted as Unicode code point (i.e. U+0000 to U+00FF, inclusive) to UTF-16.

Converts valid UTF-8 to valid UTF-16.

If the input is valid UTF-8 representing only Unicode code points from U+0000 to U+00FF, inclusive, converts the input into output that represents the value of each code point as the unsigned byte value of each output byte.

Converts potentially-invalid UTF-8 to valid UTF-16 with errors replaced with the REPLACEMENT CHARACTER.

Converts potentially-invalid UTF-8 to valid UTF-16 signaling on error.

If the input is valid UTF-16 representing only Unicode code points from U+0000 to U+00FF, inclusive, converts the input into output that represents the value of each code point as the unsigned byte value of each output byte.

Converts potentially-invalid UTF-16 to valid UTF-8 with errors replaced with the REPLACEMENT CHARACTER such that the validity of the output is signaled using the Rust type system.

Converts potentially-invalid UTF-16 to valid UTF-8 with errors replaced with the REPLACEMENT CHARACTER such that the validity of the output is signaled using the Rust type system with potentially insufficient output space.

Converts potentially-invalid UTF-16 to valid UTF-8 with errors replaced with the REPLACEMENT CHARACTER.

Converts potentially-invalid UTF-16 to valid UTF-8 with errors replaced with the REPLACEMENT CHARACTER with potentially insufficient output space.

Copies ASCII from source to destination up to the first non-ASCII byte (or the end of the input if it is ASCII in its entirety).

Copies ASCII from source to destination zero-extending it to UTF-16 up to the first non-ASCII byte (or the end of the input if it is ASCII in its entirety).

Copies Basic Latin from source to destination narrowing it to ASCII up to the first non-Basic Latin code unit (or the end of the input if it is Basic Latin in its entirety).

Converts bytes whose unsigned value is interpreted as Unicode code point (i.e. U+0000 to U+00FF, inclusive) to UTF-8.

If the input is valid UTF-8 representing only Unicode code points from U+0000 to U+00FF, inclusive, converts the input into output that represents the value of each code point as the unsigned byte value of each output byte.

Replaces unpaired surrogates in the input with the REPLACEMENT CHARACTER.

Checks whether the buffer is all-ASCII.

Checks whether the buffer is all-Basic Latin (i.e. UTF-16 representing only ASCII characters).

Checks whether a scalar value triggers right-to-left processing.

Checks whether a valid UTF-8 buffer contains code points that trigger right-to-left processing.

Checks whether the buffer represents only code points less than or equal to U+00FF.

Checks whether a potentially-invalid UTF-8 buffer contains code points that trigger right-to-left processing.

Checks whether the buffer is valid UTF-8 representing only code points less than or equal to U+00FF.

Checks whether a UTF-16 buffer contains code points that trigger right-to-left processing.

Checks whether a UTF-16 code unit triggers right-to-left processing.

Checks whether the buffer represents only code point less than or equal to U+00FF.

Returns the index of first byte that starts a non-Latin1 byte sequence, or the length of the string if there are none.

Returns the index of first byte that starts an invalid byte sequence or a non-Latin1 byte sequence, or the length of the string if there are neither.

Returns the index of the first unpaired surrogate or, if the input is valid UTF-16 in its entirety, the length of the input.