Crate rustpython_wtf8

Expand description

An implementation of WTF-8, a utf8-compatible encoding that allows for unpaired surrogate codepoints. This implementation additionally allows for paired surrogates that are nonetheless treated as two separate codepoints.

RustPython uses this because CPython internally uses a variant of UCS-1/2/4 as its string storage, which treats each u8/u16/u32 value (depending on the highest codepoint value in the string) as simply integers, unlike UTF-8 or UTF-16 where some characters are encoded using multi-byte sequences. CPython additionally doesn’t disallow the use of surrogates in strs (which in UTF-16 pair together to represent codepoints with a value higher than u16::MAX) and in fact takes quite extensive advantage of the fact that they’re allowed. The surrogateescape codec-error handler uses them to represent byte sequences which are invalid in the given codec (e.g. bytes with their high bit set in ASCII or UTF-8) by mapping them into the surrogate range. surrogateescape is the default error handler in Python for interacting with the filesystem, and thus if RustPython is to properly support surrogateescape, its strs must be able to represent surrogates.

We use WTF-8 over something more similar to CPython’s string implementation because of its compatibility with UTF-8, meaning that in the case where a string has no surrogates, it can be viewed as a UTF-8 Rust str without needing any copies or re-encoding.

This implementation is mostly copied from the WTF-8 implementation in the Rust 1.85 standard library, which is used as the backing for OsStr on Windows targets. As previously mentioned, however, it is modified to not join two surrogates into one codepoint when concatenating strings, in order to match CPython’s behavior.

Macros§

wtf8_concat: Concatenate values into a Wtf8Buf, preserving surrogates.

Structs§

CodePoint: A Unicode code point: from U+0000 to U+10FFFF.
EncodeWide: Generates a wide character sequence for potentially ill-formed UTF-16.
LeadSurrogate
TrailSurrogate
Wtf8: A borrowed slice of well-formed WTF-8 data.
Wtf8Buf: An owned, growable string of well-formed WTF-8 data.
Wtf8Chunks
Wtf8CodePointIndices
Wtf8CodePoints: Iterator for the code points of a WTF-8 string.

Enums§

Wtf8Chunk

Traits§

Wtf8Concat: Trait for types that can be appended to a Wtf8Buf, preserving surrogates.

Functions§

check_utf8_boundary: Verify that index is at the edge of either a valid UTF-8 codepoint (i.e. a codepoint that’s not a surrogate) or of the whole string.
from_boxed_wtf8_unchecked^⚠: Safety
slice_error_fail: Copied from core::str::raw::slice_error_fail
slice_unchecked^⚠: Copied from core::str::raw::slice_unchecked