Expand description
An implementation of WTF-8, a utf8-compatible encoding that allows for unpaired surrogate codepoints. This implementation additionally allows for paired surrogates that are nonetheless treated as two separate codepoints.
RustPython uses this because CPython internally uses a variant of UCS-1/2/4
as its string storage, which treats each u8/u16/u32 value (depending
on the highest codepoint value in the string) as simply integers, unlike
UTF-8 or UTF-16 where some characters are encoded using multi-byte
sequences. CPython additionally doesn’t disallow the use of surrogates in
strs (which in UTF-16 pair together to represent codepoints with a value
higher than u16::MAX) and in fact takes quite extensive advantage of the
fact that they’re allowed. The surrogateescape codec-error handler uses
them to represent byte sequences which are invalid in the given codec (e.g.
bytes with their high bit set in ASCII or UTF-8) by mapping them into the
surrogate range. surrogateescape is the default error handler in Python
for interacting with the filesystem, and thus if RustPython is to properly
support surrogateescape, its strs must be able to represent surrogates.
We use WTF-8 over something more similar to CPython’s string implementation
because of its compatibility with UTF-8, meaning that in the case where a
string has no surrogates, it can be viewed as a UTF-8 Rust str without
needing any copies or re-encoding.
This implementation is mostly copied from the WTF-8 implementation in the
Rust 1.85 standard library, which is used as the backing for OsStr on
Windows targets. As previously mentioned, however, it is modified to not
join two surrogates into one codepoint when concatenating strings, in order
to match CPython’s behavior.
Macros§
- wtf8_
concat - Concatenate values into a
Wtf8Buf, preserving surrogates.
Structs§
- Code
Point - A Unicode code point: from U+0000 to U+10FFFF.
- Encode
Wide - Generates a wide character sequence for potentially ill-formed UTF-16.
- Lead
Surrogate - Trail
Surrogate - Wtf8
- A borrowed slice of well-formed WTF-8 data.
- Wtf8Buf
- An owned, growable string of well-formed WTF-8 data.
- Wtf8
Chunks - Wtf8
Code Point Indices - Wtf8
Code Points - Iterator for the code points of a WTF-8 string.
Enums§
Traits§
- Wtf8
Concat - Trait for types that can be appended to a
Wtf8Buf, preserving surrogates.
Functions§
- check_
utf8_ boundary - Verify that
indexis at the edge of either a valid UTF-8 codepoint (i.e. a codepoint that’s not a surrogate) or of the whole string. - from_
boxed_ ⚠wtf8_ unchecked - Safety
- slice_
error_ fail - Copied from core::str::raw::slice_error_fail
- slice_
unchecked ⚠ - Copied from core::str::raw::slice_unchecked