Expand description
Lossless conversion between UTF-8 and bytes in Rust.
Non-UTF-8 bytes (>= 128) are encoded in a subset of Unicode Private Use Area
U+EF80
..U+EFFF
. Conflicted Unicode characters are escaped by prefixing
U+EF00
.
This can be useful to pass mostly UTF-8 but occasionally invalid UTF-8 data to UTF-8-only format like JSON, after receiving the UTF-8 data, reconstruct the original data losslessly.
§About PEP 383 (surrogateescape)
PEP 383 (surrogateescape) is Python’s
attempt to solve a similar problem. It uses U+DC80
..U+DCFF
(surrogates)
for non-UTF-8 bytes.
According to the Unicode FAQ, surrogate pairs are for UTF-16, and are invalid in UTF-8:
The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single 4-byte sequence. However, there is a widespread practice of generating pairs of 3-byte sequences in older software, especially software which pre-dates the introduction of UTF-16 or that is interoperating with UTF-16 environments under particular constraints. Such an encoding is not conformant to UTF-8 as defined.
A standard-conformant UTF-8 implementation like Rust’s str
would
error out on surrogates. "\u{dc80}"
does not compile.
char::from_u32(0xdc80)
is None
meaning U+DC80
is not a valid
Rust char
.
Therefore, although Python is widely used and it would be nice to be
compatible with Python, this crate has to use a different encoding.
The U+EF80
..U+EFFF
range was originally chosen by
MirBSD.
This crate uses an additional U+EF00
for escaping to achieve lossless
round-trip.
Functions§
- bytes_
to_ str - Converts a byte slice to UTF-8
str
. - str_
to_ bytes - Inverse of
bytes_to_str
.