Crate ef80escape

source ·
Expand description

Lossless conversion between UTF-8 and bytes in Rust.

Non-UTF-8 bytes (>= 128) are encoded in a subset of Unicode Private Use Area U+EF80..U+EFFF. Conflicted Unicode characters are escaped by prefixing U+EF00.

This can be useful to pass mostly UTF-8 but occasionally invalid UTF-8 data to UTF-8-only format like JSON, after receiving the UTF-8 data, reconstruct the original data losslessly.

About PEP 383 (surrogateescape)

PEP 383 (surrogateescape) is Python’s attempt to solve a similar problem. It uses U+DC80..U+DCFF (surrogates) for non-UTF-8 bytes.

According to the Unicode FAQ, surrogate pairs are for UTF-16, and are invalid in UTF-8:

The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single 4-byte sequence. However, there is a widespread practice of generating pairs of 3-byte sequences in older software, especially software which pre-dates the introduction of UTF-16 or that is interoperating with UTF-16 environments under particular constraints. Such an encoding is not conformant to UTF-8 as defined.

A standard-conformant UTF-8 implementation like Rust’s str would error out on surrogates. "\u{dc80}" does not compile. char::from_u32(0xdc80) is None meaning U+DC80 is not a valid Rust char.

Therefore, although Python is widely used and it would be nice to be compatible with Python, this crate has to use a different encoding. The U+EF80..U+EFFF range was originally chosen by MirBSD. This crate uses an additional U+EF00 for escaping to achieve lossless round-trip.

Functions

Converts a byte slice to UTF-8 str.