Crate cesu8str

Expand description

A simple library implementing the CESU-8 compatibility encoding scheme. This is a non-standard variant of UTF-8 that is used internally by some systems that need to represent UTF-16 data as 8-bit characters. Yes, this is ugly.

Use of this encoding is discouraged by the Unicode Consortium. It’s OK for working with existing internal APIs, but it should not be used for transmitting or storing data.

use std::borrow::Cow;
use cesu8str::{Cesu8Str, Variant};

// 16-bit Unicode characters are the same in UTF-8 and CESU-8.
assert_eq!("aé日".as_bytes(), Cesu8Str::from_utf8("aé日", Variant::Standard).as_bytes());
assert_eq!("aé日", Cesu8Str::from_cesu8("aé日".as_bytes(), Variant::Standard).unwrap());

// This string is CESU-8 data containing a 6-byte surrogate pair,
// which decodes to a 4-byte UTF-8 string.
let data = &[0xED, 0xA0, 0x81, 0xED, 0xB0, 0x81];
assert_eq!("\u{10401}", Cesu8Str::from_cesu8(data, Variant::Standard).unwrap());

A note about security

While this library tries it’s best to fail and check for malformed input, this is a legacy data format that should only be used for interacting with legacy libraries. CESU-8 is intended as an internal-only format, malformed data should be assumed to be improperly encoded (a bug), or an attacker.

Java and U+0000, and other variants

Java uses the CESU-8 encoding as described above, but with one difference: The null character U+0000 is represented as an overlong UTF-8 sequence C0 80. This is supported by the Cesu8Str::from_cesu8(bytes, Variant::Java) and java_variant_str.as_bytes() methods.

Surrogate pairs and UTF-8

The UTF-16 encoding uses “surrogate pairs” to represent Unicode code points in the range from U+10000 to U+10FFFF. These are 16-bit numbers in the range 0xD800 to 0xDFFF.

0xD800 to 0xDBFF: First half of surrogate pair. When encoded as CESU-8, these become 11101101 10100000 10000000 to 11101101 10101111 10111111.
0xDC00 to 0xDFFF: Second half of surrogate pair. These become 11101101 10110000 10000000 to 11101101 10111111 10111111.

Wikipedia explains the code point to UTF-16 conversion process:

Consider the encoding of U+10437 (𐐷):

Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.

Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.

Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.

Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.

Structs

Cesu8Error

Errors which can occur when attempting to interpret a str or sequence of u8 as a CESU8 string.

Cesu8Str

A CESU-8 or Modified UTF-8 string.

Enums

Variant

Which variant of the encoding are we working with?

Functions

from_cesu8

Convert CESU-8 data to a Rust string, re-encoding only if necessary. Returns an error if the data cannot be represented as valid UTF-8.

from_java_cesu8

Convert Java’s modified UTF-8 data to a Rust string, re-encoding only if necessary. Returns an error if the data cannot be represented as valid UTF-8.

is_valid_cesu8

Check whether a Rust string contains valid CESU-8 data.

is_valid_java_cesu8

Check whether a Rust string contains valid Java’s modified UTF-8 data.

to_cesu8

Convert a Rust &str to CESU-8 bytes.

to_java_cesu8

Convert a Rust &str to Java’s modified UTF-8 bytes.