Crate cesu8str

Source
Expand description

A simple library implementing the CESU-8 compatibility encoding scheme. This is a non-standard variant of UTF-8 that is used internally by some systems that need to represent UTF-16 data as 8-bit characters. Yes, this is ugly.

Use of this encoding is discouraged by the Unicode Consortium. It’s OK for working with existing internal APIs, but it should not be used for transmitting or storing data.

use std::borrow::Cow;
use cesu8str::Cesu8Str;

// 16-bit Unicode characters are the same in UTF-8 and CESU-8.
const TEST_STRING: &str = "aé日";
const TEST_UTF8: &[u8] = TEST_STRING.as_bytes();
assert_eq!(TEST_UTF8, Cesu8Str::from_utf8(TEST_STRING).as_bytes());
let cesu_from_bytes = Cesu8Str::try_from_bytes(TEST_UTF8).unwrap();
assert_eq!(TEST_UTF8, cesu_from_bytes.as_bytes());

// This string is CESU-8 data containing a 6-byte surrogate pair,
// which decodes to a 4-byte UTF-8 string.
let data = &[0xED, 0xA0, 0x81, 0xED, 0xB0, 0x81];
assert_eq!("\u{10401}", Cesu8Str::try_from_bytes(data).unwrap().to_str());

§A note about security

While this library tries it’s best to fail and check for malformed input, this is a legacy data format that should only be used for interacting with legacy libraries. CESU-8 is intended as an internal-only format, malformed data should be assumed to be improperly encoded (a bug), or an attacker.

§Java and U+0000, and other variants

Java uses the CESU-8 encoding as described above, but with one difference: The null character U+0000 is represented as an overlong UTF-8 sequence C0 80. This is supported by the Cesu8Str::from_cesu8(bytes, Variant::Java) and java_variant_str.as_bytes() methods.

§Surrogate pairs and UTF-8

The UTF-16 encoding uses “surrogate pairs” to represent Unicode code points in the range from U+10000 to U+10FFFF. These are 16-bit numbers in the range 0xD800 to 0xDFFF.

  • 0xD800 to 0xDBFF: First half of surrogate pair. When encoded as CESU-8, these become 11101101 10100000 10000000 to 11101101 10101111 10111111.

  • 0xDC00 to 0xDFFF: Second half of surrogate pair. These become 11101101 10110000 10000000 to 11101101 10111111 10111111.

Wikipedia explains the code point to UTF-16 conversion process:

Consider the encoding of U+10437 (𐐷):

  • Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.
  • Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.
  • Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.
  • Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.

Modules§

prelude
A prelude including most relavent structs for the crate, including re-exports of some stdlib helpers such as Cow, CStr, CString, and Deref.

Structs§

Cesu8Error
Errors which can occur when attempting to interpret a str or sequence of u8 as a CESU8 string.
Cesu8Str
A borrowed CESU-8 string. This type is not nul-terminated, may contain interior nuls, and encodes characters that are normally four bytes in UTF8, as two, three byte surrogate pairs.
Cesu8String
An owned CESU-8 encoded string.
FromBytesWithNulError
A possible error value when converting a Vec<u8> into the requested string.
FromMutf8BytesWithNulError
The error when trying to create a Mutf8CString from a byte buffer.
FromStrWithNulError
An error indicating that nul byte was not in the expected position
LegacyCesu8StrDeprecated
A CESU-8 or Modified UTF-8 string.
Mutf8CStr
A nul-terminated Modified UTF-8 string, valid for use with JNI Representation of a borrowed Mutf8 C string.
Mutf8CString
An owned Mutf8 byte buffer, with a terminating nul byte.
Mutf8Str
A borrowed MUTF-8 string.
Mutf8String
An owned MUTF-8 encoded string.
TryFromUtf8Error
An error signifying that a buffer was too small, when trying to convert UTF8 to another string type.

Enums§

NGCesu8CError
An error type for creating a Cesu8/Mutf8 CStr/CString
Variant
Which variant of the encoding are we working with?

Functions§

from_cesu8
Convert CESU-8 data to a Rust string, re-encoding only if necessary. Returns an error if the data cannot be represented as valid UTF-8.
from_java_cesu8
Convert Java’s modified UTF-8 data to a Rust string, re-encoding only if necessary. Returns an error if the data cannot be represented as valid UTF-8.
is_valid_cesu8
Check whether a Rust string contains valid CESU-8 data.
is_valid_java_cesu8
Check whether a Rust string contains valid Java’s modified UTF-8 data.
to_cesu8
Convert a Rust &str to CESU-8 bytes.
to_java_cesu8
Convert a Rust &str to Java’s modified UTF-8 bytes.