Expand description
A simple library implementing the CESU-8 compatibility encoding scheme. This is a non-standard variant of UTF-8 that is used internally by some systems that need to represent UTF-16 data as 8-bit characters. Yes, this is ugly.
Use of this encoding is discouraged by the Unicode Consortium. It’s OK for working with existing internal APIs, but it should not be used for transmitting or storing data.
use std::borrow::Cow;
use cesu8str::Cesu8Str;
// 16-bit Unicode characters are the same in UTF-8 and CESU-8.
const TEST_STRING: &str = "aé日";
const TEST_UTF8: &[u8] = TEST_STRING.as_bytes();
assert_eq!(TEST_UTF8, Cesu8Str::from_utf8(TEST_STRING).as_bytes());
let cesu_from_bytes = Cesu8Str::try_from_bytes(TEST_UTF8).unwrap();
assert_eq!(TEST_UTF8, cesu_from_bytes.as_bytes());
// This string is CESU-8 data containing a 6-byte surrogate pair,
// which decodes to a 4-byte UTF-8 string.
let data = &[0xED, 0xA0, 0x81, 0xED, 0xB0, 0x81];
assert_eq!("\u{10401}", Cesu8Str::try_from_bytes(data).unwrap().to_str());
§A note about security
While this library tries it’s best to fail and check for malformed input, this is a legacy data format that should only be used for interacting with legacy libraries. CESU-8 is intended as an internal-only format, malformed data should be assumed to be improperly encoded (a bug), or an attacker.
§Java and U+0000, and other variants
Java uses the CESU-8 encoding as described above, but with one
difference: The null character U+0000 is represented as an overlong
UTF-8 sequence C0 80
. This is supported by the Cesu8Str::from_cesu8(bytes, Variant::Java)
and
java_variant_str.as_bytes()
methods.
§Surrogate pairs and UTF-8
The UTF-16 encoding uses “surrogate pairs” to represent Unicode code points in the range from U+10000 to U+10FFFF. These are 16-bit numbers in the range 0xD800 to 0xDFFF.
-
0xD800 to 0xDBFF: First half of surrogate pair. When encoded as CESU-8, these become 11101101 10100000 10000000 to 11101101 10101111 10111111.
-
0xDC00 to 0xDFFF: Second half of surrogate pair. These become 11101101 10110000 10000000 to 11101101 10111111 10111111.
Wikipedia explains the code point to UTF-16 conversion process:
Consider the encoding of U+10437 (𐐷):
- Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.
- Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.
- Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.
- Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.
Modules§
- prelude
- A prelude including most relavent structs for the crate, including re-exports of some stdlib helpers
such as
Cow
,CStr
,CString
, andDeref
.
Structs§
- Cesu8
Error - Errors which can occur when attempting to interpret a
str
or sequence ofu8
as a CESU8 string. - Cesu8
Str - A borrowed CESU-8 string. This type is not nul-terminated, may contain interior nuls, and encodes characters that are normally four bytes in UTF8, as two, three byte surrogate pairs.
- Cesu8
String - An owned CESU-8 encoded string.
- From
Bytes With NulError - A possible error value when converting a
Vec<u8>
into the requested string. - From
Mutf8 Bytes With NulError - The error when trying to create a Mutf8CString from a byte buffer.
- From
StrWith NulError - An error indicating that nul byte was not in the expected position
- Legacy
Cesu8 Str Deprecated - A CESU-8 or Modified UTF-8 string.
- Mutf8C
Str - A nul-terminated Modified UTF-8 string, valid for use with JNI Representation of a borrowed Mutf8 C string.
- Mutf8C
String - An owned Mutf8 byte buffer, with a terminating nul byte.
- Mutf8
Str - A borrowed MUTF-8 string.
- Mutf8
String - An owned MUTF-8 encoded string.
- TryFrom
Utf8 Error - An error signifying that a buffer was too small, when trying to convert UTF8 to another string type.
Enums§
- NGCesu8C
Error - An error type for creating a Cesu8/Mutf8 CStr/CString
- Variant
- Which variant of the encoding are we working with?
Functions§
- from_
cesu8 - Convert CESU-8 data to a Rust string, re-encoding only if necessary. Returns an error if the data cannot be represented as valid UTF-8.
- from_
java_ cesu8 - Convert Java’s modified UTF-8 data to a Rust string, re-encoding only if necessary. Returns an error if the data cannot be represented as valid UTF-8.
- is_
valid_ cesu8 - Check whether a Rust string contains valid CESU-8 data.
- is_
valid_ java_ cesu8 - Check whether a Rust string contains valid Java’s modified UTF-8 data.
- to_
cesu8 - Convert a Rust
&str
to CESU-8 bytes. - to_
java_ cesu8 - Convert a Rust
&str
to Java’s modified UTF-8 bytes.