Crate cesu8[−][src]
A library which implements lightweight encoding and decoding functions for converting to and from CESU-8. This is a non-standard variant of UTF-8 that is used internally by some systems that need to represent UTF-16 as 8-bit characters.
Use of this encoding is discouraged by the Unicode Consortium. This encoding should only be used for working with existing internal APIs.
use std::borrow::Cow; let str = "Hello, world!"; // 16-bit Unicode characters are the same in UTF-8 and CESU-8: assert_eq!(cesu8::encode(str), Cow::Borrowed(str.as_bytes())); assert_eq!(cesu8::decode(str.as_bytes()).unwrap(), Cow::Borrowed(str)); let str = "\u{10401}"; let cesu8_data = &[0xED, 0xA0, 0x81, 0xED, 0xB0, 0x81]; // 'cesu8_data' is a byte slice containing a 6-byte surrogate pair which // becomes a 4-byte UTF-8 character. assert_eq!(cesu8::decode(cesu8_data).unwrap(), Cow::Borrowed(str));
Security
As a general rule, this library is intended to fail on malformed or unexpected input. This is desired, as CESU-8 should only be used for internal use, any error should signify an issue with a developer’s code or some attacker is trying to improperly encode data to evade security checks.
Surrogate Pairs and UTF-8
The UTF-16 encoding uses “surrogate pairs” to represent Unicode code points in the range from U+10000 to U+10FFFF. These are 16-bit numbers in the range 0xD800 to 0xDFFF.
-
0xD800 to 0xDBFF: First half of surrogate pair. When encoded as CESU-8, these become 11101101 10100000 10000000 to 11101101 10101111 10111111.
-
0xDC00 to 0xDFFF: Second half of surrogate pair. These become 11101101 10110000 10000000 to 11101101 10111111 10111111.
Wikipedia explains the code point to UTF-16 conversion process:
Consider the encoding of U+10437 (𐐷):
- Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.
- Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.
- Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.
- Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.
Related Work
This crate is a modified version of Eric Kidd’s
cesu-rs
repository.
This crate was developed for Residua as part
of their technical philosophy to have no external dependencies.
Structs
DecodingError | The error type which is returned from decoding CESU-8 data to UTF-8. |
Functions
decode | Converts a slice of bytes to a string slice. |
encode | Converts a string slice to CESU-8 bytes. |
is_valid | Returns |