Crate cesu8[][src]

A library which implements lightweight encoding and decoding functions for converting to and from CESU-8. This is a non-standard variant of UTF-8 that is used internally by some systems that need to represent UTF-16 as 8-bit characters.

Use of this encoding is discouraged by the Unicode Consortium. This encoding should only be used for working with existing internal APIs.

use std::borrow::Cow;

let str = "Hello, world!";
// 16-bit Unicode characters are the same in UTF-8 and CESU-8:
assert_eq!(cesu8::encode(str), Cow::Borrowed(str.as_bytes()));
assert_eq!(cesu8::decode(str.as_bytes()).unwrap(), Cow::Borrowed(str));

let str = "\u{10401}";
let cesu8_data = &[0xED, 0xA0, 0x81, 0xED, 0xB0, 0x81];
// 'cesu8_data' is a byte slice containing a 6-byte surrogate pair which
// becomes a 4-byte UTF-8 character.
assert_eq!(cesu8::decode(cesu8_data).unwrap(), Cow::Borrowed(str));

Security

As a general rule, this library is intended to fail on malformed or unexpected input. This is desired, as CESU-8 should only be used for internal use, any error should signify an issue with a developer’s code or some attacker is trying to improperly encode data to evade security checks.

Surrogate Pairs and UTF-8

The UTF-16 encoding uses “surrogate pairs” to represent Unicode code points in the range from U+10000 to U+10FFFF. These are 16-bit numbers in the range 0xD800 to 0xDFFF.

  • 0xD800 to 0xDBFF: First half of surrogate pair. When encoded as CESU-8, these become 11101101 10100000 10000000 to 11101101 10101111 10111111.

  • 0xDC00 to 0xDFFF: Second half of surrogate pair. These become 11101101 10110000 10000000 to 11101101 10111111 10111111.

Wikipedia explains the code point to UTF-16 conversion process:

Consider the encoding of U+10437 (𐐷):

  • Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.
  • Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.
  • Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.
  • Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.

This crate is a modified version of Eric Kidd’s cesu-rs repository. This crate was developed for Residua as part of their technical philosophy to have no external dependencies.

Structs

DecodingError

The error type which is returned from decoding CESU-8 data to UTF-8.

Functions

decode

Converts a slice of bytes to a string slice.

encode

Converts a string slice to CESU-8 bytes.

is_valid

Returns true if a string slice contains UTF-8 data that is also valid CESU-8. This is mainly used in testing if a string slice needs to be explicitly encoded using cesu8::encode().