Crate cesu8[][src]

A library for convering between CESU-8 and UTF-8.

Examples

Unicode code points from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF is encoded in the same way as UTF-8.

If cesu8::encode() or cesu8::decode() only encounters data that is both valid CESU-8 and UTF-8 data, the cesu8 crate leverages this using a clone-on-write smart pointer (Cow). This means that there are no unnecessary operations and needless allocation of memory:

use std::borrow::Cow;

let str = "Hello, world!";
assert_eq!(cesu8::encode(str), Cow::Borrowed(str.as_bytes()));
assert_eq!(cesu8::decode(str.as_bytes()).unwrap(), Cow::Borrowed(str));

When data needs to be encoded or decoded, it functions as one might expect:

let str = "\u{10401}";
let cesu8_data = &[0xED, 0xA0, 0x81, 0xED, 0xB0, 0x81];
assert_eq!(cesu8::decode(cesu8_data).unwrap(), Cow::Borrowed(str));

Technical Details

The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in [Unicode Technical Report #26] report. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF is encoded in the same way as UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Though not specified in the technical report, unpaired surrogates are also encoded as 3 bytes each, and CESU-8 is exactly the same as applying an older UCS-2 to UTF-8 converter to UTF-16 data.

CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only. It should be used exclusively for internal processing and never for external data exchange.

Security

As a general rule, this library is intended to fail on malformed or unexpected input. This is desired, as CESU-8 should only be used for internal use, any error should signify an issue with a developer’s code or some attacker is trying to improperly encode data to evade security checks.

Surrogate Pairs and UTF-8

The UTF-16 encoding uses “surrogate pairs” to represent Unicode code points in the range from U+10000 to U+10FFFF. These are 16-bit numbers in the range 0xD800 to 0xDFFF.

  • 0xD800 to 0xDBFF: First half of surrogate pair. When encoded as CESU-8, these become 11101101 10100000 10000000 to 11101101 10101111 10111111.

  • 0xDC00 to 0xDFFF: Second half of surrogate pair. These become 11101101 10110000 10000000 to 11101101 10111111 10111111.

Wikipedia explains the code point to UTF-16 conversion process:

Consider the encoding of U+10437 (𐐷):

  • Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.
  • Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.
  • Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.
  • Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.

Structs

DecodingError

The error type which is returned from decoding CESU-8 data to UTF-8.

Functions

decode

Converts a slice of bytes to a string slice.

encode

Converts a string slice to CESU-8 bytes.

encoded_len

Given a string slice, this function returns how many bytes in CESU-8 are required to encode the string slice.

is_valid

Returns true if a string slice contains UTF-8 data that is also valid CESU-8. This is mainly used in testing if a string slice needs to be explicitly encoded using cesu8::encode().