Crate mutf8[][src]

A library which implements lightweight encoding and decoding functions for converting to and from MUTF-8. This is a non-standard variant of UTF-8 that is used internally by some systems that need to represent UTF-16 as 8-bit characters while ensuring no strings can contain embedded null characters.

Use of this encoding is discouraged by the Unicode Consortium. This encoding should only be used for working with existing internal APIs.

use std::borrow::Cow;

let str = "Hello, world!";
// 16-bit Unicode characters are the same in UTF-8 and MUTF-8:
assert_eq!(mutf8::encode(str), Cow::Borrowed(str.as_bytes()));
assert_eq!(mutf8::decode(str.as_bytes()).unwrap(), Cow::Borrowed(str));

let str = "\u{10401}";
let mutf8_data = &[0xED, 0xA0, 0x81, 0xED, 0xB0, 0x81];
// 'mutf8_data' is a byte slice containing a 6-byte surrogate pair which
// becomes a 4-byte UTF-8 character.
assert_eq!(mutf8::decode(mutf8_data).unwrap(), Cow::Borrowed(str));

let str = "\0";
let mutf8_data = &[0xC0, 0x80];
// 'str' is a null character which becomes a two-byte MUTF-8 representation.
assert_eq!(mutf8::encode(str), Cow::Borrowed(mutf8_data))

Security

As a general rule, this library is intended to fail on malformed or unexpected input. This is desired, as CESU-8 should only be used for internal use, any error should signify an issue with a developer’s code or some attacker is trying to improperly encode data to evade security checks.

Surrogate Pairs and UTF-8

The UTF-16 encoding uses “surrogate pairs” to represent Unicode code points in the range from U+10000 to U+10FFFF. These are 16-bit numbers in the range 0xD800 to 0xDFFF.

  • 0xD800 to 0xDBFF: First half of surrogate pair. When encoded as CESU-8, these become 11101101 10100000 10000000 to 11101101 10101111 10111111.

  • 0xDC00 to 0xDFFF: Second half of surrogate pair. These become 11101101 10110000 10000000 to 11101101 10111111 10111111.

Wikipedia explains the code point to UTF-16 conversion process:

Consider the encoding of U+10437 (𐐷):

  • Subtract 0x10000 from 0x10437. The result is 0x00437, 0000 0000 0100 0011 0111.
  • Split this into the high 10-bit value and the low 10-bit value: 0000000001 and 0000110111.
  • Add 0xD800 to the high value to form the high surrogate: 0xD800 + 0x0001 = 0xD801.
  • Add 0xDC00 to the low value to form the low surrogate: 0xDC00 + 0x0037 = 0xDC37.

This crate is a modified version of Eric Kidd’s cesu-rs repository. This crate was developed for Residua as part of their technical philosophy to have no external dependencies.

Structs

DecodingError

The error type which is returned from decoding MUTF-8 data to UTF-8.

Functions

decode

Converts a slice of bytes to a string slice.

encode

Converts a string slice to MUTF-8 bytes.

encoded_len

Given a string slice, this function returns how many bytes in MUTF-8 are required to encode the string slice.

is_valid

Returns true if a string slice contains UTF-8 data that is also valid MUTF-8. This is mainly used in testing if a string slice needs to be explicitly encoded using mutf8::encode().