pub enum Utf8ErrorKind {
    TooFewBytes,
    NonUtf8Byte,
    UnexpectedContinuationByte,
    InterruptedSequence,
    OverlongEncoding,
    Utf16ReservedCodepoint,
    TooHighCodepoint,
}
Expand description

The types of errors that can occur when decoding a UTF-8 codepoint.

The variants are more technical than what an end user is likely interested in, but might be useful for deciding how to handle the error.

They can be grouped into three categories:

  • Will happen regularly if decoding chunked or buffered text: TooFewBytes.
  • Input might be binary, a different encoding or corrupted, UnexpectedContinuationByte and InterruptedSequence.
    (Broken UTF-8 sequence).
  • Less likely to happen accidentaly and might be malicious: OverlongEncoding, Utf16ReservedCodepoint and TooHighCodepoint. Note that theese can still be caused by certain valid latin-1 strings such as "Á©" (b"\xC1\xA9").

Variants

TooFewBytes

There are too few bytes to decode the codepoint.

This can happen when a slice is empty or too short, or an iterator returned None while in the middle of a codepoint.
This error is never produced by functions accepting fixed-size [u8; 4] arrays.

If decoding text coming chunked (such as in buffers passed to Read), the remaing bytes should be carried over into the next chunk or buffer. (including the byte this error was produced for.)

NonUtf8Byte

A byte which is never used by well-formed UTF-8 was encountered.

This means that the input is using a different encoding, is corrupted or binary.

This error is returned when a byte in the following ranges is encountered anywhere in an UTF-8 sequence:

  • 192 and 193 (0b1100_000x): Indicates an overlong encoding of a single-byte, ASCII, character, and should therefore never occur.
  • 248.. (0b1111_1xxx): Sequences cannot be longer than 4 bytes.
  • 245..=247 (0b1111_0101 | 0b1111_0110): Indicates a too high codepoint. (above \u10ffff)

UnexpectedContinuationByte

The first byte is not a valid start of a codepoint.

This might happen as a result of slicing into the middle of a codepoint, the input not being UTF-8 encoded or being corrupted. Errors of this type coming right after another error should probably be ignored, unless returned more than three times in a row.

This error is returned when the first byte has a value in the range 128..=191 (0b1000_0000..=0b1011_1111).

InterruptedSequence

The byte at index 1..=3 should be a continuation byte, but doesn’t fit the pattern 0b10xx_xxxx.

When the input slice or iterator has too few bytes, TooFewBytes is returned instead.

OverlongEncoding

The encoding of the codepoint has so many leading zeroes that it could be a byte shorter.

Successfully decoding this can present a security issue: Doing so could allow an attacker to circumvent input validation that only checks for ASCII characters, and input characters or strings that would otherwise be rejected, such as /../.

This error is only returned for 3 and 4-byte encodings; NonUtf8Byte is returned for bytes that start longer or shorter overlong encodings.

Utf16ReservedCodepoint

The codepoint is reserved for UTF-16 surrogate pairs.

(Utf8Char cannot be used to work with the WTF-8 encoding for UCS-2 strings.)

This error is returned for codepoints in the range \ud800..=\udfff. (which are three bytes long as UTF-8)

TooHighCodepoint

The codepoint is higher than \u10ffff, which is the highest codepoint unicode permits.

Trait Implementations

Returns a copy of the value. Read more

Performs copy-assignment from source. Read more

Formats the value using the given formatter. Read more

This method tests for self and other values to be equal, and is used by ==. Read more

This method tests for !=.

This method tests for self and other values to be equal, and is used by ==. Read more

This method tests for !=.

This method tests for self and other values to be equal, and is used by ==. Read more

This method tests for !=.

Auto Trait Implementations

Blanket Implementations

Gets the TypeId of self. Read more

Immutably borrows from an owned value. Read more

Mutably borrows from an owned value. Read more

Returns the argument unchanged.

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

The resulting type after obtaining ownership.

Creates owned data from borrowed data, usually by cloning. Read more

Uses borrowed data to replace owned data, usually by cloning. Read more

The type returned in the event of a conversion error.

Performs the conversion.

The type returned in the event of a conversion error.

Performs the conversion.