Skip to main content

decode_auto

Function decode_auto 

Source
pub fn decode_auto(input: &[u8]) -> Result<Cow<'_, str>, DecodeError>
Expand description

Decode Aozora source bytes to UTF-8, detecting the encoding.

Aozora material reaches this crate in two shapes: the canonical Shift_JIS archive files, and already-decoded UTF-8 mirrors (e.g. a corpus that has been pre-converted). Forcing every caller to commit to one encoding up front is the wrong default — it is why corpus tooling historically hard-coded decode_sjis and silently broke on UTF-8 input. This entry point removes that concern: hand it bytes, get back UTF-8.

  • Valid UTF-8 is returned borrowed, zero-copy.
  • Otherwise the bytes are decoded as Shift_JIS (owned).

UTF-8 is tried first on purpose. Valid UTF-8 is a near-unambiguous signal — Shift_JIS Japanese text essentially never forms a wholly valid UTF-8 sequence — whereas the converse does not hold: a UTF-8 document can contain byte runs that decode as some Shift_JIS without erroring, so sniffing Shift_JIS first risks mojibake on UTF-8 input.

BOM stripping, CRLF folding and NFC normalisation are the parser’s Phase-0 responsibility and are deliberately not applied here.

§Errors

Returns DecodeError::ShiftJisInvalid when the bytes are neither valid UTF-8 nor valid Shift_JIS.