pub fn decode_auto(input: &[u8]) -> Result<Cow<'_, str>, DecodeError>Expand description
Decode Aozora source bytes to UTF-8, detecting the encoding.
Aozora material reaches this crate in two shapes: the canonical
Shift_JIS archive files, and already-decoded UTF-8 mirrors (e.g. a
corpus that has been pre-converted). Forcing every caller to commit
to one encoding up front is the wrong default — it is why corpus
tooling historically hard-coded decode_sjis and silently broke on
UTF-8 input. This entry point removes that concern: hand it bytes,
get back UTF-8.
- Valid UTF-8 is returned borrowed, zero-copy.
- Otherwise the bytes are decoded as
Shift_JIS(owned).
UTF-8 is tried first on purpose. Valid UTF-8 is a near-unambiguous
signal — Shift_JIS Japanese text essentially never forms a wholly
valid UTF-8 sequence — whereas the converse does not hold: a UTF-8
document can contain byte runs that decode as some Shift_JIS
without erroring, so sniffing Shift_JIS first risks mojibake on
UTF-8 input.
BOM stripping, CRLF folding and NFC normalisation are the parser’s Phase-0 responsibility and are deliberately not applied here.
§Errors
Returns DecodeError::ShiftJisInvalid when the bytes are neither
valid UTF-8 nor valid Shift_JIS.