pub fn detect_encoding_or_binary(
bytes: &[u8],
truncated: bool,
) -> (Encoding, bool)Expand description
Detect the text encoding and whether content is binary.
Returns (Encoding, is_binary) where:
- Encoding is the detected encoding (or default if binary)
- is_binary is true if the content should be treated as raw binary
When truncated is true, an incomplete multi-byte UTF-8 sequence at the
end of the sample is tolerated (up to 3 bytes) since it likely results from
the caller truncating a larger stream. When false, such trailing bytes cause
the sample to be rejected as UTF-8.
§Detection Strategy
- Check for BOM (Byte Order Mark) - highest priority, definitely not binary
- Try UTF-8 validation (fast path for most files), definitely not binary
- Check for UTF-16 patterns without BOM, definitely not binary
- Check for binary control characters (null bytes, etc.) - if found, it’s binary
- Use chardetng for statistical detection of legacy encodings
- If encoding detection is uncertain, default to Windows-1252