Skip to main content

detect_encoding_or_binary

Function detect_encoding_or_binary 

Source
pub fn detect_encoding_or_binary(
    bytes: &[u8],
    truncated: bool,
) -> (Encoding, bool)
Expand description

Detect the text encoding and whether content is binary.

Returns (Encoding, is_binary) where:

  • Encoding is the detected encoding (or default if binary)
  • is_binary is true if the content should be treated as raw binary

When truncated is true, an incomplete multi-byte UTF-8 sequence at the end of the sample is tolerated (up to 3 bytes) since it likely results from the caller truncating a larger stream. When false, such trailing bytes cause the sample to be rejected as UTF-8.

§Detection Strategy

  1. Check for BOM (Byte Order Mark) - highest priority, definitely not binary
  2. Try UTF-8 validation (fast path for most files), definitely not binary
  3. Check for UTF-16 patterns without BOM, definitely not binary
  4. Check for binary control characters (null bytes, etc.) - if found, it’s binary
  5. Use chardetng for statistical detection of legacy encodings
  6. If encoding detection is uncertain, default to Windows-1252