Skip to main content

Module encoding

Module encoding 

Source
Expand description

Text encoding detection and conversion

This module handles:

  • Detecting text encodings from byte content (UTF-8, UTF-16, Latin-1, CJK, etc.)
  • Binary file detection (distinguishing text from binary content)
  • Converting between encodings (normalizing to UTF-8 on load, converting back on save)

§Encoding Detection Strategy

  1. BOM Detection: Check for Byte Order Marks (UTF-8 BOM, UTF-16 LE/BE)
  2. UTF-8 Validation: Fast path for most modern files
  3. UTF-16 Heuristics: Detect UTF-16 without BOM via null byte patterns
  4. Binary Detection: Check for control characters that indicate binary content
  5. Statistical Detection: Use chardetng for legacy encoding detection
  6. Fallback: Default to Windows-1252 for ambiguous cases

Enums§

Encoding
Supported text encodings for file I/O

Functions§

convert_from_utf8
Convert UTF-8 content to the specified encoding for saving
convert_to_utf8
Convert bytes from a specific encoding to UTF-8
detect_and_convert
Detect encoding and convert bytes to UTF-8
detect_encoding
Detect the text encoding from a sample of bytes
detect_encoding_or_binary
Detect the text encoding and whether content is binary.
is_binary_control_char
Check if a byte is a binary control character