Expand description
Text encoding detection and conversion
This module handles:
- Detecting text encodings from byte content (UTF-8, UTF-16, Latin-1, CJK, etc.)
- Binary file detection (distinguishing text from binary content)
- Converting between encodings (normalizing to UTF-8 on load, converting back on save)
§Encoding Detection Strategy
- BOM Detection: Check for Byte Order Marks (UTF-8 BOM, UTF-16 LE/BE)
- UTF-8 Validation: Fast path for most modern files
- UTF-16 Heuristics: Detect UTF-16 without BOM via null byte patterns
- Binary Detection: Check for control characters that indicate binary content
- Statistical Detection: Use chardetng for legacy encoding detection
- Fallback: Default to Windows-1252 for ambiguous cases
Enums§
- Encoding
- Supported text encodings for file I/O
Functions§
- convert_
from_ utf8 - Convert UTF-8 content to the specified encoding for saving
- convert_
to_ utf8 - Convert bytes from a specific encoding to UTF-8
- detect_
and_ convert - Detect encoding and convert bytes to UTF-8
- detect_
encoding - Detect the text encoding from a sample of bytes
- detect_
encoding_ or_ binary - Detect the text encoding and whether content is binary.
- is_
binary_ control_ char - Check if a byte is a binary control character