Expand description
Advanced character encoding detection command implementation.
This module provides sophisticated character encoding detection capabilities for subtitle files, helping users identify and resolve encoding issues that can cause display problems with non-ASCII characters. It uses multiple detection algorithms and heuristics to provide accurate encoding identification.
§Detection Algorithms
The encoding detection system employs multiple complementary approaches:
§Byte Order Mark (BOM) Detection
- UTF-8: EF BB BF byte sequence
- UTF-16LE: FF FE byte sequence
- UTF-16BE: FE FF byte sequence
- UTF-32: Various 4-byte BOM sequences
§Statistical Analysis
- Character Frequency: Analyze byte patterns for specific encodings
- Bigram Analysis: Examine two-byte character combinations
- Language Heuristics: Apply language-specific character patterns
- Confidence Scoring: Quantify detection reliability
§Format-Specific Detection
- ASCII Compatibility: Check for pure ASCII content
- Extended ASCII: Detect Windows-1252, ISO-8859-1 variants
- Multi-byte Encodings: Identify UTF-8, GB2312, Shift_JIS patterns
- Legacy Encodings: Support for regional and historical encodings
§Supported Encodings
§Unicode Family
- UTF-8: Universal encoding, recommended for all new files
- UTF-16LE/BE: Unicode with byte order variants
- UTF-32: Full Unicode support with fixed width
§Western European
- ISO-8859-1 (Latin-1): Basic Western European characters
- Windows-1252: Microsoft’s Western European encoding
- ISO-8859-15: Latin-1 with Euro symbol support
§East Asian
- GB2312/GBK: Simplified Chinese encodings
- Big5: Traditional Chinese encoding
- Shift_JIS: Japanese encoding
- EUC-JP: Alternative Japanese encoding
- EUC-KR: Korean encoding
§Cyrillic and Others
- Windows-1251: Russian and Cyrillic languages
- KOI8-R: Russian encoding
- ISO-8859-5: Cyrillic alphabet
§Detection Features
- Confidence Scoring: Reliability percentage for each detection
- Alternative Suggestions: Multiple encoding candidates with scores
- Content Sampling: Display decoded text samples for verification
- Language Hints: Detect probable language from character patterns
- Format Validation: Verify encoding produces valid subtitle content
§Examples
ⓘ
use subx_cli::commands::detect_encoding_command;
// Detect encoding for multiple files
let files = vec![
"subtitle1.srt".to_string(),
"subtitle2.ass".to_string(),
];
detect_encoding_command::detect_encoding_command(&files, true)?;
// Basic detection without verbose output
detect_encoding_command::detect_encoding_command(&["file.srt".to_string()], false)?;
Functions§
- detect_
encoding_ command - Execute character encoding detection for subtitle files based on input arguments.
- detect_
encoding_ command_ with_ config - Execute encoding detection command with injected configuration service.