Expand description
UTF-8 and text encoding utilities for ASS script processing
Provides BOM handling, encoding detection, and UTF-8 validation utilities optimized for ASS subtitle script processing with zero-copy design.
§Features
- BOM detection and stripping for common encodings
- UTF-8 validation with detailed error reporting
- Encoding detection for legacy ASS files
nostdcompatible implementation- Zero-copy operations where possible
§Examples
use ass_core::utils::utf8::{strip_bom, detect_encoding, validate_utf8};
// Strip BOM if present
let input = "\u{FEFF}[Script Info]\nTitle: Test";
let (stripped, had_bom) = strip_bom(input);
assert_eq!(stripped, "[Script Info]\nTitle: Test");
assert!(had_bom);
// Detect encoding
let text = "[Script Info]\nTitle: Test";
let encoding = detect_encoding(text.as_bytes());
assert_eq!(encoding.encoding, "UTF-8");
assert!(encoding.confidence > 0.8);
// Validate UTF-8
let valid_text = "Hello, 世界! 🎵";
assert!(validate_utf8(valid_text.as_bytes()).is_ok());Structs§
- Encoding
Info - Detected text encoding information with confidence scoring
Enums§
- BomType
- Byte Order Mark (BOM) signatures for common encodings
Functions§
- count_
replacement_ chars - Count replacement characters in text
- detect_
bom - Detect BOM type from byte sequence
- detect_
encoding - Detect text encoding with confidence scoring
- is_
likely_ ass_ content - Check if text content contains patterns typical of ASS subtitle files
- is_
valid_ ass_ text - Check if text contains only valid ASS characters
- normalize_
line_ endings - Normalize line endings to Unix style (\n)
- normalize_
whitespace - Normalize whitespace characters for consistent processing
- recover_
utf8 - Attempt to recover from UTF-8 errors by replacing invalid sequences
- remove_
control_ chars - Remove or normalize control characters for safe text processing
- strip_
bom - Detect and strip BOM from text input
- trim_
lines - Trim whitespace from start and end of each line
- truncate_
at_ char_ boundary - Truncate text at UTF-8 character boundary
- validate_
utf8 - Validate UTF-8 with detailed error information