Module utf8

Module utf8 

Source
Expand description

UTF-8 and text encoding utilities for ASS script processing

Provides BOM handling, encoding detection, and UTF-8 validation utilities optimized for ASS subtitle script processing with zero-copy design.

§Features

  • BOM detection and stripping for common encodings
  • UTF-8 validation with detailed error reporting
  • Encoding detection for legacy ASS files
  • nostd compatible implementation
  • Zero-copy operations where possible

§Examples

use ass_core::utils::utf8::{strip_bom, detect_encoding, validate_utf8};

// Strip BOM if present
let input = "\u{FEFF}[Script Info]\nTitle: Test";
let (stripped, had_bom) = strip_bom(input);
assert_eq!(stripped, "[Script Info]\nTitle: Test");
assert!(had_bom);

// Detect encoding
let text = "[Script Info]\nTitle: Test";
let encoding = detect_encoding(text.as_bytes());
assert_eq!(encoding.encoding, "UTF-8");
assert!(encoding.confidence > 0.8);

// Validate UTF-8
let valid_text = "Hello, 世界! 🎵";
assert!(validate_utf8(valid_text.as_bytes()).is_ok());

Structs§

EncodingInfo
Detected text encoding information with confidence scoring

Enums§

BomType
Byte Order Mark (BOM) signatures for common encodings

Functions§

count_replacement_chars
Count replacement characters in text
detect_bom
Detect BOM type from byte sequence
detect_encoding
Detect text encoding with confidence scoring
is_likely_ass_content
Check if text content contains patterns typical of ASS subtitle files
is_valid_ass_text
Check if text contains only valid ASS characters
normalize_line_endings
Normalize line endings to Unix style (\n)
normalize_whitespace
Normalize whitespace characters for consistent processing
recover_utf8
Attempt to recover from UTF-8 errors by replacing invalid sequences
remove_control_chars
Remove or normalize control characters for safe text processing
strip_bom
Detect and strip BOM from text input
trim_lines
Trim whitespace from start and end of each line
truncate_at_char_boundary
Truncate text at UTF-8 character boundary
validate_utf8
Validate UTF-8 with detailed error information