pub enum DocumentScript {
Latin,
CJK,
RTL,
Complex,
Mixed,
}Expand description
Document script profile for optimization.
OPTIMIZATION (Issue #1 fix): Detect document primary script once, then skip unnecessary script detection functions for faster boundary detection.
When documents contain only Latin text, we skip RTL and CJK detection entirely. When documents are CJK-dominant, we skip RTL detection. This reduces function call overhead from millions per batch to thousands.
Variants§
Latin
Latin-only document (ASCII + extended Latin) Fast path: only check space, TJ offset, geometric gap
CJK
CJK-dominant document (Chinese, Japanese, Korean) Skip RTL detection, use optimized CJK path
RTL
Right-to-left dominant (Arabic, Hebrew) Skip CJK detection, use optimized RTL path
Complex
Complex scripts (Devanagari, Thai, Khmer, etc.) Use specialized complex script detection
Mixed
Mixed scripts or unknown Check all detection functions (slowest path)
Implementations§
Source§impl DocumentScript
impl DocumentScript
Sourcepub fn detect_from_characters(characters: &[CharacterInfo]) -> Self
pub fn detect_from_characters(characters: &[CharacterInfo]) -> Self
Detect document script profile by sampling first 1000 characters.
This optimization reduces boundary detection overhead by skipping unnecessary script detection for documents with known script profiles.
PERFORMANCE: O(min(n, 1000)) sampling, executed once per extraction
Trait Implementations§
Source§impl Clone for DocumentScript
impl Clone for DocumentScript
Source§fn clone(&self) -> DocumentScript
fn clone(&self) -> DocumentScript
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read more