Expand description
Shared utility functions.
Modules§
- diff
- Document diff — compares two PdfDocument instances to detect structural changes.
- font_
metrics_ cache - Font metrics cache — caches computed string widths and font measurements to avoid redundant calculations during pipeline stages.
- image_
dedup - Image deduplication — identifies duplicate images across pages using content hashing to reduce output size and improve processing.
- language_
detector - Simple trigram-based language detection.
- layout_
analysis - Layout analysis utilities — page geometry classification, margin detection, and content density analysis for multi-column and complex layouts.
- page_
range - Page range parsing and filtering.
- sanitizer
- Content sanitization — PII masking, Unicode normalization.
- statistics
- Font and text statistics.
- text_
normalizer - Text normalization for PDF-extracted text.
- xref_
index - Cross-reference builder — creates an index mapping element IDs to their locations and relationships across the document.
- xycut
- XY-Cut++ reading order algorithm.