Skip to main content

Module utils

Module utils 

Source
Expand description

Shared utility functions.

Modules§

diff
Document diff — compares two PdfDocument instances to detect structural changes.
font_metrics_cache
Font metrics cache — caches computed string widths and font measurements to avoid redundant calculations during pipeline stages.
image_dedup
Image deduplication — identifies duplicate images across pages using content hashing to reduce output size and improve processing.
language_detector
Simple trigram-based language detection.
layout_analysis
Layout analysis utilities — page geometry classification, margin detection, and content density analysis for multi-column and complex layouts.
page_range
Page range parsing and filtering.
sanitizer
Content sanitization — PII masking, Unicode normalization.
statistics
Font and text statistics.
text_normalizer
Text normalization for PDF-extracted text.
xref_index
Cross-reference builder — creates an index mapping element IDs to their locations and relationships across the document.
xycut
XY-Cut++ reading order algorithm.