Skip to main content

Module text_normalizer

Module text_normalizer 

Source
Expand description

Text normalization for PDF-extracted text.

Handles ligature decomposition, soft-hyphen removal, zero-width character stripping, and whitespace normalization.

Functions§

collapse_whitespace
Collapse runs of whitespace into single spaces and trim.
full_normalize
Full normalization pipeline: ligatures + ignorables + typography + whitespace + NFC.
normalize_pdf_text
Normalize extracted PDF text: decompose ligatures, strip zero-width characters, remove soft hyphens, collapse whitespace, and apply NFC.
normalize_typography
Normalize smart/curly quotes and dashes to ASCII equivalents.
strip_diacritics
Remove all diacritical marks from text (NFC → NFD → strip combining marks → NFC). Useful for accent-insensitive searching.