Expand description
Text normalization for PDF-extracted text.
Handles ligature decomposition, soft-hyphen removal, zero-width character stripping, and whitespace normalization.
Functions§
- collapse_
whitespace - Collapse runs of whitespace into single spaces and trim.
- full_
normalize - Full normalization pipeline: ligatures + ignorables + typography + whitespace + NFC.
- normalize_
pdf_ text - Normalize extracted PDF text: decompose ligatures, strip zero-width characters, remove soft hyphens, collapse whitespace, and apply NFC.
- normalize_
typography - Normalize smart/curly quotes and dashes to ASCII equivalents.
- strip_
diacritics - Remove all diacritical marks from text (NFC → NFD → strip combining marks → NFC). Useful for accent-insensitive searching.