Skip to main content

Module textparse

Module textparse 

Source
Expand description

Pure-Rust PDF text extraction (replacing pdfium’s glyph layer).

pdfium reports rendered glyph boxes, which diverge from docling’s docling-parse C++ parser at exactly the points that drive conformance: generated spaces get a zero-width box, combining diacritics get a real-width box, and ligature/fraction glyphs land at different x. This module instead reconstructs each glyph’s box from the font’s own advance widths and the PDF text/graphics matrices — the same information docling-parse uses — so a space is as wide as the font says and a combining mark has zero advance.

The output is the same [Glyph] stream pdfium produces (native PDF coordinates, y-up), fed straight into the existing docling-parse line sanitizer ([crate::dp_lines]). Only the digital text layer is handled here; pages without one still fall back to OCR upstream.

Functions§

debug_glyphs
Debug: raw glyph stream (ch, ll, lr, lb, lt) (native coords) for page index, before the sanitizer. For comparing char cells to docling-parse.
pdf_textlines
Public entry: per-page (width, height, line cells) for a PDF, via the Rust text parser + the docling-parse line sanitizer. Used by the pipeline and the textparse_dump example.