Module textparse

Expand description

Pure-Rust PDF text extraction (replacing pdfium’s glyph layer).

pdfium reports rendered glyph boxes, which diverge from docling’s docling-parse C++ parser at exactly the points that drive conformance: generated spaces get a zero-width box, combining diacritics get a real-width box, and ligature/fraction glyphs land at different x. This module instead reconstructs each glyph’s box from the font’s own advance widths and the PDF text/graphics matrices — the same information docling-parse uses — so a space is as wide as the font says and a combining mark has zero advance.

The output is the same [Glyph] stream pdfium produces (native PDF coordinates, y-up), fed straight into the existing docling-parse line sanitizer ([crate::dp_lines]). Only the digital text layer is handled here; pages without one still fall back to OCR upstream.

Structs§

PageParserCells: One page’s text cells from the pure-Rust parser: prose line cells, per-word cells, and code line cells — all from a single glyph parse. Replaces the pdfium text path (roadmap item 6) when the parser drop is enabled.

Functions§

debug_glyphs: Debug: raw glyph stream (ch, ll, lr, lb, lt) (native coords) for page index, before the sanitizer. For comparing char cells to docling-parse.
pdf_all_cells: Full parser text layer: prose + word + code cells per page, glyphs parsed once. prose/words come from the docling-parse contraction ([crate::dp_lines]); code splits only at the parser’s own space glyphs (monospace keeps its source spacing). Used by the pipeline to retire pdfium’s text path.
pdf_textlines: Public entry: per-page (width, height, line cells) for a PDF, via the Rust text parser + the docling-parse line sanitizer. Used by the pipeline and the textparse_dump example.
pdf_words: Debug/diagnostic entry: per-page (width, height, word cells) for a PDF, via the Rust parser glyphs run through the docling-parse word grouping. Used to compare parser word cells against docling-parse’s word_cells oracle (roadmap item 6).