Expand description
Pure-Rust PDF text extraction (replacing pdfium’s glyph layer).
pdfium reports rendered glyph boxes, which diverge from docling’s
docling-parse C++ parser at exactly the points that drive conformance:
generated spaces get a zero-width box, combining diacritics get a real-width
box, and ligature/fraction glyphs land at different x. This module instead
reconstructs each glyph’s box from the font’s own advance widths and the
PDF text/graphics matrices — the same information docling-parse uses — so a
space is as wide as the font says and a combining mark has zero advance.
The output is the same [Glyph] stream pdfium produces (native PDF
coordinates, y-up), fed straight into the existing docling-parse line
sanitizer ([crate::dp_lines]). Only the digital text layer is handled here;
pages without one still fall back to OCR upstream.
Structs§
- Page
Parser Cells - One page’s text cells from the pure-Rust parser: prose line cells, per-word cells, and code line cells — all from a single glyph parse. Replaces the pdfium text path (roadmap item 6) when the parser drop is enabled.
Functions§
- debug_
glyphs - Debug: raw glyph stream
(ch, ll, lr, lb, lt)(native coords) for pageindex, before the sanitizer. For comparing char cells to docling-parse. - pdf_
all_ cells - Full parser text layer: prose + word + code cells per page, glyphs parsed once.
prose/wordscome from the docling-parse contraction ([crate::dp_lines]);codesplits only at the parser’s own space glyphs (monospace keeps its source spacing). Used by the pipeline to retire pdfium’s text path. - pdf_
textlines - Public entry: per-page (width, height, line cells) for a PDF, via the Rust
text parser + the docling-parse line sanitizer. Used by the pipeline and the
textparse_dumpexample. - pdf_
words - Debug/diagnostic entry: per-page (width, height, word cells) for a PDF, via
the Rust parser glyphs run through the docling-parse word grouping. Used to
compare parser word cells against docling-parse’s
word_cellsoracle (roadmap item 6).