Expand description
Pure-Rust PDF text extraction (replacing pdfium’s glyph layer).
pdfium reports rendered glyph boxes, which diverge from docling’s
docling-parse C++ parser at exactly the points that drive conformance:
generated spaces get a zero-width box, combining diacritics get a real-width
box, and ligature/fraction glyphs land at different x. This module instead
reconstructs each glyph’s box from the font’s own advance widths and the
PDF text/graphics matrices — the same information docling-parse uses — so a
space is as wide as the font says and a combining mark has zero advance.
The output is the same [Glyph] stream pdfium produces (native PDF
coordinates, y-up), fed straight into the existing docling-parse line
sanitizer ([crate::dp_lines]). Only the digital text layer is handled here;
pages without one still fall back to OCR upstream.
Functions§
- debug_
glyphs - Debug: raw glyph stream
(ch, ll, lr, lb, lt)(native coords) for pageindex, before the sanitizer. For comparing char cells to docling-parse. - pdf_
textlines - Public entry: per-page (width, height, line cells) for a PDF, via the Rust
text parser + the docling-parse line sanitizer. Used by the pipeline and the
textparse_dumpexample.