Expand description
PDF backend for fleischwolf.
A port of docling’s standard PDF pipeline: pdfium extracts the text layer
(cells with bounding boxes) and renders page images; a discriminative ONNX
stack (layout detection, table structure, OCR) classifies regions; the cells
are assembled in reading order into a DoclingDocument.
Current stages: pdfium text-cell extraction + page rendering (pdfium_backend)
and the deterministic text/reading-order assembly ([assemble]). The layout,
table-structure and OCR ONNX stages land behind Pipeline next.
Re-exports§
pub use pdfium_backend::PdfDocument;pub use pdfium_backend::PdfPage;pub use pdfium_backend::TextCell;
Modules§
- layout
- Layout detection via the RT-DETR (
docling-layout-heron) model exported to ONNX, run withort. A port of docling-ibm-models’LayoutPredictor: resize the page image to 640×640 and rescale to[0,1](the heron processor hasdo_normalize=false), run the model, then RT-DETRpost_process_object_detection(sigmoid → top-k over query×class → center-to-corners boxes scaled to the page). - pdfium_
backend - pdfium-based text extraction and page rendering.
- resample
- Pixel-exact reimplementations of the OpenCV resize kernels docling uses for TableFormer preprocessing, so the model sees byte-identical input. Verified against cv2 on docling’s own bitmaps (INTER_AREA max diff 1/255, INTER_LINEAR < 1e-4 in float).
- tableformer
- TableFormer: table-structure recovery via docling-ibm-models, exported to
ONNX by
scripts/export_tableformer.py. The image encoder + tag-transformer encoder run once to a memory tensor; the decoder is then stepped autoregressively to emit an OTSL structure-token sequence (the same model docling runs). See PDF_CONFORMANCE.md. - textparse
- Pure-Rust PDF text extraction (replacing pdfium’s glyph layer).
- timing
- Lightweight, env-gated per-stage timing for profiling the PDF pipeline.
Structs§
- Pipeline
- A reusable PDF pipeline. The primary worker runs its models on every core, so a single-page / small / image / METS input is converted at full intra-op speed with no pool to load. A document with enough pages instead fans out across a pool of narrower workers processed concurrently. Both load lazily and are cached for reuse, so a one-shot conversion only pays for what it uses.
Enums§
- PdfError
- Errors from the PDF backend. Detailed and surfaced (never silently skipped).
Functions§
- convert
- Convenience one-shot conversion (loads the pipeline per call). Errors are detailed and surfaced (never silently skipped).
- convert_
image - Convenience one-shot image conversion (loads the pipeline per call).
- convert_
image_ with_ options - Like
convert_image, but optionally skips loading/running TableFormer (seePipeline::no_table_former) and/or layout+OCR+TableFormer entirely (seePipeline::no_ocr). - convert_
mets_ gbs - convert_
mets_ gbs_ with_ options - Like
convert_mets_gbs, but optionally skips loading/running TableFormer (seecrate::Pipeline::no_table_former) and/or layout+OCR+TableFormer entirely (seecrate::Pipeline::no_ocr). - convert_
pages - Convert pre-segmented pages (image + already-known text cells, e.g. METS/hOCR scans) through the shared layout + assembly pipeline.
- convert_
pages_ with_ options - Like
convert_pages, but optionally skips loading/running TableFormer (seePipeline::no_table_former) and/or layout+OCR+TableFormer entirely (seePipeline::no_ocr). - convert_
with_ options - Like
convert, but optionally skips loading/running TableFormer (seePipeline::no_table_former) and/or layout+OCR+TableFormer entirely (seePipeline::no_ocr).