Crate fleischwolf_pdf

Expand description

PDF backend for fleischwolf.

A port of docling’s standard PDF pipeline: pdfium extracts the text layer (cells with bounding boxes) and renders page images; a discriminative ONNX stack (layout detection, table structure, OCR) classifies regions; the cells are assembled in reading order into a DoclingDocument.

Current stages: pdfium text-cell extraction + page rendering (pdfium_backend) and the deterministic text/reading-order assembly ([assemble]). The layout, table-structure and OCR ONNX stages land behind Pipeline next.

Re-exports§

pub use pdfium_backend::PdfDocument;
pub use pdfium_backend::PdfPage;
pub use pdfium_backend::TextCell;

Modules§

layout: Layout detection via the RT-DETR (docling-layout-heron) model exported to ONNX, run with ort. A port of docling-ibm-models’ LayoutPredictor: resize the page image to 640×640 and rescale to [0,1] (the heron processor has do_normalize=false), run the model, then RT-DETR post_process_object_detection (sigmoid → top-k over query×class → center-to-corners boxes scaled to the page).
pdfium_backend: pdfium-based text extraction and page rendering.
resample: Pixel-exact reimplementations of the OpenCV resize kernels docling uses for TableFormer preprocessing, so the model sees byte-identical input. Verified against cv2 on docling’s own bitmaps (INTER_AREA max diff 1/255, INTER_LINEAR < 1e-4 in float).
tableformer: TableFormer: table-structure recovery via docling-ibm-models, exported to ONNX by scripts/export_tableformer.py. The image encoder + tag-transformer encoder run once to a memory tensor; the decoder is then stepped autoregressively to emit an OTSL structure-token sequence (the same model docling runs). See PDF_CONFORMANCE.md.
textparse: Pure-Rust PDF text extraction (replacing pdfium’s glyph layer).
timing: Lightweight, env-gated per-stage timing for profiling the PDF pipeline.

Structs§

Pipeline: A reusable PDF pipeline. The primary worker runs its models on every core, so a single-page / small / image / METS input is converted at full intra-op speed with no pool to load. A document with enough pages instead fans out across a pool of narrower workers processed concurrently. Both load lazily and are cached for reuse, so a one-shot conversion only pays for what it uses.

Enums§

PdfError: Errors from the PDF backend. Detailed and surfaced (never silently skipped).

Functions§

convert: Convenience one-shot conversion (loads the pipeline per call). Errors are detailed and surfaced (never silently skipped).
convert_image: Convenience one-shot image conversion (loads the pipeline per call).
convert_image_with_options: Like convert_image, but optionally skips loading/running TableFormer (see Pipeline::no_table_former) and/or layout+OCR+TableFormer entirely (see Pipeline::no_ocr).
convert_mets_gbs
convert_mets_gbs_with_options: Like convert_mets_gbs, but optionally skips loading/running TableFormer (see crate::Pipeline::no_table_former) and/or layout+OCR+TableFormer entirely (see crate::Pipeline::no_ocr).
convert_pages: Convert pre-segmented pages (image + already-known text cells, e.g. METS/hOCR scans) through the shared layout + assembly pipeline.
convert_pages_with_options: Like convert_pages, but optionally skips loading/running TableFormer (see Pipeline::no_table_former) and/or layout+OCR+TableFormer entirely (see Pipeline::no_ocr).
convert_with_options: Like convert, but optionally skips loading/running TableFormer (see Pipeline::no_table_former) and/or layout+OCR+TableFormer entirely (see Pipeline::no_ocr).