Fleischwolf 🦀
A Rust port of docling: convert
documents into a unified DoclingDocument for downstream AI workflows.
This is an early, in-progress port. See MIGRATION.md for
the full architecture, the Python → Rust mapping, and the phased plan.
Status
The public API works end to end across Markdown, CSV, HTML, AsciiDoc, DOCX,
PPTX, XLSX, EPUB, ODF, WebVTT, Email, JATS, USPTO, XBRL, LaTeX, JSON, PDF,
images and METS — plus Markdown / docling-JSON output and image extraction.
The discriminative PDF/image pipeline (pdfium + ONNX layout/OCR) lives in
fleischwolf-pdf. Audio/ASR is the main format still on the roadmap (see
MIGRATION.md).
Output is checked against upstream Python docling — declarative formats
byte-for-byte against live docling, the ML pipeline against a deterministic
snapshot baseline. See COMPARING.md and
scripts/conformance.sh.
The API
use ;
let converter = new;
let result = converter
.convert
.unwrap;
println!; // Markdown
println!; // docling DoclingDocument JSON
JSON output
export_to_json() emits docling-core's native DoclingDocument wire format
(schema 1.10.0) — the same shape Python docling's export_to_dict() /
save_as_json() produce: a body tree of $refs into texts / groups /
tables / pictures, with labels (title, section_header, list_item,
code, formula, …), list grouping, and table grids. The output loads straight
back into Python docling-core (DoclingDocument.load_from_json(...)) and
round-trips to the same Markdown.
Note: Fleischwolf's model bakes inline formatting (bold, links, inline math) into the text, so for those spans the JSON carries the rendered text rather than docling's structured
formatting/hyperlinkfields. Block structure, headings, lists, tables, code and display equations match.
Image extraction
Backends that have the image populate Node::Picture { image }: the PDF/image
pipeline crops figure regions, and the DOCX / PPTX backends pull embedded image
blobs. Pick how pictures render with an [ImageMode] — the analogue of
docling's image_mode:
use ImageMode;
// self-contained Markdown: 
let = result.document.export_to_markdown_with_images;
// referenced:  + the bytes to write
let = result.document.export_to_markdown_with_images;
for in files
export_to_json() always embeds extracted images as docling ImageRefs
(data: URIs + size). The default export_to_markdown() stays
<!-- image -->, like docling.
The cropped/extracted pixels are real, but the base64 won't be byte-identical to docling's (different PNG encoder). HTML/EPUB pictures stay placeholders — like docling, external
<img src>files aren't fetched.
strict Markdown (Rust-only)
By default export_to_markdown() reproduces docling's output byte-for-byte,
quirks included (***x*** ., dropped code-fence languages, \_ escaping). Set
strict(true) for cleaner, more conformant Markdown:
let converter = new.strict;
let result = converter.convert.unwrap;
println!; // ```rust kept, no `***x*** .`
legacy: Foo ***both*** . | ``` (language dropped)
strict: Foo ***both***. | ```rust (language kept)
result.document.export_to_markdown_with(strict) overrides the mode per call.
Python docling has no such switch.
Testing
All commands run from the fleischwolf/ workspace root.
# everything — unit tests + the output-regression suite (pure Rust; no Python/models)
# just the regression suite: re-convert every source under
# crates/fleischwolf/tests/data/<fmt>/sources/ and assert that legacy Markdown,
# strict Markdown and docling JSON match the committed fixtures (catches drift)
# refresh the fixtures after an *intentional* output change, then review `git diff`
FLEISCHWOLF_REGEN=1
# a single crate / a single test (with output)
The ML formats (PDF, images, METS) need pdfium + the ONNX models, so they are
covered by a separate deterministic snapshot harness rather than cargo test:
Try it
# convert a file from the CLI — Markdown to stdout (add --strict for cleaner MD)
# emit docling's native DoclingDocument JSON instead (--to md is the default)
# extract pictures (PDF/image inputs): embed as data URIs, or write ./artifacts/*.png
# or via the example
# score HTML output vs docling's groundtruth (no Python), or vs live docling
# diff Python docling vs Rust on one file (loads docling from local sources)
# benchmark time / CPU / memory: Python docling vs Rust
The comparison scripts load Python docling from this repo's own sources (an
editable install in .venv-compare, created automatically) — no
pip install docling required. See COMPARING.md.
Layout
| Crate | Role | Python analogue |
|---|---|---|
fleischwolf-core |
DoclingDocument model + serializers |
docling-core |
fleischwolf |
DocumentConverter, source loading, backends |
docling |
fleischwolf-cli |
command-line interface | docling.cli |
License
MIT, matching upstream docling.