mdkit

A Rust toolkit for getting markdown out of any document. Built for Tauri / Iced / native desktop apps that want best-in-class document extraction without a 350 MB Python sidecar.

Status: v0.7 — API stability candidate for 1.0. Format coverage closed in v0.6 (PDF with scanned + mixed-content OCR, Pandoc-formats, spreadsheets, CSV/TSV, HTML, OCR backends for macOS / Windows / Linux). v0.7 freezes the public surface — see the stability section below for what's locked in. v0.7.x will iterate on examples + cookbook docs. 1.0 ships once the API is exercised by at least one downstream production user. Watch / star the repo to follow along.

Why this exists

Most Rust desktop apps that need to read DOCX / PDF / PPTX today have two choices, both bad:

Bundle a Python sidecar with markitdown — works well, ~350 MB on disk, ~1 second cold-start per parse, single process is GIL-locked. Fine for hobby projects, painful at scale.
Use markitdown-rs — pure Rust, much smaller, but PDF extraction is basic (no layout, no OCR) and DOCX support drops headings, lists, hyperlinks, images.

mdkit is the third option: dispatch to the best tool per format, prefer in-process Rust crates and OS-native APIs, fall back to a single Pandoc binary for the formats Pandoc owns the gold standard for.

The composition we ship by default:

Format	Backend	Why
DOCX, PPTX, EPUB, RTF, ODT, LaTeX	Pandoc sidecar	Best-in-world conversion quality. ~150 MB but you bundle it once.
PDF (text)	`pdfium-render`	Google's Pdfium engine, in-process, layout-aware. ~5 MB.
PDF (scanned + mixed-content)	Pdfium renders pages → platform OCR (Vision / Windows.Media.Ocr)	Per-page composition — pages with text pass through pdfium, empty pages get OCR'd. Auto-wired by `Engine::with_defaults` when both `pdf` and `ocr-platform` features are on.
Standalone images	Platform-native OCR — Vision.framework on macOS, Windows.Media.Ocr on Windows, ONNX-based (Surya) on Linux	OS-quality on Mac/Win for free. ONNX models on Linux.
XLSX, XLS, ODS	`calamine`	Already the Rust ecosystem standard.
CSV, TSV	`csv`	Stdlib-quality.
HTML	`html2md` (or Pandoc, configurable)	Default cheap, optional best.
Jupyter (IPYNB)	Built-in via `serde_json`	Pure-Rust JSON parse, no external deps. ~50 lines.

Total binary size with all backends: ~50-200 MB depending on which optional features you enable, vs ~350 MB for a Python markitdown sidecar.

Design principles

Best output per format, not uniform mediocrity. A single Rust crate that handles 20 formats poorly is worse than a dispatcher that uses Pandoc for what Pandoc is best at and Pdfium for what Pdfium is best at.
OS-native first. macOS PDFKit + Vision.framework, Windows Windows.Data.Pdf + Windows.Media.Ocr — these are battle-tested parsers Apple and Microsoft already paid for. We use them.
In-process where possible, sidecar where necessary. Process spawn is ~50-100 ms per file. For a folder of 1,000 files, that's real time. Pandoc is the one sidecar we accept; everything else is a Rust crate or OS-native FFI.
Privacy-respecting. Every extractor runs entirely on-device. No telemetry, no cloud round-trips, no analytics. (LLM-based image description is opt-in and uses the caller's own provider key.)
Graceful degradation. A bad PDF doesn't crash the process; it returns a typed error. Missing optional dependencies don't break the build; they disable specific extractors via feature flags.
Small, stable, public surface. The Extractor trait + Engine dispatcher are the API. Backends are implementation details that can be swapped without breaking callers.

Quick start

use mdkit::Engine;
use std::path::Path;

let engine = Engine::with_defaults();
let doc = engine.extract(Path::new("report.pdf"))?;
println!("{}", doc.markdown);

To register your own extractor for a custom format:

use mdkit::{Engine, Extractor, Document, Result};
use std::path::Path;

struct MyParser;

impl Extractor for MyParser {
    fn extensions(&self) -> &[&'static str] { &["custom"] }
    fn extract(&self, path: &Path) -> Result<Document> {
        Ok(Document::new(std::fs::read_to_string(path)?))
    }
}

let mut engine = Engine::new();
engine.register(Box::new(MyParser));

Feature flags

mdkit ships with backends behind feature flags so you only pay for what you use:

[dependencies]
mdkit = { version = "0.1", features = ["pdf", "pandoc", "ocr-platform", "calamine"] }

Feature	Adds	Approx. size cost
`pdf`	`pdfium-render` for PDF text extraction	~5 MB
`pandoc`	Pandoc sidecar wrapper for DOCX/PPTX/EPUB/RTF/ODT/LaTeX	~150 MB sidecar to bundle separately
`ocr-platform`	macOS Vision.framework + Windows.Media.Ocr (Linux falls back to `ocr-onnx`)	0 on macOS/Win
`ocr-onnx`	ONNX-runtime PaddleOCR via `oar-ocr` — works on all platforms incl. Linux. Caller supplies model files (~12 MB total) and `libonnxruntime`.	~30 MB compiled
`ocr-onnx-download`	Adds `ocr-onnx` and lets `oar-ocr` fetch ONNX Runtime native libs at build / first use	0 (downloads at build)
`calamine`	XLSX / XLS / ODS via `calamine`	~1 MB
`csv`	CSV / TSV	<1 MB
`html`	HTML via `html2md`	<1 MB
`ipynb`	Jupyter notebooks via `serde_json`	<1 MB
`default`	`pdf`, `calamine`, `csv`, `html`, `ipynb` (the in-process Rust ones)	~7 MB

Not enabling pandoc or ocr-platform is fine — extractors for those formats simply won't be registered, and Engine::extract will return Error::UnsupportedFormat for them.

Examples

Runnable example programs live in examples/:

extract.rs — print the extracted markdown for any document. Surfaces backend registration failures (libpdfium / pandoc / etc.) at startup. Run with:

cargo run --example extract -- /path/to/document.pdf
cargo run --example extract --features pandoc -- /path/to/report.docx
cargo run --example extract --features ocr-platform -- /path/to/scan.pdf

batch.rs — non-recursive folder → .md batch converter. Run with:

cargo run --example batch -- /path/to/folder /path/to/output

custom_extractor.rs — implements the Extractor trait for a custom file format, showing the registration pattern. Run with:
```
cargo run --example custom_extractor -- /path/to/file.rot
```

License

Dual-licensed under MIT OR Apache 2.0 at your option. SPDX: MIT OR Apache-2.0.

Stability (v0.7+) {#stability-v07}

v0.7 is the API stability candidate for 1.0. The following surface is committed to and will only change with a major version bump:

The Extractor trait shape — required methods, default implementations, Send + Sync bound.
Engine construction + dispatch — new, with_defaults, with_defaults_diagnostic, register, extract, extract_bytes, len, is_empty.
Document field set + Document::new. Marked #[non_exhaustive] so we can add fields (page count, language, confidence) without major bumps.
Error enum semantics. Marked #[non_exhaustive] so we can add variants without major bumps. Pattern-matchers must include a wildcard arm.
Feature flag names: pdf, pandoc, calamine, csv, html, ocr-platform, ocr-onnx, ocr-onnx-download, full.
Backend name() strings — used by callers for filtering / logging.

The following are implementation details and may change in minor versions:

Internal layout of any specific extractor (private fields, helper methods).
Exact set of Document.metadata keys per backend (new keys may appear; documented keys stay).
Auto-registration order in Engine::with_defaults (when multiple backends claim overlapping extensions; documented priority stays).
Internal sidecar / FFI details (Pandoc's --server mode, ONNX runtime version, libpdfium binding).

1.0 will be cut once the API is exercised by at least one downstream production user.

Status & roadmap

This is a young project. v0.1 ships the trait surface, dispatch engine, and a no-op test extractor. Real backends land per the roadmap below:

v0.2 — pdf feature (pdfium-render integration). PdfiumExtractor registers automatically in Engine::with_defaults(); falls back gracefully when libpdfium isn't installed. See src/pdf.rs for libpdfium installation notes.
v0.3 — pandoc feature. PandocExtractor covers DOCX, PPTX, EPUB, RTF, ODT, LaTeX, HTML via the pandoc binary. Auto- registers when pandoc is on PATH; supports with_binary(absolute_path) for shipping pandoc next to your app. CHANGELOG.md added.
v0.4 — calamine + csv + html features (all in-process). CalamineExtractor (XLSX/XLS/XLSB/XLSM/ODS), CsvExtractor (CSV/TSV with auto-delimiter), Html2mdExtractor (HTML/HTM, registered before Pandoc so it wins by default for HTML). Engine registration order documented inline.
v0.5 — ocr-platform feature (macOS Vision + Windows.Media.Ocr). VisionOcrExtractor (macOS) handles PNG / JPG / TIFF / BMP / GIF / HEIC via Apple Vision (neural-network-based, Apple Neural Engine-accelerated on Apple Silicon). WindowsOcrExtractor (Windows) handles PNG / JPG / TIFF / BMP / GIF via the OS-built-in Windows.Media.Ocr engine. v0.5.3 adds scanned-PDF → OCR composition: PdfiumExtractor accepts an OCR fallback at construction; when text extraction yields empty markdown, each page is rendered to a PNG and routed through OCR with ## Page N headings. Linux ONNX backend lands in v0.6.
v0.6 — ocr-onnx feature (PaddleOCR via oar-ocr). OnnxOcrExtractor wraps PaddleOCR's detection + recognition ONNX models through ort; works on Linux + macOS + Windows + WebAssembly. Caller supplies the three model files (det + rec + dict, ~12 MB total English-only) — download from https://github.com/GreatV/oar-ocr/releases. Opt into ocr-onnx-download to let oar-ocr fetch the ONNX Runtime native lib for you.
v0.7 — API stability candidate (#[non_exhaustive] audit, #[must_use] audit, stability commitments doc). v0.7.0 freezes the public surface for 1.0. v0.7.x will iterate on examples + cookbook docs + niche backend polish that doesn't change the public API. 1.0 ships once the API is exercised by at least one downstream production user.

Issues, PRs, and design discussion welcome at https://github.com/seryai/mdkit/issues.

Used by

mdkit was extracted from the document-extraction layer of Sery Link, a privacy-respecting data network for the files on your machines. If you use mdkit in your project, please open a PR to add yourself here.

Acknowledgements

mdkit would not exist without:

markitdown — Microsoft's Python implementation, the prior art and quality benchmark for "any-doc-to-markdown."
markitdown-rs — uhobnil's Rust port, which proved the Rust-side feasibility and inspired the dispatch design.
Pandoc — John MacFarlane's universal document converter, the standard the academic publishing world is built on.
Pdfium — Google's PDF engine, free for everyone to use.
calamine — tafia's industry-standard Rust XLSX parser.

mdkit 0.7.4