mdkit
A Rust toolkit for getting markdown out of any document. Built for Tauri / Iced / native desktop apps that want best-in-class document extraction without a 350 MB Python sidecar.
Status: v0.5 — PDF (Pdfium), Pandoc (DOCX/PPTX/EPUB/RTF/ODT/ LaTeX), spreadsheets (calamine), CSV/TSV, HTML (html2md / Pandoc), AND macOS image OCR via Apple's Vision framework. The trait surface + dispatch engine are stable. Windows OCR lands in v0.5.x; Linux OCR (ONNX-based) in v0.6. Feature coverage is at parity with Python markitdown for non-audio formats on macOS. Watch / star the repo to follow along.
Why this exists
Most Rust desktop apps that need to read DOCX / PDF / PPTX today have two choices, both bad:
- Bundle a Python sidecar with markitdown — works well, ~350 MB on disk, ~1 second cold-start per parse, single process is GIL-locked. Fine for hobby projects, painful at scale.
- Use markitdown-rs — pure Rust, much smaller, but PDF extraction is basic (no layout, no OCR) and DOCX support drops headings, lists, hyperlinks, images.
mdkit is the third option: dispatch to the best tool per format,
prefer in-process Rust crates and OS-native APIs, fall back to a
single Pandoc binary for the formats Pandoc owns the gold standard
for.
The composition we ship by default:
| Format | Backend | Why |
|---|---|---|
| DOCX, PPTX, EPUB, RTF, ODT, LaTeX | Pandoc sidecar | Best-in-world conversion quality. ~150 MB but you bundle it once. |
| PDF (text) | pdfium-render |
Google's Pdfium engine, in-process, layout-aware. ~5 MB. |
| PDF (scanned) + standalone images | Platform-native OCR — Vision.framework on macOS, Windows.Media.Ocr on Windows, ONNX-based (Surya) on Linux | OS-quality on Mac/Win for free. ONNX models on Linux. |
| XLSX, XLS, ODS | calamine |
Already the Rust ecosystem standard. |
| CSV, TSV | csv |
Stdlib-quality. |
| HTML | html2md (or Pandoc, configurable) |
Default cheap, optional best. |
Total binary size with all backends: ~50-200 MB depending on which optional features you enable, vs ~350 MB for a Python markitdown sidecar.
Design principles
- Best output per format, not uniform mediocrity. A single Rust crate that handles 20 formats poorly is worse than a dispatcher that uses Pandoc for what Pandoc is best at and Pdfium for what Pdfium is best at.
- OS-native first. macOS PDFKit + Vision.framework, Windows Windows.Data.Pdf + Windows.Media.Ocr — these are battle-tested parsers Apple and Microsoft already paid for. We use them.
- In-process where possible, sidecar where necessary. Process spawn is ~50-100 ms per file. For a folder of 1,000 files, that's real time. Pandoc is the one sidecar we accept; everything else is a Rust crate or OS-native FFI.
- Privacy-respecting. Every extractor runs entirely on-device. No telemetry, no cloud round-trips, no analytics. (LLM-based image description is opt-in and uses the caller's own provider key.)
- Graceful degradation. A bad PDF doesn't crash the process; it returns a typed error. Missing optional dependencies don't break the build; they disable specific extractors via feature flags.
- Small, stable, public surface. The
Extractortrait +Enginedispatcher are the API. Backends are implementation details that can be swapped without breaking callers.
Quick start
use Engine;
use Path;
let engine = with_defaults;
let doc = engine.extract?;
println!;
To register your own extractor for a custom format:
use ;
use Path;
;
let mut engine = new;
engine.register;
Feature flags
mdkit ships with backends behind feature flags so you only pay for
what you use:
[]
= { = "0.1", = ["pdf", "pandoc", "ocr-platform", "calamine"] }
| Feature | Adds | Approx. size cost |
|---|---|---|
pdf |
pdfium-render for PDF text extraction |
~5 MB |
pandoc |
Pandoc sidecar wrapper for DOCX/PPTX/EPUB/RTF/ODT/LaTeX | ~150 MB sidecar to bundle separately |
ocr-platform |
macOS Vision.framework + Windows.Media.Ocr (Linux falls back to ocr-onnx) |
0 on macOS/Win |
ocr-onnx |
ONNX-based OCR with Surya model — works on all platforms incl. Linux | ~50 MB model |
calamine |
XLSX / XLS / ODS via calamine |
~1 MB |
csv |
CSV / TSV | <1 MB |
html |
HTML via html2md |
<1 MB |
default |
pdf, calamine, csv, html (the in-process Rust ones) |
~7 MB |
Not enabling pandoc or ocr-platform is fine — extractors for those
formats simply won't be registered, and Engine::extract will return
Error::UnsupportedFormat for them.
License
Dual-licensed under MIT OR Apache 2.0
at your option. SPDX: MIT OR Apache-2.0.
Status & roadmap
This is a young project. v0.1 ships the trait surface, dispatch engine, and a no-op test extractor. Real backends land per the roadmap below:
- v0.2 —
pdffeature (pdfium-renderintegration).PdfiumExtractorregisters automatically inEngine::with_defaults(); falls back gracefully when libpdfium isn't installed. Seesrc/pdf.rsfor libpdfium installation notes. - v0.3 —
pandocfeature.PandocExtractorcovers DOCX, PPTX, EPUB, RTF, ODT, LaTeX, HTML via thepandocbinary. Auto- registers whenpandocis on PATH; supportswith_binary(absolute_path)for shipping pandoc next to your app. CHANGELOG.md added. - v0.4 —
calamine+csv+htmlfeatures (all in-process).CalamineExtractor(XLSX/XLS/XLSB/XLSM/ODS),CsvExtractor(CSV/TSV with auto-delimiter),Html2mdExtractor(HTML/HTM, registered before Pandoc so it wins by default for HTML). Engine registration order documented inline. - v0.5 —
ocr-platformfeature (macOS Vision).VisionOcrExtractorfor PNG / JPG / TIFF / BMP / GIF / HEIC. Apple Vision is neural-network-based and Apple Neural Engine- accelerated on Apple Silicon. Windows.Media.Ocr lands in v0.5.x; Linux ONNX backend in v0.6. - v0.6 —
ocr-onnxfeature (Surya + ONNX runtime fallback) - v0.7 — Audit pass + first stable trait release (1.0 candidate)
Issues, PRs, and design discussion welcome at https://github.com/mdkit-project/mdkit/issues.
Used by
mdkit was extracted from the document-extraction layer of Sery
Link, a privacy-respecting data network for the files on your
machines. If you use mdkit in your project, please open a PR to
add yourself here.
Acknowledgements
mdkit would not exist without:
- markitdown — Microsoft's Python implementation, the prior art and quality benchmark for "any-doc-to-markdown."
- markitdown-rs —
uhobnil's Rust port, which proved the Rust-side feasibility and inspired the dispatch design. - Pandoc — John MacFarlane's universal document converter, the standard the academic publishing world is built on.
- Pdfium — Google's PDF engine, free for everyone to use.
- calamine —
tafia's industry-standard Rust XLSX parser.