mdkit

A Rust toolkit for getting markdown out of any document. Built for Tauri / Iced / native desktop apps that want best-in-class document extraction without a 350 MB Python sidecar.

Status: v0.5 — PDF (Pdfium), Pandoc (DOCX/PPTX/EPUB/RTF/ODT/ LaTeX), spreadsheets (calamine), CSV/TSV, HTML (html2md / Pandoc), macOS image OCR via Apple's Vision framework, AND Windows image OCR via Windows.Media.Ocr. The trait surface + dispatch engine are stable. Linux OCR (ONNX-based) lands in v0.6. Feature coverage is at parity with Python markitdown for non-audio formats on macOS and Windows. Watch / star the repo to follow along.

Why this exists

Most Rust desktop apps that need to read DOCX / PDF / PPTX today have two choices, both bad:

Bundle a Python sidecar with markitdown — works well, ~350 MB on disk, ~1 second cold-start per parse, single process is GIL-locked. Fine for hobby projects, painful at scale.
Use markitdown-rs — pure Rust, much smaller, but PDF extraction is basic (no layout, no OCR) and DOCX support drops headings, lists, hyperlinks, images.

mdkit is the third option: dispatch to the best tool per format, prefer in-process Rust crates and OS-native APIs, fall back to a single Pandoc binary for the formats Pandoc owns the gold standard for.

The composition we ship by default:

Format	Backend	Why
DOCX, PPTX, EPUB, RTF, ODT, LaTeX	Pandoc sidecar	Best-in-world conversion quality. ~150 MB but you bundle it once.
PDF (text)	`pdfium-render`	Google's Pdfium engine, in-process, layout-aware. ~5 MB.
PDF (scanned + mixed-content)	Pdfium renders pages → platform OCR (Vision / Windows.Media.Ocr)	Per-page composition — pages with text pass through pdfium, empty pages get OCR'd. Auto-wired by `Engine::with_defaults` when both `pdf` and `ocr-platform` features are on.
Standalone images	Platform-native OCR — Vision.framework on macOS, Windows.Media.Ocr on Windows, ONNX-based (Surya) on Linux	OS-quality on Mac/Win for free. ONNX models on Linux.
XLSX, XLS, ODS	`calamine`	Already the Rust ecosystem standard.
CSV, TSV	`csv`	Stdlib-quality.
HTML	`html2md` (or Pandoc, configurable)	Default cheap, optional best.

Total binary size with all backends: ~50-200 MB depending on which optional features you enable, vs ~350 MB for a Python markitdown sidecar.

Design principles

Best output per format, not uniform mediocrity. A single Rust crate that handles 20 formats poorly is worse than a dispatcher that uses Pandoc for what Pandoc is best at and Pdfium for what Pdfium is best at.
OS-native first. macOS PDFKit + Vision.framework, Windows Windows.Data.Pdf + Windows.Media.Ocr — these are battle-tested parsers Apple and Microsoft already paid for. We use them.
In-process where possible, sidecar where necessary. Process spawn is ~50-100 ms per file. For a folder of 1,000 files, that's real time. Pandoc is the one sidecar we accept; everything else is a Rust crate or OS-native FFI.
Privacy-respecting. Every extractor runs entirely on-device. No telemetry, no cloud round-trips, no analytics. (LLM-based image description is opt-in and uses the caller's own provider key.)
Graceful degradation. A bad PDF doesn't crash the process; it returns a typed error. Missing optional dependencies don't break the build; they disable specific extractors via feature flags.
Small, stable, public surface. The Extractor trait + Engine dispatcher are the API. Backends are implementation details that can be swapped without breaking callers.

Quick start

use mdkit::Engine;
use std::path::Path;

let engine = Engine::with_defaults();
let doc = engine.extract(Path::new("report.pdf"))?;
println!("{}", doc.markdown);

To register your own extractor for a custom format:

use mdkit::{Engine, Extractor, Document, Result};
use std::path::Path;

struct MyParser;

impl Extractor for MyParser {
    fn extensions(&self) -> &[&'static str] { &["custom"] }
    fn extract(&self, path: &Path) -> Result<Document> {
        Ok(Document::new(std::fs::read_to_string(path)?))
    }
}

let mut engine = Engine::new();
engine.register(Box::new(MyParser));

Feature flags

mdkit ships with backends behind feature flags so you only pay for what you use:

[dependencies]
mdkit = { version = "0.1", features = ["pdf", "pandoc", "ocr-platform", "calamine"] }

Feature	Adds	Approx. size cost
`pdf`	`pdfium-render` for PDF text extraction	~5 MB
`pandoc`	Pandoc sidecar wrapper for DOCX/PPTX/EPUB/RTF/ODT/LaTeX	~150 MB sidecar to bundle separately
`ocr-platform`	macOS Vision.framework + Windows.Media.Ocr (Linux falls back to `ocr-onnx`)	0 on macOS/Win
`ocr-onnx`	ONNX-based OCR with Surya model — works on all platforms incl. Linux	~50 MB model
`calamine`	XLSX / XLS / ODS via `calamine`	~1 MB
`csv`	CSV / TSV	<1 MB
`html`	HTML via `html2md`	<1 MB
`default`	`pdf`, `calamine`, `csv`, `html` (the in-process Rust ones)	~7 MB

Not enabling pandoc or ocr-platform is fine — extractors for those formats simply won't be registered, and Engine::extract will return Error::UnsupportedFormat for them.

License

Dual-licensed under MIT OR Apache 2.0 at your option. SPDX: MIT OR Apache-2.0.

Status & roadmap

This is a young project. v0.1 ships the trait surface, dispatch engine, and a no-op test extractor. Real backends land per the roadmap below:

v0.2 — pdf feature (pdfium-render integration). PdfiumExtractor registers automatically in Engine::with_defaults(); falls back gracefully when libpdfium isn't installed. See src/pdf.rs for libpdfium installation notes.
v0.3 — pandoc feature. PandocExtractor covers DOCX, PPTX, EPUB, RTF, ODT, LaTeX, HTML via the pandoc binary. Auto- registers when pandoc is on PATH; supports with_binary(absolute_path) for shipping pandoc next to your app. CHANGELOG.md added.
v0.4 — calamine + csv + html features (all in-process). CalamineExtractor (XLSX/XLS/XLSB/XLSM/ODS), CsvExtractor (CSV/TSV with auto-delimiter), Html2mdExtractor (HTML/HTM, registered before Pandoc so it wins by default for HTML). Engine registration order documented inline.
v0.5 — ocr-platform feature (macOS Vision + Windows.Media.Ocr). VisionOcrExtractor (macOS) handles PNG / JPG / TIFF / BMP / GIF / HEIC via Apple Vision (neural-network-based, Apple Neural Engine-accelerated on Apple Silicon). WindowsOcrExtractor (Windows) handles PNG / JPG / TIFF / BMP / GIF via the OS-built-in Windows.Media.Ocr engine. v0.5.3 adds scanned-PDF → OCR composition: PdfiumExtractor accepts an OCR fallback at construction; when text extraction yields empty markdown, each page is rendered to a PNG and routed through OCR with ## Page N headings. Linux ONNX backend lands in v0.6.
v0.6 — ocr-onnx feature (Surya + ONNX runtime fallback)
v0.7 — Audit pass + first stable trait release (1.0 candidate)

Issues, PRs, and design discussion welcome at https://github.com/mdkit-project/mdkit/issues.

Used by

mdkit was extracted from the document-extraction layer of Sery Link, a privacy-respecting data network for the files on your machines. If you use mdkit in your project, please open a PR to add yourself here.

Acknowledgements

mdkit would not exist without:

markitdown — Microsoft's Python implementation, the prior art and quality benchmark for "any-doc-to-markdown."
markitdown-rs — uhobnil's Rust port, which proved the Rust-side feasibility and inspired the dispatch design.
Pandoc — John MacFarlane's universal document converter, the standard the academic publishing world is built on.
Pdfium — Google's PDF engine, free for everyone to use.
calamine — tafia's industry-standard Rust XLSX parser.

mdkit 0.5.6