Skip to main content

Crate mdkit

Crate mdkit 

Source
Expand description

§mdkit — get markdown out of any document.

See the README for the full design rationale; the short version is: dispatch by file extension to the best backend per format. Pandoc for DOCX/PPTX/EPUB/RTF/ODT/LaTeX, Pdfium for PDF, OS-native APIs for OCR, calamine for spreadsheets.

§Quick start

use mdkit::Engine;
use std::path::Path;

let engine = Engine::with_defaults();
let doc = engine.extract(Path::new("report.pdf"))?;
println!("{}", doc.markdown);

§Custom extractor

Implement Extractor for your own format and register it on an Engine:

use mdkit::{Document, Engine, Extractor, Result};
use std::path::Path;

struct MyParser;

impl Extractor for MyParser {
    fn extensions(&self) -> &[&'static str] { &["custom"] }
    fn extract(&self, path: &Path) -> Result<Document> {
        Ok(Document::new(std::fs::read_to_string(path)?))
    }
}

let mut engine = Engine::new();
engine.register(Box::new(MyParser));

§Stability commitment (v0.7+)

v0.7 marks the API stability candidate for 1.0. The following surface is committed to and will only change with a major version bump:

  • The Extractor trait shape — required methods, default implementations, Send + Sync bound.
  • Engine construction and dispatch methods — new, with_defaults, with_defaults_diagnostic, register, extract, extract_bytes, len, is_empty.
  • Document field set + Document::new. Marked #[non_exhaustive] so we can add fields (page count, language, confidence) without major bumps.
  • Error enum semantics. Marked #[non_exhaustive] so we can add variants (e.g. encrypted-document) without major bumps. Pattern-matchers must include a wildcard arm.
  • Feature flag names: pdf, pandoc, calamine, csv, html, ocr-platform, ocr-onnx, ocr-onnx-download, full.
  • Backend name() strings — used by callers for filtering / logging. Stable per release line.

The following are implementation details and may change in minor versions:

  • The internal layout of any specific extractor (private fields, helper methods).
  • The exact set of Document.metadata keys per backend (new keys may appear; existing documented keys stay).
  • Auto-registration order in Engine::with_defaults (when multiple backends claim overlapping extensions; documented priority stays).
  • Internal sidecar / FFI details (Pandoc’s --server mode, ONNX runtime version).

Modules§

calaminecalamine
Spreadsheet text extraction via calamine.
csvcsv
CSV / TSV extraction via the csv crate.
htmlhtml
HTML extraction via html2md.
ipynbipynb
Jupyter notebook (.ipynb) extraction.
pdfpdf
PDF text extraction via Google’s Pdfium engine.

Structs§

Document
The result of extracting one document. Markdown is always present; title and metadata are best-effort and may be empty depending on the backend.
Engine
Dispatches extract calls to the registered Extractor for the file’s extension. Construct with Engine::new for an empty engine, or Engine::with_defaults to populate the defaults that match enabled feature flags.

Enums§

Error
Errors that can arise during extraction.

Traits§

Extractor
A backend that knows how to convert one or more file formats to markdown. Implementors register themselves with an Engine.

Type Aliases§

Result
Result alias used across the crate.