Skip to main content

Crate mdkit

Crate mdkit 

Source
Expand description

§mdkit — get markdown out of any document.

See the README for the full design rationale; the short version is: dispatch by file extension to the best backend per format. Pandoc for DOCX/PPTX/EPUB/RTF/ODT/LaTeX, Pdfium for PDF, OS-native APIs for OCR, calamine for spreadsheets.

§Quick start

use mdkit::Engine;
use std::path::Path;

let engine = Engine::with_defaults();
let doc = engine.extract(Path::new("report.pdf"))?;
println!("{}", doc.markdown);

§Custom extractor

Implement Extractor for your own format and register it on an Engine:

use mdkit::{Document, Engine, Extractor, Result};
use std::path::Path;

struct MyParser;

impl Extractor for MyParser {
    fn extensions(&self) -> &[&'static str] { &["custom"] }
    fn extract(&self, path: &Path) -> Result<Document> {
        Ok(Document::new(std::fs::read_to_string(path)?))
    }
}

let mut engine = Engine::new();
engine.register(Box::new(MyParser));

Modules§

calaminecalamine
Spreadsheet text extraction via calamine.
csvcsv
CSV / TSV extraction via the csv crate.
htmlhtml
HTML extraction via html2md.
pdfpdf
PDF text extraction via Google’s Pdfium engine.

Structs§

Document
The result of extracting one document. Markdown is always present; title and metadata are best-effort and may be empty depending on the backend.
Engine
Dispatches extract calls to the registered Extractor for the file’s extension. Construct with Engine::new for an empty engine, or Engine::with_defaults to populate the defaults that match enabled feature flags.

Enums§

Error
Errors that can arise during extraction.

Traits§

Extractor
A backend that knows how to convert one or more file formats to markdown. Implementors register themselves with an Engine.

Type Aliases§

Result
Result alias used across the crate.