Expand description
§mdkit — get markdown out of any document.
See the README for the full
design rationale; the short version is: dispatch by file extension to
the best backend per format. Pandoc for DOCX/PPTX/EPUB/RTF/ODT/LaTeX,
Pdfium for PDF, OS-native APIs for OCR, calamine for spreadsheets.
§Quick start
use mdkit::Engine;
use std::path::Path;
let engine = Engine::with_defaults();
let doc = engine.extract(Path::new("report.pdf"))?;
println!("{}", doc.markdown);§Custom extractor
Implement Extractor for your own format and register it on an
Engine:
use mdkit::{Document, Engine, Extractor, Result};
use std::path::Path;
struct MyParser;
impl Extractor for MyParser {
fn extensions(&self) -> &[&'static str] { &["custom"] }
fn extract(&self, path: &Path) -> Result<Document> {
Ok(Document::new(std::fs::read_to_string(path)?))
}
}
let mut engine = Engine::new();
engine.register(Box::new(MyParser));Structs§
- Document
- The result of extracting one document. Markdown is always present; title and metadata are best-effort and may be empty depending on the backend.
- Engine
- Dispatches
extractcalls to the registeredExtractorfor the file’s extension. Construct withEngine::newfor an empty engine, orEngine::with_defaultsto populate the defaults that match enabled feature flags.
Enums§
- Error
- Errors that can arise during extraction.
Traits§
- Extractor
- A backend that knows how to convert one or more file formats to
markdown. Implementors register themselves with an
Engine.
Type Aliases§
- Result
- Result alias used across the crate.