# Changelog
All notable changes to mdkit are documented here. The format follows
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and mdkit
adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
mdkit is pre-1.0 — the public API surface (`Extractor`, `Engine`,
`Document`, `Error`) is intended to stay stable, but minor versions
may introduce additive changes to backends, feature flags, and
auxiliary types until 1.0 lands.
## [Unreleased]
## [0.7.4] — 2026-04-27
### Added
- **`examples/batch.rs`** — non-recursive folder → `.md` files
batch converter. Skips files mdkit can't extract silently
(no noisy log per non-document file), reports a final
`N extracted, M skipped, K failed` summary, exits non-zero
on any failure for CI-friendly use. Pair with
[`scankit`](https://crates.io/crates/scankit) for recursive
walks with exclude-glob support.
- **`examples/custom_extractor.rs`** — implements `Extractor`
for a deliberately silly ROT13 format to show the trait shape
without distracting from the registration pattern. Documents
the four mechanics that ARE realistic (claim extensions,
implement `extract`, optionally implement `extract_bytes`,
register with `Engine::register` — order matters for
overlapping extensions).
### Notes
- Two examples per release line up with the v0.7.x cadence —
small, additive, no API changes. Future v0.7.5+ might add
a streaming-extraction example once we have a real use case
for it.
## [0.7.3] — 2026-04-27
### Added
- **`examples/extract.rs`** — runnable CLI that takes a path and
prints the extracted markdown. Calls
`Engine::with_defaults_diagnostic` and surfaces any backend
registration failures (libpdfium not on the search path,
pandoc not in PATH, etc.) so missing runtime deps are debuggable
without reading mdkit source. Run with:
```bash
cargo run --example extract -- /path/to/document.pdf
cargo run --example extract --features pandoc -- /path/to/report.docx
cargo run --example extract --features ocr-platform -- /path/to/scan.pdf
```
Document-level metadata (title, OCR `extractor_chain`, etc.)
surfaces as HTML comment headers above the markdown body so
callers can see exactly which backend produced the output.
- README "Examples" section pointing at the new `examples/`
directory.
### Notes
- v0.7.3 is the first of v0.7.x's planned "examples + cookbook"
iteration. Future releases will add a `batch.rs` example
(folder of files → directory of markdown files) and a
`custom_extractor.rs` example (showing how to implement the
`Extractor` trait for a new format).
## [0.7.2] — 2026-04-27
### Added
- **`IpynbExtractor`** — Jupyter notebook (`.ipynb`) extraction.
Pure-Rust JSON parse via `serde_json`; no external runtime
dependency. Walks notebook cells in order:
- `markdown` cells → emitted verbatim
- `code` cells → wrapped in fenced code blocks; language hint
pulled from `metadata.kernelspec.language` (preferred) or
`metadata.language_info.name` (fallback)
- `raw` cells → emitted verbatim per spec §"Raw NBConvert cells"
- Unknown cell types → emitted as opaque text (forward-compat
for future nbformat versions)
- Whitespace-only / empty cells → skipped
- New `ipynb` feature flag (in the default set — pure-Rust, no
binary deps, ~50 lines + serde_json transitive).
- `Document.metadata` keys when extracting notebooks:
`kernel_language` (e.g. `"python"`), `kernel_display_name`
(e.g. `"Python 3"`). `Document.title` populated from
`metadata.title` when set.
- Cell `source` field handles both shapes the spec allows: a single
string, OR an array of strings (one per line, joined with no
separator since each line keeps its trailing `\n` — the
diff-friendly on-disk form).
### Notes
- Cell **outputs** are intentionally NOT included in the markdown
body. They're typically large (image data URLs, repr blobs) and
not what callers indexing notebooks for search / RAG want. A
future "rich extraction" trait could expose them.
- Closes the last format-coverage gap that Sery Link's integration
needed (DOCX, PPTX, PDF, HTML, IPYNB) — v0.7.2 unblocks the
Python-sidecar replacement work.
## [0.7.1] — 2026-04-27
### Changed
- **Repo moved from `mdkit-project/mdkit` to `seryai/mdkit`** on
GitHub. Old URLs auto-redirect (web + git), so existing clones
and links keep working — but `Cargo.toml` `homepage` /
`repository`, the README issues link, the SECURITY.md advisory
link, the `src/lib.rs` README pointer, and every `[Unreleased]` /
per-version compare URL in this CHANGELOG now point at the new
canonical location. This is a metadata-only release; no code or
API changes.
### Why the move
mdkit is the document-extraction layer of [Sery Link][sery]; the
project is consolidating under the `seryai` GitHub organisation
alongside the rest of the Sery codebase. The vendor-neutral
`mdkit-project` org served its purpose for the v0.1–v0.7.0
bootstrap; v0.7.1 onwards lives at `seryai/mdkit`.
### Migration
For most callers: bump the dep, rebuild, ship — GitHub redirects
handle all old URLs transparently. If you've cached an old git
remote URL, update it:
`git remote set-url origin git@github.com:seryai/mdkit.git`
(the redirect works but explicit is cleaner).
[sery]: https://sery.ai
## [0.7.0] — 2026-04-27
### API stability candidate (1.0 prep)
v0.7 is the **API stability candidate** for 1.0. The format-coverage
roadmap closed in v0.6 — every major desktop format and OCR backend
ships. v0.7 freezes the public surface ahead of 1.0 and locks in
SemVer commitments. v0.7.x can iterate on examples, docs polish, and
niche backend additions without changing the public API shape.
### Added
- **Stability section in `lib.rs` module docs** explicitly enumerates
what's covered by the API freeze (Extractor trait, Engine
dispatch surface, Document fields, Error semantics, feature flag
names, backend `name()` strings) and what stays implementation
detail (private extractor layout, exact metadata key sets,
registration order, sidecar / FFI internals).
- `#[non_exhaustive]` on [`Error`] — pre-1.0 we may add variants
(e.g. `EncryptedDocument` when password-protected PDFs land), and
this attribute lets us do that without a major version bump.
**Pattern-matchers must include a wildcard arm.**
- `#[non_exhaustive]` on [`Document`] — same forward-compat
rationale. Future fields (page count, language detection,
confidence, structured metadata) can land in minor versions
without breaking downstream struct-literal construction.
Construct via [`Document::new`] and then mutate fields; public
field access still works.
- `#[must_use]` annotations on `Document::new`, `Engine::new`, and
`Engine::with_defaults` — catches the "constructed but never used"
bug at compile time. Most existing extractor constructors already
had `#[must_use]`; this completes the audit.
### Changed
- No API-shape changes. v0.7.0 is intentionally an attribute-only
release — it's safe to bump from v0.6.x with no code changes
required, except where downstream code constructs `Document` /
matches on `Error` via struct/enum literals (in which case the
compiler will guide the migration).
### Migration from v0.6.x
For most callers: bump the dependency, rebuild, ship. No code
changes needed.
For callers constructing `Document` via struct literal in another
crate:
```rust
// Before
Document {
markdown: "...".into(),
title: None,
metadata: HashMap::new(),
}
// After
let mut doc = Document::new("...");
doc.title = None; // optional — these are the defaults
doc.metadata.clear(); // optional
```
For callers exhaustively matching on `Error`:
```rust
// Before
match err {
Error::Io(_) => ...,
Error::ParseError(_) => ...,
// ... every variant
}
// After
match err {
Error::Io(_) => ...,
Error::ParseError(_) => ...,
// ... every variant
_ => panic!("new mdkit error variant — check the changelog"),
}
```
### Notes
- v0.7.x will iterate on **examples** (`examples/` directory),
**cookbook**-style docs ("how to register a custom extractor",
"extending the Engine"), and any **niche backend polish** that
doesn't change the public surface (Pandoc `--server` mode,
Windows OCR auto-downscale via `BitmapTransform`).
- 1.0 will be cut once the API is exercised by at least one
downstream production user (Sery Link is the canonical
integration target).
## [0.6.0] — 2026-04-27
### Added
- **`OnnxOcrExtractor`** — cross-platform OCR via ONNX-runtime
PaddleOCR models, backed by the `oar-ocr` crate (which wraps
PaddleOCR's detection + recognition ONNX exports through `ort`).
Works on Linux, macOS, Windows, and WebAssembly — making this the
recommended OCR backend on Linux, where the `ocr-platform` feature
has no native engine to offer. Closes the last format-coverage gap
on the v0.x roadmap.
- New `ocr-onnx` feature flag — adds `oar-ocr` (default-features
off) and `image` as direct deps. The feature gate makes
`OnnxOcrExtractor` available; runtime model setup is the caller's
responsibility (see "Model setup" below).
- New `ocr-onnx-download` feature flag — opt-in convenience that
enables `oar-ocr/download-binaries`. With it, `oar-ocr` fetches
the ONNX Runtime native library at build / first-use. Without it,
consumers ship their own `libonnxruntime` (system package or
alongside the binary), same shape as the libpdfium runtime
requirement for the `pdf` feature.
- New public API:
- `OnnxOcrExtractor::with_models(detection, recognition, dict)`
— construct from caller-provided ONNX model paths. Returns
`Error::MissingDependency` for missing files,
`Error::ParseError` for corrupt models or libonnxruntime
issues.
- `OnnxOcrExtractor::detection_model_path() -> &Path` — diagnostic
accessor for the detection-model path the extractor was built
with.
### Changed
- `[features] full` now includes `ocr-onnx-download` instead of the
prior `ocr-onnx` placeholder, so `cargo test --all-features` (the
CI matrix) actually exercises the ONNX path with a real
libonnxruntime.
### Notes
- **Model setup.** `OnnxOcrExtractor::with_models` requires three
files supplied by the caller — download from
<https://github.com/GreatV/oar-ocr/releases>. For English-only
recognition: `pp-ocrv5_mobile_det.onnx` (~4.6 MB) +
`en_pp-ocrv5_mobile_rec.onnx` (~7.5 MB) +
`ppocrv5_en_dict.txt`. Other languages: swap the recognition
model + dict. Detection is language-independent.
- **Why no auto-registration in `Engine::with_defaults`.**
Construction requires model paths, and there's no portable default
location to look. Callers wire it in explicitly:
```rust
let ocr = OnnxOcrExtractor::with_models(det, rec, dict)?;
let mut engine = Engine::with_defaults();
engine.register(Box::new(ocr));
```
- **Why the Linux + cross-platform framing.** macOS and Windows
already get a higher-quality, zero-setup OCR via `ocr-platform`
(Vision / `Windows.Media.Ocr`). `OnnxOcrExtractor` is most
valuable on Linux, but registering it on macOS / Windows works
fine too — useful when you want consistent OCR output across
platforms or need a language pack the native engines don't ship.
### v0.6 milestone
This release closes the v0.6 roadmap line ("`ocr-onnx` feature"). The
v0.x milestones for the trait-stable engine + dispatch surface are
now all shipped: PDF (Pdfium), Pandoc-formats (DOCX/PPTX/EPUB/RTF/ODT/
LaTeX), spreadsheets (calamine), CSV/TSV, HTML (html2md or Pandoc),
plus OCR backends for every major desktop platform. v0.7 will be a
docs / API audit pass, then 1.0.
## [0.5.6] — 2026-04-27
### Added
- **`extract_bytes` now engages the OCR fallback** for scanned and
mixed-content PDFs. v0.5.0–v0.5.5 documented this gap explicitly
("OCR fallback isn't wired for the bytes path — left for a future
release if real callers ask"). v0.5.6 closes it: when
`PdfiumExtractor` has an OCR fallback configured, `extract_bytes`
spools the bytes to a tempfile and routes through the file-path
`extract`, picking up the per-page OCR logic for free. Cost is one
disk write + read of the PDF bytes (sub-ms for typical sizes),
traded for code simplicity.
### Changed
- **Behavior change:** `extract_bytes` on a scanned PDF previously
returned empty markdown silently. With both `pdf` and
`ocr-platform` features enabled (which `Engine::with_defaults`
wires together), it now returns OCR'd markdown — matching the
file-path `extract` contract introduced in v0.5.3 / refined in
v0.5.5. Pure text-only PDFs over the bytes path keep the original
in-memory fast path with no disk roundtrip.
### Notes
- A future optimization could add an in-memory `render_pages_subset`
that takes a pre-loaded `PdfDocument` (skipping the spool) — left
for now since the spool overhead is sub-ms even for large PDFs and
the simplicity-of-implementation win is real.
- This is the last v0.5.x feature ship targeted before either v0.6
(Linux ONNX OCR) or v0.7 (audit + 1.0 candidate). Remaining v0.5.x
candidates are increasingly niche: Pandoc `--server` mode (batch
performance), Windows OCR auto-downscale via `BitmapTransform`
(large-image edge case).
## [0.5.5] — 2026-04-27
### Added
- **Mixed-content PDF OCR.** v0.5.0–v0.5.4 only triggered the OCR
fallback when the *whole* PDF returned empty text. Real-world PDFs
often mix text-bearing and scanned pages — a text body with a
scanned cover page, a mostly-scanned doc with a couple of typed
pages, etc. v0.5.5 switches to **per-page detection**: pages whose
pdfium text extraction comes back empty get rendered + OCR'd
individually, pages with text pass through unchanged.
- **`PdfiumExtractor::render_pages_subset_to_pngs(path, indices,
out_dir)`** — render only the listed page indices (0-based) to
PNGs, sorted+deduped internally so output order is predictable
ascending-by-page-number. Used by the per-page OCR path so we
don't burn render time on pages that already have clean text.
- New extracted-`Document` metadata key `pages_ocred` reports the
count of pages that went through OCR (1+ for mixed-content,
total-page-count for fully-scanned, absent for pure-text PDFs).
### Changed
- **`## Page N` heading layout activation rule.** Whenever any page
went through OCR (mixed-content or fully-scanned), the output is
now per-page-headed so OCR'd pages are visually distinguishable
and downstream readers can cite by page. Pure text-only PDFs keep
the simpler blank-line-between-pages layout that v0.2–v0.5.4 used,
preserving backward compat for callers whose snapshots / tests pin
that shape.
- The internal `PdfiumExtractor::extract_via_ocr` fold-everything-into-
OCR helper from v0.5.3 has been removed. Its behavior is now
subsumed by the new per-page logic in `extract` (a fully-scanned
PDF has all pages flagged empty and follows the same path).
Externally observable behavior is unchanged for fully-scanned PDFs.
### Notes
- New ignored test `mixed_content_pdf_ocrs_only_empty_pages`
validates the contract end-to-end. Run with:
`cargo test --features "pdf ocr-platform" -- --ignored
mixed_content_pdf_ocrs_only_empty_pages` after dropping a 2+ page
PDF with at least one text page and at least one scanned page at
`tests/fixtures/mixed-content.pdf`.
- The detection threshold remains `text.trim().is_empty()` per page.
Pages with even a single non-whitespace character are treated as
text-bearing — partial OCR-overlay (e.g. mixing pdfium text with
OCR'd text on the same page) is intentionally not done, since it
tends to produce duplicate or garbled output.
## [0.5.4] — 2026-04-27
### Added
- **PDF document metadata extraction** (deferred since v0.2).
`PdfiumExtractor` now reads the PDF document-information dictionary
(`/Title /Author /Subject /Keywords /Creator /Producer /CreationDate
/ModDate`, per ISO 32000-2 §14.3.3) and surfaces it on the returned
`Document`:
- `Document.title` is populated from `/Title` when present.
- `Document.metadata` carries the full set under stable lowercase
keys: `title`, `author`, `subject`, `keywords`, `creator`,
`producer`, `created_at`, `modified_at`. (`title` mirrors
`Document.title` so callers consuming metadata uniformly don't
have to special-case it.)
- Tags with empty values are skipped, so callers don't have to
distinguish "absent" from "present-but-empty."
- Date values are passed through verbatim — Pdfium hands us PDF-spec
date strings (e.g. `D:20240115120000Z`) and parsing them into
RFC 3339 stays out of scope for the extractor surface; downstream
code that cares can parse `metadata["created_at"]` itself.
- **Metadata + title survive the OCR fallback path.** Scanned PDFs
often have populated `/Info` dicts even when `/Contents` is
image-only — the v0.5.3 fallback dropped both. v0.5.4 merges
pdfium-extracted metadata + title into the OCR-result `Document`,
with `extractor_chain` and `pages_ocred` added on top.
### Notes
- New ignored test `surfaces_pdf_metadata_and_title` validates the
contract end-to-end. Run with:
`cargo test --features pdf -- --ignored
surfaces_pdf_metadata_and_title` after dropping a PDF with `/Title`
set into `tests/fixtures/with-metadata.pdf`.
- The metadata API uses pdfium-render 0.9's
`PdfDocument::metadata()` + `PdfDocumentMetadataTagType`. Dropping
this in for v0.5.4 (rather than back in v0.2) means we have a
single API to integrate against — the v0.2-era pdfium-render
metadata surface had churn that's since settled.
## [0.5.3] — 2026-04-27
### Added
- **Scanned-PDF → OCR composition.** `PdfiumExtractor` now takes an
optional OCR fallback at construction via
`with_ocr_fallback(Box<dyn Extractor>)`. When primary text
extraction yields empty markdown (the typical signature of an
image-only scanned PDF), each page is rendered to a temporary PNG
and routed through the fallback extractor. Per-page output is
joined with `## Page N` headings so downstream readers — humans and
LLMs — can cite by page. Fully closes the most-reported gap in
PdfiumExtractor's v0.2–v0.5.2 surface: scanned PDFs no longer
silently return empty markdown.
- `Engine::with_defaults` wires the platform OCR backend into
`PdfiumExtractor` automatically when both `pdf` and `ocr-platform`
features are enabled and the target OS has a native OCR engine
(macOS Vision in v0.5.0; Windows.Media.Ocr in v0.5.2). Two OCR
extractor instances are constructed: one stays in PdfiumExtractor's
fallback slot for PDFs, the other registers normally for
PNG/JPG/etc. — both are stateless so duplication is free.
- New public API on `PdfiumExtractor`:
- `with_ocr_fallback(Box<dyn Extractor>) -> Self` — install the
fallback (builder-style).
- `with_ocr_render_scale(f32) -> Self` — override the per-page
render scale (default 2.0, ~144 DPI). Higher values improve OCR
accuracy on small text but risk exceeding Windows
`MaxImageDimension` (~2600 px).
- `render_pages_to_pngs(path, out_dir) -> Result<Vec<PathBuf>>` —
render all pages of a PDF to PNG files in `out_dir`. Used
internally by the OCR-fallback path; exposed publicly so callers
building richer pipelines can reuse it.
- New extracted-document metadata keys when the OCR fallback fires:
`extractor_chain` (e.g. `"pdfium-render → vision-macos"`) and
`pages_ocred` (page count). Stable across the v0.5.x line.
### Changed
- `pdfium-render` feature set now includes `image_latest` (in
addition to `thread_safe` and `pdfium_latest`). This pulls in the
`image` crate as a transitive dep for PNG encoding of rendered
pages — adds ~10 MB compiled to the `pdf` feature, which is the
acceptable tradeoff for closing the scanned-PDF gap. Callers that
only want raw text extraction (no OCR fallback) still get the
smaller v0.5.2 footprint by skipping the `pdf` feature, or by
constructing `PdfiumExtractor` without an OCR fallback.
- `tempfile = "3"` moves from dev-only to an optional regular dep
gated by the `pdf` feature, since the OCR-fallback path uses
`tempfile::tempdir` to spool rendered pages.
### Notes
- The `extract_bytes` path on `PdfiumExtractor` does NOT engage the
OCR fallback (it would need to spool the byte slice to a tempfile
first). The file-path API covers the dominant use case. If a real
caller needs bytes-to-OCR for scanned PDFs, open an issue.
- The mixed-content case (some text-bearing pages, some scanned
pages within the same PDF) is intentionally NOT handled by the
v0.5.3 fallback — pdfium returns the text-bearing pages, fallback
doesn't engage, scanned pages stay missing. Detecting and OCRing
individual pages within an otherwise text-bearing PDF is left to
a future release; the trigger today is whole-document
`markdown.trim().is_empty()`.
- `tests/fixtures/scanned.pdf` end-to-end test added (gated behind
`#[ignore]`) for local validation. To run on macOS:
`cargo test --features "pdf ocr-platform" -- --ignored
scanned_pdf_routes_through_ocr_fallback`.
## [0.5.2] — 2026-04-27
### Added
- **`WindowsOcrExtractor`** — Windows OCR via the `Windows.Media.Ocr`
API (the `windows` crate, Microsoft's official windows-rs binding).
Uses the user's installed profile languages where possible, falls
back to en-US if no profile language has an OCR pack installed, and
surfaces a clear typed error pointing the user at *Settings → Time
& Language → Language → Optional features → OCR* when no language
pack is OCR-capable.
- Handles standalone image files: PNG, JPG/JPEG, TIFF/TIF, BMP, GIF.
HEIC/HEIF are intentionally omitted — the Windows imaging stack
doesn't include them in the base OS, unlike macOS.
- Auto-registration in `Engine::with_defaults` when both the
`ocr-platform` feature is enabled and the target is Windows.
Construction is infallible; per-call init may still surface as a
`ParseError` (no installed OCR language, image too large, STA
thread, etc.).
- Windows OCR initialisation is per-thread MTA. The first `extract`
call on a thread runs `RoInitialize(MTA)`; if the thread is locked
into STA (typical UI/main threads), `extract` returns a typed
`ParseError` telling the caller to dispatch to a worker thread
(e.g. `tauri::async_runtime::spawn_blocking`).
- `OcrEngine::MaxImageDimension` is checked up-front. Images
exceeding the cap (~2600 px on shipping Windows) return
`ParseError` with a clear message rather than a deep WinRT error.
Auto-downscale via `BitmapTransform` is planned for a follow-up.
### Notes
- The `windows` crate (`0.62`) is target-conditional and only pulled
in on Windows. `--features ocr-platform` builds on macOS / Linux
succeed with no-op behavior on those platforms (Linux gets ONNX
via `ocr-onnx` in v0.6).
- README "platform-native OCR" line updated to reflect macOS + Windows
parity for v0.5.2.
- This release was developed on macOS without a Windows host —
Windows compile-and-test validation happens via CI
(`ubuntu-latest`, `macos-latest`, `windows-latest` matrix builds
with `cargo test --all-features`).
## [0.5.1] — 2026-04-27
### Fixed
- **`VisionOcrExtractor` returned empty markdown for valid images.**
v0.5.0 loaded each image via `NSImage::initWithContentsOfURL`, then
rasterized to a `CGImage` through
`CGImageForProposedRect_context_hints`, then handed the `CGImage` to
`VNImageRequestHandler::initWithCGImage_options`. The pipeline ran
without error, but Vision found zero text observations on every
input — the multi-step conversion was silently producing a CGImage
Vision couldn't read. v0.5.1 switches to
`VNImageRequestHandler::initWithURL_options`, which lets Vision
load the file directly via Image I/O. Confirmed end-to-end on a
PNG: Vision now returns the expected text at confidence 1.0.
### Changed
- Dropped the NSImage / CGImage extraction step from
`src/ocr_macos.rs`. The `objc2-app-kit` and `objc2-core-graphics`
crates are still pulled in by the `ocr-platform` feature for
forward compatibility, but the OCR backend itself no longer touches
either — the `VNImageRequestHandler::initWithURL_options` path goes
straight from filesystem URL to Vision request.
## [0.5.0] — 2026-04-27
### Added
- **`VisionOcrExtractor`** — macOS OCR via Apple's Vision framework
(`VNRecognizeTextRequest`). Neural-network-based, accelerated on
the Apple Neural Engine on Apple Silicon, handles handwriting and
mixed languages well, and ships free with every macOS install.
Gated by the `ocr-platform` feature; only present on macOS targets
(Windows + Linux are no-ops in v0.5; Windows lands in v0.5.x via
`Windows.Media.Ocr`, Linux in v0.6 via `ocr-onnx`).
- Handles standalone image files: PNG, JPG/JPEG, TIFF/TIF, BMP, GIF,
HEIC/HEIF.
- Auto-registration in `Engine::with_defaults` when both the
`ocr-platform` feature is enabled and the target is macOS.
- Output is one line of markdown per Vision text observation, in
reading order. Confidence scores and bounding boxes are not
surfaced today (the `Extractor` trait surface stays simple); a
future "rich extraction" trait could expose them.
- Runs inside an `autoreleasepool` so autoreleased Cocoa objects get
cleaned up promptly.
### Changed
- `[lints.rust] unsafe_code` downgraded from `forbid` to `deny`.
Backends with legitimate FFI needs (the macOS Vision module is
the first) can now opt in via per-module `#![allow(unsafe_code)]`
with a clear safety comment. Core dispatch and trait-only
backends remain unsafe-free.
- Scanned-PDF OCR is **not** wired in v0.5 — `PdfiumExtractor`
still returns empty markdown for image-only PDFs. A future
release will detect the empty-result case and route through OCR
automatically.
### Notes
- `objc2-vision` is already partially safe in v0.6 — most calls to
Vision APIs don't require `unsafe`. The remaining `unsafe` block
is for `CGImageForProposedRect_context_hints`, which takes raw
`*mut NSRect` for the optional out-rect parameter.
- `objc2-vision`, `objc2-app-kit`, `objc2-foundation`, and
`objc2-core-graphics` are pulled in only on macOS via a target-
specific dependency block, gated additionally by the
`ocr-platform` feature. Builds on Windows/Linux with
`--features ocr-platform` succeed but register no OCR extractor.
## [0.4.0] — 2026-04-27
### Added
- **`CalamineExtractor`** — XLSX, XLS, XLSB, XLSM, ODS spreadsheet
extraction via the [`calamine`](https://crates.io/crates/calamine)
crate (gated by the `calamine` feature). Each worksheet renders as
a markdown table preceded by an `## ` heading with the sheet name;
ragged rows pad/truncate to the header column count to keep the
table well-formed.
- **`CsvExtractor`** — CSV and TSV extraction via the
[`csv`](https://crates.io/crates/csv) crate (gated by the `csv`
feature). Auto-selects tab delimiter for `.tsv` files. First row
treated as the header row; data rendered as a markdown table.
- **`Html2mdExtractor`** — HTML and HTM extraction via the
[`html2md`](https://crates.io/crates/html2md) crate (gated by the
`html` feature). Lighter and faster than the Pandoc HTML reader;
registered before Pandoc in `Engine::with_defaults` so it wins for
HTML files when both features are enabled.
- All three extractors implement the new pattern: `Default + new()`
infallible constructors (no runtime dependency to verify), so they
register unconditionally in `Engine::with_defaults` when their
feature flag is on.
### Changed
- **Backend registration order in `Engine::with_defaults`** — cheap
in-process Rust backends (PDF, calamine, csv, html2md) register
before the Pandoc sidecar, so when format coverage overlaps (HTML
is the only one today, but future formats may too), the in-process
backend wins. Documented inline in `src/lib.rs`.
- README + roadmap reflect v0.4 ship.
### Notes
- Pipe characters (`|`) in spreadsheet/CSV cell values are escaped to
`|` to keep markdown tables well-formed; embedded newlines
collapse to a single space for the same reason.
- The `csv` crate is referenced via `::csv::` in `src/csv.rs` to
disambiguate from the local module of the same name. Module name
matches the feature name for consistency with the other backends.
## [0.3.0] — 2026-04-27
### Added
- **`PandocExtractor`** for DOCX, PPTX, EPUB, RTF, ODT, LaTeX (`tex`,
`latex`), and HTML (`html`, `htm`). Spawns the `pandoc` binary per
file with a stdin/stdout protocol; outputs GitHub-Flavored Markdown
(`gfm`). Gated by the `pandoc` feature.
- **`PandocExtractor::new`** locates `pandoc` on the system PATH and
verifies it responds to `--version` before declaring success.
- **`PandocExtractor::with_binary`** uses an explicit binary path —
preferred when shipping a static `pandoc` binary alongside your
application (Tauri / Iced / similar).
- **`PandocExtractor::pandoc_from`** (associated function) maps file
extensions to Pandoc reader names; exposed publicly so callers can
pre-check whether a given file is supported.
- Auto-registration in `Engine::with_defaults()` when the `pandoc`
feature is enabled. Falls back gracefully (logs via
`with_defaults_diagnostic`, silently skips otherwise) when the
pandoc binary isn't found.
- `CHANGELOG.md` (this file) — retroactive entries for v0.1.0 and
v0.2.0 included for completeness.
### Changed
- README roadmap reflects v0.3 ship.
### Notes
- No persistent server mode yet (each `extract` call spawns a fresh
`pandoc` process — ~50ms cold-start). Pandoc's `--server` mode
amortizes that across calls; will land as an opt-in optimization
in a later release.
- No PDF input via Pandoc by design — Pandoc deliberately doesn't
read PDFs; mdkit's `Engine` dispatches PDF to the `pdf` backend
(`PdfiumExtractor`) automatically when both features are enabled.
## [0.2.0] — 2026-04-27
### Added
- **`PdfiumExtractor`** for PDF text extraction via Google's Pdfium
engine through the `pdfium-render` crate (gated by the `pdf`
feature). In-process, layout-aware, no sidecar.
- **`PdfiumExtractor::new`** binds to libpdfium on the system library
path; **`PdfiumExtractor::with_library_path`** binds from an
explicit directory (Tauri-style "ship libpdfium next to the
binary"). Both return `Error::MissingDependency` when libpdfium
isn't found.
- **`Engine::with_defaults_diagnostic`** — new method alongside
`Engine::with_defaults` that returns the engine plus a list of
`(backend_name, Error)` for each backend that failed to register.
Lets callers log "PDF support disabled: libpdfium not found"
rather than silently shipping a degraded experience.
- Auto-registration of the PDF extractor in `Engine::with_defaults()`
when libpdfium is available; engine still constructs successfully
when libpdfium is missing (the PDF extractor is just absent).
### Changed
- `pdfium-render` dependency configured with `default-features =
false` and only the `thread_safe` + `pdfium_latest` features —
avoids pulling in the `image` crate weight since mdkit doesn't
render PDF pages, only extracts text.
## [0.1.0] — 2026-04-27
### Added
- Initial release. Establishes the crate name on crates.io and the
public API surface that backends will target.
- **`Engine`** — the dispatcher. Routes `extract(path)` calls to the
registered `Extractor` for the file's extension.
- **`Extractor` trait** — the per-format integration point.
Implementors declare `extensions()`, `name()`, `extract(path)`,
and optionally `extract_bytes(bytes, ext)`.
- **`Document`** — the unit of output. `markdown` is always present;
`title` and `metadata` are best-effort and may be empty.
- **Typed `Error` enum** — `Io`, `UnsupportedFormat`,
`UnsupportedOperation`, `ParseError`, `MissingDependency`,
`SidecarFailure`, `Other`. Coarse-grained on purpose; backends
distinguished via `Extractor::name` when needed.
- **Feature flags pre-declared** (no-op placeholders so the public
feature surface is stable from v0.1): `pdf`, `pandoc`,
`ocr-platform`, `ocr-onnx`, `calamine`, `csv`, `html`, `full`,
`default = ["pdf", "calamine", "csv", "html"]`.
- Dual-licensed under MIT OR Apache-2.0 (Rust ecosystem convention).
- CI workflow on Ubuntu + macOS + Windows (stable Rust + MSRV 1.75
+ clippy + rustfmt + cargo-audit gates).
- `CONTRIBUTING.md`, `SECURITY.md` for repo hygiene.
[Unreleased]: https://github.com/seryai/mdkit/compare/v0.7.4...HEAD
[0.7.4]: https://github.com/seryai/mdkit/compare/v0.7.3...v0.7.4
[0.7.3]: https://github.com/seryai/mdkit/compare/v0.7.2...v0.7.3
[0.7.2]: https://github.com/seryai/mdkit/compare/v0.7.1...v0.7.2
[0.7.1]: https://github.com/seryai/mdkit/compare/v0.7.0...v0.7.1
[0.7.0]: https://github.com/seryai/mdkit/compare/v0.6.0...v0.7.0
[0.6.0]: https://github.com/seryai/mdkit/compare/v0.5.6...v0.6.0
[0.5.6]: https://github.com/seryai/mdkit/compare/v0.5.5...v0.5.6
[0.5.5]: https://github.com/seryai/mdkit/compare/v0.5.4...v0.5.5
[0.5.4]: https://github.com/seryai/mdkit/compare/v0.5.3...v0.5.4
[0.5.3]: https://github.com/seryai/mdkit/compare/v0.5.2...v0.5.3
[0.5.2]: https://github.com/seryai/mdkit/compare/v0.5.1...v0.5.2
[0.5.1]: https://github.com/seryai/mdkit/compare/v0.5.0...v0.5.1
[0.5.0]: https://github.com/seryai/mdkit/compare/v0.4.0...v0.5.0
[0.4.0]: https://github.com/seryai/mdkit/compare/v0.3.0...v0.4.0
[0.3.0]: https://github.com/seryai/mdkit/compare/v0.2.0...v0.3.0
[0.2.0]: https://github.com/seryai/mdkit/compare/v0.1.0...v0.2.0
[0.1.0]: https://github.com/seryai/mdkit/releases/tag/v0.1.0