pdf2md
PDF → Markdown extractor for Rust. Wraps pdfium-render with a layout-analysis pipeline that recovers tables, columns, headings, and figures from arbitrary PDF documents and emits clean Markdown.
What it does
Given a PDF on disk, pdf2md returns:
- Markdown with ATX headings, paragraphs, and inline image references
- Figures — embedded raster images and rasterized vector regions, each deduplicated by SHA-256 hash
- Heading depth — the maximum ATX level produced (0..=6)
The pipeline includes:
- Recursive XY-cut zone segmentation with table promotion
- Banner detection — strips repeating page headers, footers, and stray page-number digits
- Border/line detection — recovers table grids and clusters path segments into figure regions
- Heading classification from font size, weight, and italic flags
- Noise stripping for junk headings and stray glyphs
- Vector-figure rasterization at ~180 DPI for diagrams without embedded bitmaps
Requirements
libpdfium must be available at runtime. The crate uses dynamic binding via
pdfium-render's bind_to_system_library().
| OS | Install |
|---|---|
| Arch | yay -S pdfium-binaries |
| Debian / Ubuntu | sudo apt install libpdfium-dev |
| macOS | brew install pdfium |
| Windows | grab pdfium.dll from bblanchon/pdfium-binaries and place it next to your binary |
Minimum supported Rust version: 1.85 (edition 2024).
Library use
The crate ships with default features (cli) on, which pulls in clap for the
binary. Library-only consumers should disable defaults:
[]
= { = "0.1", = false }
Basic extraction:
use Path;
async
Custom image-directive emission (e.g. a custom Markdown extension):
use Arc;
use ExtractConfig;
let cfg = ExtractConfig ;
let doc = extract_with_config.await?;
CLI
--figures-dir writes each extracted figure as <sha256>.png into the
target directory; without it, the CLI just reports the figure count to stderr.
License
MIT — see LICENSE.