Deterministic PDF/DOCX parser for RAG / LLMs — one Rust core, with Python, Node & WASM bindings that produce byte-identical output.
pdfmuse is a precision pre-layer for AI/RAG: it extracts everything a file actually contains — text with exact coordinates, fonts, vector rules, tables, links — fast, robustly, and identically across every binding. It stops cleanly at the ML boundary: OCR and visual layout inference are left to a pluggable backend, so the core stays deterministic with zero ML dependencies. It is not another probabilistic vision model.
Why pdfmuse
| Complete | Keeps the finest-grained chars + coordinates; never silently drops content. |
| Fast | Zero-copy streaming Rust core with a custom O(1) object parser + content tokenizer and per-page parallelism. |
| Robust | A broken page/object never sinks the doc — returns structured errors, never panics (fuzz-tested). |
| Deterministic | Same input → same output. No probabilistic models, no time/RNG in the core path. |
| Consistent | Python / Node / WASM call one Rust core; output is byte-identical (CI-enforced). |
| CJK first-class | CID/Type0 fonts + CMap/ToUnicode in the main path; compatibility codepoints NFKC-normalized for clean search. |
Performance
Two things matter for a RAG pre-layer: speed, and whether it keeps the content. Both are measured on a public, reproducible corpus — 61 arXiv papers across 8 fields (large, dense PDFs — a deliberately hard case), so you can rerun the exact benchmark:
Text extraction (to_text, median of 7 runs after warm-up; PyMuPDF 1.28 / MuPDF 1.29, pdfplumber 0.11, macOS arm64, 65 papers):
| vs | speedup (geomean) | win rate | worst case |
|---|---|---|---|
| PyMuPDF | ~7.7× faster | 65 / 65 (100%) | still 2.5× faster |
| pdfplumber | ~150× faster | 65 / 65 (100%) | 69× |
pdfmuse is faster on every file in this corpus — including a 22 MB paper (9× faster) and a plot-heavy one that draws 18k marker glyphs. Content is preserved: median 100% of PyMuPDF's non-whitespace characters (n=65).
to_text() / to_markdown() return a string straight from the Rust core (no full-IR deserialization). The full parse() — chars + bboxes + tables, far more than text — costs only ~2.3× the to_text time, still under PyMuPDF on most files. The native Node binding is ~as fast as the Rust core; WASM ~1.7×.
Honest limit — reading order: extraction is complete (100% of chars) and deterministic, but flattening a 2-D page to 1-D text is where the hard cases live. Single-column, tables, and clean two-column read correctly; dense two-column academic PDFs with very tight gutters can still interleave the columns (a known geometric edge — see docs/ / issue tracker). Eyeball any file with examples/visual_check.py.
Install
# Rust
# Python (abi3 wheels)
# Node
# WASM (browser)
Usage
CLI (debug/inspection):
Rust:
let data = read?;
let doc = parse?; // auto-detect PDF/DOCX
for page in &doc.pages
let md = to_markdown;
let chunks = chunk; // RAG chunks + {page, bbox, heading_path}
Python:
=
= # plain text — fast path (~1.3ms, no full-IR json.loads)
= # structured Markdown — headings (PDF & DOCX) + tables
= # full IR: doc.pages[i].chars/blocks with bboxes
= # strip running headers/footers
Node:
const = require;
const data = fs.;
const text = ; // plain text — fast path
const clean = ; // strip running headers/footers
const doc = ; // full IR (typed Document)
WASM (browser — digital PDFs; scanned pages return a NeedsOcr warning to hand off server-side):
import init from "@pdfmuse/core";
await ;
const text = ; // plain text
const doc = JSON.; // full IR
Integrations
-
LangChain —
langchain-pdfmuse: aPdfmuseLoaderwithsingle/page/elementsmodes. Inelementsmode each chunk carries section-aware metadata (heading_path,bbox,category) — reproducible chunks for RAG.= -
LlamaIndex —
llama-index-readers-pdfmuse: aPdfmuseReaderwith the same modes and section-aware metadata.= -
Haystack —
pdfmuse-haystack: aPdfmuseConvertercomponent (text/markdown) for Haystack 2.x pipelines.=
Scope boundary
In the core (deterministic): text + coordinates/font/size/color · vector rules & rects · line/paragraph/column clustering · heading detection (font-size + numbering) · running header/footer detection + opt-in removal · ruled & whitespace-aligned table reconstruction · full DOCX structure · JSON / Markdown / RAG-chunk output.
Out of the core (pluggable VisionBackend): scanned-page OCR · borderless-table structure recognition · heading/body/caption classification. Text-less (scanned/image) pages are flagged NeedsOcr and left for a backend — see docs/adr/0001-pdf-engine-strategy.md.
Guarding this boundary is what keeps pdfmuse fast, stable, and distinct from vision models.
Layout
crates/
pdfmuse-core/ pure-Rust core: PDF/DOCX → unified IR (parser, tokenizer, layout, output)
pdfmuse-python/ PyO3 (abi3) binding
pdfmuse-node/ napi-rs binding
pdfmuse-wasm/ wasm-bindgen binding
pdfmuse-cli/ debug CLI (`pdfmuse`)
tests/{corpus,snapshots} golden corpus + insta snapshots
tests/parity/ cross-binding byte-identical gate (Python == Node == WASM)
examples/visual_check.py render original ↔ coordinate reconstruction for QA
fuzz/ cargo-fuzz targets (never-panic)
Testing gates
- Snapshot tests (
insta+tests/corpus) - Cross-binding parity CI — Python/Node/WASM output byte-identical (a red gate blocks merge)
- Robustness — mutated/garbage input never panics (
tests/robustness.rs+fuzz/) - CJK correctness suite
Status
Core is feature-complete (milestones M0–M4 + real-world hardening M4.5): PDF + DOCX → unified IR → JSON / Markdown / RAG chunks, three byte-identical bindings, encryption, CJK. Currently in M5 · polish & release. Roadmap and tasks live in Linear (project pdfmuse).
License
Dual-licensed under MIT or Apache-2.0, at your option.