spdf
Fast, spatial PDF parsing in Rust.
Extract text with preserved columns, tables, and layout — plus optional OCR for scans, format conversion for Office docs, and a single self-contained binary.
Why spdf
Most PDF-to-text tools collapse whitespace, shuffle columns, and emit one giant line salad — fine for search indexing, useless for anything that cares about where things appear on the page (invoices, tax bills, property records, scientific tables, legal forms).
spdf keeps the geometry:
- Column-aware projection — tables, two-column layouts, sidebars, and indented blocks come back in reading order with their spatial structure intact.
- Faux-bold & shadow dedup — PDFs that "draw text twice" to simulate
bold no longer produce
TaTax Infofo; you getTax Info. - Word reconstruction — PDFium-style per-glyph extraction is stitched
back into words (
1 8 3 6→1836) using a liteparse-compatible merge heuristic. - QR / barcode / microprint filtering — the hundreds of tiny numeric glyphs that encode a QR code are auto-dropped so they don't destroy the surrounding table.
- Optional OCR — Tesseract locally, or any HTTP OCR server (PaddleOCR, EasyOCR, etc.) for image-only pages. One flag to turn it off when you know the PDF is born-digital.
- Format conversion — Office docs via LibreOffice shell-out, images via ImageMagick, all behind the same CLI.
- One static binary. Install PDFium once, ship
spdfanywhere.
Comparison
Benchmarked on two real-world U.S. county tax documents (TAX_APPEAL_MOM
set) with --no-ocr. Token-level F1 measured against the same documents
parsed through LiteParse using the
provided reference outputs in tests/parity/.
| Feature | spdf (this project) | LiteParse | pdftotext | pypdfium2 |
|---|---|---|---|---|
| Language | Rust | TypeScript | C++ | C++/Python |
| Single static binary | ✅ | ❌ (Node) | ✅ | ❌ |
| Column-aware text projection | ✅ | ✅ | partial | ❌ |
| Faux-bold shadow dedup | ✅ | ✅ | ❌ | ❌ |
| QR / microprint filter | ✅ | ✅ | ❌ | ❌ |
| OCR fallback (Tesseract + HTTP) | ✅ | ✅ | ❌ | ❌ |
| Office-format conversion | ✅ | ✅ | ❌ | ❌ |
| Batch mode | ✅ | ✅ | ❌ | ❌ |
| JSON output with per-item bboxes | ✅ | ✅ | ❌ | partial |
| C ABI FFI crate | ✅ | ❌ | ✅ | ✅ |
| Token F1 vs LiteParse (tax bill) | 0.990 | 1.000 | ~0.82 | ~0.80 |
| Token F1 vs LiteParse (PRC) | 0.922 | 1.000 | ~0.75 | ~0.78 |
| Startup time (cold) | ~25 ms | ~450 ms | ~10 ms | ~120 ms |
Parity harness and golden outputs live in tests/parity/;
run python3 tests/parity/compare.py to reproduce.
Benchmark
Reproducible head-to-head against LiteParse
on the fixtures in example/. Ground truth is raw
tesseract <image> - -l eng (PDFs rendered with pdftoppm -r 150 first);
tokens are compared case-insensitively as a multiset.
This is a like-for-like comparison, not a knock on LiteParse. LiteParse is the project spdf was designed against — we owe the original authors for the reference implementation of the spatial projection algorithm, and the goal of publishing these numbers is transparency about where the Rust port stands, not to make the TypeScript original look bad.
| fixture | engine | wall-clock | tokens | recall | precision | F1 |
|---|---|---|---|---|---|---|
| irs-f1040.pdf | spdf | 268 ms | 1094 | 63.8% | 86.5% | 73.4% |
| irs-f1040.pdf | liteparse | 541 ms | 1575 | 81.8% | 77.0% | 79.4% |
| irs-fw9-p1-2.pdf | spdf | 76 ms | 2253 | 99.1% | 98.4% | 98.7% |
| irs-fw9-p1-2.pdf | liteparse | 465 ms | 2253 | 99.1% | 98.4% | 98.8% |
| nist-sp-800-53r5-p1-2.pdf | spdf | 17 ms | 96 | 82.5% | 97.9% | 89.5% |
| nist-sp-800-53r5-p1-2.pdf | liteparse | 461 ms | 94 | 82.5% | 100.0% | 90.4% |
| nist-sp-800-63b-p1-2.pdf | spdf | 969 ms | 222 | 93.5% | 96.8% | 95.1% |
| nist-sp-800-63b-p1-2.pdf | liteparse | 5530 ms | 226 | 95.2% | 96.9% | 96.1% |
| rfc8446-p1-2.pdf | spdf | 20 ms | 399 | 99.5% | 99.7% | 99.6% |
| rfc8446-p1-2.pdf | liteparse | 375 ms | 399 | 99.5% | 99.7% | 99.6% |
| rfc9110-p1-2.pdf | spdf | 235 ms | 8 | 0.0% | 0.0% | 0.0% |
| rfc9110-p1-2.pdf | liteparse | 3101 ms | 8 | 0.0% | 0.0% | 0.0% |
| example-1.jpg | spdf | 1067 ms | 231 | 82.0% | 96.5% | 88.7% |
| example-1.jpg | liteparse | 7070 ms | 146 | 42.3% | 78.8% | 55.0% |
| test-ocr.pdf | spdf | 274 ms | 20 | 100.0% | 100.0% | 100.0% |
| test-ocr.pdf | liteparse | 3212 ms | 20 | 100.0% | 100.0% | 100.0% |
Mean over fixtures: spdf F1 80.6% in 366 ms; liteparse F1 77.4% in 2594 ms.
Spatial precision
Token accuracy alone doesn't tell you whether an engine put each word in the right place on the page — which is the whole point of a spatial parser. We also compare every matched word's bounding box against the raw-tesseract ground truth and report mean IoU, the fraction of matches that clear IoU ≥ 0.5 (the COCO-style "well localised" bar), and the mean centroid error in PDF points.
| fixture | engine | matched | mean IoU | IoU≥0.5 | centroid err |
|---|---|---|---|---|---|
| example-1.jpg | spdf | 212 | 0.976 | 97.6% | 4.50 pt |
| example-1.jpg | liteparse | 109 | 0.667 | 67.9% | 28.03 pt |
| test-ocr.pdf | spdf | 5 | 0.952 | 100.0% | 0.64 pt |
| test-ocr.pdf | liteparse | 4 | 0.957 | 100.0% | 0.54 pt |
| irs-f1040.pdf | spdf | 115 | 0.476 | 55.7% | 97.90 pt |
| irs-f1040.pdf | liteparse | 84 | 0.351 | 52.4% | 135.73 pt |
| irs-fw9-p1-2.pdf | spdf | 29 | 0.517 | 58.6% | 169.21 pt |
| irs-fw9-p1-2.pdf | liteparse | 28 | 0.348 | 53.6% | 175.61 pt |
| nist-sp-800-53r5-p1-2.pdf | spdf | 3 | 0.964 | 100.0% | 0.35 pt |
| nist-sp-800-53r5-p1-2.pdf | liteparse | 1 | 0.634 | 100.0% | 2.01 pt |
| nist-sp-800-63b-p1-2.pdf | spdf | 14 | 0.678 | 78.6% | 84.12 pt |
| nist-sp-800-63b-p1-2.pdf | liteparse | 20 | 0.471 | 65.0% | 103.50 pt |
| rfc8446-p1-2.pdf | spdf | 1 | 0.869 | 100.0% | 0.44 pt |
| rfc8446-p1-2.pdf | liteparse | 4 | 0.427 | 50.0% | 171.02 pt |
| rfc9110-p1-2.pdf | spdf | 0 | 0.000 | 0.0% | 0.00 pt |
| rfc9110-p1-2.pdf | liteparse | 0 | 0.000 | 0.0% | 0.00 pt |
Mean over fixtures: spdf mean IoU 0.679, 73.8% of matches ≥ 0.5, centroid error 44.64 pt; liteparse 0.482 / 61.1% / 77.05 pt.
Per-fixture raw outputs are committed under benchmark/outputs/ so the numbers are auditable. Reproduce on your own machine:
LITEPARSE_DIR=/path/to/liteparse
Production-readiness
spdf is pre-1.0. The table below tracks what we've hardened so you can decide whether it fits your threat model; see CHANGELOG.md for per-release detail.
| Area | Status |
|---|---|
| JSON output schema (byte-compatible with LiteParse) | stable (covered by parity harness) |
Typed error enum at the public API (SpdfError) |
stable |
| Benchmark corpus (public-domain: IRS, NIST, RFC, scanned image) | 5 fixtures in example/ |
Property tests (cargo test -p spdf-projection proptests) |
panic-freedom + shuffle-stability |
Fuzz harness (cargo +nightly fuzz run parse_pdf) |
fuzz/ — run before exposing to untrusted input |
| Cross-platform CI | Linux + macOS + Windows; MSRV 1.85; rustdoc warnings gated |
Resource guards (timeout_secs, max_input_bytes, max_pages) |
available via SpdfParser::builder |
| Security policy | see SECURITY.md |
| CLI / Rust library API | best-effort stable, breaks noted in CHANGELOG.md |
spdf-ffi C ABI |
unstable — symbols may change across 0.x releases |
| crates.io / npm publication | not yet; install from source |
Recommended posture when parsing untrusted PDFs today:
let parser = builder
.timeout_secs // defensive deadline
.max_input_bytes // 50 MiB input cap
.max_pages // reject page-tree bombs
.build;
Then wrap the process in a resource-capped sandbox (systemd-run --property=MemoryMax=1G, Firejail, Docker --memory=). Follow the
full hardening checklist in SECURITY.md.
Install
Prebuilt binaries
Self-contained tarballs (bundled libpdfium, no runtime deps) are
attached to each GitHub release.
Download, extract, and run. OCR is not compiled into the prebuilt
binaries — use --ocr-server-url for HTTP OCR or cargo install
with --features spdf-cli/tesseract for a local Tesseract build.
| Target | Tarball | Status |
|---|---|---|
x86_64-unknown-linux-gnu |
spdf-<version>-x86_64-unknown-linux-gnu.tar.gz |
✅ attached to v0.2.0-alpha.1 |
aarch64-unknown-linux-gnu |
spdf-<version>-aarch64-unknown-linux-gnu.tar.gz |
⬜ TODO |
x86_64-apple-darwin |
spdf-<version>-x86_64-apple-darwin.tar.gz |
⬜ TODO — build on macOS Intel |
aarch64-apple-darwin |
spdf-<version>-aarch64-apple-darwin.tar.gz |
⬜ TODO — build on Apple Silicon |
x86_64-pc-windows-msvc |
spdf-<version>-x86_64-pc-windows-msvc.zip |
⬜ TODO — build on Windows |
To produce a release tarball on a new host:
VER=0.2.0-alpha.1
TARGET=
DIR="spdf--"
From source
# from source (requires Rust 1.85+)
# or build locally
Runtime dependency: PDFium
spdf dynamically loads a PDFium shared library. On macOS:
Or download a prebuilt binary from
bblanchon/pdfium-binaries
and point PDFIUM_LIB_PATH at it.
Platform support matrix
| Platform | Core parsing | OCR (Tesseract) | Notes |
|---|---|---|---|
| Linux x86_64 | ✅ | ✅ | primary development target |
| macOS (Intel + Apple Silicon) | ✅ | ✅ | requires brew install tesseract |
| Windows x86_64 | ✅ | ⚠️ source-build only | see below |
Windows OCR caveat. The tesseract Rust crate used by spdf-ocr
links against libtesseract + libleptonica via bindgen, which needs
a working C toolchain (clang) and a vcpkg or manually-installed
Tesseract/Leptonica. The CI matrix builds spdf on Windows without
the tesseract feature; the Linux/macOS jobs cover OCR. If you need
Windows OCR in production today, install Tesseract via vcpkg install tesseract leptonica --triplet x64-windows, set LIBCLANG_PATH, and
build with cargo build --release -p spdf-cli --features spdf-cli/tesseract. The HTTP OCR backend (--ocr-server-url)
works on every platform and is the recommended option for Windows until
we cut a proper MSVC-native build.
Quick start
# Plain text with preserved layout
# Structured JSON with per-glyph bounding boxes
# OCR-only mode for scanned PDFs
# Use an external OCR server (PaddleOCR, EasyOCR, etc.)
# Render specific pages
# Dump pages as PNGs
# Batch-convert a directory of PDFs
Library usage
use LiteParse;
use ParseConfig;
let parser = new;
let result = parser.parse_path?;
for page in &result.pages
Architecture
crates/
spdf-types/ public schema
spdf-processing/ text / geometry / markup helpers
spdf-projection/ spatial reconstruction (the crown jewel)
spdf-pdf/ PdfEngine trait + PDFium impl
spdf-ocr/ OcrEngine trait + Tesseract + HTTP impls
spdf-convert/ LibreOffice / ImageMagick shell-outs
spdf-output/ JSON + text formatters
spdf-core/ orchestrator
spdf-cli/ spdf binary
spdf-ffi/ C ABI cdylib
xtask/ parity harness, benches, pdfium fetcher
See AGENTS.md for the full crate map and CONTRIBUTING.md for development workflow.
Roadmap
- Node bindings (
@spdf/node) on top ofspdf-ffi - Python bindings via PyO3
spdf serve— a local HTTP parsing service- Optional ML-based reading-order classifier (opt-in,
burnfeature flag)
Acknowledgements
spdf is an independent Rust project authored by Fanaperana. The spatial projection algorithm was inspired by (and is benchmarked against) LiteParse, but spdf is not a port or rewrite — it's its own implementation, with its own engine choices (PDFium + Tesseract), its own data model, and its own hardening work. Rendering is powered by PDFium; OCR uses Tesseract.
License
MIT © 2026 spdf contributors.