spdf

Fast, spatial PDF parsing in Rust.

Extract text with preserved columns, tables, and layout — plus optional OCR for scans, format conversion for Office docs, and a single self-contained binary.

Why spdf

Most PDF-to-text tools collapse whitespace, shuffle columns, and emit one giant line salad — fine for search indexing, useless for anything that cares about where things appear on the page (invoices, tax bills, property records, scientific tables, legal forms).

spdf keeps the geometry:

Column-aware projection — tables, two-column layouts, sidebars, and indented blocks come back in reading order with their spatial structure intact.
Faux-bold & shadow dedup — PDFs that "draw text twice" to simulate bold no longer produce TaTax Infofo; you get Tax Info.
Word reconstruction — PDFium-style per-glyph extraction is stitched back into words (1 8 3 6 → 1836) using a liteparse-compatible merge heuristic.
QR / barcode / microprint filtering — the hundreds of tiny numeric glyphs that encode a QR code are auto-dropped so they don't destroy the surrounding table.
Optional OCR — Tesseract locally, or any HTTP OCR server (PaddleOCR, EasyOCR, etc.) for image-only pages. One flag to turn it off when you know the PDF is born-digital.
Format conversion — Office docs via LibreOffice shell-out, images via ImageMagick, all behind the same CLI.
One static binary. Install PDFium once, ship spdf anywhere.

Comparison

Benchmarked on two real-world U.S. county tax documents (TAX_APPEAL_MOM set) with --no-ocr. Token-level F1 measured against the same documents parsed through LiteParse using the provided reference outputs in tests/parity/.

Feature	spdf (this project)	LiteParse	pdftotext	pypdfium2
Language	Rust	TypeScript	C++	C++/Python
Single static binary	✅	❌ (Node)	✅	❌
Column-aware text projection	✅	✅	partial	❌
Faux-bold shadow dedup	✅	✅	❌	❌
QR / microprint filter	✅	✅	❌	❌
OCR fallback (Tesseract + HTTP)	✅	✅	❌	❌
Office-format conversion	✅	✅	❌	❌
Batch mode	✅	✅	❌	❌
JSON output with per-item bboxes	✅	✅	❌	partial
C ABI FFI crate	✅	❌	✅	✅
Token F1 vs LiteParse (tax bill)	0.990	1.000	~0.82	~0.80
Token F1 vs LiteParse (PRC)	0.922	1.000	~0.75	~0.78
Startup time (cold)	~25 ms	~450 ms	~10 ms	~120 ms

Parity harness and golden outputs live in tests/parity/; run python3 tests/parity/compare.py to reproduce.

Benchmark

Reproducible head-to-head against LiteParse on the fixtures in example/. Ground truth is raw tesseract <image> - -l eng (PDFs rendered with pdftoppm -r 150 first); tokens are compared case-insensitively as a multiset.

This is a like-for-like comparison, not a knock on LiteParse. LiteParse is the project spdf was designed against — we owe the original authors for the reference implementation of the spatial projection algorithm, and the goal of publishing these numbers is transparency about where the Rust port stands, not to make the TypeScript original look bad.

fixture	engine	wall-clock	tokens	recall	precision	F1
irs-f1040.pdf	spdf	268 ms	1094	63.8%	86.5%	73.4%
irs-f1040.pdf	liteparse	541 ms	1575	81.8%	77.0%	79.4%
irs-fw9-p1-2.pdf	spdf	76 ms	2253	99.1%	98.4%	98.7%
irs-fw9-p1-2.pdf	liteparse	465 ms	2253	99.1%	98.4%	98.8%
nist-sp-800-53r5-p1-2.pdf	spdf	17 ms	96	82.5%	97.9%	89.5%
nist-sp-800-53r5-p1-2.pdf	liteparse	461 ms	94	82.5%	100.0%	90.4%
nist-sp-800-63b-p1-2.pdf	spdf	969 ms	222	93.5%	96.8%	95.1%
nist-sp-800-63b-p1-2.pdf	liteparse	5530 ms	226	95.2%	96.9%	96.1%
rfc8446-p1-2.pdf	spdf	20 ms	399	99.5%	99.7%	99.6%
rfc8446-p1-2.pdf	liteparse	375 ms	399	99.5%	99.7%	99.6%
rfc9110-p1-2.pdf	spdf	235 ms	8	0.0%	0.0%	0.0%
rfc9110-p1-2.pdf	liteparse	3101 ms	8	0.0%	0.0%	0.0%
example-1.jpg	spdf	1067 ms	231	82.0%	96.5%	88.7%
example-1.jpg	liteparse	7070 ms	146	42.3%	78.8%	55.0%
test-ocr.pdf	spdf	274 ms	20	100.0%	100.0%	100.0%
test-ocr.pdf	liteparse	3212 ms	20	100.0%	100.0%	100.0%

Mean over fixtures: spdf F1 80.6% in 366 ms; liteparse F1 77.4% in 2594 ms.

Spatial precision

Token accuracy alone doesn't tell you whether an engine put each word in the right place on the page — which is the whole point of a spatial parser. We also compare every matched word's bounding box against the raw-tesseract ground truth and report mean IoU, the fraction of matches that clear IoU ≥ 0.5 (the COCO-style "well localised" bar), and the mean centroid error in PDF points.

fixture	engine	matched	mean IoU	IoU≥0.5	centroid err
example-1.jpg	spdf	212	0.976	97.6%	4.50 pt
example-1.jpg	liteparse	109	0.667	67.9%	28.03 pt
test-ocr.pdf	spdf	5	0.952	100.0%	0.64 pt
test-ocr.pdf	liteparse	4	0.957	100.0%	0.54 pt
irs-f1040.pdf	spdf	115	0.476	55.7%	97.90 pt
irs-f1040.pdf	liteparse	84	0.351	52.4%	135.73 pt
irs-fw9-p1-2.pdf	spdf	29	0.517	58.6%	169.21 pt
irs-fw9-p1-2.pdf	liteparse	28	0.348	53.6%	175.61 pt
nist-sp-800-53r5-p1-2.pdf	spdf	3	0.964	100.0%	0.35 pt
nist-sp-800-53r5-p1-2.pdf	liteparse	1	0.634	100.0%	2.01 pt
nist-sp-800-63b-p1-2.pdf	spdf	14	0.678	78.6%	84.12 pt
nist-sp-800-63b-p1-2.pdf	liteparse	20	0.471	65.0%	103.50 pt
rfc8446-p1-2.pdf	spdf	1	0.869	100.0%	0.44 pt
rfc8446-p1-2.pdf	liteparse	4	0.427	50.0%	171.02 pt
rfc9110-p1-2.pdf	spdf	0	0.000	0.0%	0.00 pt
rfc9110-p1-2.pdf	liteparse	0	0.000	0.0%	0.00 pt

Mean over fixtures: spdf mean IoU 0.679, 73.8% of matches ≥ 0.5, centroid error 44.64 pt; liteparse 0.482 / 61.1% / 77.05 pt.

Per-fixture raw outputs are committed under benchmark/outputs/ so the numbers are auditable. Reproduce on your own machine:

make build-ocr   # or `make install-ocr`
LITEPARSE_DIR=/path/to/liteparse make benchmark-update

Production-readiness

spdf is pre-1.0. The table below tracks what we've hardened so you can decide whether it fits your threat model; see CHANGELOG.md for per-release detail.

Area	Status
JSON output schema (byte-compatible with LiteParse)	stable (covered by parity harness)
Typed error enum at the public API (`SpdfError`)	stable
Benchmark corpus (public-domain: IRS, NIST, RFC, scanned image)	5 fixtures in example/
Property tests (`cargo test -p spdf-projection proptests`)	panic-freedom + shuffle-stability
Fuzz harness (`cargo +nightly fuzz run parse_pdf`)	fuzz/ — run before exposing to untrusted input
Cross-platform CI	Linux + macOS + Windows; MSRV 1.85; rustdoc warnings gated
Resource guards (`timeout_secs`, `max_input_bytes`, `max_pages`)	available via `SpdfParser::builder`
Security policy	see SECURITY.md
CLI / Rust library API	best-effort stable, breaks noted in CHANGELOG.md
`spdf-ffi` C ABI	unstable — symbols may change across 0.x releases
crates.io / npm publication	not yet; install from source

Recommended posture when parsing untrusted PDFs today:

let parser = SpdfParser::builder()
    .timeout_secs(30)           // defensive deadline
    .max_input_bytes(50 << 20)  // 50 MiB input cap
    .max_pages(500)             // reject page-tree bombs
    .build();

Then wrap the process in a resource-capped sandbox (systemd-run --property=MemoryMax=1G, Firejail, Docker --memory=). Follow the full hardening checklist in SECURITY.md.

Install

Prebuilt binaries

Self-contained tarballs (bundled libpdfium, no runtime deps) are attached to each GitHub release. Download, extract, and run. OCR is not compiled into the prebuilt binaries — use --ocr-server-url for HTTP OCR or cargo install with --features spdf-cli/tesseract for a local Tesseract build.

Target	Tarball	Status
`x86_64-unknown-linux-gnu`	`spdf-<version>-x86_64-unknown-linux-gnu.tar.gz`	✅ attached to v0.2.0-alpha.1
`aarch64-unknown-linux-gnu`	`spdf-<version>-aarch64-unknown-linux-gnu.tar.gz`	⬜ TODO
`x86_64-apple-darwin`	`spdf-<version>-x86_64-apple-darwin.tar.gz`	⬜ TODO — build on macOS Intel
`aarch64-apple-darwin`	`spdf-<version>-aarch64-apple-darwin.tar.gz`	⬜ TODO — build on Apple Silicon
`x86_64-pc-windows-msvc`	`spdf-<version>-x86_64-pc-windows-msvc.zip`	⬜ TODO — build on Windows

To produce a release tarball on a new host:

cargo build --release -p spdf-cli
VER=0.2.0-alpha.1
TARGET=$(rustc -vV | awk '/^host:/ {print $2}')
DIR="spdf-${VER}-${TARGET}"
mkdir -p "dist/${DIR}"
cp target/release/spdf "dist/${DIR}/"          # use spdf.exe on Windows
cp LICENSE README.md CHANGELOG.md "dist/${DIR}/"
tar czf "dist/${DIR}.tar.gz" -C dist "${DIR}"   # or zip on Windows
sha256sum "dist/${DIR}.tar.gz" > "dist/${DIR}.tar.gz.sha256"
gh release upload v${VER} "dist/${DIR}.tar.gz" "dist/${DIR}.tar.gz.sha256"

From source

# from source (requires Rust 1.85+)
cargo install --path crates/spdf-cli

# or build locally
cargo build --release -p spdf-cli
./target/release/spdf --help

Runtime dependency: PDFium

spdf dynamically loads a PDFium shared library. On macOS:

brew install pdfium

Or download a prebuilt binary from bblanchon/pdfium-binaries and point PDFIUM_LIB_PATH at it.

Platform support matrix

Platform	Core parsing	OCR (Tesseract)	Notes
Linux x86_64	✅	✅	primary development target
macOS (Intel + Apple Silicon)	✅	✅	requires `brew install tesseract`
Windows x86_64	✅	⚠️ source-build only	see below

Windows OCR caveat. The tesseract Rust crate used by spdf-ocr links against libtesseract + libleptonica via bindgen, which needs a working C toolchain (clang) and a vcpkg or manually-installed Tesseract/Leptonica. The CI matrix builds spdf on Windows without the tesseract feature; the Linux/macOS jobs cover OCR. If you need Windows OCR in production today, install Tesseract via vcpkg install tesseract leptonica --triplet x64-windows, set LIBCLANG_PATH, and build with cargo build --release -p spdf-cli --features spdf-cli/tesseract. The HTTP OCR backend (--ocr-server-url) works on every platform and is the recommended option for Windows until we cut a proper MSVC-native build.

Quick start

# Plain text with preserved layout
spdf parse invoice.pdf --no-ocr --format text

# Structured JSON with per-glyph bounding boxes
spdf parse invoice.pdf --no-ocr --format json > out.json

# OCR-only mode for scanned PDFs
spdf parse scan.pdf --ocr-language eng

# Use an external OCR server (PaddleOCR, EasyOCR, etc.)
spdf parse scan.pdf --ocr-server-url http://localhost:8000

# Render specific pages
spdf parse book.pdf --target-pages 1-3,7,12-15

# Dump pages as PNGs
spdf screenshot report.pdf -o ./pages --dpi 200

# Batch-convert a directory of PDFs
spdf batch-parse ./inputs ./outputs --format text

Library usage

use spdf_core::LiteParse;
use spdf_types::ParseConfig;

let parser = LiteParse::new(ParseConfig {
    ocr_enabled: false,
    ..Default::default()
});
let result = parser.parse_path("invoice.pdf")?;
for page in &result.pages {
    println!("--- page {} ---\n{}", page.page_num, page.text);
}

Architecture

crates/
  spdf-types/        public schema
  spdf-processing/   text / geometry / markup helpers
  spdf-projection/   spatial reconstruction (the crown jewel)
  spdf-pdf/          PdfEngine trait + PDFium impl
  spdf-ocr/          OcrEngine trait + Tesseract + HTTP impls
  spdf-convert/      LibreOffice / ImageMagick shell-outs
  spdf-output/       JSON + text formatters
  spdf-core/         orchestrator
  spdf-cli/          spdf binary
  spdf-ffi/          C ABI cdylib
xtask/               parity harness, benches, pdfium fetcher

See AGENTS.md for the full crate map and CONTRIBUTING.md for development workflow.

Roadmap

Node bindings (@spdf/node) on top of spdf-ffi
Python bindings via PyO3
spdf serve — a local HTTP parsing service
Optional ML-based reading-order classifier (opt-in, burn feature flag)

Acknowledgements

spdf is an independent Rust project authored by Fanaperana. The spatial projection algorithm was inspired by (and is benchmarked against) LiteParse, but spdf is not a port or rewrite — it's its own implementation, with its own engine choices (PDFium + Tesseract), its own data model, and its own hardening work. Rendering is powered by PDFium; OCR uses Tesseract.

spdf-types 0.2.0-alpha.1