<div align="center">
# spdf
**Fast, spatial PDF parsing in Rust.**
Extract text with preserved columns, tables, and layout — plus optional OCR
for scans, format conversion for Office docs, and a single self-contained
binary.
[](https://github.com/Fanaperana/spdf/actions/workflows/ci.yml)
[](LICENSE)
[](rust-toolchain.toml)
[](#install)
</div>
---
## Why spdf
Most PDF-to-text tools collapse whitespace, shuffle columns, and emit one
giant line salad — fine for search indexing, useless for anything that cares
about *where* things appear on the page (invoices, tax bills, property
records, scientific tables, legal forms).
`spdf` keeps the geometry:
- **Column-aware projection** — tables, two-column layouts, sidebars, and
indented blocks come back in reading order with their spatial structure
intact.
- **Faux-bold & shadow dedup** — PDFs that "draw text twice" to simulate
bold no longer produce `TaTax Infofo`; you get `Tax Info`.
- **Word reconstruction** — PDFium-style per-glyph extraction is stitched
back into words (`1 8 3 6` → `1836`) using a liteparse-compatible merge
heuristic.
- **QR / barcode / microprint filtering** — the hundreds of tiny numeric
glyphs that encode a QR code are auto-dropped so they don't destroy the
surrounding table.
- **Optional OCR** — Tesseract locally, or any HTTP OCR server (PaddleOCR,
EasyOCR, etc.) for image-only pages. One flag to turn it off when you
know the PDF is born-digital.
- **Format conversion** — Office docs via LibreOffice shell-out, images via
ImageMagick, all behind the same CLI.
- **One static binary.** Install PDFium once, ship `spdf` anywhere.
## Comparison
Benchmarked on two real-world U.S. county tax documents (TAX_APPEAL_MOM
set) with `--no-ocr`. Token-level F1 measured against the same documents
parsed through [LiteParse](https://github.com/run-llama/liteparse) using the
provided reference outputs in [tests/parity/](tests/parity/).
| Language | Rust | TypeScript | C++ | C++/Python |
| Single static binary | ✅ | ❌ (Node) | ✅ | ❌ |
| Column-aware text projection | ✅ | ✅ | partial | ❌ |
| Faux-bold shadow dedup | ✅ | ✅ | ❌ | ❌ |
| QR / microprint filter | ✅ | ✅ | ❌ | ❌ |
| OCR fallback (Tesseract + HTTP) | ✅ | ✅ | ❌ | ❌ |
| Office-format conversion | ✅ | ✅ | ❌ | ❌ |
| Batch mode | ✅ | ✅ | ❌ | ❌ |
| JSON output with per-item bboxes | ✅ | ✅ | ❌ | partial |
| C ABI FFI crate | ✅ | ❌ | ✅ | ✅ |
| **Token F1 vs LiteParse (tax bill)** | **0.990** | 1.000 | ~0.82 | ~0.80 |
| **Token F1 vs LiteParse (PRC)** | **0.922** | 1.000 | ~0.75 | ~0.78 |
| **Startup time (cold)** | ~25 ms | ~450 ms | ~10 ms | ~120 ms |
Parity harness and golden outputs live in [tests/parity/](tests/parity/);
run `python3 tests/parity/compare.py` to reproduce.
## Benchmark
Reproducible head-to-head against [LiteParse](https://github.com/run-llama/liteparse)
on the fixtures in [example/](example/). Ground truth is raw
`tesseract <image> - -l eng` (PDFs rendered with `pdftoppm -r 150` first);
tokens are compared case-insensitively as a multiset.
> This is a like-for-like comparison, not a knock on LiteParse. LiteParse
> is the project spdf was designed against — we owe the original authors
> for the reference implementation of the spatial projection algorithm,
> and the goal of publishing these numbers is transparency about where
> the Rust port stands, not to make the TypeScript original look bad.
| irs-f1040.pdf | spdf | 268 ms | 1094 | 63.8% | 86.5% | 73.4% |
| irs-f1040.pdf | **liteparse** | **541 ms** | **1575** | **81.8%** | **77.0%** | **79.4%** |
| irs-fw9-p1-2.pdf | spdf | 76 ms | 2253 | 99.1% | 98.4% | 98.7% |
| irs-fw9-p1-2.pdf | **liteparse** | **465 ms** | **2253** | **99.1%** | **98.4%** | **98.8%** |
| nist-sp-800-53r5-p1-2.pdf | spdf | 17 ms | 96 | 82.5% | 97.9% | 89.5% |
| nist-sp-800-53r5-p1-2.pdf | **liteparse** | **461 ms** | **94** | **82.5%** | **100.0%** | **90.4%** |
| nist-sp-800-63b-p1-2.pdf | spdf | 969 ms | 222 | 93.5% | 96.8% | 95.1% |
| nist-sp-800-63b-p1-2.pdf | **liteparse** | **5530 ms** | **226** | **95.2%** | **96.9%** | **96.1%** |
| rfc8446-p1-2.pdf | **spdf** | **20 ms** | **399** | **99.5%** | **99.7%** | **99.6%** |
| rfc8446-p1-2.pdf | liteparse | 375 ms | 399 | 99.5% | 99.7% | 99.6% |
| rfc9110-p1-2.pdf | **spdf** | **235 ms** | **8** | **0.0%** | **0.0%** | **0.0%** |
| rfc9110-p1-2.pdf | liteparse | 3101 ms | 8 | 0.0% | 0.0% | 0.0% |
| example-1.jpg | **spdf** | **1067 ms** | **231** | **82.0%** | **96.5%** | **88.7%** |
| example-1.jpg | liteparse | 7070 ms | 146 | 42.3% | 78.8% | 55.0% |
| test-ocr.pdf | **spdf** | **274 ms** | **20** | **100.0%** | **100.0%** | **100.0%** |
| test-ocr.pdf | liteparse | 3212 ms | 20 | 100.0% | 100.0% | 100.0% |
**Mean over fixtures:** spdf **F1 80.6%** in **366 ms**; liteparse F1 77.4% in 2594 ms.
### Spatial precision
Token accuracy alone doesn't tell you whether an engine put each word in
the *right place on the page* — which is the whole point of a spatial
parser. We also compare every matched word's bounding box against the
raw-tesseract ground truth and report mean IoU, the fraction of matches
that clear IoU ≥ 0.5 (the COCO-style "well localised" bar), and the
mean centroid error in PDF points.
| example-1.jpg | **spdf** | **212** | **0.976** | **97.6%** | **4.50 pt** |
| example-1.jpg | liteparse | 109 | 0.667 | 67.9% | 28.03 pt |
| test-ocr.pdf | spdf | 5 | 0.952 | 100.0% | 0.64 pt |
| test-ocr.pdf | **liteparse** | **4** | **0.957** | **100.0%** | **0.54 pt** |
| irs-f1040.pdf | **spdf** | **115** | **0.476** | **55.7%** | **97.90 pt** |
| irs-f1040.pdf | liteparse | 84 | 0.351 | 52.4% | 135.73 pt |
| irs-fw9-p1-2.pdf | **spdf** | **29** | **0.517** | **58.6%** | **169.21 pt** |
| irs-fw9-p1-2.pdf | liteparse | 28 | 0.348 | 53.6% | 175.61 pt |
| nist-sp-800-53r5-p1-2.pdf | **spdf** | **3** | **0.964** | **100.0%** | **0.35 pt** |
| nist-sp-800-53r5-p1-2.pdf | liteparse | 1 | 0.634 | 100.0% | 2.01 pt |
| nist-sp-800-63b-p1-2.pdf | **spdf** | **14** | **0.678** | **78.6%** | **84.12 pt** |
| nist-sp-800-63b-p1-2.pdf | liteparse | 20 | 0.471 | 65.0% | 103.50 pt |
| rfc8446-p1-2.pdf | **spdf** | **1** | **0.869** | **100.0%** | **0.44 pt** |
| rfc8446-p1-2.pdf | liteparse | 4 | 0.427 | 50.0% | 171.02 pt |
| rfc9110-p1-2.pdf | **spdf** | **0** | **0.000** | **0.0%** | **0.00 pt** |
| rfc9110-p1-2.pdf | liteparse | 0 | 0.000 | 0.0% | 0.00 pt |
**Mean over fixtures:** spdf **mean IoU 0.679**, **73.8%** of matches ≥ 0.5, centroid error **44.64 pt**; liteparse 0.482 / 61.1% / 77.05 pt.
Per-fixture raw outputs are committed under [benchmark/outputs/](benchmark/outputs/)
so the numbers are auditable. Reproduce on your own machine:
```sh
make build-ocr # or `make install-ocr`
LITEPARSE_DIR=/path/to/liteparse make benchmark-update
```
## Production-readiness
spdf is pre-1.0. The table below tracks what we've hardened so you can
decide whether it fits your threat model; see [CHANGELOG.md](CHANGELOG.md)
for per-release detail.
| JSON output schema (byte-compatible with LiteParse) | stable (covered by parity harness) |
| Typed error enum at the public API (`SpdfError`) | stable |
| Benchmark corpus (public-domain: IRS, NIST, RFC, scanned image) | 5 fixtures in [example/](example/) |
| Property tests (`cargo test -p spdf-projection proptests`) | panic-freedom + shuffle-stability |
| Fuzz harness (`cargo +nightly fuzz run parse_pdf`) | [fuzz/](fuzz/) — run before exposing to untrusted input |
| Cross-platform CI | Linux + macOS + Windows; MSRV 1.85; rustdoc warnings gated |
| Resource guards (`timeout_secs`, `max_input_bytes`, `max_pages`) | available via [`SpdfParser::builder`](crates/spdf-core/src/lib.rs) |
| Security policy | see [SECURITY.md](SECURITY.md) |
| CLI / Rust library API | best-effort stable, breaks noted in [CHANGELOG.md](CHANGELOG.md) |
| `spdf-ffi` C ABI | **unstable** — symbols may change across 0.x releases |
| crates.io / npm publication | not yet; install from source |
**Recommended posture** when parsing untrusted PDFs today:
```rust
let parser = SpdfParser::builder()
.timeout_secs(30) // defensive deadline
.max_input_bytes(50 << 20) // 50 MiB input cap
.max_pages(500) // reject page-tree bombs
.build();
```
Then wrap the process in a resource-capped sandbox (`systemd-run
--property=MemoryMax=1G`, Firejail, Docker `--memory=`). Follow the
full hardening checklist in [SECURITY.md](SECURITY.md).
## Install
### Prebuilt binaries
Self-contained tarballs (bundled `libpdfium`, no runtime deps) are
attached to each [GitHub release](https://github.com/Fanaperana/spdf/releases).
Download, extract, and run. OCR is not compiled into the prebuilt
binaries — use `--ocr-server-url` for HTTP OCR or `cargo install`
with `--features spdf-cli/tesseract` for a local Tesseract build.
| `x86_64-unknown-linux-gnu` | `spdf-<version>-x86_64-unknown-linux-gnu.tar.gz` | ✅ attached to v0.2.0-alpha.1 |
| `aarch64-unknown-linux-gnu` | `spdf-<version>-aarch64-unknown-linux-gnu.tar.gz` | ⬜ TODO |
| `x86_64-apple-darwin` | `spdf-<version>-x86_64-apple-darwin.tar.gz` | ⬜ TODO — build on macOS Intel |
| `aarch64-apple-darwin` | `spdf-<version>-aarch64-apple-darwin.tar.gz` | ⬜ TODO — build on Apple Silicon |
| `x86_64-pc-windows-msvc` | `spdf-<version>-x86_64-pc-windows-msvc.zip` | ⬜ TODO — build on Windows |
To produce a release tarball on a new host:
```sh
cargo build --release -p spdf-cli
VER=0.2.0-alpha.1
mkdir -p "dist/${DIR}"
cp target/release/spdf "dist/${DIR}/" # use spdf.exe on Windows
cp LICENSE README.md CHANGELOG.md "dist/${DIR}/"
tar czf "dist/${DIR}.tar.gz" -C dist "${DIR}" # or zip on Windows
sha256sum "dist/${DIR}.tar.gz" > "dist/${DIR}.tar.gz.sha256"
gh release upload v${VER} "dist/${DIR}.tar.gz" "dist/${DIR}.tar.gz.sha256"
```
### From source
```sh
# from source (requires Rust 1.85+)
cargo install --path crates/spdf-cli
# or build locally
cargo build --release -p spdf-cli
./target/release/spdf --help
```
### Runtime dependency: PDFium
`spdf` dynamically loads a PDFium shared library. On macOS:
```sh
brew install pdfium
```
Or download a prebuilt binary from
[bblanchon/pdfium-binaries](https://github.com/bblanchon/pdfium-binaries/releases)
and point `PDFIUM_LIB_PATH` at it.
### Platform support matrix
| Linux x86_64 | ✅ | ✅ | primary development target |
| macOS (Intel + Apple Silicon) | ✅ | ✅ | requires `brew install tesseract` |
| Windows x86_64 | ✅ | ⚠️ source-build only | see below |
**Windows OCR caveat.** The `tesseract` Rust crate used by `spdf-ocr`
links against `libtesseract` + `libleptonica` via `bindgen`, which needs
a working C toolchain (clang) and a `vcpkg` or manually-installed
Tesseract/Leptonica. The CI matrix builds spdf on Windows **without**
the `tesseract` feature; the Linux/macOS jobs cover OCR. If you need
Windows OCR in production today, install Tesseract via `vcpkg install
tesseract leptonica --triplet x64-windows`, set `LIBCLANG_PATH`, and
build with `cargo build --release -p spdf-cli --features
spdf-cli/tesseract`. The [HTTP OCR backend](#cli) (`--ocr-server-url`)
works on every platform and is the recommended option for Windows until
we cut a proper MSVC-native build.
## Quick start
```sh
# Plain text with preserved layout
spdf parse invoice.pdf --no-ocr --format text
# Structured JSON with per-glyph bounding boxes
spdf parse invoice.pdf --no-ocr --format json > out.json
# OCR-only mode for scanned PDFs
spdf parse scan.pdf --ocr-language eng
# Use an external OCR server (PaddleOCR, EasyOCR, etc.)
spdf parse scan.pdf --ocr-server-url http://localhost:8000
# Render specific pages
spdf parse book.pdf --target-pages 1-3,7,12-15
# Dump pages as PNGs
spdf screenshot report.pdf -o ./pages --dpi 200
# Batch-convert a directory of PDFs
spdf batch-parse ./inputs ./outputs --format text
```
## Library usage
```rust
use spdf_core::LiteParse;
use spdf_types::ParseConfig;
let parser = LiteParse::new(ParseConfig {
ocr_enabled: false,
..Default::default()
});
let result = parser.parse_path("invoice.pdf")?;
for page in &result.pages {
println!("--- page {} ---\n{}", page.page_num, page.text);
}
```
## Architecture
```
crates/
spdf-types/ public schema
spdf-processing/ text / geometry / markup helpers
spdf-projection/ spatial reconstruction (the crown jewel)
spdf-pdf/ PdfEngine trait + PDFium impl
spdf-ocr/ OcrEngine trait + Tesseract + HTTP impls
spdf-convert/ LibreOffice / ImageMagick shell-outs
spdf-output/ JSON + text formatters
spdf-core/ orchestrator
spdf-cli/ spdf binary
spdf-ffi/ C ABI cdylib
xtask/ parity harness, benches, pdfium fetcher
```
See [AGENTS.md](AGENTS.md) for the full crate map and
[CONTRIBUTING.md](CONTRIBUTING.md) for development workflow.
## Roadmap
- Node bindings (`@spdf/node`) on top of `spdf-ffi`
- Python bindings via PyO3
- `spdf serve` — a local HTTP parsing service
- Optional ML-based reading-order classifier (opt-in, `burn` feature flag)
## Acknowledgements
spdf is an independent Rust project authored by
[Fanaperana](https://github.com/Fanaperana). The spatial projection
algorithm was inspired by (and is benchmarked against)
[LiteParse](https://github.com/run-llama/liteparse), but spdf is not a
port or rewrite — it's its own implementation, with its own engine
choices (PDFium + Tesseract), its own data model, and its own hardening
work. Rendering is powered by
[PDFium](https://pdfium.googlesource.com/pdfium/); OCR uses
[Tesseract](https://github.com/tesseract-ocr/tesseract).
## License
[MIT](LICENSE) © 2026 spdf contributors.