spdf-types 0.2.0-alpha.1

Core types for the spdf workspace: TextItem, ParsedPage, ParseResult, ParseConfig.
Documentation
<div align="center">

# spdf

**Fast, spatial PDF parsing in Rust.**

Extract text with preserved columns, tables, and layout — plus optional OCR
for scans, format conversion for Office docs, and a single self-contained
binary.

[![CI](https://github.com/Fanaperana/spdf/actions/workflows/ci.yml/badge.svg)](https://github.com/Fanaperana/spdf/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Rust](https://img.shields.io/badge/rust-1.85%2B-orange.svg)](rust-toolchain.toml)
[![Platforms](https://img.shields.io/badge/platforms-macOS%20%7C%20Linux%20%7C%20Windows-lightgrey.svg)](#install)

</div>

---

## Why spdf

Most PDF-to-text tools collapse whitespace, shuffle columns, and emit one
giant line salad — fine for search indexing, useless for anything that cares
about *where* things appear on the page (invoices, tax bills, property
records, scientific tables, legal forms).

`spdf` keeps the geometry:

- **Column-aware projection** — tables, two-column layouts, sidebars, and
  indented blocks come back in reading order with their spatial structure
  intact.
- **Faux-bold & shadow dedup** — PDFs that "draw text twice" to simulate
  bold no longer produce `TaTax Infofo`; you get `Tax Info`.
- **Word reconstruction** — PDFium-style per-glyph extraction is stitched
  back into words (`1 8 3 6``1836`) using a liteparse-compatible merge
  heuristic.
- **QR / barcode / microprint filtering** — the hundreds of tiny numeric
  glyphs that encode a QR code are auto-dropped so they don't destroy the
  surrounding table.
- **Optional OCR** — Tesseract locally, or any HTTP OCR server (PaddleOCR,
  EasyOCR, etc.) for image-only pages. One flag to turn it off when you
  know the PDF is born-digital.
- **Format conversion** — Office docs via LibreOffice shell-out, images via
  ImageMagick, all behind the same CLI.
- **One static binary.** Install PDFium once, ship `spdf` anywhere.

## Comparison

Benchmarked on two real-world U.S. county tax documents (TAX_APPEAL_MOM
set) with `--no-ocr`. Token-level F1 measured against the same documents
parsed through [LiteParse](https://github.com/run-llama/liteparse) using the
provided reference outputs in [tests/parity/](tests/parity/).

| Feature                              | spdf (this project) | LiteParse  | pdftotext  | pypdfium2  |
| ------------------------------------ | :-----------------: | :--------: | :--------: | :--------: |
| Language                             | Rust                | TypeScript | C++        | C++/Python |
| Single static binary                 || ❌ (Node)  |||
| Column-aware text projection         ||| partial    ||
| Faux-bold shadow dedup               |||||
| QR / microprint filter               |||||
| OCR fallback (Tesseract + HTTP)      |||||
| Office-format conversion             |||||
| Batch mode                           |||||
| JSON output with per-item bboxes     |||| partial    |
| C ABI FFI crate                      |||||
| **Token F1 vs LiteParse (tax bill)** | **0.990**           | 1.000      | ~0.82      | ~0.80      |
| **Token F1 vs LiteParse (PRC)**      | **0.922**           | 1.000      | ~0.75      | ~0.78      |
| **Startup time (cold)**              | ~25 ms              | ~450 ms    | ~10 ms     | ~120 ms    |

Parity harness and golden outputs live in [tests/parity/](tests/parity/);
run `python3 tests/parity/compare.py` to reproduce.

## Benchmark

Reproducible head-to-head against [LiteParse](https://github.com/run-llama/liteparse)
on the fixtures in [example/](example/). Ground truth is raw
`tesseract <image> - -l eng` (PDFs rendered with `pdftoppm -r 150` first);
tokens are compared case-insensitively as a multiset.

> This is a like-for-like comparison, not a knock on LiteParse. LiteParse
> is the project spdf was designed against — we owe the original authors
> for the reference implementation of the spatial projection algorithm,
> and the goal of publishing these numbers is transparency about where
> the Rust port stands, not to make the TypeScript original look bad.

<!-- BENCHMARK:BEGIN -->
<!-- generated by benchmark/run.sh — do not edit by hand -->

| fixture | engine | wall-clock | tokens | recall | precision | F1 |
|---|---|---:|---:|---:|---:|---:|
| irs-f1040.pdf | spdf | 268 ms | 1094 | 63.8% | 86.5% | 73.4% |
| irs-f1040.pdf | **liteparse** | **541 ms** | **1575** | **81.8%** | **77.0%** | **79.4%** |
| irs-fw9-p1-2.pdf | spdf | 76 ms | 2253 | 99.1% | 98.4% | 98.7% |
| irs-fw9-p1-2.pdf | **liteparse** | **465 ms** | **2253** | **99.1%** | **98.4%** | **98.8%** |
| nist-sp-800-53r5-p1-2.pdf | spdf | 17 ms | 96 | 82.5% | 97.9% | 89.5% |
| nist-sp-800-53r5-p1-2.pdf | **liteparse** | **461 ms** | **94** | **82.5%** | **100.0%** | **90.4%** |
| nist-sp-800-63b-p1-2.pdf | spdf | 969 ms | 222 | 93.5% | 96.8% | 95.1% |
| nist-sp-800-63b-p1-2.pdf | **liteparse** | **5530 ms** | **226** | **95.2%** | **96.9%** | **96.1%** |
| rfc8446-p1-2.pdf | **spdf** | **20 ms** | **399** | **99.5%** | **99.7%** | **99.6%** |
| rfc8446-p1-2.pdf | liteparse | 375 ms | 399 | 99.5% | 99.7% | 99.6% |
| rfc9110-p1-2.pdf | **spdf** | **235 ms** | **8** | **0.0%** | **0.0%** | **0.0%** |
| rfc9110-p1-2.pdf | liteparse | 3101 ms | 8 | 0.0% | 0.0% | 0.0% |
| example-1.jpg | **spdf** | **1067 ms** | **231** | **82.0%** | **96.5%** | **88.7%** |
| example-1.jpg | liteparse | 7070 ms | 146 | 42.3% | 78.8% | 55.0% |
| test-ocr.pdf | **spdf** | **274 ms** | **20** | **100.0%** | **100.0%** | **100.0%** |
| test-ocr.pdf | liteparse | 3212 ms | 20 | 100.0% | 100.0% | 100.0% |

**Mean over fixtures:** spdf **F1 80.6%** in **366 ms**; liteparse F1 77.4% in 2594 ms.

<!-- BENCHMARK:END -->

### Spatial precision

Token accuracy alone doesn't tell you whether an engine put each word in
the *right place on the page* — which is the whole point of a spatial
parser. We also compare every matched word's bounding box against the
raw-tesseract ground truth and report mean IoU, the fraction of matches
that clear IoU ≥ 0.5 (the COCO-style "well localised" bar), and the
mean centroid error in PDF points.

<!-- SPATIAL:BEGIN -->
<!-- generated by benchmark/spatial.py — do not edit by hand -->

| fixture | engine | matched | mean IoU | IoU≥0.5 | centroid err |
|---|---|---:|---:|---:|---:|
| example-1.jpg | **spdf** | **212** | **0.976** | **97.6%** | **4.50 pt** |
| example-1.jpg | liteparse | 109 | 0.667 | 67.9% | 28.03 pt |
| test-ocr.pdf | spdf | 5 | 0.952 | 100.0% | 0.64 pt |
| test-ocr.pdf | **liteparse** | **4** | **0.957** | **100.0%** | **0.54 pt** |
| irs-f1040.pdf | **spdf** | **115** | **0.476** | **55.7%** | **97.90 pt** |
| irs-f1040.pdf | liteparse | 84 | 0.351 | 52.4% | 135.73 pt |
| irs-fw9-p1-2.pdf | **spdf** | **29** | **0.517** | **58.6%** | **169.21 pt** |
| irs-fw9-p1-2.pdf | liteparse | 28 | 0.348 | 53.6% | 175.61 pt |
| nist-sp-800-53r5-p1-2.pdf | **spdf** | **3** | **0.964** | **100.0%** | **0.35 pt** |
| nist-sp-800-53r5-p1-2.pdf | liteparse | 1 | 0.634 | 100.0% | 2.01 pt |
| nist-sp-800-63b-p1-2.pdf | **spdf** | **14** | **0.678** | **78.6%** | **84.12 pt** |
| nist-sp-800-63b-p1-2.pdf | liteparse | 20 | 0.471 | 65.0% | 103.50 pt |
| rfc8446-p1-2.pdf | **spdf** | **1** | **0.869** | **100.0%** | **0.44 pt** |
| rfc8446-p1-2.pdf | liteparse | 4 | 0.427 | 50.0% | 171.02 pt |
| rfc9110-p1-2.pdf | **spdf** | **0** | **0.000** | **0.0%** | **0.00 pt** |
| rfc9110-p1-2.pdf | liteparse | 0 | 0.000 | 0.0% | 0.00 pt |

**Mean over fixtures:** spdf **mean IoU 0.679**, **73.8%** of matches ≥ 0.5, centroid error **44.64 pt**; liteparse 0.482 / 61.1% / 77.05 pt.

<!-- SPATIAL:END -->

Per-fixture raw outputs are committed under [benchmark/outputs/](benchmark/outputs/)
so the numbers are auditable. Reproduce on your own machine:

```sh
make build-ocr   # or `make install-ocr`
LITEPARSE_DIR=/path/to/liteparse make benchmark-update
```

## Production-readiness

spdf is pre-1.0. The table below tracks what we've hardened so you can
decide whether it fits your threat model; see [CHANGELOG.md](CHANGELOG.md)
for per-release detail.

| Area | Status |
| --- | --- |
| JSON output schema (byte-compatible with LiteParse) | stable (covered by parity harness) |
| Typed error enum at the public API (`SpdfError`) | stable |
| Benchmark corpus (public-domain: IRS, NIST, RFC, scanned image) | 5 fixtures in [example/]example/ |
| Property tests (`cargo test -p spdf-projection proptests`) | panic-freedom + shuffle-stability |
| Fuzz harness (`cargo +nightly fuzz run parse_pdf`) | [fuzz/]fuzz/ — run before exposing to untrusted input |
| Cross-platform CI | Linux + macOS + Windows; MSRV 1.85; rustdoc warnings gated |
| Resource guards (`timeout_secs`, `max_input_bytes`, `max_pages`) | available via [`SpdfParser::builder`]crates/spdf-core/src/lib.rs |
| Security policy | see [SECURITY.md]SECURITY.md |
| CLI / Rust library API | best-effort stable, breaks noted in [CHANGELOG.md]CHANGELOG.md |
| `spdf-ffi` C ABI | **unstable** — symbols may change across 0.x releases |
| crates.io / npm publication | not yet; install from source |

**Recommended posture** when parsing untrusted PDFs today:

```rust
let parser = SpdfParser::builder()
    .timeout_secs(30)           // defensive deadline
    .max_input_bytes(50 << 20)  // 50 MiB input cap
    .max_pages(500)             // reject page-tree bombs
    .build();
```

Then wrap the process in a resource-capped sandbox (`systemd-run
--property=MemoryMax=1G`, Firejail, Docker `--memory=`). Follow the
full hardening checklist in [SECURITY.md](SECURITY.md).

## Install

### Prebuilt binaries

Self-contained tarballs (bundled `libpdfium`, no runtime deps) are
attached to each [GitHub release](https://github.com/Fanaperana/spdf/releases).
Download, extract, and run. OCR is not compiled into the prebuilt
binaries — use `--ocr-server-url` for HTTP OCR or `cargo install`
with `--features spdf-cli/tesseract` for a local Tesseract build.

| Target | Tarball | Status |
| --- | --- | :---: |
| `x86_64-unknown-linux-gnu` | `spdf-<version>-x86_64-unknown-linux-gnu.tar.gz` | ✅ attached to v0.2.0-alpha.1 |
| `aarch64-unknown-linux-gnu` | `spdf-<version>-aarch64-unknown-linux-gnu.tar.gz` | ⬜ TODO |
| `x86_64-apple-darwin` | `spdf-<version>-x86_64-apple-darwin.tar.gz` | ⬜ TODO — build on macOS Intel |
| `aarch64-apple-darwin` | `spdf-<version>-aarch64-apple-darwin.tar.gz` | ⬜ TODO — build on Apple Silicon |
| `x86_64-pc-windows-msvc` | `spdf-<version>-x86_64-pc-windows-msvc.zip` | ⬜ TODO — build on Windows |

To produce a release tarball on a new host:

```sh
cargo build --release -p spdf-cli
VER=0.2.0-alpha.1
TARGET=$(rustc -vV | awk '/^host:/ {print $2}')
DIR="spdf-${VER}-${TARGET}"
mkdir -p "dist/${DIR}"
cp target/release/spdf "dist/${DIR}/"          # use spdf.exe on Windows
cp LICENSE README.md CHANGELOG.md "dist/${DIR}/"
tar czf "dist/${DIR}.tar.gz" -C dist "${DIR}"   # or zip on Windows
sha256sum "dist/${DIR}.tar.gz" > "dist/${DIR}.tar.gz.sha256"
gh release upload v${VER} "dist/${DIR}.tar.gz" "dist/${DIR}.tar.gz.sha256"
```

### From source

```sh
# from source (requires Rust 1.85+)
cargo install --path crates/spdf-cli

# or build locally
cargo build --release -p spdf-cli
./target/release/spdf --help
```

### Runtime dependency: PDFium

`spdf` dynamically loads a PDFium shared library. On macOS:

```sh
brew install pdfium
```

Or download a prebuilt binary from
[bblanchon/pdfium-binaries](https://github.com/bblanchon/pdfium-binaries/releases)
and point `PDFIUM_LIB_PATH` at it.

### Platform support matrix

| Platform | Core parsing | OCR (Tesseract) | Notes |
| --- | :---: | :---: | --- |
| Linux x86_64 ||| primary development target |
| macOS (Intel + Apple Silicon) ||| requires `brew install tesseract` |
| Windows x86_64 || ⚠️ source-build only | see below |

**Windows OCR caveat.** The `tesseract` Rust crate used by `spdf-ocr`
links against `libtesseract` + `libleptonica` via `bindgen`, which needs
a working C toolchain (clang) and a `vcpkg` or manually-installed
Tesseract/Leptonica. The CI matrix builds spdf on Windows **without**
the `tesseract` feature; the Linux/macOS jobs cover OCR. If you need
Windows OCR in production today, install Tesseract via `vcpkg install
tesseract leptonica --triplet x64-windows`, set `LIBCLANG_PATH`, and
build with `cargo build --release -p spdf-cli --features
spdf-cli/tesseract`. The [HTTP OCR backend](#cli) (`--ocr-server-url`)
works on every platform and is the recommended option for Windows until
we cut a proper MSVC-native build.

## Quick start

```sh
# Plain text with preserved layout
spdf parse invoice.pdf --no-ocr --format text

# Structured JSON with per-glyph bounding boxes
spdf parse invoice.pdf --no-ocr --format json > out.json

# OCR-only mode for scanned PDFs
spdf parse scan.pdf --ocr-language eng

# Use an external OCR server (PaddleOCR, EasyOCR, etc.)
spdf parse scan.pdf --ocr-server-url http://localhost:8000

# Render specific pages
spdf parse book.pdf --target-pages 1-3,7,12-15

# Dump pages as PNGs
spdf screenshot report.pdf -o ./pages --dpi 200

# Batch-convert a directory of PDFs
spdf batch-parse ./inputs ./outputs --format text
```

## Library usage

```rust
use spdf_core::LiteParse;
use spdf_types::ParseConfig;

let parser = LiteParse::new(ParseConfig {
    ocr_enabled: false,
    ..Default::default()
});
let result = parser.parse_path("invoice.pdf")?;
for page in &result.pages {
    println!("--- page {} ---\n{}", page.page_num, page.text);
}
```

## Architecture

```
crates/
  spdf-types/        public schema
  spdf-processing/   text / geometry / markup helpers
  spdf-projection/   spatial reconstruction (the crown jewel)
  spdf-pdf/          PdfEngine trait + PDFium impl
  spdf-ocr/          OcrEngine trait + Tesseract + HTTP impls
  spdf-convert/      LibreOffice / ImageMagick shell-outs
  spdf-output/       JSON + text formatters
  spdf-core/         orchestrator
  spdf-cli/          spdf binary
  spdf-ffi/          C ABI cdylib
xtask/               parity harness, benches, pdfium fetcher
```

See [AGENTS.md](AGENTS.md) for the full crate map and
[CONTRIBUTING.md](CONTRIBUTING.md) for development workflow.

## Roadmap

- Node bindings (`@spdf/node`) on top of `spdf-ffi`
- Python bindings via PyO3
- `spdf serve` — a local HTTP parsing service
- Optional ML-based reading-order classifier (opt-in, `burn` feature flag)

## Acknowledgements

spdf is an independent Rust project authored by
[Fanaperana](https://github.com/Fanaperana). The spatial projection
algorithm was inspired by (and is benchmarked against)
[LiteParse](https://github.com/run-llama/liteparse), but spdf is not a
port or rewrite — it's its own implementation, with its own engine
choices (PDFium + Tesseract), its own data model, and its own hardening
work. Rendering is powered by
[PDFium](https://pdfium.googlesource.com/pdfium/); OCR uses
[Tesseract](https://github.com/tesseract-ocr/tesseract).

## License

[MIT](LICENSE) © 2026 spdf contributors.