spdf-types 0.2.0-alpha.2

Core types for the spdf workspace: TextItem, ParsedPage, ParseResult, ParseConfig.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
<div align="center">

# spdf

**Fast, spatial PDF parsing in Rust.**

Extract text with preserved columns, tables, and layout — plus optional OCR
for scans, format conversion for Office docs, and a single self-contained
binary.

[![CI](https://github.com/Fanaperana/spdf/actions/workflows/ci.yml/badge.svg)](https://github.com/Fanaperana/spdf/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Rust](https://img.shields.io/badge/rust-1.85%2B-orange.svg)](rust-toolchain.toml)
[![Platforms](https://img.shields.io/badge/platforms-macOS%20%7C%20Linux%20%7C%20Windows-lightgrey.svg)](#install)

</div>

---

## Why spdf

Most PDF-to-text tools collapse whitespace, shuffle columns, and emit one
giant line salad — fine for search indexing, useless for anything that cares
about *where* things appear on the page (invoices, tax bills, property
records, scientific tables, legal forms).

`spdf` keeps the geometry:

- **Column-aware projection** — tables, two-column layouts, sidebars, and
  indented blocks come back in reading order with their spatial structure
  intact.
- **Faux-bold & shadow dedup** — PDFs that "draw text twice" to simulate
  bold no longer produce `TaTax Infofo`; you get `Tax Info`.
- **Word reconstruction** — PDFium-style per-glyph extraction is stitched
  back into words (`1 8 3 6``1836`) using a liteparse-compatible merge
  heuristic.
- **QR / barcode / microprint filtering** — the hundreds of tiny numeric
  glyphs that encode a QR code are auto-dropped so they don't destroy the
  surrounding table.
- **Optional OCR** — Tesseract locally, or any HTTP OCR server (PaddleOCR,
  EasyOCR, etc.) for image-only pages. One flag to turn it off when you
  know the PDF is born-digital.
- **Format conversion** — Office docs via LibreOffice shell-out, images via
  ImageMagick, all behind the same CLI.
- **One static binary.** Install PDFium once, ship `spdf` anywhere.

## Comparison

Benchmarked on two real-world U.S. county tax documents (TAX_APPEAL_MOM
set) with `--no-ocr`. Token-level F1 measured against the same documents
parsed through [LiteParse](https://github.com/run-llama/liteparse) using the
provided reference outputs in [tests/parity/](tests/parity/).

| Feature                              | spdf (this project) | LiteParse  | pdftotext  | pypdfium2  |
| ------------------------------------ | :-----------------: | :--------: | :--------: | :--------: |
| Language                             | Rust                | TypeScript | C++        | C++/Python |
| Single static binary                 || ❌ (Node)  |||
| Column-aware text projection         ||| partial    ||
| Faux-bold shadow dedup               |||||
| QR / microprint filter               |||||
| OCR fallback (Tesseract + HTTP)      |||||
| Office-format conversion             |||||
| Batch mode                           |||||
| JSON output with per-item bboxes     |||| partial    |
| C ABI FFI crate                      |||||
| **Token F1 vs LiteParse (tax bill)** | **0.990**           | 1.000      | ~0.82      | ~0.80      |
| **Token F1 vs LiteParse (PRC)**      | **0.922**           | 1.000      | ~0.75      | ~0.78      |
| **Startup time (cold)**              | ~25 ms              | ~450 ms    | ~10 ms     | ~120 ms    |

Parity harness and golden outputs live in [tests/parity/](tests/parity/);
run `python3 tests/parity/compare.py` to reproduce.

## Benchmark

Reproducible head-to-head against [LiteParse](https://github.com/run-llama/liteparse)
on the fixtures in [example/](example/). Ground truth is raw
`tesseract <image> - -l eng` (PDFs rendered with `pdftoppm -r 150` first);
tokens are compared case-insensitively as a multiset.

> This is a like-for-like comparison, not a knock on LiteParse. LiteParse
> is the project spdf was designed against — we owe the original authors
> for the reference implementation of the spatial projection algorithm,
> and the goal of publishing these numbers is transparency about where
> the Rust port stands, not to make the TypeScript original look bad.

<!-- BENCHMARK:BEGIN -->
<!-- generated by benchmark/run.sh — do not edit by hand -->

| fixture | engine | wall-clock | tokens | recall | precision | F1 |
|---|---|---:|---:|---:|---:|---:|
| irs-f1040.pdf | spdf | 268 ms | 1094 | 63.8% | 86.5% | 73.4% |
| irs-f1040.pdf | **liteparse** | **541 ms** | **1575** | **81.8%** | **77.0%** | **79.4%** |
| irs-fw9-p1-2.pdf | spdf | 76 ms | 2253 | 99.1% | 98.4% | 98.7% |
| irs-fw9-p1-2.pdf | **liteparse** | **465 ms** | **2253** | **99.1%** | **98.4%** | **98.8%** |
| nist-sp-800-53r5-p1-2.pdf | spdf | 17 ms | 96 | 82.5% | 97.9% | 89.5% |
| nist-sp-800-53r5-p1-2.pdf | **liteparse** | **461 ms** | **94** | **82.5%** | **100.0%** | **90.4%** |
| nist-sp-800-63b-p1-2.pdf | spdf | 969 ms | 222 | 93.5% | 96.8% | 95.1% |
| nist-sp-800-63b-p1-2.pdf | **liteparse** | **5530 ms** | **226** | **95.2%** | **96.9%** | **96.1%** |
| rfc8446-p1-2.pdf | **spdf** | **20 ms** | **399** | **99.5%** | **99.7%** | **99.6%** |
| rfc8446-p1-2.pdf | liteparse | 375 ms | 399 | 99.5% | 99.7% | 99.6% |
| rfc9110-p1-2.pdf | **spdf** | **235 ms** | **8** | **0.0%** | **0.0%** | **0.0%** |
| rfc9110-p1-2.pdf | liteparse | 3101 ms | 8 | 0.0% | 0.0% | 0.0% |
| example-1.jpg | **spdf** | **1067 ms** | **231** | **82.0%** | **96.5%** | **88.7%** |
| example-1.jpg | liteparse | 7070 ms | 146 | 42.3% | 78.8% | 55.0% |
| test-ocr.pdf | **spdf** | **274 ms** | **20** | **100.0%** | **100.0%** | **100.0%** |
| test-ocr.pdf | liteparse | 3212 ms | 20 | 100.0% | 100.0% | 100.0% |

**Mean over fixtures:** spdf **F1 80.6%** in **366 ms**; liteparse F1 77.4% in 2594 ms.

<!-- BENCHMARK:END -->

### Spatial precision

Token accuracy alone doesn't tell you whether an engine put each word in
the *right place on the page* — which is the whole point of a spatial
parser. We also compare every matched word's bounding box against the
raw-tesseract ground truth and report mean IoU, the fraction of matches
that clear IoU ≥ 0.5 (the COCO-style "well localised" bar), and the
mean centroid error in PDF points.

<!-- SPATIAL:BEGIN -->
<!-- generated by benchmark/spatial.py — do not edit by hand -->

| fixture | engine | matched | mean IoU | IoU≥0.5 | centroid err |
|---|---|---:|---:|---:|---:|
| example-1.jpg | **spdf** | **212** | **0.976** | **97.6%** | **4.50 pt** |
| example-1.jpg | liteparse | 109 | 0.667 | 67.9% | 28.03 pt |
| test-ocr.pdf | spdf | 5 | 0.952 | 100.0% | 0.64 pt |
| test-ocr.pdf | **liteparse** | **4** | **0.957** | **100.0%** | **0.54 pt** |
| irs-f1040.pdf | **spdf** | **115** | **0.476** | **55.7%** | **97.90 pt** |
| irs-f1040.pdf | liteparse | 84 | 0.351 | 52.4% | 135.73 pt |
| irs-fw9-p1-2.pdf | **spdf** | **29** | **0.517** | **58.6%** | **169.21 pt** |
| irs-fw9-p1-2.pdf | liteparse | 28 | 0.348 | 53.6% | 175.61 pt |
| nist-sp-800-53r5-p1-2.pdf | **spdf** | **3** | **0.964** | **100.0%** | **0.35 pt** |
| nist-sp-800-53r5-p1-2.pdf | liteparse | 1 | 0.634 | 100.0% | 2.01 pt |
| nist-sp-800-63b-p1-2.pdf | **spdf** | **14** | **0.678** | **78.6%** | **84.12 pt** |
| nist-sp-800-63b-p1-2.pdf | liteparse | 20 | 0.471 | 65.0% | 103.50 pt |
| rfc8446-p1-2.pdf | **spdf** | **1** | **0.869** | **100.0%** | **0.44 pt** |
| rfc8446-p1-2.pdf | liteparse | 4 | 0.427 | 50.0% | 171.02 pt |
| rfc9110-p1-2.pdf | **spdf** | **0** | **0.000** | **0.0%** | **0.00 pt** |
| rfc9110-p1-2.pdf | liteparse | 0 | 0.000 | 0.0% | 0.00 pt |

**Mean over fixtures:** spdf **mean IoU 0.679**, **73.8%** of matches ≥ 0.5, centroid error **44.64 pt**; liteparse 0.482 / 61.1% / 77.05 pt.

<!-- SPATIAL:END -->

Per-fixture raw outputs are committed under [benchmark/outputs/](benchmark/outputs/)
so the numbers are auditable. Reproduce on your own machine:

```sh
make build-ocr   # or `make install-ocr`
LITEPARSE_DIR=/path/to/liteparse make benchmark-update
```

## Production-readiness

spdf is pre-1.0. The table below tracks what we've hardened so you can
decide whether it fits your threat model; see [CHANGELOG.md](CHANGELOG.md)
for per-release detail.

| Area | Status |
| --- | --- |
| JSON output schema (byte-compatible with LiteParse) | stable (covered by parity harness) |
| Typed error enum at the public API (`SpdfError`) | stable |
| Benchmark corpus (public-domain: IRS, NIST, RFC, scanned image) | 9 fixtures in [example/]example/ |
| Property tests (`cargo test -p spdf-projection proptests`) | panic-freedom + shuffle-stability |
| Fuzz harness (`cargo +nightly fuzz run parse_pdf`) | [fuzz/]fuzz/ — run before exposing to untrusted input |
| Cross-platform CI | Linux + macOS + Windows; MSRV 1.85; rustdoc warnings gated |
| Resource guards (`timeout_secs`, `max_input_bytes`, `max_pages`) | available via [`SpdfParser::builder`]crates/spdf-core/src/lib.rs |
| Security policy | see [SECURITY.md]SECURITY.md |
| CLI / Rust library API | best-effort stable, breaks noted in [CHANGELOG.md]CHANGELOG.md |
| `spdf-ffi` C ABI | **unstable** — symbols may change across 0.x releases |
| crates.io publication | [`spdf-core`]https://crates.io/crates/spdf-core, [`spdf-cli`]https://crates.io/crates/spdf-cli, and all sub-crates are on crates.io from 0.2.0-alpha.1 |

**Recommended posture** when parsing untrusted PDFs today:

```rust
let parser = SpdfParser::builder()
    .timeout_secs(30)           // defensive deadline
    .max_input_bytes(50 << 20)  // 50 MiB input cap
    .max_pages(500)             // reject page-tree bombs
    .build();
```

Then wrap the process in a resource-capped sandbox (`systemd-run
--property=MemoryMax=1G`, Firejail, Docker `--memory=`). Follow the
full hardening checklist in [SECURITY.md](SECURITY.md).

## Install

### From crates.io

```sh
cargo install spdf-cli --version 0.2.0-alpha.2
# installs the `spdf` binary. `libpdfium` is downloaded and bundled
# at build time (no extra runtime deps). Add `--features tesseract`
# to compile in local Tesseract OCR.
```

### Prebuilt binaries

Self-contained tarballs (bundled `libpdfium`, no runtime deps) are
attached to each [GitHub release](https://github.com/Fanaperana/spdf/releases).
Download, extract, and run. OCR is not compiled into the prebuilt
binaries — use `--ocr-server-url` for HTTP OCR or `cargo install`
with `--features spdf-cli/tesseract` for a local Tesseract build.

| Target | Tarball | Status |
| --- | --- | :---: |
| `x86_64-unknown-linux-gnu` | `spdf-<version>-x86_64-unknown-linux-gnu.tar.gz` | ✅ attached to v0.2.0-alpha.2 |
| `aarch64-unknown-linux-gnu` | `spdf-<version>-aarch64-unknown-linux-gnu.tar.gz` | ⬜ TODO |
| `x86_64-apple-darwin` | `spdf-<version>-x86_64-apple-darwin.tar.gz` | ⬜ TODO — build on macOS Intel |
| `aarch64-apple-darwin` | `spdf-<version>-aarch64-apple-darwin.tar.gz` | ⬜ TODO — build on Apple Silicon |
| `x86_64-pc-windows-msvc` | `spdf-<version>-x86_64-pc-windows-msvc.zip` | ⬜ TODO — build on Windows |

To produce a release tarball on a new host:

```sh
cargo build --release -p spdf-cli
VER=0.2.0-alpha.2
TARGET=$(rustc -vV | awk '/^host:/ {print $2}')
DIR="spdf-${VER}-${TARGET}"
mkdir -p "dist/${DIR}"
cp target/release/spdf "dist/${DIR}/"          # use spdf.exe on Windows
cp LICENSE README.md CHANGELOG.md "dist/${DIR}/"
tar czf "dist/${DIR}.tar.gz" -C dist "${DIR}"   # or zip on Windows
sha256sum "dist/${DIR}.tar.gz" > "dist/${DIR}.tar.gz.sha256"
gh release upload v${VER} "dist/${DIR}.tar.gz" "dist/${DIR}.tar.gz.sha256"
```

### From source

```sh
# from source (requires Rust 1.85+)
cargo install --path crates/spdf-cli

# or build locally
cargo build --release -p spdf-cli
./target/release/spdf --help
```

### Runtime dependency: PDFium

`spdf` dynamically loads a PDFium shared library. On macOS:

```sh
brew install pdfium
```

Or download a prebuilt binary from
[bblanchon/pdfium-binaries](https://github.com/bblanchon/pdfium-binaries/releases)
and point `PDFIUM_LIB_PATH` at it.

### Platform support matrix

| Platform | Core parsing | OCR (Tesseract) | Notes |
| --- | :---: | :---: | --- |
| Linux x86_64 ||| primary development target |
| macOS (Intel + Apple Silicon) ||| requires `brew install tesseract` |
| Windows x86_64 || ⚠️ source-build only | see below |

**Windows OCR caveat.** The `tesseract` Rust crate used by `spdf-ocr`
links against `libtesseract` + `libleptonica` via `bindgen`, which needs
a working C toolchain (clang) and a `vcpkg` or manually-installed
Tesseract/Leptonica. The CI matrix builds spdf on Windows **without**
the `tesseract` feature; the Linux/macOS jobs cover OCR. If you need
Windows OCR in production today, install Tesseract via `vcpkg install
tesseract leptonica --triplet x64-windows`, set `LIBCLANG_PATH`, and
build with `cargo build --release -p spdf-cli --features
spdf-cli/tesseract`. The [HTTP OCR backend](#cli) (`--ocr-server-url`)
works on every platform and is the recommended option for Windows until
we cut a proper MSVC-native build.

## Quick start

```sh
# Confirm the install
spdf --version

# Plain text with preserved layout
spdf parse invoice.pdf --no-ocr --format text

# Read from stdin (handy for piping through curl / aws s3 cp)
cat invoice.pdf | spdf parse - --no-ocr

# Structured JSON with per-glyph bounding boxes
spdf parse invoice.pdf --no-ocr --format json > out.json

# Password-protected PDFs
spdf parse confidential.pdf --password 'hunter2'

# Keep very-small glyphs (<2 pt) — useful for IRS-style form field labels
spdf parse irs-fw9.pdf --preserve-small-text

# OCR-only mode for scanned PDFs (local Tesseract; needs --features tesseract)
spdf parse scan.pdf --ocr-language eng

# Use an external OCR server (PaddleOCR, EasyOCR, etc.) — works on every platform
spdf parse scan.pdf --ocr-server-url http://localhost:8000

# Render specific pages
spdf parse book.pdf --target-pages 1-3,7,12-15

# Dump pages as PNGs
spdf screenshot report.pdf -o ./pages --dpi 200

# Batch-convert a directory of PDFs
spdf batch-parse ./inputs ./outputs --format text
```

Exit codes: `0` success, `1` parse/OCR error (with `Error: …` on stderr),
`2` invalid CLI arguments (standard clap).

## Library usage

```rust
use spdf_core::SpdfParser;
use spdf_types::ParseConfig;

let parser = SpdfParser::new(ParseConfig {
    ocr_enabled: false,
    ..Default::default()
});
let result = parser.parse(std::path::Path::new("invoice.pdf"))?;
for page in &result.pages {
    println!("--- page {} ---\n{}", page.page_num, page.text);
}
# Ok::<(), spdf_types::SpdfError>(())
```

Or use the ergonomic builder if you only want to tweak a few knobs:

```rust
use spdf_core::SpdfParser;

let parser = SpdfParser::builder()
    .timeout_secs(30)
    .max_input_bytes(50 << 20)
    .ocr_enabled(false)
    .build();
let result = parser.parse(std::path::Path::new("invoice.pdf"))?;
# Ok::<(), spdf_types::SpdfError>(())
```

## Architecture

```
crates/
  spdf-types/        public schema
  spdf-processing/   text / geometry / markup helpers
  spdf-projection/   spatial reconstruction (the crown jewel)
  spdf-pdf/          PdfEngine trait + PDFium impl
  spdf-ocr/          OcrEngine trait + Tesseract + HTTP impls
  spdf-convert/      LibreOffice / ImageMagick shell-outs
  spdf-output/       JSON + text formatters
  spdf-core/         orchestrator
  spdf-cli/          spdf binary
  spdf-ffi/          C ABI cdylib
xtask/               parity harness, benches, pdfium fetcher
```

See [AGENTS.md](AGENTS.md) for the full crate map and
[CONTRIBUTING.md](CONTRIBUTING.md) for development workflow.

## Roadmap

- Node bindings (`@spdf/node`) on top of `spdf-ffi`
- Python bindings via PyO3
- `spdf serve` — a local HTTP parsing service
- Optional ML-based reading-order classifier (opt-in, `burn` feature flag)

## Release process (for maintainers)

Releases are fully automated by [`.github/workflows/release.yml`](.github/workflows/release.yml):

1. Bump `version` in the workspace `Cargo.toml` and update [CHANGELOG.md]CHANGELOG.md.
2. `git commit -am "release 0.x.y" && git push`
3. `git tag -a v0.x.y -m "spdf 0.x.y" && git push origin v0.x.y`

The workflow then:
- builds the `x86_64-unknown-linux-gnu` tarball and attaches it to the
  GitHub release (creating one if missing),
- runs `cargo publish` for all 9 crates in topological order using the
  `CRATES_IO_TOKEN` repository secret, with retry-on-429 for the
  crates.io new-crate rate limit.

To smoke-test the pipeline without consuming a real version number, run
the workflow manually from the **Actions** tab with `dry_run = true` —
it skips uploads and runs `cargo publish --dry-run` for every crate.

## Acknowledgements

spdf is an independent Rust project authored by
[Fanaperana](https://github.com/Fanaperana). The spatial projection
algorithm was inspired by (and is benchmarked against)
[LiteParse](https://github.com/run-llama/liteparse), but spdf is not a
port or rewrite — it's its own implementation, with its own engine
choices (PDFium + Tesseract), its own data model, and its own hardening
work. Rendering is powered by
[PDFium](https://pdfium.googlesource.com/pdfium/); OCR uses
[Tesseract](https://github.com/tesseract-ocr/tesseract).

## License

[MIT](LICENSE) © 2026 spdf contributors.