pdfsink-rs
A native pure-Rust PDF extraction library inspired by pdfplumber. Drop-in conceptual replacement — same capabilities, ~10-50x faster.
Benchmarks: pdfsink-rs vs pdfplumber
Tested against pdfplumber 0.11.8 on real-world government PDFs.
Speed
| Pages | Size | pdfsink-rs | pdfplumber | Speedup | |
|---|---|---|---|---|---|
| US Budget FY2025 | 188 | 2.4 MB | 775 ms | 11.1 s | 14x |
| NIST SP 800-53 | 492 | 6.1 MB | 4.07 s | 33.9 s | 8x |
| IRS W-9 | 6 | 138 KB | 31 ms | 509 ms | 17x |
| UN Charter | 54 | 3.0 MB | 130 ms | 1.95 s | 15x |
| Census Table | 1 | 58 KB | 4.0 ms | 40 ms | 10x |
| EPA Guide | 10 | — | 8 ms | 335 ms | 42x |
| Total (13 PDFs) | 5.0 s | 48.4 s | 9.6x |
Text extraction is ~34x faster. Table extraction is ~253x faster. Gracefully handles malformed pages that crash other parsers.
Accuracy
| Metric | Result |
|---|---|
| Text similarity vs pdfplumber | 99.7% |
| Word count match | 21/21 pages |
| Character count match | exact on all PDFs |
| Page dimensions | exact on all PDFs |
| Line/rect object counts | exact on matching PDFs |
| Table detection (simple PDFs) | 1:1 match |
Features
- Open a PDF and access pages with full metadata
- Resilient parsing — malformed pages recovered gracefully (geometry preserved, content skipped)
- Inspect page objects (
chars,lines,rects,curves,images,annots,hyperlinks) - Crop / within-bbox / outside-bbox filtering
- Text extraction, word extraction, line extraction, regex search
- Table finding and table extraction (lines, lines_strict, text, explicit strategies)
- Layout analysis — textlines, textboxes, hierarchical layout tree
- Serialization — JSON (with precision/filtering), CSV, dictionary export
- Image rendering — rasterize pages to PNG/JPEG with drawing primitives
- Document metadata — mediabox, cropbox, trimbox, bleedbox, artbox
- Structure tree — tagged PDF structure element access
- Document aggregates —
chars(),lines(),rects(),edges(), etc. across all pages - CLI for inspection, debugging, and export
Example
use PdfDocument;
CLI
pdfsink-rs info <file.pdf>
pdfsink-rs text <file.pdf> [page]
pdfsink-rs words <file.pdf> [page]
pdfsink-rs search <file.pdf> [page] [pattern]
pdfsink-rs objects <file.pdf> [page]
pdfsink-rs json <file.pdf> [page]
pdfsink-rs csv <file.pdf> [page]
pdfsink-rs links <file.pdf> [page]
pdfsink-rs table <file.pdf> [page] [lines|lines_strict|text|explicit]
pdfsink-rs svg <file.pdf> [page] [output.svg]
pdfsink-rs render <file.pdf> [page] [output.png]
Architecture
Built on lopdf for PDF parsing and pdf-extract for content stream processing. No Python runtime dependency.
| File | Purpose |
|---|---|
src/lib.rs |
Public API (PdfDocument, Page methods) |
src/parse.rs |
PDF parsing, page-object extraction, metadata |
src/text.rs |
Text/word extraction, search, layout |
src/table.rs |
Table detection and extraction |
src/layout.rs |
Layout analysis (textlines, textboxes, layout tree) |
src/container_api.rs |
Serialization (JSON, CSV, dict export) |
src/display.rs |
Image rendering, drawing primitives |
src/geometry.rs |
Bbox operations, cropping, filtering |
src/clustering.rs |
Value clustering for layout analysis |
Running Tests
Running Benchmarks
# Rust
# pdfplumber (requires: pip install pdfplumber)
# Compare
License
MIT