pdfsink-rs

A native pure-Rust PDF extraction library inspired by pdfplumber. Drop-in conceptual replacement — same capabilities, ~10-50x faster.

Benchmarks: pdfsink-rs vs pdfplumber

Tested against pdfplumber 0.11.8 on real-world government PDFs.

Speed

PDF	Pages	Size	pdfsink-rs	pdfplumber	Speedup
US Budget FY2025	188	2.4 MB	775 ms	11.1 s	14x
NIST SP 800-53	492	6.1 MB	4.07 s	33.9 s	8x
IRS W-9	6	138 KB	31 ms	509 ms	17x
UN Charter	54	3.0 MB	130 ms	1.95 s	15x
Census Table	1	58 KB	4.0 ms	40 ms	10x
EPA Guide	10	—	8 ms	335 ms	42x
Total (13 PDFs)			5.0 s	48.4 s	9.6x

Text extraction is ~34x faster. Table extraction is ~253x faster. Gracefully handles malformed pages that crash other parsers.

Accuracy

Metric	Result
Text similarity vs pdfplumber	99.7%
Word count match	21/21 pages
Character count match	exact on all PDFs
Page dimensions	exact on all PDFs
Line/rect object counts	exact on matching PDFs
Table detection (simple PDFs)	1:1 match

Features

Open a PDF and access pages with full metadata
Resilient parsing — malformed pages recovered gracefully (geometry preserved, content skipped)
Inspect page objects (chars, lines, rects, curves, images, annots, hyperlinks)
Crop / within-bbox / outside-bbox filtering
Text extraction, word extraction, line extraction, regex search
Table finding and table extraction (lines, lines_strict, text, explicit strategies)
Layout analysis — textlines, textboxes, hierarchical layout tree
Serialization — JSON (with precision/filtering), CSV, dictionary export
Image rendering — rasterize pages to PNG/JPEG with drawing primitives
Document metadata — mediabox, cropbox, trimbox, bleedbox, artbox
Structure tree — tagged PDF structure element access
Document aggregates — chars(), lines(), rects(), edges(), etc. across all pages
CLI for inspection, debugging, and export

Example

use pdfsink_rs::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let pdf = PdfDocument::open("document.pdf")?;
    let page = pdf.page(1)?;

    // Extract text
    println!("{}", page.extract_text());

    // Extract words with positions
    for word in page.extract_words() {
        println!("{} @ ({}, {})", word.text, word.x0, word.top);
    }

    // Extract tables
    use pdfsink_rs::TableSettings;
    if let Some(table) = page.extract_table(TableSettings::default())? {
        for row in &table {
            println!("{:?}", row);
        }
    }

    // Layout analysis
    for tl in page.textlinehorizontals() {
        println!("line: {:?} @ ({}, {})", tl.text, tl.x0, tl.top);
    }

    // Serialize to JSON with precision control
    let json = page.to_json::<Vec<u8>>(None, None, None, None, Some(2), Some(2))?;
    println!("{}", json.unwrap_or_default());

    // Render page to PNG
    let image = page.to_image(Some(150.0), None, None, false, false)?;
    image.save("page.png", Some(image::ImageFormat::Png), false, 256, 8)?;

    Ok(())
}

CLI

pdfsink-rs info <file.pdf>
pdfsink-rs text <file.pdf> [page]
pdfsink-rs words <file.pdf> [page]
pdfsink-rs search <file.pdf> [page] [pattern]
pdfsink-rs objects <file.pdf> [page]
pdfsink-rs json <file.pdf> [page]
pdfsink-rs csv <file.pdf> [page]
pdfsink-rs links <file.pdf> [page]
pdfsink-rs table <file.pdf> [page] [lines|lines_strict|text|explicit]
pdfsink-rs svg <file.pdf> [page] [output.svg]
pdfsink-rs render <file.pdf> [page] [output.png]

Architecture

Built on lopdf for PDF parsing and pdf-extract for content stream processing. No Python runtime dependency.

File	Purpose
`src/lib.rs`	Public API (PdfDocument, Page methods)
`src/parse.rs`	PDF parsing, page-object extraction, metadata
`src/text.rs`	Text/word extraction, search, layout
`src/table.rs`	Table detection and extraction
`src/layout.rs`	Layout analysis (textlines, textboxes, layout tree)
`src/container_api.rs`	Serialization (JSON, CSV, dict export)
`src/display.rs`	Image rendering, drawing primitives
`src/geometry.rs`	Bbox operations, cropping, filtering
`src/clustering.rs`	Value clustering for layout analysis

Running Tests

cargo test

Running Benchmarks

# Rust
cargo run --release --example bench_pdfsink

# pdfplumber (requires: pip install pdfplumber)
python bench/bench_pdfplumber.py

# Compare
python bench/compare.py

License

MIT

pdfsink-rs 0.2.3