pdfsink-rs 0.2.3

Native pure-Rust PDF extraction crate inspired by pdfplumber — ~10-50x faster text, word, table, and object extraction from PDFs
Documentation

pdfsink-rs

A native pure-Rust PDF extraction library inspired by pdfplumber. Drop-in conceptual replacement — same capabilities, ~10-50x faster.

Benchmarks: pdfsink-rs vs pdfplumber

Tested against pdfplumber 0.11.8 on real-world government PDFs.

Speed

PDF Pages Size pdfsink-rs pdfplumber Speedup
US Budget FY2025 188 2.4 MB 775 ms 11.1 s 14x
NIST SP 800-53 492 6.1 MB 4.07 s 33.9 s 8x
IRS W-9 6 138 KB 31 ms 509 ms 17x
UN Charter 54 3.0 MB 130 ms 1.95 s 15x
Census Table 1 58 KB 4.0 ms 40 ms 10x
EPA Guide 10 8 ms 335 ms 42x
Total (13 PDFs) 5.0 s 48.4 s 9.6x

Text extraction is ~34x faster. Table extraction is ~253x faster. Gracefully handles malformed pages that crash other parsers.

Accuracy

Metric Result
Text similarity vs pdfplumber 99.7%
Word count match 21/21 pages
Character count match exact on all PDFs
Page dimensions exact on all PDFs
Line/rect object counts exact on matching PDFs
Table detection (simple PDFs) 1:1 match

Features

  • Open a PDF and access pages with full metadata
  • Resilient parsing — malformed pages recovered gracefully (geometry preserved, content skipped)
  • Inspect page objects (chars, lines, rects, curves, images, annots, hyperlinks)
  • Crop / within-bbox / outside-bbox filtering
  • Text extraction, word extraction, line extraction, regex search
  • Table finding and table extraction (lines, lines_strict, text, explicit strategies)
  • Layout analysis — textlines, textboxes, hierarchical layout tree
  • Serialization — JSON (with precision/filtering), CSV, dictionary export
  • Image rendering — rasterize pages to PNG/JPEG with drawing primitives
  • Document metadata — mediabox, cropbox, trimbox, bleedbox, artbox
  • Structure tree — tagged PDF structure element access
  • Document aggregateschars(), lines(), rects(), edges(), etc. across all pages
  • CLI for inspection, debugging, and export

Example

use pdfsink_rs::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let pdf = PdfDocument::open("document.pdf")?;
    let page = pdf.page(1)?;

    // Extract text
    println!("{}", page.extract_text());

    // Extract words with positions
    for word in page.extract_words() {
        println!("{} @ ({}, {})", word.text, word.x0, word.top);
    }

    // Extract tables
    use pdfsink_rs::TableSettings;
    if let Some(table) = page.extract_table(TableSettings::default())? {
        for row in &table {
            println!("{:?}", row);
        }
    }

    // Layout analysis
    for tl in page.textlinehorizontals() {
        println!("line: {:?} @ ({}, {})", tl.text, tl.x0, tl.top);
    }

    // Serialize to JSON with precision control
    let json = page.to_json::<Vec<u8>>(None, None, None, None, Some(2), Some(2))?;
    println!("{}", json.unwrap_or_default());

    // Render page to PNG
    let image = page.to_image(Some(150.0), None, None, false, false)?;
    image.save("page.png", Some(image::ImageFormat::Png), false, 256, 8)?;

    Ok(())
}

CLI

pdfsink-rs info <file.pdf>
pdfsink-rs text <file.pdf> [page]
pdfsink-rs words <file.pdf> [page]
pdfsink-rs search <file.pdf> [page] [pattern]
pdfsink-rs objects <file.pdf> [page]
pdfsink-rs json <file.pdf> [page]
pdfsink-rs csv <file.pdf> [page]
pdfsink-rs links <file.pdf> [page]
pdfsink-rs table <file.pdf> [page] [lines|lines_strict|text|explicit]
pdfsink-rs svg <file.pdf> [page] [output.svg]
pdfsink-rs render <file.pdf> [page] [output.png]

Architecture

Built on lopdf for PDF parsing and pdf-extract for content stream processing. No Python runtime dependency.

File Purpose
src/lib.rs Public API (PdfDocument, Page methods)
src/parse.rs PDF parsing, page-object extraction, metadata
src/text.rs Text/word extraction, search, layout
src/table.rs Table detection and extraction
src/layout.rs Layout analysis (textlines, textboxes, layout tree)
src/container_api.rs Serialization (JSON, CSV, dict export)
src/display.rs Image rendering, drawing primitives
src/geometry.rs Bbox operations, cropping, filtering
src/clustering.rs Value clustering for layout analysis

Running Tests

cargo test

Running Benchmarks

# Rust
cargo run --release --example bench_pdfsink

# pdfplumber (requires: pip install pdfplumber)
python bench/bench_pdfplumber.py

# Compare
python bench/compare.py

License

MIT