pdfsink-rs 0.2.5

Fast pure-Rust PDF extraction library and CLI — ~10-50x faster than pdfplumber for text, word, table, layout, image, and metadata extraction from PDFs. By Clark Labs Inc.
Documentation

pdfsink-rs — Fast Pure-Rust PDF Text, Table & Layout Extraction

Crates.io Docs.rs License: MIT Rust

pdfsink-rs is a fast, native, pure-Rust PDF extraction library and CLI — a drop-in conceptual replacement for Python's pdfplumber that runs ~10–50x faster with zero Python runtime. Use it to parse PDFs, extract text, words, tables, layout, images, and metadata from Rust — with resilient handling of malformed documents.

Keywords: rust pdf library · rust pdf parser · pdf text extraction · pdf table extraction · pdfplumber alternative · rust pdf crate · extract text from pdf rust · extract tables from pdf rust · pdf layout analysis · pdf to json

Built and maintained by Clark Labs Inc.

Why pdfsink-rs?

  • Fastest pure-Rust PDF extraction — 10–50x faster than pdfplumber on real-world PDFs, with no Python interpreter, no native C bindings, no shell-outs.
  • Drop-in conceptual API — mirrors pdfplumber's mental model (chars, lines, rects, words, tables, extract_text, extract_table, crop/within_bbox), so migrating a pipeline from Python to Rust is mostly renaming method calls.
  • Resilient — malformed or damaged pages are recovered gracefully instead of crashing the whole document.
  • Batteries included — text, word, line, table, layout, image, annotation, hyperlink, metadata, and tagged-structure extraction in a single crate.
  • Production-ready output formats — JSON (with precision and filter control), CSV, and dictionary export for pipelines and LLM ingestion.
  • Ships a CLIpdfsink-rs binary for inspecting, debugging, and exporting PDFs from the shell without writing any Rust.

Use cases

  • Data pipelines extracting text and tables from government, financial, legal, and scientific PDFs
  • LLM / RAG document ingestion requiring fast, structured PDF parsing in Rust
  • PDF-to-JSON / PDF-to-CSV conversion at scale
  • Layout-aware document understanding (textlines, textboxes, hierarchical layout trees)
  • Rendering PDF pages to PNG / JPEG for previews or OCR pre-processing
  • Migrating Python pdfplumber codebases to a faster native Rust stack

Benchmarks: pdfsink-rs vs pdfplumber

Tested against pdfplumber 0.11.8 on real-world government PDFs.

Speed

PDF Pages Size pdfsink-rs pdfplumber Speedup
US Budget FY2025 188 2.4 MB 775 ms 11.1 s 14x
NIST SP 800-53 492 6.1 MB 4.07 s 33.9 s 8x
IRS W-9 6 138 KB 31 ms 509 ms 17x
UN Charter 54 3.0 MB 130 ms 1.95 s 15x
Census Table 1 58 KB 4.0 ms 40 ms 10x
EPA Guide 10 8 ms 335 ms 42x
Total (13 PDFs) 5.0 s 48.4 s 9.6x

Text extraction is ~34x faster. Table extraction is ~253x faster. Gracefully handles malformed pages that crash other parsers.

Accuracy

Metric Result
Text similarity vs pdfplumber 99.7%
Word count match 21/21 pages
Character count match exact on all PDFs
Page dimensions exact on all PDFs
Line/rect object counts exact on matching PDFs
Table detection (simple PDFs) 1:1 match

Features

  • Open a PDF and access pages with full metadata
  • Resilient parsing — malformed pages recovered gracefully (geometry preserved, content skipped)
  • Inspect page objects (chars, lines, rects, curves, images, annots, hyperlinks)
  • Crop / within-bbox / outside-bbox filtering
  • Text extraction, word extraction, line extraction, regex search
  • Table finding and table extraction (lines, lines_strict, text, explicit strategies)
  • Layout analysis — textlines, textboxes, hierarchical layout tree
  • Serialization — JSON (with precision/filtering), CSV, dictionary export
  • Image rendering — rasterize pages to PNG/JPEG with drawing primitives
  • Document metadata — mediabox, cropbox, trimbox, bleedbox, artbox
  • Structure tree — tagged PDF structure element access
  • Document aggregateschars(), lines(), rects(), edges(), etc. across all pages
  • CLI for inspection, debugging, and export

Installation

Add the crate to your Cargo.toml:

cargo add pdfsink-rs

Or in Cargo.toml:

[dependencies]
pdfsink-rs = "0.2"

Quick Start — Extract Text and Tables from a PDF in Rust

use pdfsink_rs::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let pdf = PdfDocument::open("document.pdf")?;
    let page = pdf.page(1)?;

    // Extract text
    println!("{}", page.extract_text());

    // Extract words with positions
    for word in page.extract_words() {
        println!("{} @ ({}, {})", word.text, word.x0, word.top);
    }

    // Extract tables
    use pdfsink_rs::TableSettings;
    if let Some(table) = page.extract_table(TableSettings::default())? {
        for row in &table {
            println!("{:?}", row);
        }
    }

    // Layout analysis
    for tl in page.textlinehorizontals() {
        println!("line: {:?} @ ({}, {})", tl.text, tl.x0, tl.top);
    }

    // Serialize to JSON with precision control
    let json = page.to_json::<Vec<u8>>(None, None, None, None, Some(2), Some(2))?;
    println!("{}", json.unwrap_or_default());

    // Render page to PNG
    let image = page.to_image(Some(150.0), None, None, false, false)?;
    image.save("page.png", Some(image::ImageFormat::Png), false, 256, 8)?;

    Ok(())
}

CLI

pdfsink-rs info <file.pdf>
pdfsink-rs text <file.pdf> [page]
pdfsink-rs words <file.pdf> [page]
pdfsink-rs search <file.pdf> [page] [pattern]
pdfsink-rs objects <file.pdf> [page]
pdfsink-rs json <file.pdf> [page]
pdfsink-rs csv <file.pdf> [page]
pdfsink-rs links <file.pdf> [page]
pdfsink-rs table <file.pdf> [page] [lines|lines_strict|text|explicit]
pdfsink-rs svg <file.pdf> [page] [output.svg]
pdfsink-rs render <file.pdf> [page] [output.png]

Architecture

Built on lopdf for PDF parsing and pdf-extract for content stream processing. No Python runtime dependency.

File Purpose
src/lib.rs Public API (PdfDocument, Page methods)
src/parse.rs PDF parsing, page-object extraction, metadata
src/text.rs Text/word extraction, search, layout
src/table.rs Table detection and extraction
src/layout.rs Layout analysis (textlines, textboxes, layout tree)
src/container_api.rs Serialization (JSON, CSV, dict export)
src/display.rs Image rendering, drawing primitives
src/geometry.rs Bbox operations, cropping, filtering
src/clustering.rs Value clustering for layout analysis

Running Tests

cargo test

Running Benchmarks

# Rust
cargo run --release --example bench_pdfsink

# pdfplumber (requires: pip install pdfplumber)
python bench/bench_pdfplumber.py

# Compare
python bench/compare.py

FAQ

Is pdfsink-rs a pure-Rust PDF library? Yes. Zero Python runtime, zero C bindings outside of standard Rust crates. It builds on lopdf (PDF object parser) and pdf-extract (content stream decoder) — both pure Rust.

How does pdfsink-rs compare to pdfplumber? Same mental model, much faster. Benchmarked at ~10–50x faster across a 13-PDF real-world corpus, with 99.7% text similarity and exact character/word counts on matched pages.

Can I extract tables from a PDF in Rust with pdfsink-rs? Yes. page.extract_table(TableSettings::default()) and find_tables implement pdfplumber-style lines, lines_strict, text, and explicit strategies.

Does it render PDFs to images? Yes. page.to_image() rasterizes pages to PNG or JPEG at any DPI.

Does it handle malformed PDFs? Yes. Pages that fail content-stream parsing are reported but don't abort document-level extraction — geometry is preserved and bad content is skipped.

License

MIT © Clark Labs Inc.


Built by Clark Labs Inc. — high-performance Rust tooling for document AI pipelines. If pdfsink-rs saves you time, a ⭐ on GitHub helps others discover it.