pdfplumber 0.2.0

Extract chars, words, lines, rects, and tables from PDF documents with precise coordinates
Documentation

Extract chars, words, lines, rects, and tables from PDF documents with precise coordinates.

pdfplumber is a Rust library for extracting structured content from PDF files. It is a Rust port of Python's pdfplumber, providing the same coordinate-accurate extraction of characters, words, lines, rectangles, curves, images, and tables.

Quick Start

use pdfplumber::{Pdf, TextOptions};

let pdf = Pdf::open_file("document.pdf", None).unwrap();
for page_result in pdf.pages_iter() {
    let page = page_result.unwrap();
    let text = page.extract_text(&TextOptions::default());
    println!("Page {}: {}", page.page_number(), text);
}

Architecture

The library is split into three crates:

  • pdfplumber-core: Backend-independent data types and algorithms
  • pdfplumber-parse: PDF parsing (Layer 1) and content stream interpreter (Layer 2)
  • pdfplumber (this crate): Public API facade that ties everything together

Feature Flags

Feature Default Description
std Yes Enables file-path APIs ([Pdf::open_file]). Disable for WASM.
serde No Adds Serialize/Deserialize to all public data types.
parallel No Enables Pdf::pages_parallel() via rayon. Not WASM-compatible.

Extracting Text

# use pdfplumber::{Pdf, TextOptions};
let pdf = Pdf::open_file("document.pdf", None).unwrap();
let page = pdf.page(0).unwrap();

// Simple text extraction
let text = page.extract_text(&TextOptions::default());

// Layout-preserving text extraction
let text = page.extract_text(&TextOptions { layout: true, ..Default::default() });

Extracting Tables

# use pdfplumber::{Pdf, TableSettings};
let pdf = Pdf::open_file("document.pdf", None).unwrap();
let page = pdf.page(0).unwrap();
let tables = page.find_tables(&TableSettings::default());
for table in &tables {
    for row in &table.rows {
        let cells: Vec<&str> = row.iter()
            .map(|c| c.text.as_deref().unwrap_or(""))
            .collect();
        println!("{:?}", cells);
    }
}

WASM Support

This crate compiles for wasm32-unknown-unknown. For WASM builds, disable the default std feature and use the bytes-based API:

[dependencies]
pdfplumber = { version = "0.1", default-features = false }

Then use [Pdf::open] with a byte slice:

let pdf = Pdf::open(pdf_bytes, None)?;
let page = pdf.page(0)?;
let text = page.extract_text(&TextOptions::default());

The parallel feature is not available for WASM targets (rayon requires OS threads).