Extract chars, words, lines, rects, and tables from PDF documents with precise coordinates.
pdfplumber is a Rust library for extracting structured content from PDF files. It is a Rust port of Python's pdfplumber, providing the same coordinate-accurate extraction of characters, words, lines, rectangles, curves, images, and tables.
Quick Start
use pdfplumber::{Pdf, TextOptions};
let pdf = Pdf::open_file("document.pdf", None).unwrap();
for page_result in pdf.pages_iter() {
let page = page_result.unwrap();
let text = page.extract_text(&TextOptions::default());
println!("Page {}: {}", page.page_number(), text);
}
Architecture
The library is split into three crates:
- pdfplumber-core: Backend-independent data types and algorithms
- pdfplumber-parse: PDF parsing (Layer 1) and content stream interpreter (Layer 2)
- pdfplumber (this crate): Public API facade that ties everything together
Feature Flags
| Feature | Default | Description |
|---|---|---|
std |
Yes | Enables file-path APIs ([Pdf::open_file]). Disable for WASM. |
serde |
No | Adds Serialize/Deserialize to all public data types. |
parallel |
No | Enables Pdf::pages_parallel() via rayon. Not WASM-compatible. |
Extracting Text
# use pdfplumber::{Pdf, TextOptions};
let pdf = Pdf::open_file("document.pdf", None).unwrap();
let page = pdf.page(0).unwrap();
// Simple text extraction
let text = page.extract_text(&TextOptions::default());
// Layout-preserving text extraction
let text = page.extract_text(&TextOptions { layout: true, ..Default::default() });
Extracting Tables
# use pdfplumber::{Pdf, TableSettings};
let pdf = Pdf::open_file("document.pdf", None).unwrap();
let page = pdf.page(0).unwrap();
let tables = page.find_tables(&TableSettings::default());
for table in &tables {
for row in &table.rows {
let cells: Vec<&str> = row.iter()
.map(|c| c.text.as_deref().unwrap_or(""))
.collect();
println!("{:?}", cells);
}
}
WASM Support
This crate compiles for wasm32-unknown-unknown. For WASM builds, disable
the default std feature and use the bytes-based API:
[]
= { = "0.1", = false }
Then use [Pdf::open] with a byte slice:
let pdf = Pdf::open(pdf_bytes, None)?;
let page = pdf.page(0)?;
let text = page.extract_text(&TextOptions::default());
The parallel feature is not available for WASM targets (rayon requires OS threads).