Expand description
Extract chars, words, lines, rects, and tables from PDF documents with precise coordinates.
pdfplumber is a Rust library for extracting structured content from PDF files. It is a Rust port of Python’s pdfplumber, providing the same coordinate-accurate extraction of characters, words, lines, rectangles, curves, images, and tables.
§Quick Start
use pdfplumber::{Pdf, TextOptions};
let pdf = Pdf::open_file("document.pdf", None).unwrap();
for page_result in pdf.pages_iter() {
let page = page_result.unwrap();
let text = page.extract_text(&TextOptions::default());
println!("Page {}: {}", page.page_number(), text);
}§Architecture
The library is split into three crates:
- pdfplumber-core: Backend-independent data types and algorithms
- pdfplumber-parse: PDF parsing (Layer 1) and content stream interpreter (Layer 2)
- pdfplumber (this crate): Public API facade that ties everything together
§Feature Flags
| Feature | Default | Description |
|---|---|---|
std | Yes | Enables file-path APIs (Pdf::open_file). Disable for WASM. |
serde | No | Adds Serialize/Deserialize to all public data types. |
parallel | No | Enables Pdf::pages_parallel() via rayon. Not WASM-compatible. |
§Extracting Text
let pdf = Pdf::open_file("document.pdf", None).unwrap();
let page = pdf.page(0).unwrap();
// Simple text extraction
let text = page.extract_text(&TextOptions::default());
// Layout-preserving text extraction
let text = page.extract_text(&TextOptions { layout: true, ..Default::default() });§Extracting Tables
let pdf = Pdf::open_file("document.pdf", None).unwrap();
let page = pdf.page(0).unwrap();
let tables = page.find_tables(&TableSettings::default());
for table in &tables {
for row in &table.rows {
let cells: Vec<&str> = row.iter()
.map(|c| c.text.as_deref().unwrap_or(""))
.collect();
println!("{:?}", cells);
}
}§WASM Support
This crate compiles for wasm32-unknown-unknown. For WASM builds, disable
the default std feature and use the bytes-based API:
[dependencies]
pdfplumber = { version = "0.1", default-features = false }Then use Pdf::open with a byte slice:
ⓘ
let pdf = Pdf::open(pdf_bytes, None)?;
let page = pdf.page(0)?;
let text = page.extract_text(&TextOptions::default());The parallel feature is not available for WASM targets (rayon requires OS threads).
Re-exports§
pub use pdfplumber_parse;
Structs§
- Annotation
- A PDF annotation extracted from a page.
- BBox
- Bounding box with top-left origin coordinate system.
- Bookmark
- A single entry in the PDF document outline (bookmark / table of contents).
- Cell
- A detected table cell.
- Char
- A single character extracted from a PDF page.
- Char
Event - Information about a rendered character glyph.
- Cropped
Page - A spatially filtered view of a PDF page.
- Ctm
- Current Transformation Matrix (CTM) — affine transform.
- Curve
- A curve extracted from a painted path (cubic Bezier segment).
- Dash
Pattern - Dash pattern for stroking operations.
- Dedupe
Options - Options for duplicate character detection and removal.
- Document
Metadata - Document-level metadata extracted from the PDF /Info dictionary.
- Draw
Style - Style options for drawing overlays on the SVG page.
- Edge
- A line segment edge for table detection.
- Encoding
Resolver - Resolved encoding for a font, following PDF encoding resolution order.
- Explicit
Lines - User-provided line coordinates for Explicit strategy.
- ExtG
State - Extended Graphics State parameters (from
gsoperator). - Extract
Options - Options controlling extraction behavior and resource limits.
- Extract
Result - Result wrapper that pairs a value with collected warnings.
- Extract
Warning - A non-fatal warning encountered during extraction.
- Font
Encoding - An encoding table that may be a standard encoding modified by a Differences array.
- Form
Field - A PDF form field extracted from the document’s AcroForm dictionary.
- Graphics
State - Graphics state relevant to path painting.
- Html
Options - Options for HTML rendering.
- Html
Renderer - Renders PDF page content as semantic HTML.
- Hyperlink
- A resolved hyperlink extracted from a PDF page.
- Image
- An image extracted from a PDF page via the Do operator.
- Image
Content - Extracted image content (raw bytes) from a PDF image XObject.
- Image
Event - Information about a placed image.
- Image
Metadata - Metadata about an image XObject from the PDF resource dictionary.
- Intersection
- An intersection point between horizontal and vertical edges.
- Line
- A line segment extracted from a painted path.
- Lopdf
Backend - The lopdf-based PDF backend.
- Lopdf
Document - A parsed PDF document backed by lopdf.
- Lopdf
Page - A reference to a single page within a
LopdfDocument. - Markdown
Options - Options for Markdown rendering.
- Markdown
Renderer - Renders PDF page content as Markdown.
- Page
- A single page from a PDF document.
- Page
Geometry - Page coordinate normalization configuration.
- Pages
Iter - Iterator over pages of a PDF document, yielding each page on demand.
- Painted
Path - A painted path — the result of a painting operator applied to a constructed path.
- Path
- A complete path consisting of segments.
- Path
Builder - Builder for constructing paths from PDF path operators.
- Path
Event - Information about a painted path.
- A PDF document opened for extraction.
- Point
- A 2D point.
- Rect
- A rectangle extracted from a painted path.
- Repair
Options - Options for controlling which PDF repairs to attempt.
- Repair
Result - Result of a PDF repair operation.
- Search
Match - A single text search match with its bounding box and position information.
- Search
Options - Options controlling text search behavior.
- Signature
Info - Digital signature metadata extracted from a PDF signature field.
- Struct
Element - A node in the PDF structure tree.
- SvgDebug
Options - Options for the debug_tablefinder SVG output.
- SvgOptions
- Options for SVG generation.
- SvgRenderer
- Renders PDF page content as SVG markup for visual debugging.
- Table
- A detected table.
- Table
Finder - Orchestrator for the table detection pipeline.
- Table
Finder Debug - Intermediate results from the table detection pipeline.
- Table
Quality - Quality metrics for a detected table.
- Table
Settings - Configuration for table detection.
- Text
Block - A text block: a group of lines forming a coherent paragraph or section.
- Text
Line - A text line: a sequence of words on the same y-level.
- Text
Options - Options for layout-aware text extraction.
- Validation
Issue - A validation issue found in a PDF document.
- Word
- A word extracted from a PDF page.
- Word
Extractor - Extracts words from a sequence of characters based on spatial proximity.
- Word
Options - Options for word extraction, matching pdfplumber defaults.
Enums§
- Annotation
Type - Common PDF annotation subtypes.
- Color
- Color value from a PDF color space.
- Edge
Source - Source of an edge, tracking which geometric primitive it came from.
- Field
Type - The type of a PDF form field.
- Fill
Rule - Fill rule for path painting.
- Image
Format - Format of extracted image data.
- Orientation
- Orientation of a geometric element.
- Page
Object - An enum wrapping references to different page object types.
- PaintOp
- The type of paint operation applied to a path.
- Path
Segment - A segment of a PDF path.
- PdfError
- Fatal error types for PDF processing.
- Severity
- Severity of a validation issue.
- Standard
Encoding - A named standard PDF encoding.
- Strategy
- Strategy for table detection.
- Text
Direction - Text flow direction.
- Unicode
Norm - Unicode normalization form to apply to extracted text.
Traits§
- Content
Handler - Callback handler for content stream interpretation.
- PdfBackend
- Trait abstracting PDF parsing operations.
Functions§
- blocks_
to_ text - Convert text blocks into a string.
- cells_
to_ tables - Group adjacent cells into distinct tables.
- cluster_
lines_ into_ blocks - Cluster text line segments into text blocks based on x-overlap and vertical proximity.
- cluster_
words_ into_ lines - Cluster words into text lines based on y-proximity.
- derive_
edges - Derive all edges from collections of lines, rects, and curves.
- edge_
from_ curve - Derive an Edge from a Curve using chord approximation (start to end).
- edge_
from_ line - Derive an Edge from a Line (direct conversion).
- edges_
from_ rect - Derive 4 Edges from a Rect (top, bottom, left, right).
- edges_
to_ intersections - Find all intersection points between horizontal and vertical edges.
- explicit_
lines_ to_ edges - Convert user-provided explicit line coordinates into edges.
- extract_
shapes - Extract Line, Rect, and Curve objects from a painted path.
- extract_
text_ for_ cells - Extract text content for each cell by finding characters within the cell bbox.
- image_
from_ ctm - Extract an Image from the CTM active during a Do operator invocation.
- intersections_
to_ cells - Construct rectangular cells from a grid of intersection points.
- is_cjk
- Returns
trueif the character is a CJK ideograph, syllable, or kana. - is_
cjk_ text - Returns
trueif the first character of the text is CJK. - join_
edge_ group - Merge overlapping or adjacent collinear edge segments.
- snap_
edges - Snap nearby parallel edges to aligned positions.
- sort_
blocks_ reading_ order - Sort text blocks in natural reading order.
- split_
lines_ at_ columns - Split text lines at large horizontal gaps to detect column boundaries.
- words_
to_ edges_ stream - Generate synthetic edges from text alignment patterns for the Stream strategy.
- words_
to_ text - Simple (non-layout) text extraction from words.
Type Aliases§
- Filtered
Page - A page view produced by
Page::filterorCroppedPage::filter. - Line
Orientation - Type alias preserving backward compatibility.