Crate pdfplumber

Expand description

Extract chars, words, lines, rects, and tables from PDF documents with precise coordinates.

pdfplumber is a Rust library for extracting structured content from PDF files. It is a Rust port of Python’s pdfplumber, providing the same coordinate-accurate extraction of characters, words, lines, rectangles, curves, images, and tables.

§Quick Start

use pdfplumber::{Pdf, TextOptions};

let pdf = Pdf::open_file("document.pdf", None).unwrap();
for page_result in pdf.pages_iter() {
    let page = page_result.unwrap();
    let text = page.extract_text(&TextOptions::default());
    println!("Page {}: {}", page.page_number(), text);
}

§Architecture

The library is split into three crates:

pdfplumber-core: Backend-independent data types and algorithms
pdfplumber-parse: PDF parsing (Layer 1) and content stream interpreter (Layer 2)
pdfplumber (this crate): Public API facade that ties everything together

§Feature Flags

Feature	Default	Description
`std`	Yes	Enables file-path APIs (`Pdf::open_file`). Disable for WASM.
`serde`	No	Adds `Serialize`/`Deserialize` to all public data types.
`parallel`	No	Enables `Pdf::pages_parallel()` via rayon. Not WASM-compatible.

§Extracting Text

let pdf = Pdf::open_file("document.pdf", None).unwrap();
let page = pdf.page(0).unwrap();

// Simple text extraction
let text = page.extract_text(&TextOptions::default());

// Layout-preserving text extraction
let text = page.extract_text(&TextOptions { layout: true, ..Default::default() });

§Extracting Tables

let pdf = Pdf::open_file("document.pdf", None).unwrap();
let page = pdf.page(0).unwrap();
let tables = page.find_tables(&TableSettings::default());
for table in &tables {
    for row in &table.rows {
        let cells: Vec<&str> = row.iter()
            .map(|c| c.text.as_deref().unwrap_or(""))
            .collect();
        println!("{:?}", cells);
    }
}

§WASM Support

This crate compiles for wasm32-unknown-unknown. For WASM builds, disable the default std feature and use the bytes-based API:

[dependencies]
pdfplumber = { version = "0.1", default-features = false }

Then use Pdf::open with a byte slice:

let pdf = Pdf::open(pdf_bytes, None)?;
let page = pdf.page(0)?;
let text = page.extract_text(&TextOptions::default());

The parallel feature is not available for WASM targets (rayon requires OS threads).

Re-exports§

pub use pdfplumber_parse;

Structs§

Annotation: A PDF annotation extracted from a page.
BBox: Bounding box with top-left origin coordinate system.
Bookmark: A single entry in the PDF document outline (bookmark / table of contents).
Cell: A detected table cell.
Char: A single character extracted from a PDF page.
CharEvent: Information about a rendered character glyph.
CroppedPage: A spatially filtered view of a PDF page.
Ctm: Current Transformation Matrix (CTM) — affine transform.
Curve: A curve extracted from a painted path (cubic Bezier segment).
DashPattern: Dash pattern for stroking operations.
DedupeOptions: Options for duplicate character detection and removal.
DocumentMetadata: Document-level metadata extracted from the PDF /Info dictionary.
DrawStyle: Style options for drawing overlays on the SVG page.
Edge: A line segment edge for table detection.
EncodingResolver: Resolved encoding for a font, following PDF encoding resolution order.
ExplicitLines: User-provided line coordinates for Explicit strategy.
ExtGState: Extended Graphics State parameters (from gs operator).
ExtractOptions: Options controlling extraction behavior and resource limits.
ExtractResult: Result wrapper that pairs a value with collected warnings.
ExtractWarning: A non-fatal warning encountered during extraction.
FontEncoding: An encoding table that may be a standard encoding modified by a Differences array.
FormField: A PDF form field extracted from the document’s AcroForm dictionary.
GraphicsState: Graphics state relevant to path painting.
HtmlOptions: Options for HTML rendering.
HtmlRenderer: Renders PDF page content as semantic HTML.
Hyperlink: A resolved hyperlink extracted from a PDF page.
Image: An image extracted from a PDF page via the Do operator.
ImageContent: Extracted image content (raw bytes) from a PDF image XObject.
ImageEvent: Information about a placed image.
ImageMetadata: Metadata about an image XObject from the PDF resource dictionary.
Intersection: An intersection point between horizontal and vertical edges.
Line: A line segment extracted from a painted path.
LopdfBackend: The lopdf-based PDF backend.
LopdfDocument: A parsed PDF document backed by lopdf.
LopdfPage: A reference to a single page within a LopdfDocument.
MarkdownOptions: Options for Markdown rendering.
MarkdownRenderer: Renders PDF page content as Markdown.
Page: A single page from a PDF document.
PageGeometry: Page coordinate normalization configuration.
PagesIter: Iterator over pages of a PDF document, yielding each page on demand.
PaintedPath: A painted path — the result of a painting operator applied to a constructed path.
Path: A complete path consisting of segments.
PathBuilder: Builder for constructing paths from PDF path operators.
PathEvent: Information about a painted path.
Pdf: A PDF document opened for extraction.
Point: A 2D point.
Rect: A rectangle extracted from a painted path.
RepairOptions: Options for controlling which PDF repairs to attempt.
RepairResult: Result of a PDF repair operation.
SearchMatch: A single text search match with its bounding box and position information.
SearchOptions: Options controlling text search behavior.
SignatureInfo: Digital signature metadata extracted from a PDF signature field.
StructElement: A node in the PDF structure tree.
SvgDebugOptions: Options for the debug_tablefinder SVG output.
SvgOptions: Options for SVG generation.
SvgRenderer: Renders PDF page content as SVG markup for visual debugging.
Table: A detected table.
TableFinder: Orchestrator for the table detection pipeline.
TableFinderDebug: Intermediate results from the table detection pipeline.
TableQuality: Quality metrics for a detected table.
TableSettings: Configuration for table detection.
TextBlock: A text block: a group of lines forming a coherent paragraph or section.
TextLine: A text line: a sequence of words on the same y-level.
TextOptions: Options for layout-aware text extraction.
ValidationIssue: A validation issue found in a PDF document.
Word: A word extracted from a PDF page.
WordExtractor: Extracts words from a sequence of characters based on spatial proximity.
WordOptions: Options for word extraction, matching pdfplumber defaults.

Enums§

AnnotationType: Common PDF annotation subtypes.
Color: Color value from a PDF color space.
EdgeSource: Source of an edge, tracking which geometric primitive it came from.
FieldType: The type of a PDF form field.
FillRule: Fill rule for path painting.
ImageFormat: Format of extracted image data.
Orientation: Orientation of a geometric element.
PageObject: An enum wrapping references to different page object types.
PaintOp: The type of paint operation applied to a path.
PathSegment: A segment of a PDF path.
PdfError: Fatal error types for PDF processing.
Severity: Severity of a validation issue.
StandardEncoding: A named standard PDF encoding.
Strategy: Strategy for table detection.
TextDirection: Text flow direction.
UnicodeNorm: Unicode normalization form to apply to extracted text.

Traits§

ContentHandler: Callback handler for content stream interpretation.
PdfBackend: Trait abstracting PDF parsing operations.

Functions§

blocks_to_text: Convert text blocks into a string.
cells_to_tables: Group adjacent cells into distinct tables.
cluster_lines_into_blocks: Cluster text line segments into text blocks based on x-overlap and vertical proximity.
cluster_words_into_lines: Cluster words into text lines based on y-proximity.
derive_edges: Derive all edges from collections of lines, rects, and curves.
edge_from_curve: Derive an Edge from a Curve using chord approximation (start to end).
edge_from_line: Derive an Edge from a Line (direct conversion).
edges_from_rect: Derive 4 Edges from a Rect (top, bottom, left, right).
edges_to_intersections: Find all intersection points between horizontal and vertical edges.
explicit_lines_to_edges: Convert user-provided explicit line coordinates into edges.
extract_shapes: Extract Line, Rect, and Curve objects from a painted path.
extract_text_for_cells: Extract text content for each cell by finding characters within the cell bbox.
image_from_ctm: Extract an Image from the CTM active during a Do operator invocation.
intersections_to_cells: Construct rectangular cells from a grid of intersection points.
is_cjk: Returns true if the character is a CJK ideograph, syllable, or kana.
is_cjk_text: Returns true if the first character of the text is CJK.
join_edge_group: Merge overlapping or adjacent collinear edge segments.
snap_edges: Snap nearby parallel edges to aligned positions.
sort_blocks_reading_order: Sort text blocks in natural reading order.
split_lines_at_columns: Split text lines at large horizontal gaps to detect column boundaries.
words_to_edges_stream: Generate synthetic edges from text alignment patterns for the Stream strategy.
words_to_text: Simple (non-layout) text extraction from words.

Type Aliases§

FilteredPage: A page view produced by Page::filter or CroppedPage::filter.
LineOrientation: Type alias preserving backward compatibility.