Skip to main content

Crate pdfplumber

Crate pdfplumber 

Source
Expand description

Extract chars, words, lines, rects, and tables from PDF documents with precise coordinates.

pdfplumber is a Rust library for extracting structured content from PDF files. It is a Rust port of Python’s pdfplumber, providing the same coordinate-accurate extraction of characters, words, lines, rectangles, curves, images, and tables.

§Quick Start

use pdfplumber::{Pdf, TextOptions};

let pdf = Pdf::open_file("document.pdf", None).unwrap();
for page_result in pdf.pages_iter() {
    let page = page_result.unwrap();
    let text = page.extract_text(&TextOptions::default());
    println!("Page {}: {}", page.page_number(), text);
}

§Architecture

The library is split into three crates:

  • pdfplumber-core: Backend-independent data types and algorithms
  • pdfplumber-parse: PDF parsing (Layer 1) and content stream interpreter (Layer 2)
  • pdfplumber (this crate): Public API facade that ties everything together

§Feature Flags

FeatureDefaultDescription
stdYesEnables file-path APIs (Pdf::open_file). Disable for WASM.
serdeNoAdds Serialize/Deserialize to all public data types.
parallelNoEnables Pdf::pages_parallel() via rayon. Not WASM-compatible.

§Extracting Text

let pdf = Pdf::open_file("document.pdf", None).unwrap();
let page = pdf.page(0).unwrap();

// Simple text extraction
let text = page.extract_text(&TextOptions::default());

// Layout-preserving text extraction
let text = page.extract_text(&TextOptions { layout: true, ..Default::default() });

§Extracting Tables

let pdf = Pdf::open_file("document.pdf", None).unwrap();
let page = pdf.page(0).unwrap();
let tables = page.find_tables(&TableSettings::default());
for table in &tables {
    for row in &table.rows {
        let cells: Vec<&str> = row.iter()
            .map(|c| c.text.as_deref().unwrap_or(""))
            .collect();
        println!("{:?}", cells);
    }
}

§WASM Support

This crate compiles for wasm32-unknown-unknown. For WASM builds, disable the default std feature and use the bytes-based API:

[dependencies]
pdfplumber = { version = "0.1", default-features = false }

Then use Pdf::open with a byte slice:

let pdf = Pdf::open(pdf_bytes, None)?;
let page = pdf.page(0)?;
let text = page.extract_text(&TextOptions::default());

The parallel feature is not available for WASM targets (rayon requires OS threads).

Re-exports§

pub use pdfplumber_parse;

Structs§

Annotation
A PDF annotation extracted from a page.
BBox
Bounding box with top-left origin coordinate system.
Bookmark
A single entry in the PDF document outline (bookmark / table of contents).
Cell
A detected table cell.
Char
A single character extracted from a PDF page.
CharEvent
Information about a rendered character glyph.
CroppedPage
A spatially filtered view of a PDF page.
Ctm
Current Transformation Matrix (CTM) — affine transform.
Curve
A curve extracted from a painted path (cubic Bezier segment).
DashPattern
Dash pattern for stroking operations.
DedupeOptions
Options for duplicate character detection and removal.
DocumentMetadata
Document-level metadata extracted from the PDF /Info dictionary.
DrawStyle
Style options for drawing overlays on the SVG page.
Edge
A line segment edge for table detection.
EncodingResolver
Resolved encoding for a font, following PDF encoding resolution order.
ExplicitLines
User-provided line coordinates for Explicit strategy.
ExtGState
Extended Graphics State parameters (from gs operator).
ExtractOptions
Options controlling extraction behavior and resource limits.
ExtractResult
Result wrapper that pairs a value with collected warnings.
ExtractWarning
A non-fatal warning encountered during extraction.
FontEncoding
An encoding table that may be a standard encoding modified by a Differences array.
FormField
A PDF form field extracted from the document’s AcroForm dictionary.
GraphicsState
Graphics state relevant to path painting.
HtmlOptions
Options for HTML rendering.
HtmlRenderer
Renders PDF page content as semantic HTML.
Hyperlink
A resolved hyperlink extracted from a PDF page.
Image
An image extracted from a PDF page via the Do operator.
ImageContent
Extracted image content (raw bytes) from a PDF image XObject.
ImageEvent
Information about a placed image.
ImageMetadata
Metadata about an image XObject from the PDF resource dictionary.
Intersection
An intersection point between horizontal and vertical edges.
Line
A line segment extracted from a painted path.
LopdfBackend
The lopdf-based PDF backend.
LopdfDocument
A parsed PDF document backed by lopdf.
LopdfPage
A reference to a single page within a LopdfDocument.
MarkdownOptions
Options for Markdown rendering.
MarkdownRenderer
Renders PDF page content as Markdown.
Page
A single page from a PDF document.
PageGeometry
Page coordinate normalization configuration.
PagesIter
Iterator over pages of a PDF document, yielding each page on demand.
PaintedPath
A painted path — the result of a painting operator applied to a constructed path.
Path
A complete path consisting of segments.
PathBuilder
Builder for constructing paths from PDF path operators.
PathEvent
Information about a painted path.
Pdf
A PDF document opened for extraction.
Point
A 2D point.
Rect
A rectangle extracted from a painted path.
RepairOptions
Options for controlling which PDF repairs to attempt.
RepairResult
Result of a PDF repair operation.
SearchMatch
A single text search match with its bounding box and position information.
SearchOptions
Options controlling text search behavior.
SignatureInfo
Digital signature metadata extracted from a PDF signature field.
StructElement
A node in the PDF structure tree.
SvgDebugOptions
Options for the debug_tablefinder SVG output.
SvgOptions
Options for SVG generation.
SvgRenderer
Renders PDF page content as SVG markup for visual debugging.
Table
A detected table.
TableFinder
Orchestrator for the table detection pipeline.
TableFinderDebug
Intermediate results from the table detection pipeline.
TableQuality
Quality metrics for a detected table.
TableSettings
Configuration for table detection.
TextBlock
A text block: a group of lines forming a coherent paragraph or section.
TextLine
A text line: a sequence of words on the same y-level.
TextOptions
Options for layout-aware text extraction.
ValidationIssue
A validation issue found in a PDF document.
Word
A word extracted from a PDF page.
WordExtractor
Extracts words from a sequence of characters based on spatial proximity.
WordOptions
Options for word extraction, matching pdfplumber defaults.

Enums§

AnnotationType
Common PDF annotation subtypes.
Color
Color value from a PDF color space.
EdgeSource
Source of an edge, tracking which geometric primitive it came from.
FieldType
The type of a PDF form field.
FillRule
Fill rule for path painting.
ImageFormat
Format of extracted image data.
Orientation
Orientation of a geometric element.
PageObject
An enum wrapping references to different page object types.
PaintOp
The type of paint operation applied to a path.
PathSegment
A segment of a PDF path.
PdfError
Fatal error types for PDF processing.
Severity
Severity of a validation issue.
StandardEncoding
A named standard PDF encoding.
Strategy
Strategy for table detection.
TextDirection
Text flow direction.
UnicodeNorm
Unicode normalization form to apply to extracted text.

Traits§

ContentHandler
Callback handler for content stream interpretation.
PdfBackend
Trait abstracting PDF parsing operations.

Functions§

blocks_to_text
Convert text blocks into a string.
cells_to_tables
Group adjacent cells into distinct tables.
cluster_lines_into_blocks
Cluster text line segments into text blocks based on x-overlap and vertical proximity.
cluster_words_into_lines
Cluster words into text lines based on y-proximity.
derive_edges
Derive all edges from collections of lines, rects, and curves.
edge_from_curve
Derive an Edge from a Curve using chord approximation (start to end).
edge_from_line
Derive an Edge from a Line (direct conversion).
edges_from_rect
Derive 4 Edges from a Rect (top, bottom, left, right).
edges_to_intersections
Find all intersection points between horizontal and vertical edges.
explicit_lines_to_edges
Convert user-provided explicit line coordinates into edges.
extract_shapes
Extract Line, Rect, and Curve objects from a painted path.
extract_text_for_cells
Extract text content for each cell by finding characters within the cell bbox.
image_from_ctm
Extract an Image from the CTM active during a Do operator invocation.
intersections_to_cells
Construct rectangular cells from a grid of intersection points.
is_cjk
Returns true if the character is a CJK ideograph, syllable, or kana.
is_cjk_text
Returns true if the first character of the text is CJK.
join_edge_group
Merge overlapping or adjacent collinear edge segments.
snap_edges
Snap nearby parallel edges to aligned positions.
sort_blocks_reading_order
Sort text blocks in natural reading order.
split_lines_at_columns
Split text lines at large horizontal gaps to detect column boundaries.
words_to_edges_stream
Generate synthetic edges from text alignment patterns for the Stream strategy.
words_to_text
Simple (non-layout) text extraction from words.

Type Aliases§

FilteredPage
A page view produced by Page::filter or CroppedPage::filter.
LineOrientation
Type alias preserving backward compatibility.