Expand description
Backend-independent data types and algorithms for pdfplumber-rs.
This crate provides the foundational types (BBox, Char, Word,
Line, Rect, Table, etc.) and algorithms (text grouping, table
detection) used by pdfplumber-rs. It has no required external dependencies —
all functionality is pure Rust.
§Modules
geometry— Geometric primitives:Point,BBox,Ctm,Orientationtext— Character data:Char,TextDirection, CJK detectionwords— Word extraction:Word,WordExtractor,WordOptionslayout— Text layout:TextLine,TextBlock,TextOptionsshapes— Shapes from painted paths:Line,Rect,Curveedges— Edge derivation for table detection:Edge,EdgeSourcetable— Table detection:Table,TableFinder,TableSettingsimages— Image extraction:Image,ImageMetadatapainting— Graphics state:Color,GraphicsState,PaintedPathpath— Path construction:Path,PathBuilder,PathSegmentencoding— Font encoding:FontEncoding,EncodingResolvererror— Errors and warnings:PdfError,ExtractWarning,ExtractOptionssearch— Text search:SearchMatch,SearchOptions,search_charsunicode_norm— Unicode normalization:UnicodeNorm,normalize_chars
Re-exports§
pub use annotation::Annotation;pub use annotation::AnnotationType;pub use bookmark::Bookmark;pub use dedupe::DedupeOptions;pub use dedupe::dedupe_chars;pub use edges::Edge;pub use edges::EdgeSource;pub use edges::derive_edges;pub use edges::edge_from_curve;pub use edges::edge_from_line;pub use edges::edges_from_rect;pub use encoding::EncodingResolver;pub use encoding::FontEncoding;pub use encoding::StandardEncoding;pub use error::ExtractOptions;pub use error::ExtractResult;pub use error::ExtractWarning;pub use error::PdfError;pub use form_field::FieldType;pub use form_field::FormField;pub use geometry::BBox;pub use geometry::Ctm;pub use geometry::Orientation;pub use geometry::Point;pub use html::HtmlOptions;pub use html::HtmlRenderer;pub use hyperlink::Hyperlink;pub use images::Image;pub use images::ImageContent;pub use images::ImageFormat;pub use images::ImageMetadata;pub use images::image_from_ctm;pub use layout::TextBlock;pub use layout::TextLine;pub use layout::TextOptions;pub use layout::blocks_to_text;pub use layout::cluster_lines_into_blocks;pub use layout::cluster_words_into_lines;pub use layout::sort_blocks_reading_order;pub use layout::split_lines_at_columns;pub use layout::words_to_text;pub use markdown::MarkdownOptions;pub use markdown::MarkdownRenderer;pub use metadata::DocumentMetadata;pub use page_object::PageObject;pub use painting::Color;pub use painting::DashPattern;pub use painting::ExtGState;pub use painting::FillRule;pub use painting::GraphicsState;pub use painting::PaintedPath;pub use path::Path;pub use path::PathBuilder;pub use path::PathSegment;pub use repair::RepairOptions;pub use repair::RepairResult;pub use search::SearchMatch;pub use search::SearchOptions;pub use search::search_chars;pub use shapes::Curve;pub use shapes::Line;pub use shapes::LineOrientation;pub use shapes::Rect;pub use shapes::extract_shapes;pub use signature::SignatureInfo;pub use struct_tree::StructElement;pub use svg::DrawStyle;pub use svg::SvgDebugOptions;pub use svg::SvgOptions;pub use svg::SvgRenderer;pub use table::Cell;pub use table::ExplicitLines;pub use table::Intersection;pub use table::Strategy;pub use table::Table;pub use table::TableFinder;pub use table::TableFinderDebug;pub use table::TableQuality;pub use table::TableSettings;pub use table::cells_to_tables;pub use table::edges_to_intersections;pub use table::explicit_lines_to_edges;pub use table::extract_text_for_cells;pub use table::intersections_to_cells;pub use table::join_edge_group;pub use table::snap_edges;pub use table::words_to_edges_stream;pub use text::Char;pub use text::TextDirection;pub use text::is_cjk;pub use text::is_cjk_text;pub use unicode_norm::UnicodeNorm;pub use unicode_norm::normalize_chars;pub use validation::Severity;pub use validation::ValidationIssue;pub use words::Word;pub use words::WordExtractor;pub use words::WordOptions;
Modules§
- annotation
- PDF annotation types. PDF annotation types.
- bookmark
- PDF bookmark / outline / table of contents types. PDF bookmark / outline / table of contents types.
- dedupe
- Duplicate character deduplication. Duplicate character deduplication.
- edges
- Edge derivation from geometric primitives for table detection. Edge derivation from geometric primitives.
- encoding
- Font encoding mapping (Standard, Windows, Mac, Custom). Standard PDF text encodings and encoding resolution.
- error
- Error and warning types for PDF processing. Error and warning types for pdfplumber-rs.
- form_
field - PDF form field types for AcroForm extraction. PDF form field types for AcroForm extraction.
- geometry
- Geometric primitives: Point, BBox, CTM, Orientation.
- html
- HTML rendering for PDF page content. HTML rendering for PDF page content.
- hyperlink
- PDF hyperlink types. PDF hyperlink types.
- images
- Image extraction and metadata. Image extraction from XObject Do operator.
- layout
- Text layout: words → lines → blocks, reading order, text output.
- markdown
- Markdown rendering for PDF page content. Markdown rendering for PDF page content.
- metadata
- Document-level metadata types. Document-level metadata types.
- page_
object - PageObject enum for custom object filtering. PageObject enum for custom filtering.
- painting
- Graphics state, colors, dash patterns, and painted paths. Path painting operators, graphics state, and ExtGState types.
- path
- PDF path construction (MoveTo, LineTo, CurveTo, ClosePath).
- repair
- PDF repair types for best-effort fixing of common PDF issues. PDF repair types for best-effort fixing of common PDF issues.
- search
- Text search with position — find text patterns and return matches with bounding boxes. Text search with position — find text patterns and return matches with bounding boxes.
- shapes
- Shape extraction: Lines, Rects, Curves from painted paths. Line and Rect extraction from painted paths.
- signature
- PDF digital signature information types. PDF digital signature information types.
- struct_
tree - PDF structure tree types for tagged PDF access. PDF structure tree types for tagged PDF access.
- svg
- SVG rendering for visual debugging of PDF pages. SVG rendering for visual debugging of PDF pages.
- table
- Table detection: lattice, stream, and explicit strategies. Table detection types and pipeline.
- text
- Character data types and CJK detection.
- unicode_
norm - Unicode normalization for extracted text. Unicode normalization for extracted text.
- validation
- PDF validation types for detecting specification violations. PDF validation types for detecting specification violations.
- words
- Word extraction from characters based on spatial proximity.