Expand description
§PDF Oxide
Production-grade PDF toolkit in Rust: 47.9× faster than PyMuPDF4LLM with PDF spec compliance.
§Core Features
§Reading & Extraction
- PDF Spec Compliance: ISO 32000-1:2008 sections 9, 14.7-14.8
- Text Extraction: 5-level character-to-Unicode priority (§9.10.2)
- Reading Order: 4 pluggable strategies (XY-Cut, Structure Tree, Geometric, Simple)
- Font Support: 70-80% character recovery with CID-to-GID mapping
- OCR Support: DBNet++ detection + SVTR recognition with smart auto-detection
- Complex Scripts: RTL (Arabic/Hebrew), CJK (Japanese/Korean/Chinese), Devanagari, Thai
- Format Conversion: Markdown, HTML, PlainText, TOC
§Writing & Creation (v0.3.0)
- PDF Generation: Fluent DocumentBuilder API for programmatic PDF creation
- Format Conversion: Markdown → PDF, HTML → PDF, Plain Text → PDF
- Advanced Graphics: Path operations, image embedding, table generation
- Font Embedding: Automatic font subsetting for compact output
- Interactive Forms: Fillable forms with text fields, checkboxes, radio buttons, dropdowns
§Editing (v0.3.0)
- DOM-like API: Query and modify PDF content with strongly-typed wrappers
- Element Modification: Find and replace text, modify images, paths, tables
- Page Operations: Add, remove, reorder, merge pages
- Metadata Editing: Title, author, subject, keywords
- Incremental Saves: Efficient appending without full rewrite
§Architecture
- Pluggable Design: Trait-based extensibility for strategies and converters
- Python Bindings: Full API via PyO3
- Symmetric Read/Write: Unified ContentElement model for extraction and generation
§Planned for v0.4.0+
- Digital Signatures: Full signing and verification (foundation in v0.3.0)
- Advanced: Figures, citations, annotations, accessibility (v0.5.0+)
§Quick Start - Rust
ⓘ
use pdf_oxide::PdfDocument;
use pdf_oxide::pipeline::{TextPipeline, TextPipelineConfig};
use pdf_oxide::pipeline::converters::MarkdownOutputConverter;
// Open a PDF
let mut doc = PdfDocument::open("paper.pdf")?;
// Extract text with reading order (multi-column support)
let spans = doc.extract_spans(0)?;
let config = TextPipelineConfig::default();
let pipeline = TextPipeline::with_config(config.clone());
let ordered_spans = pipeline.process(spans, Default::default())?;
// Convert to Markdown
let converter = MarkdownOutputConverter::new();
let markdown = converter.convert(&ordered_spans, &config)?;
println!("{}", markdown);§Quick Start - Python
from pdf_oxide import PdfDocument
# Open and extract with automatic reading order
doc = PdfDocument("paper.pdf")
markdown = doc.to_markdown(0)
print(markdown)§License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Re-exports§
pub use pipeline::XYCutStrategy;pub use annotation_types::AnnotationBorderStyle;pub use annotation_types::AnnotationColor;pub use annotation_types::AnnotationFlags;pub use annotation_types::AnnotationSubtype;pub use annotation_types::BorderEffectStyle;pub use annotation_types::BorderStyleType;pub use annotation_types::CaretSymbol;pub use annotation_types::FileAttachmentIcon;pub use annotation_types::FreeTextIntent;pub use annotation_types::HighlightMode;pub use annotation_types::LineEndingStyle;pub use annotation_types::QuadPoint;pub use annotation_types::ReplyType;pub use annotation_types::StampType;pub use annotation_types::TextAlignment;pub use annotation_types::TextAnnotationIcon;pub use annotation_types::TextMarkupType;pub use annotation_types::WidgetFieldType;pub use annotations::Annotation;pub use annotations::LinkAction;pub use annotations::LinkDestination;pub use config::DocumentType;pub use config::ExtractionProfile;pub use document::ExtractedImageRef;pub use document::ImageFormat;pub use document::PdfDocument;pub use error::Error;pub use error::Result;pub use outline::Destination;pub use outline::OutlineItem;
Modules§
- annotation_
types - Core annotation types and enums per PDF spec Core annotation types and enums per PDF spec ISO 32000-1:2008, Section 12.5.
- annotations
- PDF annotations support.
- api
- High-level PDF API for simple document creation and manipulation.
- compliance
- PDF compliance validation and conversion module.
- config
- Configuration module for PDF text extraction.
- content
- PDF content stream parsing and execution.
- converters
- Format converters for PDF documents.
- decoders
- Stream decoder implementations for PDF filters.
- document
- PDF document model.
- editor
- PDF editing module for modifying existing PDF documents.
- elements
- Content elements for PDF generation Unified content elements for read/write operations.
- encryption
- PDF encryption support.
- error
- Error types for the PDF library.
- extractors
- Text and content extraction from PDF documents.
- fdf
- Forms Data Format (FDF) support for exporting/importing form field values.
- fonts
- Font handling and encoding.
- geometry
- Geometric primitives for layout analysis.
- hybrid
- Hybrid classical + ML architecture.
- images
- Image extraction from PDFs.
- layout
- Layout analysis algorithms for PDF documents.
- lexer
- PDF lexer (tokenizer).
- object
- PDF object types.
- objstm
- Object stream parsing (PDF 1.5+).
- outline
- PDF document outline (bookmarks) support.
- parser
- PDF object parser.
- parser_
config - Parser configuration options
- pipeline
- PDF text extraction pipeline with clean abstraction layers.
- search
- Text search functionality for PDF documents.
- structure
- PDF logical structure (Tagged PDFs) PDF Logical Structure (Tagged PDF) support.
- text
- Text processing and analysis module.
- writer
- PDF writing module for generating PDF files.
- xfa
- XFA (XML Forms Architecture) support.
- xref
- Cross-reference table parser.
- xref_
reconstruction - Cross-reference table reconstruction for damaged PDFs.
Macros§
- extract_
log_ debug - Log a DEBUG level message.
- extract_
log_ error - Log an ERROR level message.
- extract_
log_ info - Log an INFO level message.
- extract_
log_ trace - Log a TRACE level message.
- extract_
log_ warn - Log a WARN level message.