Crate pdf_oxide

Crate pdf_oxide 

Source
Expand description

§PDF Oxide

Production-grade PDF toolkit in Rust: 47.9× faster than PyMuPDF4LLM with PDF spec compliance.

§Core Features

§Reading & Extraction

  • PDF Spec Compliance: ISO 32000-1:2008 sections 9, 14.7-14.8
  • Text Extraction: 5-level character-to-Unicode priority (§9.10.2)
  • Reading Order: 4 pluggable strategies (XY-Cut, Structure Tree, Geometric, Simple)
  • Font Support: 70-80% character recovery with CID-to-GID mapping
  • OCR Support: DBNet++ detection + SVTR recognition with smart auto-detection
  • Complex Scripts: RTL (Arabic/Hebrew), CJK (Japanese/Korean/Chinese), Devanagari, Thai
  • Format Conversion: Markdown, HTML, PlainText, TOC

§Writing & Creation (v0.3.0)

  • PDF Generation: Fluent DocumentBuilder API for programmatic PDF creation
  • Format Conversion: Markdown → PDF, HTML → PDF, Plain Text → PDF
  • Advanced Graphics: Path operations, image embedding, table generation
  • Font Embedding: Automatic font subsetting for compact output
  • Interactive Forms: Fillable forms with text fields, checkboxes, radio buttons, dropdowns

§Editing (v0.3.0)

  • DOM-like API: Query and modify PDF content with strongly-typed wrappers
  • Element Modification: Find and replace text, modify images, paths, tables
  • Page Operations: Add, remove, reorder, merge pages
  • Metadata Editing: Title, author, subject, keywords
  • Incremental Saves: Efficient appending without full rewrite

§Architecture

  • Pluggable Design: Trait-based extensibility for strategies and converters
  • Python Bindings: Full API via PyO3
  • Symmetric Read/Write: Unified ContentElement model for extraction and generation

§Planned for v0.4.0+

  • Digital Signatures: Full signing and verification (foundation in v0.3.0)
  • Advanced: Figures, citations, annotations, accessibility (v0.5.0+)

§Quick Start - Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::pipeline::{TextPipeline, TextPipelineConfig};
use pdf_oxide::pipeline::converters::MarkdownOutputConverter;

// Open a PDF
let mut doc = PdfDocument::open("paper.pdf")?;

// Extract text with reading order (multi-column support)
let spans = doc.extract_spans(0)?;
let config = TextPipelineConfig::default();
let pipeline = TextPipeline::with_config(config.clone());
let ordered_spans = pipeline.process(spans, Default::default())?;

// Convert to Markdown
let converter = MarkdownOutputConverter::new();
let markdown = converter.convert(&ordered_spans, &config)?;
println!("{}", markdown);

§Quick Start - Python

from pdf_oxide import PdfDocument

# Open and extract with automatic reading order
doc = PdfDocument("paper.pdf")
markdown = doc.to_markdown(0)
print(markdown)

§License

Licensed under either of:

at your option.

Re-exports§

pub use pipeline::XYCutStrategy;
pub use annotation_types::AnnotationBorderStyle;
pub use annotation_types::AnnotationColor;
pub use annotation_types::AnnotationFlags;
pub use annotation_types::AnnotationSubtype;
pub use annotation_types::BorderEffectStyle;
pub use annotation_types::BorderStyleType;
pub use annotation_types::CaretSymbol;
pub use annotation_types::FileAttachmentIcon;
pub use annotation_types::FreeTextIntent;
pub use annotation_types::HighlightMode;
pub use annotation_types::LineEndingStyle;
pub use annotation_types::QuadPoint;
pub use annotation_types::ReplyType;
pub use annotation_types::StampType;
pub use annotation_types::TextAlignment;
pub use annotation_types::TextAnnotationIcon;
pub use annotation_types::TextMarkupType;
pub use annotation_types::WidgetFieldType;
pub use annotations::Annotation;
pub use annotations::LinkAction;
pub use annotations::LinkDestination;
pub use config::DocumentType;
pub use config::ExtractionProfile;
pub use document::ExtractedImageRef;
pub use document::ImageFormat;
pub use document::PdfDocument;
pub use error::Error;
pub use error::Result;
pub use outline::Destination;
pub use outline::OutlineItem;

Modules§

annotation_types
Core annotation types and enums per PDF spec Core annotation types and enums per PDF spec ISO 32000-1:2008, Section 12.5.
annotations
PDF annotations support.
api
High-level PDF API for simple document creation and manipulation.
compliance
PDF compliance validation and conversion module.
config
Configuration module for PDF text extraction.
content
PDF content stream parsing and execution.
converters
Format converters for PDF documents.
decoders
Stream decoder implementations for PDF filters.
document
PDF document model.
editor
PDF editing module for modifying existing PDF documents.
elements
Content elements for PDF generation Unified content elements for read/write operations.
encryption
PDF encryption support.
error
Error types for the PDF library.
extractors
Text and content extraction from PDF documents.
fdf
Forms Data Format (FDF) support for exporting/importing form field values.
fonts
Font handling and encoding.
geometry
Geometric primitives for layout analysis.
hybrid
Hybrid classical + ML architecture.
images
Image extraction from PDFs.
layout
Layout analysis algorithms for PDF documents.
lexer
PDF lexer (tokenizer).
object
PDF object types.
objstm
Object stream parsing (PDF 1.5+).
outline
PDF document outline (bookmarks) support.
parser
PDF object parser.
parser_config
Parser configuration options
pipeline
PDF text extraction pipeline with clean abstraction layers.
search
Text search functionality for PDF documents.
structure
PDF logical structure (Tagged PDFs) PDF Logical Structure (Tagged PDF) support.
text
Text processing and analysis module.
writer
PDF writing module for generating PDF files.
xfa
XFA (XML Forms Architecture) support.
xref
Cross-reference table parser.
xref_reconstruction
Cross-reference table reconstruction for damaged PDFs.

Macros§

extract_log_debug
Log a DEBUG level message.
extract_log_error
Log an ERROR level message.
extract_log_info
Log an INFO level message.
extract_log_trace
Log a TRACE level message.
extract_log_warn
Log a WARN level message.

Constants§

NAME
Library name
VERSION
Library version