Skip to main content

Crate pdf_oxide

Crate pdf_oxide 

Source
Expand description

§PDF Oxide

The fastest PDF library for Python and Rust. 0.8ms mean text extraction — 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber. 100% pass rate on 3,830 real-world PDFs. MIT licensed. A drop-in PyMuPDF alternative with no AGPL restrictions.

§Performance (v0.3.9)

Benchmarked against 18 libraries on 3,830 PDFs from 3 public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Single-thread, 60s timeout, no warm-up.

§Python PDF Libraries

LibraryMeanPass RateLicense
pdf_oxide0.8ms100%MIT
unstructured478.4ms99.6%Apache-2.0
PyMuPDF4.6ms99.3%AGPL-3.0
pypdfium24.1ms99.2%Apache-2.0
kreuzberg7.2ms99.1%MIT
pymupdf4llm55.5ms99.1%AGPL-3.0
pdftext7.3ms99.0%GPL-3.0
extractous112.0ms98.9%Apache-2.0
pdfminer16.8ms98.8%MIT
pdfplumber23.2ms98.8%MIT
markitdown108.8ms98.6%MIT
pypdf12.1ms98.4%BSD-3

§Rust PDF Libraries

LibraryMeanPass RateText Extraction
pdf_oxide0.8ms100%Built-in
oxidize_pdf13.5ms99.1%Basic
unpdf2.8ms95.1%Basic
pdf_extract4.08ms91.5%Basic
lopdf0.3ms80.2%No built-in extraction

99.5% text quality parity vs PyMuPDF, pypdfium2, and kreuzberg across the full corpus. Full benchmark details: https://pdf.oxide.fyi/docs/performance

§Core Features

§Reading & Extraction

  • Text Extraction: Character, span, and page-level with font metadata and bounding boxes
  • Reading Order: 4 pluggable strategies (XY-Cut, Structure Tree, Geometric, Simple)
  • Complex Scripts: RTL (Arabic/Hebrew), CJK (Japanese/Korean/Chinese), Devanagari, Thai
  • Format Conversion: PDF → Markdown, HTML, PlainText
  • Image Extraction: Content streams, Form XObjects, inline images
  • Forms & Annotations: Read/write form fields, all annotation types, bookmarks
  • Text Search: Regex and case-insensitive search with page-level results

§Writing & Creation

  • PDF Generation: Fluent DocumentBuilder API for programmatic PDF creation
  • Format Conversion: Markdown → PDF, HTML → PDF, Plain Text → PDF, Image → PDF
  • Advanced Graphics: Path operations, image embedding, table generation
  • Font Embedding: Automatic font subsetting for compact output
  • Interactive Forms: Fillable forms with text fields, checkboxes, radio buttons, dropdowns
  • QR Codes & Barcodes: Code128, EAN-13, UPC-A (feature flag: barcodes)

§Editing

  • DOM-like API: Query and modify PDF content with strongly-typed wrappers
  • Element Modification: Find and replace text, modify images, paths, tables
  • Page Operations: Add, remove, reorder, merge, rotate, crop pages
  • Encryption: AES-256, password protection
  • Incremental Saves: Efficient appending without full rewrite

§Compliance

  • PDF/A: Validation and conversion
  • PDF/UA: Accessibility checks
  • PDF/X: Print production validation

§Quick Start - Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::pipeline::{TextPipeline, TextPipelineConfig};
use pdf_oxide::pipeline::converters::MarkdownOutputConverter;

// Open a PDF
let mut doc = PdfDocument::open("paper.pdf")?;

// Extract text with reading order (multi-column support)
let spans = doc.extract_spans(0)?;
let config = TextPipelineConfig::default();
let pipeline = TextPipeline::with_config(config.clone());
let ordered_spans = pipeline.process(spans, Default::default())?;

// Convert to Markdown
let converter = MarkdownOutputConverter::new();
let markdown = converter.convert(&ordered_spans, &config)?;
println!("{}", markdown);

§Quick Start - Python

from pdf_oxide import PdfDocument

# Open and extract with automatic reading order
doc = PdfDocument("paper.pdf")
markdown = doc.to_markdown(0)
print(markdown)

§License

Licensed under either of:

at your option.

Re-exports§

pub use pipeline::XYCutStrategy;
pub use annotation_types::AnnotationBorderStyle;
pub use annotation_types::AnnotationColor;
pub use annotation_types::AnnotationFlags;
pub use annotation_types::AnnotationSubtype;
pub use annotation_types::BorderEffectStyle;
pub use annotation_types::BorderStyleType;
pub use annotation_types::CaretSymbol;
pub use annotation_types::FileAttachmentIcon;
pub use annotation_types::FreeTextIntent;
pub use annotation_types::HighlightMode;
pub use annotation_types::LineEndingStyle;
pub use annotation_types::QuadPoint;
pub use annotation_types::ReplyType;
pub use annotation_types::StampType;
pub use annotation_types::TextAlignment;
pub use annotation_types::TextAnnotationIcon;
pub use annotation_types::TextMarkupType;
pub use annotation_types::WidgetFieldType;
pub use annotations::Annotation;
pub use annotations::LinkAction;
pub use annotations::LinkDestination;
pub use config::DocumentType;
pub use config::ExtractionProfile;
pub use document::ExtractedImageRef;
pub use document::ImageFormat;
pub use document::PdfDocument;
pub use error::Error;
pub use error::Result;
pub use outline::Destination;
pub use outline::OutlineItem;

Modules§

annotation_types
Core annotation types and enums per PDF spec Core annotation types and enums per PDF spec ISO 32000-1:2008, Section 12.5.
annotations
PDF annotations support.
api
High-level PDF API for simple document creation and manipulation.
compliance
PDF compliance validation and conversion module.
config
Configuration module for PDF text extraction.
content
PDF content stream parsing and execution.
converters
Format converters for PDF documents.
decoders
Stream decoder implementations for PDF filters.
document
PDF document model.
editor
PDF editing module for modifying existing PDF documents.
elements
Content elements for PDF generation Unified content elements for read/write operations.
encryption
PDF encryption support.
error
Error types for the PDF library.
extractors
Text and content extraction from PDF documents.
fdf
Forms Data Format (FDF) support for exporting/importing form field values.
fonts
Font handling and encoding.
geometry
Geometric primitives for layout analysis.
hybrid
Hybrid classical + ML architecture.
layout
Layout analysis algorithms for PDF documents.
lexer
PDF lexer (tokenizer).
object
PDF object types.
objstm
Object stream parsing (PDF 1.5+).
outline
PDF document outline (bookmarks) support.
parser
PDF object parser.
parser_config
Parser configuration options
pipeline
PDF text extraction pipeline with clean abstraction layers.
search
Text search functionality for PDF documents.
structure
PDF logical structure (Tagged PDFs) PDF Logical Structure (Tagged PDF) support.
text
Text processing and analysis module.
writer
PDF writing module for generating PDF files.
xfa
XFA (XML Forms Architecture) support.
xref
Cross-reference table parser.
xref_reconstruction
Cross-reference table reconstruction for damaged PDFs.

Macros§

extract_log_debug
Log a DEBUG level message.
extract_log_error
Log an ERROR level message.
extract_log_info
Log an INFO level message.
extract_log_trace
Log a TRACE level message.
extract_log_warn
Log a WARN level message.

Constants§

NAME
Library name
VERSION
Library version