Crate pdf_oxide

Expand description

§PDF Oxide

The fastest PDF library for Python and Rust. 0.8ms mean text extraction — 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber. 100% pass rate on 3,830 real-world PDFs. MIT licensed. A drop-in PyMuPDF alternative with no AGPL restrictions.

§Performance (v0.3.9)

Benchmarked against 18 libraries on 3,830 PDFs from 3 public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Single-thread, 60s timeout, no warm-up.

§Python PDF Libraries

Library	Mean	Pass Rate	License
pdf_oxide	0.8ms	100%	MIT
unstructured	478.4ms	99.6%	Apache-2.0
PyMuPDF	4.6ms	99.3%	AGPL-3.0
pypdfium2	4.1ms	99.2%	Apache-2.0
kreuzberg	7.2ms	99.1%	MIT
pymupdf4llm	55.5ms	99.1%	AGPL-3.0
pdftext	7.3ms	99.0%	GPL-3.0
extractous	112.0ms	98.9%	Apache-2.0
pdfminer	16.8ms	98.8%	MIT
pdfplumber	23.2ms	98.8%	MIT
markitdown	108.8ms	98.6%	MIT
pypdf	12.1ms	98.4%	BSD-3

§Rust PDF Libraries

Library	Mean	Pass Rate	Text Extraction
pdf_oxide	0.8ms	100%	Built-in
oxidize_pdf	13.5ms	99.1%	Basic
unpdf	2.8ms	95.1%	Basic
pdf_extract	4.08ms	91.5%	Basic
lopdf	0.3ms	80.2%	No built-in extraction

99.5% text quality parity vs PyMuPDF, pypdfium2, and kreuzberg across the full corpus. Full benchmark details: https://pdf.oxide.fyi/docs/performance

§Core Features

§Reading & Extraction

Text Extraction: Character, span, and page-level with font metadata and bounding boxes
Reading Order: 4 pluggable strategies (XY-Cut, Structure Tree, Geometric, Simple)
Complex Scripts: RTL (Arabic/Hebrew), CJK (Japanese/Korean/Chinese), Devanagari, Thai
Format Conversion: PDF → Markdown, HTML, PlainText
Image Extraction: Content streams, Form XObjects, inline images
Forms & Annotations: Read/write form fields, all annotation types, bookmarks
Text Search: Regex and case-insensitive search with page-level results

§Writing & Creation

PDF Generation: Fluent DocumentBuilder API for programmatic PDF creation
Format Conversion: Markdown → PDF, HTML → PDF, Plain Text → PDF, Image → PDF
Advanced Graphics: Path operations, image embedding, table generation
Font Embedding: Automatic font subsetting for compact output
Interactive Forms: Fillable forms with text fields, checkboxes, radio buttons, dropdowns
QR Codes & Barcodes: Code128, EAN-13, UPC-A (feature flag: barcodes)

§Editing

DOM-like API: Query and modify PDF content with strongly-typed wrappers
Element Modification: Find and replace text, modify images, paths, tables
Page Operations: Add, remove, reorder, merge, rotate, crop pages
Encryption: AES-256, password protection
Incremental Saves: Efficient appending without full rewrite

§Compliance

PDF/A: Validation and conversion
PDF/UA: Accessibility checks
PDF/X: Print production validation

§Quick Start - Rust

use pdf_oxide::PdfDocument;
use pdf_oxide::pipeline::{TextPipeline, TextPipelineConfig};
use pdf_oxide::pipeline::converters::MarkdownOutputConverter;

// Open a PDF
let mut doc = PdfDocument::open("paper.pdf")?;

// Extract text with reading order (multi-column support)
let spans = doc.extract_spans(0)?;
let config = TextPipelineConfig::default();
let pipeline = TextPipeline::with_config(config.clone());
let ordered_spans = pipeline.process(spans, Default::default())?;

// Convert to Markdown
let converter = MarkdownOutputConverter::new();
let markdown = converter.convert(&ordered_spans, &config)?;
println!("{}", markdown);

§Quick Start - Python

from pdf_oxide import PdfDocument

# Open and extract with automatic reading order
doc = PdfDocument("paper.pdf")
markdown = doc.to_markdown(0)
print(markdown)

§License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Re-exports§

pub use pipeline::XYCutStrategy;
pub use annotation_types::AnnotationBorderStyle;
pub use annotation_types::AnnotationColor;
pub use annotation_types::AnnotationFlags;
pub use annotation_types::AnnotationSubtype;
pub use annotation_types::BorderEffectStyle;
pub use annotation_types::BorderStyleType;
pub use annotation_types::CaretSymbol;
pub use annotation_types::FileAttachmentIcon;
pub use annotation_types::FreeTextIntent;
pub use annotation_types::HighlightMode;
pub use annotation_types::LineEndingStyle;
pub use annotation_types::QuadPoint;
pub use annotation_types::ReplyType;
pub use annotation_types::StampType;
pub use annotation_types::TextAlignment;
pub use annotation_types::TextAnnotationIcon;
pub use annotation_types::TextMarkupType;
pub use annotation_types::WidgetFieldType;
pub use annotations::Annotation;
pub use annotations::LinkAction;
pub use annotations::LinkDestination;
pub use config::DocumentType;
pub use config::ExtractionProfile;
pub use document::ExtractedImageRef;
pub use document::ImageFormat;
pub use document::PdfDocument;
pub use error::Error;
pub use error::Result;
pub use outline::Destination;
pub use outline::OutlineItem;

Modules§

annotation_types: Core annotation types and enums per PDF spec Core annotation types and enums per PDF spec ISO 32000-1:2008, Section 12.5.
annotations: PDF annotations support.
api: High-level PDF API for simple document creation and manipulation.
compliance: PDF compliance validation and conversion module.
config: Configuration module for PDF text extraction.
content: PDF content stream parsing and execution.
converters: Format converters for PDF documents.
decoders: Stream decoder implementations for PDF filters.
document: PDF document model.
editor: PDF editing module for modifying existing PDF documents.
elements: Content elements for PDF generation Unified content elements for read/write operations.
encryption: PDF encryption support.
error: Error types for the PDF library.
extractors: Text and content extraction from PDF documents.
fdf: Forms Data Format (FDF) support for exporting/importing form field values.
fonts: Font handling and encoding.
geometry: Geometric primitives for layout analysis.
hybrid: Hybrid classical + ML architecture.
layout: Layout analysis algorithms for PDF documents.
lexer: PDF lexer (tokenizer).
object: PDF object types.
objstm: Object stream parsing (PDF 1.5+).
outline: PDF document outline (bookmarks) support.
parser: PDF object parser.
parser_config: Parser configuration options
pipeline: PDF text extraction pipeline with clean abstraction layers.
search: Text search functionality for PDF documents.
structure: PDF logical structure (Tagged PDFs) PDF Logical Structure (Tagged PDF) support.
text: Text processing and analysis module.
writer: PDF writing module for generating PDF files.
xfa: XFA (XML Forms Architecture) support.
xref: Cross-reference table parser.
xref_reconstruction: Cross-reference table reconstruction for damaged PDFs.

Macros§

extract_log_debug: Log a DEBUG level message.
extract_log_error: Log an ERROR level message.
extract_log_info: Log an INFO level message.
extract_log_trace: Log a TRACE level message.
extract_log_warn: Log a WARN level message.

Constants§

NAME: Library name
VERSION: Library version

Crate pdf_oxide

Crate pdf_oxide Copy item path

§PDF Oxide

§Performance (v0.3.9)

§Python PDF Libraries

§Rust PDF Libraries

§Core Features

§Reading & Extraction

§Writing & Creation

§Editing

§Compliance

§Quick Start - Rust

§Quick Start - Python

§License

Re-exports§

Modules§

Macros§

Constants§

Crate pdf_oxide