Expand description
§PDF Oxide
The fastest PDF library for Python and Rust. 0.8ms mean text extraction — 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber. 100% pass rate on 3,830 real-world PDFs. MIT licensed. A drop-in PyMuPDF alternative with no AGPL restrictions.
§Performance (v0.3.9)
Benchmarked against 18 libraries on 3,830 PDFs from 3 public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Single-thread, 60s timeout, no warm-up.
§Python PDF Libraries
| Library | Mean | Pass Rate | License |
|---|---|---|---|
| pdf_oxide | 0.8ms | 100% | MIT |
| unstructured | 478.4ms | 99.6% | Apache-2.0 |
| PyMuPDF | 4.6ms | 99.3% | AGPL-3.0 |
| pypdfium2 | 4.1ms | 99.2% | Apache-2.0 |
| kreuzberg | 7.2ms | 99.1% | MIT |
| pymupdf4llm | 55.5ms | 99.1% | AGPL-3.0 |
| pdftext | 7.3ms | 99.0% | GPL-3.0 |
| extractous | 112.0ms | 98.9% | Apache-2.0 |
| pdfminer | 16.8ms | 98.8% | MIT |
| pdfplumber | 23.2ms | 98.8% | MIT |
| markitdown | 108.8ms | 98.6% | MIT |
| pypdf | 12.1ms | 98.4% | BSD-3 |
§Rust PDF Libraries
| Library | Mean | Pass Rate | Text Extraction |
|---|---|---|---|
| pdf_oxide | 0.8ms | 100% | Built-in |
| oxidize_pdf | 13.5ms | 99.1% | Basic |
| unpdf | 2.8ms | 95.1% | Basic |
| pdf_extract | 4.08ms | 91.5% | Basic |
| lopdf | 0.3ms | 80.2% | No built-in extraction |
99.5% text quality parity vs PyMuPDF, pypdfium2, and kreuzberg across the full corpus. Full benchmark details: https://pdf.oxide.fyi/docs/performance
§Core Features
§Reading & Extraction
- Text Extraction: Character, span, and page-level with font metadata and bounding boxes
- Reading Order: 4 pluggable strategies (XY-Cut, Structure Tree, Geometric, Simple)
- Complex Scripts: RTL (Arabic/Hebrew), CJK (Japanese/Korean/Chinese), Devanagari, Thai
- Format Conversion: PDF → Markdown, HTML, PlainText
- Image Extraction: Content streams, Form XObjects, inline images
- Forms & Annotations: Read/write form fields, all annotation types, bookmarks
- Text Search: Regex and case-insensitive search with page-level results
§Writing & Creation
- PDF Generation: Fluent DocumentBuilder API for programmatic PDF creation
- Format Conversion: Markdown → PDF, HTML → PDF, Plain Text → PDF, Image → PDF
- Advanced Graphics: Path operations, image embedding, table generation
- Font Embedding: Automatic font subsetting for compact output
- Interactive Forms: Fillable forms with text fields, checkboxes, radio buttons, dropdowns
- QR Codes & Barcodes: Code128, EAN-13, UPC-A (feature flag:
barcodes)
§Editing
- DOM-like API: Query and modify PDF content with strongly-typed wrappers
- Element Modification: Find and replace text, modify images, paths, tables
- Page Operations: Add, remove, reorder, merge, rotate, crop pages
- Encryption: AES-256, password protection
- Incremental Saves: Efficient appending without full rewrite
§Compliance
- PDF/A: Validation and conversion
- PDF/UA: Accessibility checks
- PDF/X: Print production validation
§Quick Start - Rust
ⓘ
use pdf_oxide::PdfDocument;
use pdf_oxide::pipeline::{TextPipeline, TextPipelineConfig};
use pdf_oxide::pipeline::converters::MarkdownOutputConverter;
// Open a PDF
let mut doc = PdfDocument::open("paper.pdf")?;
// Extract text with reading order (multi-column support)
let spans = doc.extract_spans(0)?;
let config = TextPipelineConfig::default();
let pipeline = TextPipeline::with_config(config.clone());
let ordered_spans = pipeline.process(spans, Default::default())?;
// Convert to Markdown
let converter = MarkdownOutputConverter::new();
let markdown = converter.convert(&ordered_spans, &config)?;
println!("{}", markdown);§Quick Start - Python
from pdf_oxide import PdfDocument
# Open and extract with automatic reading order
doc = PdfDocument("paper.pdf")
markdown = doc.to_markdown(0)
print(markdown)§License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
Re-exports§
pub use pipeline::XYCutStrategy;pub use annotation_types::AnnotationBorderStyle;pub use annotation_types::AnnotationColor;pub use annotation_types::AnnotationFlags;pub use annotation_types::AnnotationSubtype;pub use annotation_types::BorderEffectStyle;pub use annotation_types::BorderStyleType;pub use annotation_types::CaretSymbol;pub use annotation_types::FileAttachmentIcon;pub use annotation_types::FreeTextIntent;pub use annotation_types::HighlightMode;pub use annotation_types::LineEndingStyle;pub use annotation_types::QuadPoint;pub use annotation_types::ReplyType;pub use annotation_types::StampType;pub use annotation_types::TextAlignment;pub use annotation_types::TextAnnotationIcon;pub use annotation_types::TextMarkupType;pub use annotation_types::WidgetFieldType;pub use annotations::Annotation;pub use annotations::LinkAction;pub use annotations::LinkDestination;pub use config::DocumentType;pub use config::ExtractionProfile;pub use document::ExtractedImageRef;pub use document::ImageFormat;pub use document::PdfDocument;pub use error::Error;pub use error::Result;pub use outline::Destination;pub use outline::OutlineItem;
Modules§
- annotation_
types - Core annotation types and enums per PDF spec Core annotation types and enums per PDF spec ISO 32000-1:2008, Section 12.5.
- annotations
- PDF annotations support.
- api
- High-level PDF API for simple document creation and manipulation.
- compliance
- PDF compliance validation and conversion module.
- config
- Configuration module for PDF text extraction.
- content
- PDF content stream parsing and execution.
- converters
- Format converters for PDF documents.
- decoders
- Stream decoder implementations for PDF filters.
- document
- PDF document model.
- editor
- PDF editing module for modifying existing PDF documents.
- elements
- Content elements for PDF generation Unified content elements for read/write operations.
- encryption
- PDF encryption support.
- error
- Error types for the PDF library.
- extractors
- Text and content extraction from PDF documents.
- fdf
- Forms Data Format (FDF) support for exporting/importing form field values.
- fonts
- Font handling and encoding.
- geometry
- Geometric primitives for layout analysis.
- hybrid
- Hybrid classical + ML architecture.
- layout
- Layout analysis algorithms for PDF documents.
- lexer
- PDF lexer (tokenizer).
- object
- PDF object types.
- objstm
- Object stream parsing (PDF 1.5+).
- outline
- PDF document outline (bookmarks) support.
- parser
- PDF object parser.
- parser_
config - Parser configuration options
- pipeline
- PDF text extraction pipeline with clean abstraction layers.
- search
- Text search functionality for PDF documents.
- structure
- PDF logical structure (Tagged PDFs) PDF Logical Structure (Tagged PDF) support.
- text
- Text processing and analysis module.
- writer
- PDF writing module for generating PDF files.
- xfa
- XFA (XML Forms Architecture) support.
- xref
- Cross-reference table parser.
- xref_
reconstruction - Cross-reference table reconstruction for damaged PDFs.
Macros§
- extract_
log_ debug - Log a DEBUG level message.
- extract_
log_ error - Log an ERROR level message.
- extract_
log_ info - Log an INFO level message.
- extract_
log_ trace - Log a TRACE level message.
- extract_
log_ warn - Log a WARN level message.