pdf_oxide 0.3.7

The Complete PDF Toolkit: extract, create, and edit PDFs. Rust core with bindings for Python, Node, WASM, Go, and more.
Documentation

PDF Oxide - The Fastest PDF Library for Python and Rust

The fastest Python PDF library for text extraction, image extraction, and markdown conversion. Built on a Rust core for reliability and speed — mean 2.1ms per document, 100% pass rate on 3,830 real-world PDFs.

Crates.io PyPI PyPI Downloads Documentation Build Status License: MIT OR Apache-2.0

Quick Start

Python

from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
text = doc.extract_text(0)
chars = doc.extract_chars(0)
markdown = doc.to_markdown(0, detect_headings=True)
pip install pdf_oxide

Rust

use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let images = doc.extract_images(0)?;
let markdown = doc.to_markdown(0, Default::default())?;
[dependencies]
pdf_oxide = "0.3"

Why pdf_oxide?

  • Fast — Rust core, mean 2.1ms per document, p99 = 18ms, 98.4% under 10ms
  • Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero slow (>5s) PDFs
  • Complete — Text extraction, image extraction, PDF creation, and editing in one library
  • Dual-language — First-class Rust API and Python bindings via PyO3
  • Permissive license — MIT / Apache-2.0 — use freely in commercial and open-source projects

Features

Extract Create Edit
Text & Layout Documents Annotations
Images Tables Form Fields
Forms Graphics Bookmarks
Annotations Templates Links
Bookmarks Images Content

Python API

from pdf_oxide import PdfDocument

doc = PdfDocument("report.pdf")
print(f"Pages: {doc.page_count}")
print(f"Version: {doc.version}")

# Extract text from each page
for i in range(doc.page_count):
    text = doc.extract_text(i)
    print(f"Page {i}: {len(text)} chars")

# Character-level extraction with positions
chars = doc.extract_chars(0)
for ch in chars:
    print(f"'{ch.char}' at ({ch.x:.1f}, {ch.y:.1f})")

# Password-protected PDFs
doc = PdfDocument("encrypted.pdf")
doc.authenticate("password")
text = doc.extract_text(0)

Rust API

use pdf_oxide::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut doc = PdfDocument::open("paper.pdf")?;

    // Extract text
    let text = doc.extract_text(0)?;

    // Character-level extraction
    let chars = doc.extract_chars(0)?;

    // Extract images
    let images = doc.extract_images(0)?;

    // Vector graphics
    let paths = doc.extract_paths(0)?;

    Ok(())
}

Performance

Verified against 3,830 PDFs from three independent test suites:

Corpus PDFs Pass Rate
veraPDF (PDF/A compliance) 2,907 100%
Mozilla pdf.js 897 99.2%
SafeDocs (targeted edge cases) 26 100%
Total 3,830 100%
Metric v0.3.5 v0.3.6
Mean latency 23.3ms 2.1ms (-91%)
p50 latency 0.6ms 0.6ms
p90 latency 3.0ms 2.6ms (-13%)
p99 latency 33ms 18ms (-46%)
Max latency 68,722ms 625ms (-99%)
Under 10ms 97.4% 98.4%
Slow (>5s) 2 0
Timeouts 0 0
Panics 0 0

v0.3.6 eliminated two O(n) bottlenecks: page tree traversal (168× faster on 10,000-page PDFs) and xref miss scanning (146× faster on tagged PDFs). 100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).

Installation

Python

pip install pdf_oxide

Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.

Rust

[dependencies]
pdf_oxide = "0.3"

Building from Source

# Clone and build
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

Documentation

Use Cases

  • RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
  • Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
  • Data extraction — Pull structured data from forms, tables, and layouts
  • Academic research — Parse papers, extract citations, and process large corpora
  • PDF generation — Create invoices, reports, certificates, and templated documents programmatically

License

Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.

Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings

Citation

@software{pdf_oxide,
  title = {PDF Oxide: Fast PDF Toolkit for Rust and Python},
  author = {Yury Fedoseev},
  year = {2025},
  url = {https://github.com/yfedoseev/pdf_oxide}
}

Rust + Python | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | mean 2.1ms/doc | v0.3.6