PDF Oxide - The Fastest PDF Library for Python and Rust
The fastest Python PDF library for text extraction, image extraction, and markdown conversion. Built on a Rust core for reliability and speed — mean 1.8ms per document, 3.5× faster than leading industry libraries, 100% pass rate on 3,830 real-world PDFs.
Quick Start
Python
=
=
=
=
Rust
use PdfDocument;
let mut doc = open?;
let text = doc.extract_text?;
let images = doc.extract_images?;
let markdown = doc.to_markdown?;
[]
= "0.3"
Why pdf_oxide?
- Fast — Rust core, mean 1.8ms per document, 3.5× faster than leading industry libraries, 97% under 10ms
- Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero slow (>5s) PDFs
- Complete — Text extraction, image extraction, PDF creation, and editing in one library
- Dual-language — First-class Rust API and Python bindings via PyO3
- Permissive license — MIT / Apache-2.0 — use freely in commercial and open-source projects
Features
| Extract | Create | Edit |
|---|---|---|
| Text & Layout | Documents | Annotations |
| Images | Tables | Form Fields |
| Forms | Graphics | Bookmarks |
| Annotations | Templates | Links |
| Bookmarks | Images | Content |
Python API
=
# Extract text from each page
=
# Character-level extraction with positions
=
# Password-protected PDFs
=
=
Rust API
use PdfDocument;
Performance
Verified against 3,830 PDFs from three independent test suites:
| Corpus | PDFs | Pass Rate |
|---|---|---|
| veraPDF (PDF/A compliance) | 2,907 | 100% |
| Mozilla pdf.js | 897 | 99.2% |
| SafeDocs (targeted edge cases) | 26 | 100% |
| Total | 3,830 | 100% |
| Metric | Value |
|---|---|
| Mean latency | 1.8ms |
| p50 latency | 0.6ms |
| p90 latency | 2.6ms |
| p99 latency | 18ms |
| Max latency | 625ms |
| Under 10ms | 98.4% |
| Slow (>5s) | 0 |
| Timeouts | 0 |
| Panics | 0 |
100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams). v0.3.8 adds a text-only content stream parser that skips graphics operators at the byte level, further reducing parse time on graphics-heavy pages.
Installation
Python
Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.
Rust
[]
= "0.3"
Building from Source
# Clone and build
# Run tests
# Build Python bindings
Documentation
- Getting Started (Rust) - Complete Rust guide
- Getting Started (Python) - Complete Python guide
- API Docs - Full Rust API reference
- PDF Spec Reference - ISO 32000-1:2008
Use Cases
- RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
- Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
- Data extraction — Pull structured data from forms, tables, and layouts
- Academic research — Parse papers, extract citations, and process large corpora
- PDF generation — Create invoices, reports, certificates, and templated documents programmatically
License
Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
&& && &&
Citation
Rust + Python | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | mean 1.8ms/doc | v0.3.8