PDF Oxide - The Fastest PDF Toolkit for Python, Rust, Go, JS/TS, C#, WASM, CLI & AI
More language bindings coming in May 2026. Java, Ruby, PHP, Swift, and Kotlin are on the roadmap. Want another language? Open an issue and tell us.
The fastest PDF library for text extraction, image extraction, and markdown conversion. Rust core with bindings for Python, Go, JavaScript / TypeScript, C# / .NET, and WASM, plus a CLI tool and MCP server for AI assistants. 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.
New in v0.3.24 — now available in Go, JavaScript / TypeScript, and C# / .NET, alongside the existing Python, Rust, and WASM bindings. Same Rust core, same 0.8 ms extraction speed, same 100% pass rate. See the language guides: Python · Go · JavaScript / TypeScript · C# / .NET · WASM
Quick Start
Python
# path can be str or pathlib.Path; use with for scoped access
=
# or: with PdfDocument("paper.pdf") as doc: ...
=
=
=
Rust
use PdfDocument;
let mut doc = open?;
let text = doc.extract_text?;
let images = doc.extract_images?;
let markdown = doc.to_markdown?;
[]
= "0.3"
CLI
MCP Server (for AI assistants)
# Install
# Configure in Claude Desktop / Claude Code / Cursor
{
}
Why pdf_oxide?
- Fast — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber
- Reliable — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts
- Complete — Text extraction, image extraction, PDF creation, and editing in one library
- Multi-platform — Rust, Python, Go, JavaScript/TypeScript, C#/.NET, WASM, CLI, and MCP server for AI assistants
- Permissive license — MIT / Apache-2.0 — use freely in commercial and open-source projects
Performance
Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Text extraction libraries only (no OCR). Single-thread, 60s timeout, no warm-up.
Python Libraries
| Library | Mean | p99 | Pass Rate | License |
|---|---|---|---|---|
| PDF Oxide | 0.8ms | 9ms | 100% | MIT |
| PyMuPDF | 4.6ms | 28ms | 99.3% | AGPL-3.0 |
| pypdfium2 | 4.1ms | 42ms | 99.2% | Apache-2.0 |
| pymupdf4llm | 55.5ms | 280ms | 99.1% | AGPL-3.0 |
| pdftext | 7.3ms | 82ms | 99.0% | GPL-3.0 |
| pdfminer | 16.8ms | 124ms | 98.8% | MIT |
| pdfplumber | 23.2ms | 189ms | 98.8% | MIT |
| markitdown | 108.8ms | 378ms | 98.6% | MIT |
| pypdf | 12.1ms | 97ms | 98.4% | BSD-3 |
Rust Libraries
| Library | Mean | p99 | Pass Rate | Text Extraction |
|---|---|---|---|---|
| PDF Oxide | 0.8ms | 9ms | 100% | Built-in |
| oxidize_pdf | 13.5ms | 11ms | 99.1% | Basic |
| unpdf | 2.8ms | 10ms | 95.1% | Basic |
| pdf_extract | 4.08ms | 37ms | 91.5% | Basic |
| lopdf | 0.3ms | 2ms | 80.2% | No built-in extraction |
Text Quality
99.5% text parity vs PyMuPDF and pypdfium2 across the full corpus. PDF Oxide extracts text from 7–10× more "hard" files than it misses vs any competitor.
Corpus
| Suite | PDFs | Pass Rate |
|---|---|---|
| veraPDF (PDF/A compliance) | 2,907 | 100% |
| Mozilla pdf.js | 897 | 99.2% |
| SafeDocs (targeted edge cases) | 26 | 100% |
| Total | 3,830 | 100% |
100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).
Features
| Extract | Create | Edit |
|---|---|---|
| Text & Layout | Documents | Annotations |
| Images | Tables | Form Fields |
| Forms | Graphics | Bookmarks |
| Annotations | Templates | Links |
| Bookmarks | Images | Content |
Python API
# Path can be str or pathlib.Path; use "with PdfDocument(...) as doc" for context manager
=
# 1. Scoped extraction (v0.3.14)
# Extract only from a specific area: (x, y, width, height)
=
# 2. Word-level extraction (v0.3.14)
=
# Access individual characters in the word
# print(w.chars[0].font_name)
# Optional: override the adaptive word gap threshold (in PDF points)
=
# 3. Line-level extraction (v0.3.14)
=
# Optional: override word and/or line gap thresholds (in PDF points)
=
# Inspect the adaptive thresholds before overriding
=
# Use a pre-tuned extraction profile for specific document types
=
=
# 4. Table extraction (v0.3.14)
=
# 5. Traditional extraction
=
=
Form Fields
# Extract form fields
=
# Fill and save
Rust API
use PdfDocument;
Form Fields (Rust)
use ;
use FormFieldValue;
let mut editor = open?;
editor.set_form_field_value?;
editor.save_with_options?;
Installation
Python
Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.
Rust
[]
= "0.3"
JavaScript/WASM
const = require;
CLI
MCP Server
Other languages
- Go —
go get github.com/yfedoseev/pdf_oxide/go— see go/README.md - JavaScript / TypeScript (Node.js) —
npm install pdf-oxide— see js/README.md - C# / .NET —
dotnet add package PdfOxide— see csharp/README.md
All three share the same Rust core as the Python and WASM bindings, so everything you read in this README applies to them as well — just with each language's native naming conventions.
CLI
22 commands for PDF processing directly from your terminal:
Run pdf-oxide with no arguments for interactive REPL mode. Use --pages 1-5 to process specific pages, --json for machine-readable output.
MCP Server
pdf-oxide-mcp lets AI assistants (Claude, Cursor, etc.) extract content from PDFs locally via the Model Context Protocol.
Add to your MCP client configuration:
The server exposes an extract tool that supports text, markdown, and HTML output formats with optional page ranges and image extraction. All processing runs locally — no files leave your machine.
Building from Source
# Clone and build
# Run tests
# Build Python bindings
# Build the shared library for Go, JS/TS, and C# bindings
# Output: target/release/libpdf_oxide.{so,dylib} or pdf_oxide.dll
Documentation
- Full Documentation — Complete documentation site
- Getting Started (Rust) — Rust guide
- Getting Started (Python) — Python guide
- Getting Started (Go) — Go guide
- Getting Started (JavaScript / TypeScript) — Node.js guide
- Getting Started (C# / .NET) — .NET guide
- Getting Started (WASM) — Browser and Node.js WASM guide
- API Docs — Full Rust API reference
- Performance Benchmarks — Full benchmark methodology and results
Use Cases
- RAG / LLM pipelines — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
- AI assistants — Give Claude, Cursor, or any MCP-compatible tool direct PDF access via the MCP server
- Document processing at scale — Extract text, images, and metadata from thousands of PDFs in seconds
- Data extraction — Pull structured data from forms, tables, and layouts
- Academic research — Parse papers, extract citations, and process large corpora
- PDF generation — Create invoices, reports, certificates, and templated documents programmatically
- PyMuPDF alternative — MIT licensed, 5× faster, no AGPL restrictions
Why I built this
I needed PyMuPDF's speed without its AGPL license, and I needed it in more than one language. Nothing existed that ticked all three boxes — fast, MIT, multi-language — so I wrote it. The Rust core is what does the real work; the bindings for Python, Go, JS/TS, C#, and WASM are thin shells around the same code, so a bug fix in one lands in all of them. It now passes 100% of the veraPDF + Mozilla pdf.js + DARPA SafeDocs test corpora (3,830 PDFs) on every platform I've tested.
If it's useful to you, a star on GitHub genuinely helps. If something's broken or missing, open an issue — I read all of them.
— Yury
License
Dual-licensed under MIT or Apache-2.0 at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
&& && &&
Citation
Rust + Python + Go + JS/TS + C# + WASM + CLI + MCP | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5× faster than the industry leaders