pdf_oxide 0.3.59

The fastest Rust PDF library with text extraction: 0.8ms mean, 100% pass rate on 3,830 PDFs. 5× faster than pdf_extract, 17× faster than oxidize_pdf. Extract, create, and edit PDFs.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
# PDF Oxide - The Fastest PDF Toolkit for Python, Rust, Go, JS/TS, C#, Java, WASM, CLI & AI

> **New in v0.3.54 — text-extraction fidelity pass** (Hebrew / RTL visual-vs-logical detection, ToUnicode CMap fallback for bullet & ligature decode, multi-column prose reading order, reference-style two-column reading order). **Java is the 8th binding** (`fyi.oxide:pdf-oxide:0.3.54` on Maven Central, JDK 11+, free Kotlin interop via the same JAR). **Ruby, PHP, and Swift are next on the roadmap.** Want another language? [Open an issue]https://github.com/yfedoseev/pdf_oxide/issues/new and tell us.

The fastest PDF library for text extraction, image extraction, and markdown conversion. Rust core with bindings for Python, Go, JavaScript / TypeScript, C# / .NET, **Java (JDK 11+, Kotlin-compatible)**, and WASM, plus a CLI tool and MCP server for AI assistants. 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.

[![Crates.io](https://img.shields.io/crates/v/pdf_oxide.svg)](https://crates.io/crates/pdf_oxide)
[![PyPI](https://img.shields.io/pypi/v/pdf_oxide.svg)](https://pypi.org/project/pdf_oxide/)
[![PyPI Downloads](https://img.shields.io/pypi/dm/pdf-oxide)](https://pypi.org/project/pdf-oxide/)
[![npm](https://img.shields.io/npm/v/pdf-oxide-wasm)](https://www.npmjs.com/package/pdf-oxide-wasm)
[![Documentation](https://docs.rs/pdf_oxide/badge.svg)](https://docs.rs/pdf_oxide)
[![Build Status](https://github.com/yfedoseev/pdf_oxide/workflows/CI/badge.svg)](https://github.com/yfedoseev/pdf_oxide/actions)
[![License: MIT OR Apache-2.0](https://img.shields.io/badge/License-MIT%20OR%20Apache--2.0-blue.svg)](https://opensource.org/licenses)
<!-- [![OpenSSF Scorecard](https://api.scorecard.dev/projects/github.com/yfedoseev/pdf_oxide/badge)](https://scorecard.dev/viewer/?uri=github.com/yfedoseev/pdf_oxide) -->
<!-- [![OpenSSF Best Practices](https://www.bestpractices.dev/projects/NNNN/badge)](https://www.bestpractices.dev/projects/NNNN) -->

> **New in v0.3.24 — now available in Go, JavaScript / TypeScript, and C# / .NET**, alongside the existing Python, Rust, and WASM bindings.
> Same Rust core, same 0.8 ms extraction speed, same 100% pass rate.
> See the language guides: [Python]python/README.md · [Go]go/README.md · [JavaScript / TypeScript]js/README.md · [C# / .NET]csharp/README.md · [Java / Kotlin]java/README.md · [WASM]wasm-pkg/README.md

## Quick Start

### Python
```python
from pdf_oxide import PdfDocument

# path can be str or pathlib.Path; use with for scoped access
doc = PdfDocument("paper.pdf")
# or: with PdfDocument("paper.pdf") as doc: ...
text = doc.extract_text(0)
chars = doc.extract_chars(0)
markdown = doc.to_markdown(0, detect_headings=True)
```

```bash
pip install pdf_oxide
```

### Rust
```rust
use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;
let text = doc.extract_text(0)?;
let images = doc.extract_images(0)?;
let markdown = doc.to_markdown(0, Default::default())?;
```

```toml
[dependencies]
pdf_oxide = "0.3"
```

### CLI
```bash
pdf-oxide text document.pdf
pdf-oxide markdown document.pdf -o output.md
pdf-oxide search document.pdf "pattern"
pdf-oxide merge a.pdf b.pdf -o combined.pdf
```

```bash
brew install yfedoseev/tap/pdf-oxide
```

### MCP Server (for AI assistants)
```bash
# Install
brew install yfedoseev/tap/pdf-oxide   # includes pdf-oxide-mcp

# Configure in Claude Desktop / Claude Code / Cursor
{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}
```

## Why pdf_oxide?

- **Fast** — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber
- **Reliable** — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts
- **Complete** — Text extraction, image extraction, PDF creation, and editing in one library
- **Multi-platform** — Rust, Python, Go, JavaScript/TypeScript, C#/.NET, Java/Kotlin, WASM, CLI, and MCP server for AI assistants
- **Permissive license** — MIT / Apache-2.0 — use freely in commercial and open-source projects

## Performance

Benchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Text extraction libraries only (no OCR). Single-thread, 60s timeout, no warm-up.

### Python Libraries

| Library | Mean | p99 | Pass Rate | License |
|---------|------|-----|-----------|---------|
| **PDF Oxide** | **0.8ms** | **9ms** | **100%** | **MIT** |
| PyMuPDF | 4.6ms | 28ms | 99.3% | AGPL-3.0 |
| pypdfium2 | 4.1ms | 42ms | 99.2% | Apache-2.0 |
| pymupdf4llm | 55.5ms | 280ms | 99.1% | AGPL-3.0 |
| pdftext | 7.3ms | 82ms | 99.0% | GPL-3.0 |
| pdfminer | 16.8ms | 124ms | 98.8% | MIT |
| pdfplumber | 23.2ms | 189ms | 98.8% | MIT |
| markitdown | 108.8ms | 378ms | 98.6% | MIT |
| pypdf | 12.1ms | 97ms | 98.4% | BSD-3 |

### Rust Libraries

| Library | Mean | p99 | Pass Rate | Text Extraction |
|---------|------|-----|-----------|-----------------|
| **PDF Oxide** | **0.8ms** | **9ms** | **100%** | **Built-in** |
| oxidize_pdf | 13.5ms | 11ms | 99.1% | Basic |
| unpdf | 2.8ms | 10ms | 95.1% | Basic |
| pdf_extract | 4.08ms | 37ms | 91.5% | Basic |
| lopdf | 0.3ms | 2ms | 80.2% | No built-in extraction |

### Text Quality

99.5% text parity vs PyMuPDF and pypdfium2 across the full corpus. PDF Oxide extracts text from 7–10× more "hard" files than it misses vs any competitor.

### Corpus

| Suite | PDFs | Pass Rate |
|-------|-----:|----------:|
| [veraPDF]https://github.com/veraPDF/veraPDF-corpus (PDF/A compliance) | 2,907 | 100% |
| [Mozilla pdf.js]https://github.com/mozilla/pdf.js/tree/master/test/pdfs | 897 | 99.2% |
| [SafeDocs]https://github.com/pdf-association/safedocs (targeted edge cases) | 26 | 100% |
| **Total** | **3,830** | **100%** |

100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).

## Features

| Extract | Create | Edit |
|---------|--------|------|
| Text & Layout | Documents | Annotations |
| Images | Tables | Form Fields |
| Forms | Graphics | Bookmarks |
| Annotations | Templates | Links |
| Bookmarks | Images | Content |

## Python API

```python
from pdf_oxide import PdfDocument

# Path can be str or pathlib.Path; use "with PdfDocument(...) as doc" for context manager
doc = PdfDocument("report.pdf")
print(f"Pages: {doc.page_count()}")
print(f"Version: {doc.version()}")

# 1. Scoped extraction (v0.3.14)
# Extract only from a specific area: (x, y, width, height)
header = doc.within(0, (0, 700, 612, 92)).extract_text()

# 2. Word-level extraction (v0.3.14)
words = doc.extract_words(0)
for w in words:
    print(f"{w.text} at {w.bbox}")
    # Access individual characters in the word
    # print(w.chars[0].font_name)

# Optional: override the adaptive word gap threshold (in PDF points)
words = doc.extract_words(0, word_gap_threshold=2.5)

# 3. Line-level extraction (v0.3.14)
lines = doc.extract_text_lines(0)
for line in lines:
    print(f"Line: {line.text}")

# Optional: override word and/or line gap thresholds (in PDF points)
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)

# Inspect the adaptive thresholds before overriding
params = doc.page_layout_params(0)
print(f"word gap: {params.word_gap_threshold:.1f}, line gap: {params.line_gap_threshold:.1f}")

# Use a pre-tuned extraction profile for specific document types
from pdf_oxide import ExtractionProfile
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())

# 4. Table extraction (v0.3.14)
tables = doc.extract_tables(0)
for table in tables:
    print(f"Table with {table.row_count} rows")

# 5. Traditional extraction
text = doc.extract_text(0)
chars = doc.extract_chars(0)
```

### Form Fields

```python
# Extract form fields
fields = doc.get_form_fields()
for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")

# Fill and save
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.save("filled.pdf")
```

## Rust API

```rust
use pdf_oxide::PdfDocument;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut doc = PdfDocument::open("paper.pdf")?;

    // Extract text
    let text = doc.extract_text(0)?;

    // Character-level extraction
    let chars = doc.extract_chars(0)?;

    // Extract images
    let images = doc.extract_images(0)?;

    // Vector graphics
    let paths = doc.extract_paths(0)?;

    Ok(())
}
```

### Form Fields (Rust)

```rust
use pdf_oxide::editor::{DocumentEditor, EditableDocument, SaveOptions};
use pdf_oxide::editor::form_fields::FormFieldValue;

let mut editor = DocumentEditor::open("w2.pdf")?;
editor.set_form_field_value("employee_name", FormFieldValue::Text("Jane Doe".into()))?;
editor.save_with_options("filled.pdf", SaveOptions::incremental())?;
```

## Installation

### Python

```bash
pip install pdf_oxide
```

Wheels available for Linux, macOS, and Windows. Python 3.8–3.14.

### Rust

```toml
[dependencies]
pdf_oxide = "0.3"
```

### JavaScript/WASM

```bash
npm install pdf-oxide-wasm
```

```javascript
const { WasmPdfDocument } = require("pdf-oxide-wasm");
```

### CLI

```bash
brew install yfedoseev/tap/pdf-oxide    # Homebrew (macOS/Linux)
cargo install pdf_oxide_cli             # Cargo
cargo binstall pdf_oxide_cli            # Pre-built binary via cargo-binstall
```

### MCP Server

```bash
brew install yfedoseev/tap/pdf-oxide    # Included with CLI in Homebrew
cargo install pdf_oxide_mcp             # Cargo
```

### Other languages

- **Go**`go get github.com/yfedoseev/pdf_oxide/go` — see [go/README.md]go/README.md
- **JavaScript / TypeScript (Node.js)**`npm install pdf-oxide` — see [js/README.md]js/README.md
- **C# / .NET**`dotnet add package PdfOxide` — see [csharp/README.md]csharp/README.md
- **Java / Kotlin (JDK 11+)** — Maven coords `fyi.oxide:pdf-oxide:0.3.59` — see [java/README.md]java/README.md

  ```xml
  <dependency>
    <groupId>fyi.oxide</groupId>
    <artifactId>pdf-oxide</artifactId>
    <version>0.3.59</version>
  </dependency>
  ```

  ```gradle
  // Gradle (Kotlin DSL)
  implementation("fyi.oxide:pdf-oxide:0.3.59")
  ```

All four share the same Rust core as the Python and WASM bindings, so everything you read in this README applies to them as well — just with each language's native naming conventions.

## CLI

22 commands for PDF processing directly from your terminal:

```bash
pdf-oxide text report.pdf                      # Extract text
pdf-oxide markdown report.pdf -o report.md     # Convert to Markdown
pdf-oxide html report.pdf -o report.html       # Convert to HTML
pdf-oxide info report.pdf                      # Show metadata
pdf-oxide search report.pdf "neural.?network"  # Search (regex)
pdf-oxide images report.pdf -o ./images/       # Extract images
pdf-oxide merge a.pdf b.pdf -o combined.pdf    # Merge PDFs
pdf-oxide split report.pdf -o ./pages/         # Split into pages
pdf-oxide watermark doc.pdf "DRAFT"            # Add watermark
pdf-oxide forms w2.pdf --fill "name=Jane"      # Fill form fields
```

Run `pdf-oxide` with no arguments for interactive REPL mode. Use `--pages 1-5` to process specific pages, `--json` for machine-readable output.

## MCP Server

`pdf-oxide-mcp` lets AI assistants (Claude, Cursor, etc.) extract content from PDFs locally via the [Model Context Protocol](https://modelcontextprotocol.io/).

Add to your MCP client configuration:

```json
{
  "mcpServers": {
    "pdf-oxide": { "command": "crgx", "args": ["pdf_oxide_mcp@latest"] }
  }
}
```

The server exposes an `extract` tool that supports text, markdown, and HTML output formats with optional page ranges and image extraction. All processing runs locally — no files leave your machine.

## Building from Source

```bash
# Clone and build
git clone https://github.com/yfedoseev/pdf_oxide
cd pdf_oxide
cargo build --release

# Run tests
cargo test

# Build Python bindings
maturin develop

# Build the shared library for Go, JS/TS, and C# bindings
cargo build --release --lib
# Output: target/release/libpdf_oxide.{so,dylib} or pdf_oxide.dll
```

## Documentation

- **[Full Documentation]https://pdf.oxide.fyi** — Complete documentation site
- **[Getting Started (Rust)]docs/getting-started-rust.md** — Rust guide
- **[Getting Started (Python)]docs/getting-started-python.md** — Python guide
- **[Getting Started (Go)]go/README.md** — Go guide
- **[Getting Started (JavaScript / TypeScript)]js/README.md** — Node.js guide
- **[Getting Started (C# / .NET)]csharp/README.md** — .NET guide
- **[Getting Started (WASM)]docs/getting-started-wasm.md** — Browser and Node.js WASM guide
- **[API Docs]https://docs.rs/pdf_oxide** — Full Rust API reference
- **[Performance Benchmarks]https://pdf.oxide.fyi/docs/performance** — Full benchmark methodology and results

## Use Cases

- **RAG / LLM pipelines** — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework
- **AI assistants** — Give Claude, Cursor, or any MCP-compatible tool direct PDF access via the MCP server
- **Document processing at scale** — Extract text, images, and metadata from thousands of PDFs in seconds
- **Data extraction** — Pull structured data from forms, tables, and layouts
- **Academic research** — Parse papers, extract citations, and process large corpora
- **PDF generation** — Create invoices, reports, certificates, and templated documents programmatically
- **PyMuPDF alternative** — MIT licensed, 5× faster, no AGPL restrictions

## Why I built this

I needed PyMuPDF's speed without its AGPL license, and I needed it in more than one language. Nothing existed that ticked all three boxes — fast, MIT, multi-language — so I wrote it. The Rust core is what does the real work; the bindings for Python, Go, JS/TS, C#, and WASM are thin shells around the same code, so a bug fix in one lands in all of them. It now passes 100% of the veraPDF + Mozilla pdf.js + DARPA SafeDocs test corpora (3,830 PDFs) on every platform I've tested.

If it's useful to you, a star on GitHub genuinely helps. If something's broken or missing, [open an issue](https://github.com/yfedoseev/pdf_oxide/issues) — I read all of them.

— Yury

## License

Dual-licensed under [MIT](LICENSE-MIT) or [Apache-2.0](LICENSE-APACHE) at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.

## Contributing

We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

```bash
cargo build && cargo test && cargo fmt && cargo clippy -- -D warnings
```

## Citation

```bibtex
@software{pdf_oxide,
  title = {PDF Oxide: Fast PDF Toolkit for Rust, Python, Go, JavaScript, and C#},
  author = {Yury Fedoseev},
  year = {2025},
  url = {https://github.com/yfedoseev/pdf_oxide}
}
```

---

**Rust** + **Python** + **Go** + **JS/TS** + **C#** + **WASM** + **CLI** + **MCP** | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5× faster than the industry leaders