pdf_oxide 0.3.23

The fastest Rust PDF library with text extraction: 0.8ms mean, 100% pass rate on 3,830 PDFs. 5× faster than pdf_extract, 17× faster than oxidize_pdf. Extract, create, and edit PDFs.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
# Getting Started with PDFOxide (Python)

PDFOxide is the complete PDF toolkit. One library for extracting, creating, and editing PDFs with a unified API. Built on a Rust core for maximum performance.

## Installation

```bash
pip install pdf_oxide
```

## Quick Start - The Unified `Pdf` API

The `Pdf` class is your main entry point for all PDF operations:

```python
from pdf_oxide import Pdf

# Create from Markdown
pdf = Pdf.from_markdown("# Hello World\n\nThis is a PDF.")
pdf.save("output.pdf")
```

## Creating PDFs

### From Markdown

```python
from pdf_oxide import Pdf

pdf = Pdf.from_markdown("""
# Report Title

## Introduction

This is **bold** and *italic* text.

- Item 1
- Item 2
- Item 3

## Code Example

```python
print("Hello, World!")
```
""")
pdf.save("report.pdf")
```

### From HTML

```python
from pdf_oxide import Pdf

pdf = Pdf.from_html("""
<h1>Invoice</h1>
<p>Thank you for your purchase.</p>
<table>
    <tr><th>Item</th><th>Price</th></tr>
    <tr><td>Widget</td><td>$10.00</td></tr>
</table>
""")
pdf.save("invoice.pdf")
```

### From Plain Text

```python
from pdf_oxide import Pdf

pdf = Pdf.from_text("Simple plain text document.\n\nWith paragraphs.")
pdf.save("notes.pdf")
```

### From Images

```python
from pdf_oxide import Pdf

# Single image
pdf = Pdf.from_image("photo.jpg")
pdf.save("photo.pdf")

# Multiple images (one per page)
album = Pdf.from_images(["page1.jpg", "page2.png", "page3.jpg"])
album.save("album.pdf")
```

## Opening and Reading PDFs

```python
from pdf_oxide import PdfDocument

# Open existing PDF (path can be str or pathlib.Path)
doc = PdfDocument("document.pdf")

# Or use as a context manager
with PdfDocument("document.pdf") as doc:
    text = doc.extract_text(0)
    print(f"Text: {text}")
    markdown = doc.to_markdown(0)
    print(f"Pages: {doc.page_count()}")

# Extract text from page 0
text = doc.extract_text(0)
print(f"Text: {text}")

# Convert to Markdown
markdown = doc.to_markdown(0)
print(f"Markdown:\n{markdown}")

# Get page count
print(f"Pages: {doc.page_count()}")
```

## Builder Pattern for Advanced Creation

For full control over PDF creation, use `PdfBuilder`:

```python
from pdf_oxide import PdfBuilder, PageSize

pdf = (PdfBuilder()
    .title("Annual Report 2025")
    .author("Company Inc.")
    .subject("Financial Summary")
    .page_size(PageSize.A4)
    .margins(72.0, 72.0, 72.0, 72.0)  # 1 inch margins
    .font_size(11.0)
    .from_markdown("# Annual Report\n\n..."))

pdf.save("annual-report.pdf")
```

## Encryption and Security

### Password Protection

```python
from pdf_oxide import Pdf

pdf = Pdf.from_markdown("# Confidential Document")

# Simple password protection (AES-256)
pdf.save_encrypted("secure.pdf", "user-password", "owner-password")
```

## Text Extraction Options

### Basic Extraction

```python
from pdf_oxide import PdfDocument

doc = PdfDocument("paper.pdf")
text = doc.extract_text(0)
```

### With Options

```python
from pdf_oxide import PdfDocument, ConversionOptions

doc = PdfDocument("paper.pdf")
options = ConversionOptions(
    detect_headings=True,
    detect_lists=True,
    embed_images=True
)
markdown = doc.to_markdown(0, options)
```

### Extract All Pages

```python
from pdf_oxide import PdfDocument

doc = PdfDocument("book.pdf")

# Extract text from all pages
all_text = doc.extract_text_all()

# Convert entire document to Markdown
all_markdown = doc.to_markdown_all()
```

## Office Document Conversion

Convert DOCX, XLSX, and PPTX files to PDF:

```python
from pdf_oxide import OfficeConverter

# Auto-detect format
converter = OfficeConverter()

# Convert Word document
converter.convert("report.docx", "report.pdf")

# Convert Excel spreadsheet
converter.convert("data.xlsx", "data.pdf")

# Convert PowerPoint presentation
converter.convert("slides.pptx", "slides.pdf")
```

## Working with Images

### Extract Images from PDF

```python
from pdf_oxide import PdfDocument

doc = PdfDocument("document.pdf")
images = doc.extract_images(0)

for i, img in enumerate(images):
    img.save(f"image_{i}.png")
```

### Embed Images in Output

```python
from pdf_oxide import PdfDocument, ConversionOptions

doc = PdfDocument("paper.pdf")
options = ConversionOptions(embed_images=True)

# Images embedded as base64 data URIs
html = doc.to_html(0, options)
```

## OCR - Extracting Text from Scanned PDFs

> For a comprehensive guide covering model selection, configuration reference, resize strategies, and troubleshooting, see the [OCR Guide]OCR_GUIDE.md.

### Setup

The OCR feature requires heavy machine learning dependencies (ONNX Runtime) and is optional.

```bash
# Recommended: Install with OCR support
pip install pdf_oxide[ocr]

# Or build from source with OCR
maturin develop --features python,ocr
```

> **Troubleshooting:** If you see `RuntimeError: OCR feature not enabled`, it means the library was installed without OCR support. Re-install using the `[ocr]` extra above.

**Quick start** — download the recommended models:

```bash
./scripts/setup_ocr_models.sh
```

#### Model Selection Guide

PDFOxide supports PaddleOCR v3, v4, and v5 models. You can mix detection and recognition models from different versions.

| Combination | Detection | Recognition | English Accuracy | Total Size |
|---|---|---|---|---|
| **V4 det + V5 rec (recommended)** | ch_PP-OCRv4_det | en_PP-OCRv5_mobile_rec | Best | ~12.5 MB |
| V4 det + V4 rec | ch_PP-OCRv4_det | en_PP-OCRv4_rec | Good | ~12.4 MB |
| V5 det + V5 rec | PP-OCRv5_server_det | en_PP-OCRv5_mobile_rec | Good (different errors) | ~96 MB |
| V3 det + V3 rec | en_PP-OCRv3_det | en_PP-OCRv3_rec | Fair | ~11 MB |

The **V4 detection + V5 recognition** combination gives the best results for English documents: V4 detection reliably segments text lines, while V5 recognition has the highest character-level accuracy.

**Manual download:**

```bash
# Recommended: V4 detection + V5 recognition
# Detection (4.7 MB):
curl -L https://huggingface.co/deepghs/paddleocr/resolve/main/det/ch_PP-OCRv4_det/model.onnx -o .models/det.onnx

# Recognition (7.8 MB):
curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/rec.onnx -o .models/rec.onnx

# Dictionary (must include space as last entry):
curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/dict.txt -o .models/en_dict.txt
echo " " >> .models/en_dict.txt
```

### Basic OCR Usage

```python
from pdf_oxide import PdfDocument, OcrEngine, OcrConfig

# Create OCR engine (default config works with recommended V4 det + V5 rec models)
engine = OcrEngine(
    det_model_path=".models/det.onnx",
    rec_model_path=".models/rec.onnx",
    dict_path=".models/en_dict.txt",
)

# Extract text using OCR
doc = PdfDocument("scanned.pdf")
text = doc.extract_text_ocr(page=0, engine=engine)
print(text)
```

### Processing Multiple Pages

```python
doc = PdfDocument("scanned.pdf")
for page in range(doc.page_count()):
    text = doc.extract_text_ocr(page=page, engine=engine)
    if text.strip():
        print(f"--- Page {page + 1} ---")
        print(text)
```

### Using PP-OCRv5 Detection

If you use the full PP-OCRv5 stack (v5 detection + v5 recognition), pass `use_v5=True` to `OcrConfig`. This preserves the original image resolution instead of downscaling to 960px, which the larger v5 detection model needs:

```python
config = OcrConfig(use_v5=True)
engine = OcrEngine(
    det_model_path="v5_det.onnx",
    rec_model_path="v5_rec.onnx",
    dict_path="v5_dict.txt",
    config=config,
)
```

> **Note:** The `OcrEngine` is reusable — create it once and pass it to multiple `extract_text_ocr` calls. ONNX Runtime requires `libonnxruntime.so` (v1.23+) to be available at runtime (via `LD_LIBRARY_PATH` or system install).

## Structured Extraction

Beyond plain text, PDFOxide can extract structured content from pages:

```python
from pdf_oxide import PdfDocument

doc = PdfDocument("document.pdf")

# 1. Scoped extraction from a specific area (v0.3.14)
# Area: (x, y, width, height) in points
header = doc.within(0, (0, 700, 612, 92)).extract_text()

# 2. Text spans with font info, position, and style
spans = doc.extract_spans(0)
for span in spans:
    print(f"{span.text} — {span.font_name} {span.font_size}pt")

# 3. Word-level extraction (v0.3.14)
words = doc.extract_words(0)
for w in words:
    print(f"Word: {w.text} at {w.bbox}")
    # Access character metadata for the word
    # print(w.chars[0].font_name)

# Optional: override the adaptive word gap threshold (in PDF points).
# Smaller values split more aggressively; useful for dense forms.
words = doc.extract_words(0, word_gap_threshold=2.5)

# 4. Line-level extraction (v0.3.14)
lines = doc.extract_text_lines(0)
for line in lines:
    print(f"Line: {line.text}")

# Optional: override word and/or line gap thresholds (in PDF points).
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)

# 5. Inspect computed layout params before overriding
params = doc.page_layout_params(0)
print(f"Adaptive word gap: {params.word_gap_threshold:.1f}pt")
print(f"Adaptive line gap: {params.line_gap_threshold:.1f}pt")

# 6. Pre-tuned extraction profiles for different document types
from pdf_oxide import ExtractionProfile
profile = ExtractionProfile.form()
print(f"Profile: {profile.name}, word_margin_ratio={profile.word_margin_ratio}")

# Pass a profile to extraction methods to control how raw text is parsed
words = doc.extract_words(0, profile=ExtractionProfile.form())
lines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())

# Combine profile with threshold overrides (profile controls span parsing,
# thresholds control word/line clustering)
words = doc.extract_words(0, word_gap_threshold=1.5, profile=ExtractionProfile.aggressive())

# 7. Image metadata
images = doc.extract_images(0)
for img in images:
    print(f"{img['width']}x{img['height']} {img['color_space']}")

# 8. Bookmarks / table of contents
outline = doc.get_outline()  # None if no outline
if outline:
    for item in outline:
        print(f"{item['title']} -> page {item.get('page')}")

# 9. Vector paths (lines, curves, shapes)
paths = doc.extract_paths(0)
for path in paths:
    print(f"bbox={path['bbox']}, stroke={path.get('stroke_color')}")
```

## Working with Form Fields

PDFOxide can extract, read, fill, and export PDF form field data (AcroForm fields).

### List All Form Fields

```python
from pdf_oxide import PdfDocument

doc = PdfDocument("tax-form.pdf")
fields = doc.get_form_fields()

for f in fields:
    print(f"{f.name} ({f.field_type}) = {f.value}")
```

Each `FormField` has:
- `name` — fully qualified field name (e.g. `"topmostSubform[0].Page1[0].f1_01[0]"`)
- `field_type``"text"`, `"button"`, `"choice"`, `"signature"`
- `value` — current value (`str`, `bool`, or `None`)
- `flags` — field flags (read-only, required, etc.)

### Read and Set Field Values

```python
doc = PdfDocument("w2.pdf")

# Read a field value
ssn = doc.get_form_field_value("topmostSubform[0].CopyA[0].f1_01[0]")
print(f"SSN: {ssn}")

# Fill fields
doc.set_form_field_value("employee_name", "Jane Doe")
doc.set_form_field_value("wages", "85000.00")
doc.set_form_field_value("retirement_plan", True)  # checkbox

# Save (values are persisted via incremental save)
doc.save("filled_w2.pdf")
```

### Extract Text with Form Field Values

Filled form field values appear inline in `extract_text` and `to_markdown`:

```python
doc = PdfDocument("filled_w2.pdf")

# Form values appear inline in extracted text
text = doc.extract_text(0)
print(text)  # "Jane Doe" appears where the name field is

# to_markdown includes form fields by default
md = doc.to_markdown(0, include_form_fields=True)

# Exclude form field values
md_clean = doc.to_markdown(0, include_form_fields=False)
```

### Export Form Data

```python
doc = PdfDocument("filled-form.pdf")

# Export as FDF
doc.export_form_data("form_data.fdf")

# Export as XFDF
doc.export_form_data("form_data.xfdf", format="xfdf")
```

## Performance Tips

1. **Reuse document objects** - Opening a PDF has overhead, reuse the object for multiple operations
2. **Use specific page extraction** - `extract_text(page_num)` is faster than `extract_text_all()` if you only need some pages
3. **Disable features you don't need** - Use `ConversionOptions` to skip heading detection, image extraction, etc.

```python
from pdf_oxide import PdfDocument, ConversionOptions

doc = PdfDocument("large.pdf")

# Fast extraction - minimal processing
options = ConversionOptions(
    detect_headings=False,
    detect_lists=False,
    embed_images=False
)
text = doc.to_markdown(0, options)
```

## Error Handling

```python
from pdf_oxide import PdfDocument, PdfError

try:
    doc = PdfDocument("document.pdf")
    text = doc.extract_text(0)
except PdfError as e:
    print(f"PDF error: {e}")
except FileNotFoundError:
    print("File not found")
```

## Examples

See the [examples/](../examples/) directory for complete working examples.

### Quick Script Examples

**Extract text from all PDFs in a folder:**

```python
from pdf_oxide import PdfDocument
from pathlib import Path

for pdf_path in Path("documents").glob("*.pdf"):
    # PdfDocument accepts pathlib.Path directly
    with PdfDocument(pdf_path) as doc:
        text = doc.extract_text_all()

    output_path = pdf_path.with_suffix(".txt")
    output_path.write_text(text)
    print(f"Extracted: {pdf_path.name}")
```

**Batch convert Markdown to PDF:**

```python
from pdf_oxide import Pdf
from pathlib import Path

for md_path in Path("notes").glob("*.md"):
    content = md_path.read_text()
    pdf = Pdf.from_markdown(content)

    output_path = md_path.with_suffix(".pdf")
    pdf.save(str(output_path))
    print(f"Created: {output_path.name}")
```

## Next Steps

- [API Reference]https://docs.rs/pdf_oxide - Full API documentation
- [PDF Creation Guide]PDF_CREATION_GUIDE.md - Advanced creation options
- [GitHub Issues]https://github.com/yfedoseev/pdf_oxide/issues - Report bugs or request features