pdf_oxide 0.3.22

The fastest Rust PDF library with text extraction: 0.8ms mean, 100% pass rate on 3,830 PDFs. 5× faster than pdf_extract, 17× faster than oxidize_pdf. Extract, create, and edit PDFs.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
# Getting Started with PDFOxide (Rust)

PDFOxide is the complete PDF toolkit for Rust. One library for extracting, creating, and editing PDFs with a unified API.

## Installation

Add to your `Cargo.toml`:

```toml
[dependencies]
pdf_oxide = "0.3"
```

### Feature Flags

Select only the features you need:

```toml
[dependencies]
# Default - text extraction, creation, editing
pdf_oxide = "0.3"

# With barcode generation
pdf_oxide = { version = "0.3", features = ["barcodes"] }

# With Office document conversion (DOCX, XLSX, PPTX)
pdf_oxide = { version = "0.3", features = ["office"] }

# With digital signatures
pdf_oxide = { version = "0.3", features = ["signatures"] }

# With OCR for scanned PDFs (PaddleOCR via ONNX Runtime)
pdf_oxide = { version = "0.3", features = ["ocr"] }

# With page rendering to images
pdf_oxide = { version = "0.3", features = ["rendering"] }

# All features
pdf_oxide = { version = "0.3", features = ["full"] }
```

## Quick Start - The Unified `Pdf` API

The `Pdf` class is your main entry point for all PDF operations:

```rust
use pdf_oxide::api::Pdf;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create from Markdown
    let mut pdf = Pdf::from_markdown("# Hello World\n\nThis is a PDF.")?;
    pdf.save("output.pdf")?;

    Ok(())
}
```

## Creating PDFs

### From Markdown

```rust
use pdf_oxide::api::Pdf;

let mut pdf = Pdf::from_markdown(r#"
# Report Title

## Introduction

This is **bold** and *italic* text.

- Item 1
- Item 2
- Item 3

## Code Example

```python
print("Hello, World!")
```
"#)?;
pdf.save("report.pdf")?;
```

### From HTML

```rust
use pdf_oxide::api::Pdf;

let mut pdf = Pdf::from_html(r#"
<h1>Invoice</h1>
<p>Thank you for your purchase.</p>
<table>
    <tr><th>Item</th><th>Price</th></tr>
    <tr><td>Widget</td><td>$10.00</td></tr>
</table>
"#)?;
pdf.save("invoice.pdf")?;
```

### From Plain Text

```rust
use pdf_oxide::api::Pdf;

let mut pdf = Pdf::from_text("Simple plain text document.\n\nWith paragraphs.")?;
pdf.save("notes.pdf")?;
```

### From Images

```rust
use pdf_oxide::api::Pdf;

// Single image
let mut pdf = Pdf::from_image("photo.jpg")?;
pdf.save("photo.pdf")?;

// Multiple images (one per page)
let mut album = Pdf::from_images(&["page1.jpg", "page2.png", "page3.jpg"])?;
album.save("album.pdf")?;
```

## Opening and Reading PDFs

```rust
use pdf_oxide::api::Pdf;

// Open existing PDF
let mut pdf = Pdf::open("document.pdf")?;

// Extract text from page 0
let text = pdf.extract_text(0)?;
println!("Text: {}", text);

// Convert to Markdown
let markdown = pdf.to_markdown(0)?;
println!("Markdown:\n{}", markdown);

// Get page count
println!("Pages: {}", pdf.page_count());
```

## Editing PDFs

### DOM-like Navigation

```rust
use pdf_oxide::api::{Pdf, PdfElement};

let mut pdf = Pdf::open("document.pdf")?;

// Get a page for DOM-like access
let page = pdf.page(0)?;

// Find text elements
for text in page.find_text_containing("Hello") {
    println!("Found '{}' at {:?}", text.text(), text.bbox());
}

// Iterate through all elements
for element in page.children() {
    match element {
        PdfElement::Text(t) => println!("Text: {}", t.text()),
        PdfElement::Image(i) => println!("Image: {}x{}", i.width(), i.height()),
        PdfElement::Path(p) => println!("Path at {:?}", p.bbox()),
        _ => {}
    }
}
```

### Modifying Content

```rust
use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("document.pdf")?;

// Get mutable page
let mut page = pdf.page(0)?;

// Find and replace text
let texts = page.find_text_containing("old");
for t in &texts {
    page.set_text(t.id(), "new")?;
}

// Save changes back
pdf.save_page(page)?;
pdf.save("modified.pdf")?;
```

### Adding Annotations

```rust
use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("document.pdf")?;

// Add highlight
pdf.add_highlight(0, [100.0, 700.0, 300.0, 720.0], None)?;

// Add sticky note
pdf.add_sticky_note(0, 500.0, 750.0, "Review this section")?;

// Add link
pdf.add_link(0, [100.0, 600.0, 200.0, 620.0], "https://example.com")?;

pdf.save("annotated.pdf")?;
```

### Working with Form Fields

Extract, read, fill, and save PDF form fields (AcroForm):

```rust
use pdf_oxide::PdfDocument;
use pdf_oxide::extractors::forms::FormExtractor;

let mut doc = PdfDocument::open("tax-form.pdf")?;

// List all form fields
let fields = FormExtractor::extract_fields(&mut doc)?;
for f in &fields {
    println!("{} ({:?}) = {:?}", f.full_name, f.field_type, f.value);
}
```

#### Fill Form Fields and Save

```rust
use pdf_oxide::editor::{DocumentEditor, EditableDocument, SaveOptions};
use pdf_oxide::editor::form_fields::FormFieldValue;

let mut editor = DocumentEditor::open("w2.pdf")?;

// Set text values
editor.set_form_field_value("employee_name", FormFieldValue::Text("Jane Doe".into()))?;
editor.set_form_field_value("wages", FormFieldValue::Text("85000.00".into()))?;

// Set checkbox
editor.set_form_field_value("retirement_plan", FormFieldValue::Boolean(true))?;

// Save with incremental update (preserves original, appends changes)
editor.save_with_options("filled_w2.pdf", SaveOptions::incremental())?;
```

#### Extract Text with Filled Values

Filled values appear inline in `extract_text` and `to_markdown`:

```rust
use pdf_oxide::PdfDocument;
use pdf_oxide::converters::ConversionOptions;

let mut doc = PdfDocument::open("filled_w2.pdf")?;

// Form values appear where the fields are positioned
let text = doc.extract_text(0)?;
println!("{}", text); // "Jane Doe" appears inline

// Include form fields in Markdown (default)
let opts = ConversionOptions { include_form_fields: true, ..Default::default() };
let md = doc.to_markdown(0, &opts)?;

// Exclude form fields
let opts_off = ConversionOptions { include_form_fields: false, ..Default::default() };
let md_clean = doc.to_markdown(0, &opts_off)?;
```

#### Adding New Form Fields

```rust
use pdf_oxide::api::Pdf;

let mut pdf = Pdf::open("form-template.pdf")?;

// Add text field
pdf.add_text_field("name", [100.0, 700.0, 300.0, 720.0])?;

// Add checkbox
pdf.add_checkbox("agree", [100.0, 650.0, 120.0, 670.0], false)?;

pdf.save("form.pdf")?;
```

## Builder Pattern for Advanced Creation

For full control over PDF creation, use `PdfBuilder`:

```rust
use pdf_oxide::api::PdfBuilder;
use pdf_oxide::writer::PageSize;

let mut pdf = PdfBuilder::new()
    .title("Annual Report 2025")
    .author("Company Inc.")
    .subject("Financial Summary")
    .page_size(PageSize::A4)
    .margins(72.0, 72.0, 72.0, 72.0)  // 1 inch margins
    .font_size(11.0)
    .from_markdown("# Annual Report\n\n...")?;

pdf.save("annual-report.pdf")?;
```

## Encryption and Security

### Password Protection

```rust
use pdf_oxide::api::Pdf;

let mut pdf = Pdf::from_markdown("# Confidential Document")?;

// Simple password protection (AES-256)
pdf.save_encrypted("secure.pdf", "user-password", Some("owner-password"))?;
```

### Advanced Encryption Options

```rust
use pdf_oxide::api::Pdf;
use pdf_oxide::editor::{EncryptionConfig, EncryptionAlgorithm, Permissions};

let mut pdf = Pdf::from_markdown("# Protected")?;

let config = EncryptionConfig::new("user", Some("owner"))
    .algorithm(EncryptionAlgorithm::Aes256)
    .permissions(Permissions::PRINT | Permissions::COPY);

pdf.save_with_encryption("protected.pdf", config)?;
```

## PDF Compliance

### PDF/A Validation and Conversion

```rust
use pdf_oxide::compliance::{PdfAValidator, PdfALevel, PdfAConverter};

// Validate
let validator = PdfAValidator::new();
let result = validator.validate_file("document.pdf", PdfALevel::PdfA2b)?;
if result.is_compliant {
    println!("PDF/A-2b compliant!");
} else {
    for error in result.errors {
        println!("Error: {:?}", error);
    }
}

// Convert to PDF/A
let converter = PdfAConverter::new(PdfALevel::PdfA2b);
converter.convert("input.pdf", "archive.pdf")?;
```

## OCR - Extracting Text from Scanned PDFs

> For a comprehensive guide covering model selection, configuration reference, resize strategies, and troubleshooting, see the [OCR Guide]OCR_GUIDE.md.

PDFOxide can extract text from scanned PDFs using PaddleOCR models via ONNX Runtime. Enable the `ocr` feature:

```toml
[dependencies]
pdf_oxide = { version = "0.3", features = ["ocr"] }
```

### Model Setup

PDFOxide supports PaddleOCR v3, v4, and v5 models. You can mix detection and recognition models from different versions.

**Quick start** — download the recommended models:

```bash
./scripts/setup_ocr_models.sh
```

#### Model Selection Guide

| Combination | Detection | Recognition | English Accuracy | Total Size |
|---|---|---|---|---|
| **V4 det + V5 rec (recommended)** | ch_PP-OCRv4_det | en_PP-OCRv5_mobile_rec | Best | ~12.5 MB |
| V4 det + V4 rec | ch_PP-OCRv4_det | en_PP-OCRv4_rec | Good | ~12.4 MB |
| V5 det + V5 rec | PP-OCRv5_server_det | en_PP-OCRv5_mobile_rec | Good (different errors) | ~96 MB |
| V3 det + V3 rec | en_PP-OCRv3_det | en_PP-OCRv3_rec | Fair | ~11 MB |

The **V4 detection + V5 recognition** combination gives the best results for English documents: V4 detection reliably segments text lines, while V5 recognition has the highest character-level accuracy.

**Manual download:**

```bash
# Recommended: V4 detection + V5 recognition
# Detection (4.7 MB):
curl -L https://huggingface.co/deepghs/paddleocr/resolve/main/det/ch_PP-OCRv4_det/model.onnx -o .models/det.onnx

# Recognition (7.8 MB):
curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/rec.onnx -o .models/rec.onnx

# Dictionary (must include space as last entry):
curl -L https://huggingface.co/monkt/paddleocr-onnx/resolve/main/languages/english/dict.txt -o .models/en_dict.txt
echo " " >> .models/en_dict.txt
```

### Basic OCR Usage

```rust
use pdf_oxide::PdfDocument;
use pdf_oxide::ocr::{OcrEngine, OcrConfig, OcrExtractOptions, needs_ocr};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Create OCR engine (reuse across pages)
    let engine = OcrEngine::new(
        ".models/det.onnx",
        ".models/rec.onnx",
        ".models/en_dict.txt",
        OcrConfig::default(),
    )?;

    let mut doc = PdfDocument::open("scanned.pdf")?;
    let options = OcrExtractOptions::with_dpi(300.0);

    for page in 0..doc.page_count()? {
        if needs_ocr(&mut doc, page)? {
            let text = pdf_oxide::ocr::ocr_page(&mut doc, page, &engine, &options)?;
            println!("Page {} (OCR): {}", page + 1, text);
        } else {
            let text = doc.extract_text(page)?;
            println!("Page {} (native): {}", page + 1, text);
        }
    }

    Ok(())
}
```

### Using PP-OCRv5 Detection

If you use the full PP-OCRv5 stack (v5 detection + v5 recognition), use `OcrConfig::v5()` which preserves the original image resolution instead of downscaling to 960px:

```rust
// For PP-OCRv5 server detection model (88 MB)
let config = OcrConfig::v5();
let engine = OcrEngine::new("v5_det.onnx", "v5_rec.onnx", "v5_dict.txt", config)?;
```

> **Note:** ONNX Runtime (`libonnxruntime.so` v1.23+) must be available at runtime. Set `ORT_LIB_LOCATION` to the directory containing the shared library during build, or install the ONNX Runtime system package. You can also set `ORT_PREFER_DYNAMIC_LINK=1` to link dynamically.

## Lower-Level APIs

For specialized use cases, PDFOxide provides lower-level APIs:

| API | Use Case |
|-----|----------|
| `PdfDocument` | Direct PDF parsing and text extraction |
| `DocumentBuilder` | Low-level PDF generation with full control |
| `DocumentEditor` | Direct editing without the `Pdf` wrapper |

### Using PdfDocument Directly

```rust
use pdf_oxide::PdfDocument;

let mut doc = PdfDocument::open("paper.pdf")?;

// Low-level text extraction with spans
let spans = doc.extract_spans(0)?;
for span in spans {
    println!("{} at ({}, {})", span.text, span.x, span.y);
}

// Access raw PDF objects
let page = doc.get_page(0)?;
let media_box = page.get("MediaBox");
```

### Using DocumentBuilder Directly

```rust
use pdf_oxide::writer::DocumentBuilder;

let mut builder = DocumentBuilder::new();
builder.add_page(612.0, 792.0)  // Letter size in points
    .text("Custom positioned text", 72.0, 720.0, 12.0)
    .rect(100.0, 600.0, 200.0, 50.0)
    .image_at("logo.png", 400.0, 700.0, 100.0, 50.0)?;

builder.save("custom.pdf")?;
```

## Examples

See the [examples/](../examples/) directory for complete working examples:

- `create_pdf_from_markdown.rs` - Creating PDFs from Markdown
- `edit_existing_pdf.rs` - Opening and modifying PDFs
- `edit_text_content.rs` - In-place text editing
- `add_form_fields.rs` - Interactive form creation
- `encrypt_pdf.rs` - Password protection

Run an example:

```bash
cargo run --example create_pdf_from_markdown
```

## Next Steps

- [PDF Creation Guide]PDF_CREATION_GUIDE.md - Advanced creation options
- [Architecture]ARCHITECTURE.md - Understanding the library structure
- [API Documentation]https://docs.rs/pdf_oxide - Full API reference