pdf_oxide 0.3.23

The fastest Rust PDF library with text extraction: 0.8ms mean, 100% pass rate on 3,830 PDFs. 5× faster than pdf_extract, 17× faster than oxidize_pdf. Extract, create, and edit PDFs.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
# Getting Started with PDFOxide (WebAssembly)

PDFOxide compiles to WebAssembly for use in browsers and Node.js. The same Rust core that powers the Python and Rust APIs runs directly in JavaScript/TypeScript with near-native performance.

## Installation

### From npm (recommended)

```bash
npm install pdf-oxide-wasm
```

```javascript
const { WasmPdfDocument } = require("pdf-oxide-wasm");
// or ESM:
// import { WasmPdfDocument } from "pdf-oxide-wasm";
```

### Building from Source

#### Prerequisites

- Rust toolchain with `wasm32-unknown-unknown` target
- `wasm-bindgen-cli` (must match the `wasm-bindgen` version in Cargo.toml)

```bash
# Install the WASM target
rustup target add wasm32-unknown-unknown

# Install wasm-bindgen CLI (check Cargo.toml for the exact version)
cargo install wasm-bindgen-cli --version 0.2.106
```

### Build for Node.js

```bash
# Build the WASM binary
cargo build --lib --target wasm32-unknown-unknown --features wasm --release

# Generate Node.js bindings
wasm-bindgen --target nodejs --out-dir pkg \
  target/wasm32-unknown-unknown/release/pdf_oxide.wasm
```

This produces four files in `pkg/`:
- `pdf_oxide.js` — JS glue code (import this)
- `pdf_oxide_bg.wasm` — compiled WASM binary
- `pdf_oxide.d.ts` — TypeScript type definitions
- `pdf_oxide_bg.wasm.d.ts` — WASM type definitions

### Build for Browser

```bash
wasm-bindgen --target web --out-dir pkg \
  target/wasm32-unknown-unknown/release/pdf_oxide.wasm
```

### Size-Optimized Build

For smaller WASM binaries, use the `release-small` profile:

```bash
cargo build --lib --target wasm32-unknown-unknown --features wasm \
  --profile release-small
```

## Quick Start

### Node.js (ESM)

```javascript
import { readFileSync } from "fs";
import { WasmPdfDocument, WasmPdf } from "./pkg/pdf_oxide.js";

// Open a PDF file
const bytes = new Uint8Array(readFileSync("document.pdf"));
const doc = new WasmPdfDocument(bytes);

// Basic info
console.log(`Pages: ${doc.pageCount()}`);
console.log(`Version: ${doc.version()}`);

// Extract text
const text = doc.extractText(0);
console.log(text);

// Clean up
doc.free();
```

### Browser

```html
<script type="module">
import init, { WasmPdfDocument, WasmPdf } from "./pkg/pdf_oxide.js";

await init();

// Load PDF from fetch
const response = await fetch("document.pdf");
const bytes = new Uint8Array(await response.arrayBuffer());
const doc = new WasmPdfDocument(bytes);

console.log(`Pages: ${doc.pageCount()}`);
console.log(doc.extractText(0));
doc.free();
</script>
```

### Browser with File Input

```html
<input type="file" id="pdfInput" accept=".pdf" />
<pre id="output"></pre>

<script type="module">
import init, { WasmPdfDocument } from "./pkg/pdf_oxide.js";
await init();

document.getElementById("pdfInput").addEventListener("change", async (e) => {
  const file = e.target.files[0];
  const bytes = new Uint8Array(await file.arrayBuffer());
  const doc = new WasmPdfDocument(bytes);

  let result = `Pages: ${doc.pageCount()}\n\n`;
  for (let i = 0; i < doc.pageCount(); i++) {
    result += `--- Page ${i + 1} ---\n`;
    result += doc.extractText(i) + "\n\n";
  }

  document.getElementById("output").textContent = result;
  doc.free();
});
</script>
```

## Creating PDFs

Create new PDFs from Markdown, HTML, or plain text using `WasmPdf`:

```javascript
import { WasmPdf, WasmPdfDocument } from "./pkg/pdf_oxide.js";

// From Markdown
const pdf = WasmPdf.fromMarkdown("# Hello World\n\nThis is a PDF.", "My Title", "Author");
const bytes = pdf.toBytes(); // Uint8Array
console.log(`PDF size: ${pdf.size} bytes`);

// From HTML
const invoice = WasmPdf.fromHtml(
  "<h1>Invoice</h1><p>Thank you for your purchase.</p>",
  "Invoice #123"
);

// From plain text
const notes = WasmPdf.fromText("Meeting notes\n\nAction items:\n- Review PR\n- Update docs");

// Save to file (Node.js)
import { writeFileSync } from "fs";
writeFileSync("output.pdf", pdf.toBytes());

// Download in browser
const blob = new Blob([pdf.toBytes()], { type: "application/pdf" });
const url = URL.createObjectURL(blob);
const a = document.createElement("a");
a.href = url;
a.download = "output.pdf";
a.click();
```

## Text Extraction

### Single Page

```javascript
const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0); // page 0
```

### All Pages

```javascript
const allText = doc.extractAllText(); // pages separated by form feed
```

### Convert to Markdown

```javascript
// Single page
const markdown = doc.toMarkdown(0);

// With options
const md = doc.toMarkdown(0, true, true); // detect_headings, include_images

// All pages
const allMarkdown = doc.toMarkdownAll();
```

### Convert to HTML

```javascript
const html = doc.toHtml(0);

// With layout preservation
const layoutHtml = doc.toHtml(0, true, true); // preserve_layout, detect_headings

// All pages
const allHtml = doc.toHtmlAll();
```

### Convert to Plain Text

```javascript
const plain = doc.toPlainText(0);
const allPlain = doc.toPlainTextAll();
```

## Structured Extraction

Get character-level and span-level data with positions and font metadata:

```javascript
// 1. Scoped extraction (v0.3.14)
// Area: [x, y, width, height]
const headerRegion = doc.within(0, [0, 700, 612, 92]);
const headerText = headerRegion.extractText();

// 2. Character-level data
const chars = doc.extractChars(0);
// Returns: [{ char, bbox: {x, y, width, height}, font_name, font_size, font_weight, is_italic, color: {r, g, b} }, ...]

for (const c of chars) {
  console.log(`'${c.char}' at (${c.bbox.x}, ${c.bbox.y}) font=${c.font_name}`);
}

// 3. Word-level extraction (v0.3.14)
const words = doc.extractWords(0);
for (const w of words) {
  console.log(`Word: ${w.text} at ${w.bbox.x},${w.bbox.y}`);
}

// 4. Line-level extraction (v0.3.14)
const lines = doc.extractTextLines(0);
for (const line of lines) {
  console.log(`Line: ${line.text}`);
}

// 5. Span-level data
const spans = doc.extractSpans(0);
// Returns: [{ text, bbox, font_name, font_size, font_weight, is_italic, color }, ...]

for (const span of spans) {
  console.log(`"${span.text}" size=${span.font_size}`);
}
```

## Working with Form Fields

Extract form field data and export filled values:

```javascript
const doc = new WasmPdfDocument(bytes);

// Get all form fields
const fields = doc.getFormFields();
// Returns: [{ name, field_type, value, flags }, ...]

for (const f of fields) {
  console.log(`${f.name} (${f.field_type}) = ${f.value}`);
}

// Export form data as FDF or XFDF
const fdfBytes = doc.exportFormData();       // FDF format (default)
const xfdfBytes = doc.exportFormData("xfdf"); // XFDF format
```

### Form Fields in Text Extraction

Filled form field values appear inline in `toMarkdown` and `toHtml`:

```javascript
// Include form field values (default)
const md = doc.toMarkdown(0, true, true, true); // ..., include_form_fields=true
const html = doc.toHtml(0, true, true, true);

// Exclude form field values
const mdClean = doc.toMarkdown(0, true, true, false); // include_form_fields=false
```

## Text Search

Search across all pages or within a specific page:

```javascript
// Search all pages
const results = doc.search("hello", true); // case_insensitive=true
// Returns: [{ page, text, bbox, start_index, end_index, span_boxes }, ...]

for (const r of results) {
  console.log(`Found "${r.text}" on page ${r.page}`);
}

// Search single page
const pageResults = doc.searchPage(0, "hello", true, true); // case_insensitive, literal

// Regex search
const regexResults = doc.search("\\d{4}-\\d{2}-\\d{2}"); // find dates

// Whole word match
const wordResults = doc.search("test", false, true, true); // literal, whole_word
```

## Image Metadata

```javascript
// Get image metadata (does NOT return raw bytes)
const images = doc.extractImages(0);
// Returns: [{ width, height, color_space, bits_per_component, bbox }, ...]

for (const img of images) {
  console.log(`Image: ${img.width}x${img.height} ${img.color_space}`);
}
```

## Editing PDFs

### Metadata

```javascript
const doc = new WasmPdfDocument(bytes);

doc.setTitle("Updated Title");
doc.setAuthor("Jane Doe");
doc.setSubject("Quarterly Report");
doc.setKeywords("finance, Q4, 2025");

const edited = doc.saveToBytes(); // Uint8Array with changes applied
```

### Page Rotation

```javascript
// Get current rotation
const rotation = doc.pageRotation(0); // 0, 90, 180, or 270

// Set absolute rotation
doc.setPageRotation(0, 90);

// Add to current rotation
doc.rotatePage(0, 90); // if was 90, now 180

// Rotate all pages
doc.rotateAllPages(180);
```

### Page Dimensions

```javascript
// Get MediaBox [llx, lly, urx, ury]
const mediaBox = doc.pageMediaBox(0);
console.log(`Page size: ${mediaBox[2]}x${mediaBox[3]} points`);

// Set MediaBox
doc.setPageMediaBox(0, 0, 0, 612, 792); // US Letter

// Get CropBox (may be null if not set)
const cropBox = doc.pageCropBox(0);

// Set CropBox
doc.setPageCropBox(0, 50, 50, 562, 742);

// Crop margins from all pages (points)
doc.cropMargins(36, 36, 36, 36); // 0.5 inch margins
```

### Erase / Whiteout

```javascript
// Erase a single region
doc.eraseRegion(0, 100, 700, 300, 720); // llx, lly, urx, ury

// Erase multiple regions at once
const rects = new Float32Array([
  100, 700, 300, 720,  // region 1
  100, 650, 300, 670,  // region 2
]);
doc.eraseRegions(0, rects);

// Clear pending erase operations
doc.clearEraseRegions(0);
```

### Annotations

```javascript
// Flatten annotations into page content (makes them permanent)
doc.flattenPageAnnotations(0);

// Flatten all pages
doc.flattenAllAnnotations();
```

### Redaction

```javascript
// Apply redactions on a page (permanently removes content)
doc.applyPageRedactions(0);

// Apply redactions on all pages
doc.applyAllRedactions();
```

### Image Manipulation

```javascript
// List images on a page
const images = doc.pageImages(0);
// Returns: [{ name, bounds: [x, y, width, height], matrix: [a, b, c, d, e, f] }, ...]

// Reposition an image
doc.repositionImage(0, images[0].name, 100, 500);

// Resize an image
doc.resizeImage(0, images[0].name, 200, 150);

// Set full bounds
doc.setImageBounds(0, images[0].name, 100, 500, 200, 150);
```

## Saving

```javascript
// Save with edits
const output = doc.saveToBytes(); // Uint8Array

// Save with encryption (AES-256)
const encrypted = doc.saveEncryptedToBytes(
  "user-password",
  "owner-password",  // optional, defaults to user password
  true,   // allow_print
  true,   // allow_copy
  false,  // allow_modify
  true    // allow_annotate
);
```

## Encrypted PDFs

```javascript
const doc = new WasmPdfDocument(encryptedBytes);

// Authenticate before accessing content
const success = doc.authenticate("password");
if (success) {
  const text = doc.extractText(0);
  console.log(text);
}
```

## Document Info

```javascript
const doc = new WasmPdfDocument(bytes);

const [major, minor] = doc.version();
console.log(`PDF ${major}.${minor}`);
console.log(`Pages: ${doc.pageCount()}`);
console.log(`Tagged PDF: ${doc.hasStructureTree()}`);
```

## Memory Management

WASM objects hold Rust memory that must be freed explicitly:

```javascript
const doc = new WasmPdfDocument(bytes);
try {
  // ... work with doc
} finally {
  doc.free();
}

// Or with the using declaration (TC39 Explicit Resource Management):
using doc = new WasmPdfDocument(bytes);
// automatically freed when doc goes out of scope
```

## TypeScript

Type definitions are generated alongside the JS bindings. Import directly:

```typescript
import { WasmPdfDocument, WasmPdf } from "./pkg/pdf_oxide.js";

const doc: WasmPdfDocument = new WasmPdfDocument(bytes);
const text: string = doc.extractText(0);
const markdown: string = doc.toMarkdown(0);
const pdf: WasmPdf = WasmPdf.fromMarkdown("# Hello");
const size: number = pdf.size;
```

## Error Handling

All methods that can fail throw JavaScript `Error` objects:

```javascript
try {
  const doc = new WasmPdfDocument(new Uint8Array([0, 1, 2])); // invalid PDF
} catch (e) {
  console.error(`Failed to open: ${e.message}`);
}

try {
  doc.extractText(999); // invalid page index
} catch (e) {
  console.error(`Extraction failed: ${e.message}`);
}
```

## API Reference

### WasmPdf (PDF Creation)

| Method | Returns | Description |
|--------|---------|-------------|
| `WasmPdf.fromMarkdown(content, title?, author?)` | `WasmPdf` | Create PDF from Markdown |
| `WasmPdf.fromHtml(content, title?, author?)` | `WasmPdf` | Create PDF from HTML |
| `WasmPdf.fromText(content, title?, author?)` | `WasmPdf` | Create PDF from plain text |
| `.toBytes()` | `Uint8Array` | Get PDF as bytes |
| `.size` | `number` | PDF size in bytes (readonly) |

### WasmPdfDocument (Read, Extract, Edit)

**Read-Only:**

| Method | Returns | Description |
|--------|---------|-------------|
| `new WasmPdfDocument(data)` | `WasmPdfDocument` | Load PDF from `Uint8Array` |
| `.pageCount()` | `number` | Number of pages |
| `.version()` | `Uint8Array` | PDF version as `[major, minor]` |
| `.authenticate(password)` | `boolean` | Decrypt an encrypted PDF |
| `.hasStructureTree()` | `boolean` | Check if Tagged PDF |

**Text Extraction:**

| Method | Returns | Description |
|--------|---------|-------------|
| `.extractText(page)` | `string` | Plain text from one page |
| `.extractAllText()` | `string` | Plain text from all pages |
| `.extractChars(page)` | `Array` | Character-level data with positions |
| `.extractSpans(page)` | `Array` | Span-level data with font info |

**Format Conversion:**

| Method | Returns | Description |
|--------|---------|-------------|
| `.toMarkdown(page, headings?, images?)` | `string` | Convert page to Markdown |
| `.toMarkdownAll(headings?, images?)` | `string` | Convert all pages to Markdown |
| `.toHtml(page, layout?, headings?)` | `string` | Convert page to HTML |
| `.toHtmlAll(layout?, headings?)` | `string` | Convert all pages to HTML |
| `.toPlainText(page)` | `string` | Convert page to plain text |
| `.toPlainTextAll()` | `string` | Convert all pages to plain text |

**Search:**

| Method | Returns | Description |
|--------|---------|-------------|
| `.search(pattern, case?, literal?, word?, max?)` | `Array` | Search all pages |
| `.searchPage(page, pattern, case?, literal?, word?, max?)` | `Array` | Search one page |

**Image Info:**

| Method | Returns | Description |
|--------|---------|-------------|
| `.extractImages(page)` | `Array` | Image metadata (no raw bytes) |
| `.pageImages(page)` | `Array` | Image names and bounds |

**Document Structure:**

| Method | Returns | Description |
|--------|---------|-------------|
| `.getOutline()` | `Array\|null` | Document bookmarks / table of contents |
| `.getAnnotations(page)` | `Array` | Annotation metadata (type, rect, contents, etc.) |
| `.extractPaths(page)` | `Array` | Vector paths (lines, curves, shapes) |

**Form Fields:**

| Method | Returns | Description |
|--------|---------|-------------|
| `.getFormFields()` | `Array` | All form fields with name, type, value, flags |
| `.exportFormData(format?)` | `Uint8Array` | Export form data as FDF (default) or XFDF |

**Editing:**

| Method | Returns | Description |
|--------|---------|-------------|
| `.setTitle(title)` | `void` | Set document title |
| `.setAuthor(author)` | `void` | Set document author |
| `.setSubject(subject)` | `void` | Set document subject |
| `.setKeywords(keywords)` | `void` | Set document keywords |
| `.setPageRotation(page, degrees)` | `void` | Set page rotation |
| `.rotatePage(page, degrees)` | `void` | Add to page rotation |
| `.rotateAllPages(degrees)` | `void` | Rotate all pages |
| `.setPageMediaBox(page, llx, lly, urx, ury)` | `void` | Set MediaBox |
| `.setPageCropBox(page, llx, lly, urx, ury)` | `void` | Set CropBox |
| `.cropMargins(left, right, top, bottom)` | `void` | Crop all page margins |
| `.eraseRegion(page, llx, lly, urx, ury)` | `void` | Whiteout a region |
| `.eraseRegions(page, rects)` | `void` | Whiteout multiple regions |
| `.clearEraseRegions(page)` | `void` | Clear pending erases |
| `.flattenPageAnnotations(page)` | `void` | Flatten annotations on page |
| `.flattenAllAnnotations()` | `void` | Flatten all annotations |
| `.applyPageRedactions(page)` | `void` | Apply redactions on page |
| `.applyAllRedactions()` | `void` | Apply all redactions |
| `.repositionImage(page, name, x, y)` | `void` | Move image |
| `.resizeImage(page, name, w, h)` | `void` | Resize image |
| `.setImageBounds(page, name, x, y, w, h)` | `void` | Set image bounds |

**Save:**

| Method | Returns | Description |
|--------|---------|-------------|
| `.saveToBytes()` | `Uint8Array` | Save edited PDF |
| `.saveEncryptedToBytes(pass, owner?, print?, copy?, modify?, annotate?)` | `Uint8Array` | Save with AES-256 encryption |
| `.free()` | `void` | Release WASM memory |

## Feature Availability

Some features require native dependencies and are **not available** in WebAssembly builds:

| Feature | WASM | Notes |
|---------|------|-------|
| Text extraction | Yes | Full support |
| PDF creation | Yes | Markdown, HTML, text, images |
| PDF editing | Yes | Full support |
| Encryption | Yes | AES-256 |
| OCR | **No** | Requires ONNX Runtime (native only) |
| Digital signatures | **No** | Requires native crypto libraries |
| Page rendering | **No** | Requires tiny-skia (native only) |

For OCR support, use the [Rust](getting-started-rust.md) or [Python](getting-started-python.md) bindings. See the [OCR Guide](OCR_GUIDE.md) for details.

## Next Steps

- [TypeScript Definitions]../examples/wasm_node/pdf_oxide.d.ts — Full type signatures
- [Node.js Example]../examples/wasm_node/extract_text.mjs — Working demo script
- [API Reference]https://docs.rs/pdf_oxide — Full Rust API documentation
- [GitHub Issues]https://github.com/yfedoseev/pdf_oxide/issues — Report bugs or request features