# Getting Started with PDFOxide (WebAssembly)
PDFOxide compiles to WebAssembly for use in browsers and Node.js. The same Rust core that powers the Python and Rust APIs runs directly in JavaScript/TypeScript with near-native performance.
## Installation
### From npm (recommended)
```bash
npm install pdf-oxide-wasm
```
```javascript
const { WasmPdfDocument } = require("pdf-oxide-wasm");
// or ESM:
// import { WasmPdfDocument } from "pdf-oxide-wasm";
```
### Building from Source
#### Prerequisites
- Rust toolchain with `wasm32-unknown-unknown` target
- `wasm-bindgen-cli` (must match the `wasm-bindgen` version in Cargo.toml)
```bash
# Install the WASM target
rustup target add wasm32-unknown-unknown
# Install wasm-bindgen CLI (check Cargo.toml for the exact version)
cargo install wasm-bindgen-cli --version 0.2.106
```
### Build for Node.js
```bash
# Build the WASM binary
cargo build --lib --target wasm32-unknown-unknown --features wasm --release
# Generate Node.js bindings
wasm-bindgen --target nodejs --out-dir pkg \
target/wasm32-unknown-unknown/release/pdf_oxide.wasm
```
This produces four files in `pkg/`:
- `pdf_oxide.js` — JS glue code (import this)
- `pdf_oxide_bg.wasm` — compiled WASM binary
- `pdf_oxide.d.ts` — TypeScript type definitions
- `pdf_oxide_bg.wasm.d.ts` — WASM type definitions
### Build for Browser
```bash
wasm-bindgen --target web --out-dir pkg \
target/wasm32-unknown-unknown/release/pdf_oxide.wasm
```
### Size-Optimized Build
For smaller WASM binaries, use the `release-small` profile:
```bash
cargo build --lib --target wasm32-unknown-unknown --features wasm \
--profile release-small
```
## Quick Start
### Node.js (ESM)
```javascript
import { readFileSync } from "fs";
import { WasmPdfDocument, WasmPdf } from "./pkg/pdf_oxide.js";
// Open a PDF file
const bytes = new Uint8Array(readFileSync("document.pdf"));
const doc = new WasmPdfDocument(bytes);
// Basic info
console.log(`Pages: ${doc.pageCount()}`);
console.log(`Version: ${doc.version()}`);
// Extract text
const text = doc.extractText(0);
console.log(text);
// Clean up
doc.free();
```
### Browser
```html
<script type="module">
import init, { WasmPdfDocument, WasmPdf } from "./pkg/pdf_oxide.js";
await init();
const response = await fetch("document.pdf");
const bytes = new Uint8Array(await response.arrayBuffer());
const doc = new WasmPdfDocument(bytes);
console.log(`Pages: ${doc.pageCount()}`);
console.log(doc.extractText(0));
doc.free();
</script>
```
### Browser with File Input
```html
<input type="file" id="pdfInput" accept=".pdf" />
<pre id="output"></pre>
<script type="module">
import init, { WasmPdfDocument } from "./pkg/pdf_oxide.js";
await init();
document.getElementById("pdfInput").addEventListener("change", async (e) => {
const file = e.target.files[0];
const bytes = new Uint8Array(await file.arrayBuffer());
const doc = new WasmPdfDocument(bytes);
let result = `Pages: ${doc.pageCount()}\n\n`;
for (let i = 0; i < doc.pageCount(); i++) {
result += `--- Page ${i + 1} ---\n`;
result += doc.extractText(i) + "\n\n";
}
document.getElementById("output").textContent = result;
doc.free();
});
</script>
```
## Creating PDFs
Create new PDFs from Markdown, HTML, or plain text using `WasmPdf`:
```javascript
import { WasmPdf, WasmPdfDocument } from "./pkg/pdf_oxide.js";
// From Markdown
const pdf = WasmPdf.fromMarkdown("# Hello World\n\nThis is a PDF.", "My Title", "Author");
const bytes = pdf.toBytes(); // Uint8Array
console.log(`PDF size: ${pdf.size} bytes`);
// From HTML
const invoice = WasmPdf.fromHtml(
"<h1>Invoice</h1><p>Thank you for your purchase.</p>",
"Invoice #123"
);
// From plain text
const notes = WasmPdf.fromText("Meeting notes\n\nAction items:\n- Review PR\n- Update docs");
// Save to file (Node.js)
import { writeFileSync } from "fs";
writeFileSync("output.pdf", pdf.toBytes());
// Download in browser
const blob = new Blob([pdf.toBytes()], { type: "application/pdf" });
const url = URL.createObjectURL(blob);
const a = document.createElement("a");
a.href = url;
a.download = "output.pdf";
a.click();
```
## Text Extraction
### Single Page
```javascript
const doc = new WasmPdfDocument(bytes);
const text = doc.extractText(0); // page 0
```
### All Pages
```javascript
const allText = doc.extractAllText(); // pages separated by form feed
```
### Convert to Markdown
```javascript
// Single page
const markdown = doc.toMarkdown(0);
// With options
const md = doc.toMarkdown(0, true, true); // detect_headings, include_images
// All pages
const allMarkdown = doc.toMarkdownAll();
```
### Convert to HTML
```javascript
const html = doc.toHtml(0);
// With layout preservation
const layoutHtml = doc.toHtml(0, true, true); // preserve_layout, detect_headings
// All pages
const allHtml = doc.toHtmlAll();
```
### Convert to Plain Text
```javascript
const plain = doc.toPlainText(0);
const allPlain = doc.toPlainTextAll();
```
## Structured Extraction
Get character-level and span-level data with positions and font metadata:
```javascript
// 1. Scoped extraction (v0.3.14)
// Area: [x, y, width, height]
const headerRegion = doc.within(0, [0, 700, 612, 92]);
const headerText = headerRegion.extractText();
// 2. Character-level data
const chars = doc.extractChars(0);
// Returns: [{ char, bbox: {x, y, width, height}, font_name, font_size, font_weight, is_italic, color: {r, g, b} }, ...]
for (const c of chars) {
console.log(`'${c.char}' at (${c.bbox.x}, ${c.bbox.y}) font=${c.font_name}`);
}
// 3. Word-level extraction (v0.3.14)
const words = doc.extractWords(0);
for (const w of words) {
console.log(`Word: ${w.text} at ${w.bbox.x},${w.bbox.y}`);
}
// 4. Line-level extraction (v0.3.14)
const lines = doc.extractTextLines(0);
for (const line of lines) {
console.log(`Line: ${line.text}`);
}
// 5. Span-level data
const spans = doc.extractSpans(0);
// Returns: [{ text, bbox, font_name, font_size, font_weight, is_italic, color }, ...]
for (const span of spans) {
console.log(`"${span.text}" size=${span.font_size}`);
}
```
## Working with Form Fields
Extract form field data and export filled values:
```javascript
const doc = new WasmPdfDocument(bytes);
// Get all form fields
const fields = doc.getFormFields();
// Returns: [{ name, field_type, value, flags }, ...]
for (const f of fields) {
console.log(`${f.name} (${f.field_type}) = ${f.value}`);
}
// Export form data as FDF or XFDF
const fdfBytes = doc.exportFormData(); // FDF format (default)
const xfdfBytes = doc.exportFormData("xfdf"); // XFDF format
```
### Form Fields in Text Extraction
Filled form field values appear inline in `toMarkdown` and `toHtml`:
```javascript
// Include form field values (default)
const md = doc.toMarkdown(0, true, true, true); // ..., include_form_fields=true
const html = doc.toHtml(0, true, true, true);
// Exclude form field values
const mdClean = doc.toMarkdown(0, true, true, false); // include_form_fields=false
```
## Text Search
Search across all pages or within a specific page:
```javascript
// Search all pages
const results = doc.search("hello", true); // case_insensitive=true
// Returns: [{ page, text, bbox, start_index, end_index, span_boxes }, ...]
for (const r of results) {
console.log(`Found "${r.text}" on page ${r.page}`);
}
// Search single page
const pageResults = doc.searchPage(0, "hello", true, true); // case_insensitive, literal
// Regex search
const regexResults = doc.search("\\d{4}-\\d{2}-\\d{2}"); // find dates
// Whole word match
const wordResults = doc.search("test", false, true, true); // literal, whole_word
```
## Image Metadata
```javascript
// Get image metadata (does NOT return raw bytes)
const images = doc.extractImages(0);
// Returns: [{ width, height, color_space, bits_per_component, bbox }, ...]
for (const img of images) {
console.log(`Image: ${img.width}x${img.height} ${img.color_space}`);
}
```
## Editing PDFs
### Metadata
```javascript
const doc = new WasmPdfDocument(bytes);
doc.setTitle("Updated Title");
doc.setAuthor("Jane Doe");
doc.setSubject("Quarterly Report");
doc.setKeywords("finance, Q4, 2025");
const edited = doc.saveToBytes(); // Uint8Array with changes applied
```
### Page Rotation
```javascript
// Get current rotation
const rotation = doc.pageRotation(0); // 0, 90, 180, or 270
// Set absolute rotation
doc.setPageRotation(0, 90);
// Add to current rotation
doc.rotatePage(0, 90); // if was 90, now 180
// Rotate all pages
doc.rotateAllPages(180);
```
### Page Dimensions
```javascript
// Get MediaBox [llx, lly, urx, ury]
const mediaBox = doc.pageMediaBox(0);
console.log(`Page size: ${mediaBox[2]}x${mediaBox[3]} points`);
// Set MediaBox
doc.setPageMediaBox(0, 0, 0, 612, 792); // US Letter
// Get CropBox (may be null if not set)
const cropBox = doc.pageCropBox(0);
// Set CropBox
doc.setPageCropBox(0, 50, 50, 562, 742);
// Crop margins from all pages (points)
doc.cropMargins(36, 36, 36, 36); // 0.5 inch margins
```
### Erase / Whiteout
```javascript
// Erase a single region
doc.eraseRegion(0, 100, 700, 300, 720); // llx, lly, urx, ury
// Erase multiple regions at once
const rects = new Float32Array([
100, 700, 300, 720, // region 1
100, 650, 300, 670, // region 2
]);
doc.eraseRegions(0, rects);
// Clear pending erase operations
doc.clearEraseRegions(0);
```
### Annotations
```javascript
// Flatten annotations into page content (makes them permanent)
doc.flattenPageAnnotations(0);
// Flatten all pages
doc.flattenAllAnnotations();
```
### Redaction
```javascript
// Apply redactions on a page (permanently removes content)
doc.applyPageRedactions(0);
// Apply redactions on all pages
doc.applyAllRedactions();
```
### Image Manipulation
```javascript
// List images on a page
const images = doc.pageImages(0);
// Returns: [{ name, bounds: [x, y, width, height], matrix: [a, b, c, d, e, f] }, ...]
// Reposition an image
doc.repositionImage(0, images[0].name, 100, 500);
// Resize an image
doc.resizeImage(0, images[0].name, 200, 150);
// Set full bounds
doc.setImageBounds(0, images[0].name, 100, 500, 200, 150);
```
## Saving
```javascript
// Save with edits
const output = doc.saveToBytes(); // Uint8Array
// Save with encryption (AES-256)
const encrypted = doc.saveEncryptedToBytes(
"user-password",
"owner-password", // optional, defaults to user password
true, // allow_print
true, // allow_copy
false, // allow_modify
true // allow_annotate
);
```
## Encrypted PDFs
```javascript
const doc = new WasmPdfDocument(encryptedBytes);
// Authenticate before accessing content
const success = doc.authenticate("password");
if (success) {
const text = doc.extractText(0);
console.log(text);
}
```
## Document Info
```javascript
const doc = new WasmPdfDocument(bytes);
const [major, minor] = doc.version();
console.log(`PDF ${major}.${minor}`);
console.log(`Pages: ${doc.pageCount()}`);
console.log(`Tagged PDF: ${doc.hasStructureTree()}`);
```
## Memory Management
WASM objects hold Rust memory that must be freed explicitly:
```javascript
const doc = new WasmPdfDocument(bytes);
try {
// ... work with doc
} finally {
doc.free();
}
// Or with the using declaration (TC39 Explicit Resource Management):
using doc = new WasmPdfDocument(bytes);
// automatically freed when doc goes out of scope
```
## TypeScript
Type definitions are generated alongside the JS bindings. Import directly:
```typescript
import { WasmPdfDocument, WasmPdf } from "./pkg/pdf_oxide.js";
const doc: WasmPdfDocument = new WasmPdfDocument(bytes);
const text: string = doc.extractText(0);
const markdown: string = doc.toMarkdown(0);
const pdf: WasmPdf = WasmPdf.fromMarkdown("# Hello");
const size: number = pdf.size;
```
## Error Handling
All methods that can fail throw JavaScript `Error` objects:
```javascript
try {
const doc = new WasmPdfDocument(new Uint8Array([0, 1, 2])); // invalid PDF
} catch (e) {
console.error(`Failed to open: ${e.message}`);
}
try {
doc.extractText(999); // invalid page index
} catch (e) {
console.error(`Extraction failed: ${e.message}`);
}
```
## API Reference
### WasmPdf (PDF Creation)
| `WasmPdf.fromMarkdown(content, title?, author?)` | `WasmPdf` | Create PDF from Markdown |
| `WasmPdf.fromHtml(content, title?, author?)` | `WasmPdf` | Create PDF from HTML |
| `WasmPdf.fromText(content, title?, author?)` | `WasmPdf` | Create PDF from plain text |
| `.toBytes()` | `Uint8Array` | Get PDF as bytes |
| `.size` | `number` | PDF size in bytes (readonly) |
### WasmPdfDocument (Read, Extract, Edit)
**Read-Only:**
| `new WasmPdfDocument(data)` | `WasmPdfDocument` | Load PDF from `Uint8Array` |
| `.pageCount()` | `number` | Number of pages |
| `.version()` | `Uint8Array` | PDF version as `[major, minor]` |
| `.authenticate(password)` | `boolean` | Decrypt an encrypted PDF |
| `.hasStructureTree()` | `boolean` | Check if Tagged PDF |
**Text Extraction:**
| `.extractText(page)` | `string` | Plain text from one page |
| `.extractAllText()` | `string` | Plain text from all pages |
| `.extractChars(page)` | `Array` | Character-level data with positions |
| `.extractSpans(page)` | `Array` | Span-level data with font info |
**Format Conversion:**
| `.toMarkdown(page, headings?, images?)` | `string` | Convert page to Markdown |
| `.toMarkdownAll(headings?, images?)` | `string` | Convert all pages to Markdown |
| `.toHtml(page, layout?, headings?)` | `string` | Convert page to HTML |
| `.toHtmlAll(layout?, headings?)` | `string` | Convert all pages to HTML |
| `.toPlainText(page)` | `string` | Convert page to plain text |
| `.toPlainTextAll()` | `string` | Convert all pages to plain text |
**Search:**
| `.search(pattern, case?, literal?, word?, max?)` | `Array` | Search all pages |
| `.searchPage(page, pattern, case?, literal?, word?, max?)` | `Array` | Search one page |
**Image Info:**
| `.extractImages(page)` | `Array` | Image metadata (no raw bytes) |
| `.pageImages(page)` | `Array` | Image names and bounds |
**Document Structure:**
| `.getOutline()` | `Array\|null` | Document bookmarks / table of contents |
| `.getAnnotations(page)` | `Array` | Annotation metadata (type, rect, contents, etc.) |
| `.extractPaths(page)` | `Array` | Vector paths (lines, curves, shapes) |
**Form Fields:**
| `.getFormFields()` | `Array` | All form fields with name, type, value, flags |
| `.exportFormData(format?)` | `Uint8Array` | Export form data as FDF (default) or XFDF |
**Editing:**
| `.setTitle(title)` | `void` | Set document title |
| `.setAuthor(author)` | `void` | Set document author |
| `.setSubject(subject)` | `void` | Set document subject |
| `.setKeywords(keywords)` | `void` | Set document keywords |
| `.setPageRotation(page, degrees)` | `void` | Set page rotation |
| `.rotatePage(page, degrees)` | `void` | Add to page rotation |
| `.rotateAllPages(degrees)` | `void` | Rotate all pages |
| `.setPageMediaBox(page, llx, lly, urx, ury)` | `void` | Set MediaBox |
| `.setPageCropBox(page, llx, lly, urx, ury)` | `void` | Set CropBox |
| `.cropMargins(left, right, top, bottom)` | `void` | Crop all page margins |
| `.eraseRegion(page, llx, lly, urx, ury)` | `void` | Whiteout a region |
| `.eraseRegions(page, rects)` | `void` | Whiteout multiple regions |
| `.clearEraseRegions(page)` | `void` | Clear pending erases |
| `.flattenPageAnnotations(page)` | `void` | Flatten annotations on page |
| `.flattenAllAnnotations()` | `void` | Flatten all annotations |
| `.applyPageRedactions(page)` | `void` | Apply redactions on page |
| `.applyAllRedactions()` | `void` | Apply all redactions |
| `.repositionImage(page, name, x, y)` | `void` | Move image |
| `.resizeImage(page, name, w, h)` | `void` | Resize image |
| `.setImageBounds(page, name, x, y, w, h)` | `void` | Set image bounds |
**Save:**
| `.saveToBytes()` | `Uint8Array` | Save edited PDF |
| `.saveEncryptedToBytes(pass, owner?, print?, copy?, modify?, annotate?)` | `Uint8Array` | Save with AES-256 encryption |
| `.free()` | `void` | Release WASM memory |
## Feature Availability
Some features require native dependencies and are **not available** in WebAssembly builds:
| Text extraction | Yes | Full support |
| PDF creation | Yes | Markdown, HTML, text, images |
| PDF editing | Yes | Full support |
| Encryption | Yes | AES-256 |
| OCR | **No** | Requires ONNX Runtime (native only) |
| Digital signatures | **No** | Requires native crypto libraries |
| Page rendering | **No** | Requires tiny-skia (native only) |
For OCR support, use the [Rust](getting-started-rust.md) or [Python](getting-started-python.md) bindings. See the [OCR Guide](OCR_GUIDE.md) for details.
## Next Steps
- [TypeScript Definitions](../examples/wasm_node/pdf_oxide.d.ts) — Full type signatures
- [Node.js Example](../examples/wasm_node/extract_text.mjs) — Working demo script
- [API Reference](https://docs.rs/pdf_oxide) — Full Rust API documentation
- [GitHub Issues](https://github.com/yfedoseev/pdf_oxide/issues) — Report bugs or request features