PDFPurr
A pure-Rust PDF library
PDFPurr reads, writes, edits, renders, OCRs, and validates PDF documents in Rust. It handles PDF 1.0 through 2.0 with PDF/UA (accessibility), PDF/A (archival), and PDF/X (print production) standards.
1150+ tests. CI on Ubuntu, macOS, and Windows with nightly clippy and libFuzzer smoke tests.
Quick Start
[]
= "0.4"
use Document;
let mut doc = new;
doc.add_page.unwrap; // US Letter
let bytes = doc.to_bytes.unwrap;
let doc = from_bytes.unwrap;
assert_eq!;
Feature Guide
Reading and Parsing
Parses any PDF from 1.0 to 2.0, including encrypted and malformed files.
use Document;
let doc = open.unwrap;
let doc = from_bytes.unwrap;
let doc = from_bytes_with_password.unwrap;
let doc = from_bytes_lazy.unwrap; // parse on demand
let doc = open_mmap.unwrap; // memory-mapped
Encryption: R2 through R6 (RC4 40/128-bit, AES-128, AES-256). Constant-time password comparison.
Repair: Rebuilds corrupt xref tables by scanning for object definitions. Recovers streams with wrong /Length via endstream scanning. Falls back to raw deflate when zlib headers are corrupt.
Text Extraction
let text = doc.extract_page_text.unwrap;
let all_text = doc.extract_all_text.unwrap;
Uses font encoding tables (WinAnsi, MacRoman, PDFDoc) and ToUnicode CMaps. CJK scripts supported.
Structure Detection
Detects headings, paragraphs, lists, tables, and code blocks from font metrics and text positions — no OCR needed for native text.
use BlockRole;
let blocks = doc.analyze_page_structure.unwrap;
for block in &blocks
Also detects tables (column alignment), headers/footers (repeated across pages), form field labels (proximity), and figure captions.
Auto-Tagging
Adds a PDF/UA structure tree to untagged PDFs based on detected content structure.
let blocks_tagged = doc.auto_tag.unwrap;
println!;
Accessibility Checking
Reports issues by comparing detected structure against existing tags.
let issues = doc.check_accessibility;
for issue in &issues
Detects: untagged documents, missing language, heading count mismatches, missing alt text on figures, heading level skips (H1 → H3 without H2).
OCR for Image-Only PDFs
Three engines. Windows OCR and Tesseract need no feature flags.
// Windows OCR (~95% accuracy, zero dependencies)
use WindowsOcrEngine;
let engine = new;
doc.ocr_all_pages.unwrap;
// Tesseract (~85-89%, requires tesseract CLI)
use TesseractEngine;
let engine = new;
// ocrs (pure Rust, Latin only, requires "ocr" feature)
use OcrsEngine;
let engine = new.unwrap;
Invisible text overlay (rendering mode 3) with tagged structure for screen readers. Full Unicode via ToUnicode CMap — CJK, emoji, and symbols preserved.
Hybrid OCR + Text Comparison
Compares content stream text against OCR output. When they disagree, presents both to screen readers.
let result = doc.hybrid_ocr_page.unwrap;
match result.source
Text Run Analysis
Extracts positioned text runs with font name, size, position, color, and style flags.
let runs = doc.extract_text_runs.unwrap;
for run in &runs
Image Extraction
let images = doc.extract_all_images.unwrap;
for in &images
Extracts both XObject images and inline images (BI/ID/EI). Filters: FlateDecode, DCT, JPX, CCITT, LZW, ASCII85, ASCIIHex, RunLength.
Metadata
let meta = doc.metadata;
if let Some = &meta.title
Reads Info dictionary and XMP streams.
Outlines, Annotations, Form Fields, Signatures
let outlines = doc.outlines;
let annots = doc.page_annotations;
let fields = doc.form_fields;
let sigs = doc.signatures;
Page Rendering
use ;
let renderer = new;
let pixmap = renderer.render_page.unwrap;
Creating and Manipulating PDFs
let mut doc = new;
doc.add_page.unwrap;
doc.rotate_page.unwrap;
doc.merge.unwrap;
doc.save.unwrap;
Font Embedding
TTF, OTF/CFF, and variable fonts with automatic subsetting.
let font = from_ttf.unwrap;
let subset = font.subset.unwrap;
let otf = from_otf.unwrap; // CFF outlines
let bold = from_ttf_with_axes.unwrap;
Standards Validation
let report = doc.validate_pdfa;
let a11y = doc.accessibility_report;
Linearized Writing and Incremental Updates
let linearized = doc.to_linearized_bytes.unwrap;
let incremental = doc.to_incremental_update.unwrap;
Feature Flags
| Feature | Default | Description |
|---|---|---|
jpeg2000 |
Yes | JPEG2000 decoding (C dependency via openjpeg) |
ocr |
No | Adds ocrs engine (pure Rust, Latin text) |
ocr-windows-native |
No | Native Windows OCR via WinRT (Windows 10+) |
Windows OCR (subprocess) and Tesseract are always available.
= "0.4"
= { = "0.4", = ["ocr"] }
Architecture
src/
core/ Object model, filters, compression
parser/ Lexer, object parser, xref, repair
content/ Tokenizer, builder, text analysis, structure detection
fonts/ Parsing, encoding, embedding, subsetting
images/ Extraction and embedding (XObject + inline)
rendering/ Page-to-pixel (tiny-skia), annotation rendering
encryption/ R2-R6, RC4, AES-128/256, constant-time comparison
forms/ AcroForms (read/write)
signatures/ Digital signature parsing
standards/ PDF/A, PDF/X validation
accessibility/ PDF/UA, structure tree, auto-tagging, quality checks
structure/ Outlines, annotations, metadata
ocr/ OCR engines, text layer, hybrid comparison
document.rs High-level API
Testing
CI: Ubuntu/Windows/macOS (stable) + nightly + fuzz + benchmarks. On-demand: 1800+ external PDFs from veraPDF, qpdf, BFO.
Performance
| Operation | Time |
|---|---|
| Parse 14-page PDF (1 MB) | 26 ms |
| Text extraction per page | 4.7 ms |
| Render page at 150 DPI | 210 ms |
Security
- Zero
panic!orunreachable!()in production code - Constant-time password hash comparison
- Checked arithmetic on image dimensions and buffer allocations
- Resource limits: 256 MB decoded stream max, 1M xref rebuild cap
- Depth limits: page tree (64), outlines (32), inherited properties (32)
- Cycle detection on outline /Next chains
- Raw deflate fallback for corrupt zlib headers
- Fuzz-tested: 3 cargo-fuzz + 18 proptest + 22 adversarial tests
License
Dual-licensed under MIT or Apache-2.0 at your option.