PDFPurr
A pure-Rust PDF library
PDFPurr reads, writes, edits, renders, OCRs, and validates PDF documents in Rust. It supports PDF 1.0 through 2.0 with accessibility (PDF/UA), archival (PDF/A), and print production (PDF/X) standards.
1000+ tests across unit, integration, adversarial, property-based, and fuzz testing. CI runs on Ubuntu, macOS, and Windows with nightly clippy and libFuzzer smoke tests.
Quick Start
Add to your Cargo.toml:
[]
= "0.1"
use Document;
// Create a new PDF
let mut doc = new;
doc.add_page.unwrap; // US Letter
let bytes = doc.to_bytes.unwrap;
// Parse it back
let doc = from_bytes.unwrap;
assert_eq!;
Feature Guide
Reading and Parsing
PDFPurr parses any PDF from 1.0 to 2.0, including encrypted and malformed files.
use Document;
// Open from disk
let doc = open.unwrap;
// Parse from bytes
let data = read.unwrap;
let doc = from_bytes.unwrap;
// Open an encrypted PDF
let doc = from_bytes_with_password.unwrap;
// Lazy loading (parse objects on demand — fast for large files)
let doc = from_bytes_lazy.unwrap;
// Memory-mapped file (OS pages in data on demand)
let doc = open_mmap.unwrap;
// Basic document info
println!;
println!;
if let Some = doc.title
Supported encryption: Standard security handler revisions R2 through R6 (RC4 40/128-bit, AES-128, AES-256).
Repair: If the cross-reference table is corrupt, PDFPurr automatically rebuilds it by scanning for object definitions. Streams with incorrect /Length values are recovered by scanning for endstream.
Text Extraction
Extract text from individual pages or the entire document.
use Document;
let doc = open.unwrap;
// Single page
let text = doc.extract_page_text.unwrap;
println!;
// All pages
let all_text = doc.extract_all_text.unwrap;
println!;
Text extraction uses font encoding tables (WinAnsiEncoding, MacRomanEncoding, PDFDocEncoding) and ToUnicode CMaps for accurate character mapping, including CJK scripts.
OCR for Image-Only PDFs
Make scanned documents searchable and accessible. Three engines available — Windows OCR and Tesseract work out of the box with no feature flags.
Windows OCR (recommended on Windows, ~95% accuracy, zero dependencies):
use Document;
use ;
use WindowsOcrEngine;
let engine = new;
let mut doc = open.unwrap;
doc.ocr_all_pages.unwrap;
doc.save.unwrap;
Tesseract (~85-89% accuracy, requires tesseract CLI):
use TesseractEngine;
let engine = new;
if engine.is_available
ocrs (pure Rust, Latin only, requires ocr feature):
use OcrsEngine; // requires "ocr" feature
let engine = new.unwrap;
doc.ocr_all_pages.unwrap;
All engines overlay invisible text (rendering mode 3) with tagged PDF structure (<Document>, <H1>–<H6>, <P>) for screen reader accessibility.
Image Extraction
Extract images from any page.
use Document;
let doc = open.unwrap;
// All images across all pages
let images = doc.extract_all_images.unwrap;
for in &images
// Images from a specific page
let page = doc.get_page.unwrap;
let page_images = doc.page_images;
Supported filters: FlateDecode, DCTDecode (JPEG), JPXDecode (JPEG2000), CCITTFaxDecode, LZWDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode.
Metadata
Read document metadata from both the Info dictionary and XMP streams.
use Document;
let doc = open.unwrap;
let meta = doc.metadata;
if let Some = &meta.title
if let Some = &meta.author
if let Some = &meta.subject
if let Some = &meta.creator
if let Some = &meta.producer
if let Some = &meta.creation_date
XMP metadata is parsed with namespace-aware XML processing, supporting rdf:Alt/rdf:Seq containers and rdf:Description attribute forms.
Outlines (Bookmarks)
Read the document outline tree, including actions and styling.
use Document;
let doc = open.unwrap;
let outlines = doc.outlines;
for outline in &outlines
Annotations
Extract annotations from pages with full metadata.
use Document;
let doc = open.unwrap;
let page = doc.get_page.unwrap;
let annotations = doc.page_annotations;
for annot in &annotations
Page Rendering
Render PDF pages to pixel images using the tiny-skia backend.
use ;
let doc = open.unwrap;
let renderer = new;
let pixmap = renderer.render_page.unwrap;
pixmap.save_png.unwrap;
The renderer supports the full ISO 32000-2 content stream operator set including annotation appearance streams.
Creating PDFs
Build new PDF documents from scratch.
use Document;
let mut doc = new;
doc.add_page.unwrap; // US Letter
doc.add_page.unwrap; // A4
doc.save.unwrap;
Page Manipulation
Merge, split, reorder, rotate, and remove pages.
use Document;
let mut doc = open.unwrap;
doc.rotate_page.unwrap;
doc.remove_page.unwrap;
doc.reorder_pages.unwrap;
let other = open.unwrap;
doc.merge.unwrap;
doc.save.unwrap;
Font Embedding
Embed TrueType and OpenType fonts with automatic subsetting.
use EmbeddedFont;
let font_data = read.unwrap;
let font = from_ttf.unwrap;
let subset = font.subset.unwrap;
// Variable font support
let bold = from_ttf_with_axes.unwrap;
// CJK text via CidFont
use CidFont;
let cjk = from_ttf.unwrap;
Form Fields
Read and fill AcroForm fields.
use Document;
let mut doc = open.unwrap;
for field in doc.form_fields
doc.set_form_field.unwrap;
doc.save.unwrap;
Digital Signatures
Parse and verify digital signature integrity.
use Document;
let doc = open.unwrap;
for sig in doc.signatures
Standards Validation
Validate documents against PDF/A, PDF/X, and PDF/UA standards.
use ;
let doc = open.unwrap;
let report = doc.validate_pdfa;
println!;
let a11y = doc.accessibility_report;
println!;
Linearized PDF Writing
Write PDFs optimized for progressive web display (Fast Web View).
use Document;
let mut doc = new;
doc.add_page.unwrap;
let linearized = doc.to_linearized_bytes.unwrap;
Incremental Updates
Append changes without rewriting the original file, preserving digital signatures.
use Document;
let original = read.unwrap;
let mut doc = from_bytes.unwrap;
doc.set_form_field.unwrap;
let updated = doc.to_incremental_update.unwrap;
write.unwrap;
Feature Flags
| Feature | Default | Description |
|---|---|---|
jpeg2000 |
Yes | JPEG2000 decoding via openjpeg (C dependency) |
ocr |
No | Adds ocrs engine (pure Rust, Latin text) |
Windows OCR and Tesseract engines are always available — no feature flag needed.
# Default (no ocrs engine)
= "0.2"
# With ocrs pure-Rust engine
= { = "0.2", = ["ocr"] }
Architecture
pdfpurr/
src/
core/ PDF object model, filters, compression
parser/ Lexer, object parser, xref, repair
content/ Content stream tokenizer and builder
fonts/ Font parsing, encoding, embedding, subsetting, variable fonts
images/ Image extraction and embedding
rendering/ Page-to-pixel engine (tiny-skia), annotation rendering
encryption/ Standard security handler (R2-R6, RC4, AES-128/256)
forms/ AcroForms (read/write)
signatures/ Digital signature parsing and verification
standards/ PDF/A, PDF/X validation
accessibility/ PDF/UA, tagged PDF, structure tree, structure builder
structure/ Outlines, annotations, metadata
ocr/ OCR engines, text layer, layout analysis, preprocessing
document.rs High-level API (open, parse, lazy, mmap, write, OCR)
page_builder.rs Page creation API
error.rs Error types
tests/
adversarial_tests.rs Edge-case and malformed PDF tests (22)
corpus_tests.rs Real-world PDF corpus (33 files)
external_corpus_tests.rs veraPDF, qpdf fuzz, BFO tests
integration.rs Write-path roundtrip tests (25)
proptest_fuzz.rs Property-based fuzzing (18 targets)
fuzz/ cargo-fuzz targets (document, content stream, object)
benches/ Criterion benchmarks (parser, tokenizer, renderer)
Testing
CI runs automatically on every push: Ubuntu/Windows/macOS (stable), Ubuntu nightly, Criterion benchmarks, and 120 seconds of fuzz testing across 3 targets. On-demand corpus testing against 1800+ external PDFs available via GitHub Actions.
Performance
| Operation | Time |
|---|---|
| Parse 14-page PDF (1 MB) | 26 ms |
| Text extraction per page | 4.7 ms |
| Render page at 150 DPI | 210 ms |
| Create new Document | < 1 us |
Security
- Zero
panic!orunreachable!()in production code - Checked arithmetic on all image dimensions and buffer allocations
- Resource limits: 256MB decoded stream max, 1M xref rebuild cap
- Depth limits on recursive structures (page tree, outlines, inherited properties)
- Fuzz-tested: 3 cargo-fuzz + 18 proptest + 22 adversarial tests
License
Dual-licensed under MIT or Apache-2.0 at your option.