pdfpurr 0.3.0

A comprehensive pure-Rust PDF library for reading, writing, rendering, and validating PDF documents
Documentation

PDFPurr

A pure-Rust PDF library

CI crates.io docs.rs License: MIT OR Apache-2.0

PDFPurr reads, writes, edits, renders, OCRs, and validates PDF documents in Rust. It supports PDF 1.0 through 2.0 with accessibility (PDF/UA), archival (PDF/A), and print production (PDF/X) standards.

1000+ tests across unit, integration, adversarial, property-based, and fuzz testing. CI runs on Ubuntu, macOS, and Windows with nightly clippy and libFuzzer smoke tests.

Quick Start

Add to your Cargo.toml:

[dependencies]
pdfpurr = "0.1"
use pdfpurr::Document;

// Create a new PDF
let mut doc = Document::new();
doc.add_page(612.0, 792.0).unwrap(); // US Letter
let bytes = doc.to_bytes().unwrap();

// Parse it back
let doc = Document::from_bytes(&bytes).unwrap();
assert_eq!(doc.page_count().unwrap(), 1);

Feature Guide

Reading and Parsing

PDFPurr parses any PDF from 1.0 to 2.0, including encrypted and malformed files.

use pdfpurr::Document;

// Open from disk
let doc = Document::open("report.pdf").unwrap();

// Parse from bytes
let data = std::fs::read("report.pdf").unwrap();
let doc = Document::from_bytes(&data).unwrap();

// Open an encrypted PDF
let doc = Document::from_bytes_with_password(&data, b"secret").unwrap();

// Lazy loading (parse objects on demand — fast for large files)
let doc = Document::from_bytes_lazy(&data).unwrap();

// Memory-mapped file (OS pages in data on demand)
let doc = Document::open_mmap("large_report.pdf").unwrap();

// Basic document info
println!("Pages: {}", doc.page_count().unwrap());
println!("Objects: {}", doc.object_count());
if let Some(title) = doc.title() {
    println!("Title: {}", title);
}

Supported encryption: Standard security handler revisions R2 through R6 (RC4 40/128-bit, AES-128, AES-256).

Repair: If the cross-reference table is corrupt, PDFPurr automatically rebuilds it by scanning for object definitions. Streams with incorrect /Length values are recovered by scanning for endstream.

Text Extraction

Extract text from individual pages or the entire document.

use pdfpurr::Document;

let doc = Document::open("paper.pdf").unwrap();

// Single page
let text = doc.extract_page_text(0).unwrap();
println!("Page 1: {}", text);

// All pages
let all_text = doc.extract_all_text().unwrap();
println!("{}", all_text);

Text extraction uses font encoding tables (WinAnsiEncoding, MacRomanEncoding, PDFDocEncoding) and ToUnicode CMaps for accurate character mapping, including CJK scripts.

OCR for Image-Only PDFs

Make scanned documents searchable and accessible. Three engines available — Windows OCR and Tesseract work out of the box with no feature flags.

Windows OCR (recommended on Windows, ~95% accuracy, zero dependencies):

use pdfpurr::Document;
use pdfpurr::ocr::{OcrConfig, OcrEngine};
use pdfpurr::ocr::windows_engine::WindowsOcrEngine;

let engine = WindowsOcrEngine::new();
let mut doc = Document::open("scanned.pdf").unwrap();
doc.ocr_all_pages(&engine, &OcrConfig::default()).unwrap();
doc.save("searchable.pdf").unwrap();

Tesseract (~85-89% accuracy, requires tesseract CLI):

use pdfpurr::ocr::tesseract_engine::TesseractEngine;

let engine = TesseractEngine::new("eng");
if engine.is_available() {
    doc.ocr_all_pages(&engine, &OcrConfig::default()).unwrap();
}

ocrs (pure Rust, Latin only, requires ocr feature):

use pdfpurr::ocr::ocrs_engine::OcrsEngine;  // requires "ocr" feature

let engine = OcrsEngine::new("text-detection.rten", "text-recognition.rten").unwrap();
doc.ocr_all_pages(&engine, &OcrConfig::default()).unwrap();

All engines overlay invisible text (rendering mode 3) with tagged PDF structure (<Document>, <H1><H6>, <P>) for screen reader accessibility.

Image Extraction

Extract images from any page.

use pdfpurr::Document;

let doc = Document::open("brochure.pdf").unwrap();

// All images across all pages
let images = doc.extract_all_images().unwrap();
for (page_idx, name, image) in &images {
    println!(
        "Page {}: {} ({}x{}, {:?})",
        page_idx, name, image.width, image.height, image.color_space
    );
}

// Images from a specific page
let page = doc.get_page(0).unwrap();
let page_images = doc.page_images(page);

Supported filters: FlateDecode, DCTDecode (JPEG), JPXDecode (JPEG2000), CCITTFaxDecode, LZWDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode.

Metadata

Read document metadata from both the Info dictionary and XMP streams.

use pdfpurr::Document;

let doc = Document::open("report.pdf").unwrap();
let meta = doc.metadata();

if let Some(title) = &meta.title { println!("Title: {}", title); }
if let Some(author) = &meta.author { println!("Author: {}", author); }
if let Some(subject) = &meta.subject { println!("Subject: {}", subject); }
if let Some(creator) = &meta.creator { println!("Creator: {}", creator); }
if let Some(producer) = &meta.producer { println!("Producer: {}", producer); }
if let Some(date) = &meta.creation_date { println!("Created: {}", date); }

XMP metadata is parsed with namespace-aware XML processing, supporting rdf:Alt/rdf:Seq containers and rdf:Description attribute forms.

Outlines (Bookmarks)

Read the document outline tree, including actions and styling.

use pdfpurr::Document;

let doc = Document::open("book.pdf").unwrap();
let outlines = doc.outlines();

for outline in &outlines {
    println!("{}", outline.title);
    if let Some(uri) = &outline.uri {
        println!("  Link: {}", uri);
    }
    if outline.is_bold() { println!("  [bold]"); }
    for child in &outline.children {
        println!("  - {}", child.title);
    }
}

Annotations

Extract annotations from pages with full metadata.

use pdfpurr::Document;

let doc = Document::open("annotated.pdf").unwrap();
let page = doc.get_page(0).unwrap();
let annotations = doc.page_annotations(page);

for annot in &annotations {
    println!("{} at {:?}", annot.subtype, annot.rect);
    if let Some(uri) = &annot.uri { println!("  URI: {}", uri); }
    if let Some(author) = &annot.author { println!("  Author: {}", author); }
    if !annot.quad_points.is_empty() {
        println!("  QuadPoints: {} values", annot.quad_points.len());
    }
}

Page Rendering

Render PDF pages to pixel images using the tiny-skia backend.

use pdfpurr::{Document, Renderer, RenderOptions};

let doc = Document::open("slides.pdf").unwrap();
let renderer = Renderer::new(&doc, RenderOptions {
    dpi: 150.0,
    background: [255, 255, 255, 255],
});

let pixmap = renderer.render_page(0).unwrap();
pixmap.save_png("page1.png").unwrap();

The renderer supports the full ISO 32000-2 content stream operator set including annotation appearance streams.

Creating PDFs

Build new PDF documents from scratch.

use pdfpurr::Document;

let mut doc = Document::new();
doc.add_page(612.0, 792.0).unwrap(); // US Letter
doc.add_page(595.0, 842.0).unwrap(); // A4
doc.save("output.pdf").unwrap();

Page Manipulation

Merge, split, reorder, rotate, and remove pages.

use pdfpurr::Document;

let mut doc = Document::open("report.pdf").unwrap();
doc.rotate_page(0, 90).unwrap();
doc.remove_page(2).unwrap();
doc.reorder_pages(&[2, 0, 1]).unwrap();

let other = Document::open("appendix.pdf").unwrap();
doc.merge(&other).unwrap();
doc.save("combined.pdf").unwrap();

Font Embedding

Embed TrueType and OpenType fonts with automatic subsetting.

use pdfpurr::EmbeddedFont;

let font_data = std::fs::read("fonts/Roboto-Regular.ttf").unwrap();
let font = EmbeddedFont::from_ttf(&font_data).unwrap();
let subset = font.subset(&['H', 'e', 'l', 'o']).unwrap();

// Variable font support
let bold = EmbeddedFont::from_ttf_with_axes(&font_data, &[("wght", 700.0)]).unwrap();

// CJK text via CidFont
use pdfpurr::CidFont;
let cjk = CidFont::from_ttf(&std::fs::read("NotoSansCJK.ttf").unwrap()).unwrap();

Form Fields

Read and fill AcroForm fields.

use pdfpurr::Document;

let mut doc = Document::open("form.pdf").unwrap();
for field in doc.form_fields() {
    println!("{}: {:?} = {:?}", field.name, field.field_type, field.value);
}
doc.set_form_field("username", "alice").unwrap();
doc.save("filled_form.pdf").unwrap();

Digital Signatures

Parse and verify digital signature integrity.

use pdfpurr::Document;

let doc = Document::open("signed.pdf").unwrap();
for sig in doc.signatures() {
    println!("Filter: {}", sig.filter);
    if let Some(name) = &sig.name { println!("Signer: {}", name); }
}

Standards Validation

Validate documents against PDF/A, PDF/X, and PDF/UA standards.

use pdfpurr::{Document, PdfALevel};

let doc = Document::open("archive.pdf").unwrap();
let report = doc.validate_pdfa(PdfALevel::A2b);
println!("PDF/A-2b compliant: {}", report.is_compliant());

let a11y = doc.accessibility_report();
println!("Accessible: {}", a11y.is_compliant());

Linearized PDF Writing

Write PDFs optimized for progressive web display (Fast Web View).

use pdfpurr::Document;

let mut doc = Document::new();
doc.add_page(612.0, 792.0).unwrap();
let linearized = doc.to_linearized_bytes().unwrap();

Incremental Updates

Append changes without rewriting the original file, preserving digital signatures.

use pdfpurr::Document;

let original = std::fs::read("signed_contract.pdf").unwrap();
let mut doc = Document::from_bytes(&original).unwrap();
doc.set_form_field("status", "approved").unwrap();
let updated = doc.to_incremental_update(&original).unwrap();
std::fs::write("approved.pdf", &updated).unwrap();

Feature Flags

Feature Default Description
jpeg2000 Yes JPEG2000 decoding via openjpeg (C dependency)
ocr No Adds ocrs engine (pure Rust, Latin text)

Windows OCR and Tesseract engines are always available — no feature flag needed.

# Default (no ocrs engine)
pdfpurr = "0.2"

# With ocrs pure-Rust engine
pdfpurr = { version = "0.2", features = ["ocr"] }

Architecture

pdfpurr/
  src/
    core/             PDF object model, filters, compression
    parser/           Lexer, object parser, xref, repair
    content/          Content stream tokenizer and builder
    fonts/            Font parsing, encoding, embedding, subsetting, variable fonts
    images/           Image extraction and embedding
    rendering/        Page-to-pixel engine (tiny-skia), annotation rendering
    encryption/       Standard security handler (R2-R6, RC4, AES-128/256)
    forms/            AcroForms (read/write)
    signatures/       Digital signature parsing and verification
    standards/        PDF/A, PDF/X validation
    accessibility/    PDF/UA, tagged PDF, structure tree, structure builder
    structure/        Outlines, annotations, metadata
    ocr/              OCR engines, text layer, layout analysis, preprocessing
    document.rs       High-level API (open, parse, lazy, mmap, write, OCR)
    page_builder.rs   Page creation API
    error.rs          Error types
  tests/
    adversarial_tests.rs  Edge-case and malformed PDF tests (22)
    corpus_tests.rs       Real-world PDF corpus (33 files)
    external_corpus_tests.rs  veraPDF, qpdf fuzz, BFO tests
    integration.rs        Write-path roundtrip tests (25)
    proptest_fuzz.rs      Property-based fuzzing (18 targets)
  fuzz/               cargo-fuzz targets (document, content stream, object)
  benches/            Criterion benchmarks (parser, tokenizer, renderer)

Testing

cargo test                       # all tests
cargo bench                      # Criterion benchmarks
cargo clippy -- -D warnings      # Lint check (clean on stable and nightly)
cargo fmt --check                # Format check
cargo doc --no-deps              # Build documentation (zero warnings)

CI runs automatically on every push: Ubuntu/Windows/macOS (stable), Ubuntu nightly, Criterion benchmarks, and 120 seconds of fuzz testing across 3 targets. On-demand corpus testing against 1800+ external PDFs available via GitHub Actions.

Performance

Operation Time
Parse 14-page PDF (1 MB) 26 ms
Text extraction per page 4.7 ms
Render page at 150 DPI 210 ms
Create new Document < 1 us

Security

  • Zero panic! or unreachable!() in production code
  • Checked arithmetic on all image dimensions and buffer allocations
  • Resource limits: 256MB decoded stream max, 1M xref rebuild cap
  • Depth limits on recursive structures (page tree, outlines, inherited properties)
  • Fuzz-tested: 3 cargo-fuzz + 18 proptest + 22 adversarial tests

License

Dual-licensed under MIT or Apache-2.0 at your option.