pdfpurr 0.4.0

A comprehensive pure-Rust PDF library for reading, writing, rendering, and validating PDF documents
Documentation

PDFPurr

A pure-Rust PDF library

CI crates.io docs.rs License: MIT OR Apache-2.0

PDFPurr reads, writes, edits, renders, OCRs, and validates PDF documents in Rust. It handles PDF 1.0 through 2.0 with PDF/UA (accessibility), PDF/A (archival), and PDF/X (print production) standards.

1150+ tests. CI on Ubuntu, macOS, and Windows with nightly clippy and libFuzzer smoke tests.

Quick Start

[dependencies]
pdfpurr = "0.4"
use pdfpurr::Document;

let mut doc = Document::new();
doc.add_page(612.0, 792.0).unwrap(); // US Letter
let bytes = doc.to_bytes().unwrap();

let doc = Document::from_bytes(&bytes).unwrap();
assert_eq!(doc.page_count().unwrap(), 1);

Feature Guide

Reading and Parsing

Parses any PDF from 1.0 to 2.0, including encrypted and malformed files.

use pdfpurr::Document;

let doc = Document::open("report.pdf").unwrap();
let doc = Document::from_bytes(&data).unwrap();
let doc = Document::from_bytes_with_password(&data, b"secret").unwrap();
let doc = Document::from_bytes_lazy(&data).unwrap();     // parse on demand
let doc = Document::open_mmap("large.pdf").unwrap();      // memory-mapped

Encryption: R2 through R6 (RC4 40/128-bit, AES-128, AES-256). Constant-time password comparison.

Repair: Rebuilds corrupt xref tables by scanning for object definitions. Recovers streams with wrong /Length via endstream scanning. Falls back to raw deflate when zlib headers are corrupt.

Text Extraction

let text = doc.extract_page_text(0).unwrap();
let all_text = doc.extract_all_text().unwrap();

Uses font encoding tables (WinAnsi, MacRoman, PDFDoc) and ToUnicode CMaps. CJK scripts supported.

Structure Detection

Detects headings, paragraphs, lists, tables, and code blocks from font metrics and text positions — no OCR needed for native text.

use pdfpurr::content::structure_detection::BlockRole;

let blocks = doc.analyze_page_structure(0).unwrap();
for block in &blocks {
    match &block.role {
        BlockRole::Heading(level) => println!("H{level}: {}", block.runs[0].text),
        BlockRole::Paragraph => println!("P: {} chars", block.runs.len()),
        BlockRole::ListItem => println!("LI: {}", block.runs[0].text),
        BlockRole::Code => println!("Code block"),
        _ => {}
    }
}

Also detects tables (column alignment), headers/footers (repeated across pages), form field labels (proximity), and figure captions.

Auto-Tagging

Adds a PDF/UA structure tree to untagged PDFs based on detected content structure.

let blocks_tagged = doc.auto_tag("en-US").unwrap();
println!("Tagged {} blocks", blocks_tagged);

Accessibility Checking

Reports issues by comparing detected structure against existing tags.

let issues = doc.check_accessibility();
for issue in &issues {
    println!("[{}] {}: {}", issue.severity, issue.description, issue.suggestion);
}

Detects: untagged documents, missing language, heading count mismatches, missing alt text on figures, heading level skips (H1 → H3 without H2).

OCR for Image-Only PDFs

Three engines. Windows OCR and Tesseract need no feature flags.

// Windows OCR (~95% accuracy, zero dependencies)
use pdfpurr::ocr::windows_engine::WindowsOcrEngine;
let engine = WindowsOcrEngine::new();
doc.ocr_all_pages(&engine, &OcrConfig::default()).unwrap();

// Tesseract (~85-89%, requires tesseract CLI)
use pdfpurr::ocr::tesseract_engine::TesseractEngine;
let engine = TesseractEngine::new("eng");

// ocrs (pure Rust, Latin only, requires "ocr" feature)
use pdfpurr::ocr::ocrs_engine::OcrsEngine;
let engine = OcrsEngine::new("det.rten", "rec.rten").unwrap();

Invisible text overlay (rendering mode 3) with tagged structure for screen readers. Full Unicode via ToUnicode CMap — CJK, emoji, and symbols preserved.

Hybrid OCR + Text Comparison

Compares content stream text against OCR output. When they disagree, presents both to screen readers.

let result = doc.hybrid_ocr_page(0, &engine, &config).unwrap();
match result.source {
    TextSource::ContentStream => println!("Text is reliable"),
    TextSource::Ocr => println!("OCR is better"),
    TextSource::Both => println!("Disagreement: {}", result.accessible_text),
    TextSource::Neither => println!("No text found"),
}

Text Run Analysis

Extracts positioned text runs with font name, size, position, color, and style flags.

let runs = doc.extract_text_runs(0).unwrap();
for run in &runs {
    println!("{} at ({:.0}, {:.0}) {}pt {}{}",
        run.text, run.x, run.y, run.font_size,
        if run.is_bold { "bold " } else { "" },
        if run.is_italic { "italic" } else { "" },
    );
}

Image Extraction

let images = doc.extract_all_images().unwrap();
for (page, name, img) in &images {
    println!("Page {page}: {name} ({}x{})", img.width, img.height);
}

Extracts both XObject images and inline images (BI/ID/EI). Filters: FlateDecode, DCT, JPX, CCITT, LZW, ASCII85, ASCIIHex, RunLength.

Metadata

let meta = doc.metadata();
if let Some(title) = &meta.title { println!("Title: {title}"); }

Reads Info dictionary and XMP streams.

Outlines, Annotations, Form Fields, Signatures

let outlines = doc.outlines();
let annots = doc.page_annotations(page);
let fields = doc.form_fields();
let sigs = doc.signatures();

Page Rendering

use pdfpurr::{Renderer, RenderOptions};
let renderer = Renderer::new(&doc, RenderOptions { dpi: 150.0, ..Default::default() });
let pixmap = renderer.render_page(0).unwrap();

Creating and Manipulating PDFs

let mut doc = Document::new();
doc.add_page(612.0, 792.0).unwrap();
doc.rotate_page(0, 90).unwrap();
doc.merge(&other_doc).unwrap();
doc.save("output.pdf").unwrap();

Font Embedding

TTF, OTF/CFF, and variable fonts with automatic subsetting.

let font = EmbeddedFont::from_ttf(&data).unwrap();
let subset = font.subset(&['H', 'e', 'l', 'o']).unwrap();

let otf = EmbeddedFont::from_otf(&otf_data).unwrap(); // CFF outlines
let bold = EmbeddedFont::from_ttf_with_axes(&data, &[("wght", 700.0)]).unwrap();

Standards Validation

let report = doc.validate_pdfa(PdfALevel::A2b);
let a11y = doc.accessibility_report();

Linearized Writing and Incremental Updates

let linearized = doc.to_linearized_bytes().unwrap();
let incremental = doc.to_incremental_update(&original_bytes).unwrap();

Feature Flags

Feature Default Description
jpeg2000 Yes JPEG2000 decoding (C dependency via openjpeg)
ocr No Adds ocrs engine (pure Rust, Latin text)
ocr-windows-native No Native Windows OCR via WinRT (Windows 10+)

Windows OCR (subprocess) and Tesseract are always available.

pdfpurr = "0.4"
pdfpurr = { version = "0.4", features = ["ocr"] }

Architecture

src/
  core/               Object model, filters, compression
  parser/             Lexer, object parser, xref, repair
  content/            Tokenizer, builder, text analysis, structure detection
  fonts/              Parsing, encoding, embedding, subsetting
  images/             Extraction and embedding (XObject + inline)
  rendering/          Page-to-pixel (tiny-skia), annotation rendering
  encryption/         R2-R6, RC4, AES-128/256, constant-time comparison
  forms/              AcroForms (read/write)
  signatures/         Digital signature parsing
  standards/          PDF/A, PDF/X validation
  accessibility/      PDF/UA, structure tree, auto-tagging, quality checks
  structure/          Outlines, annotations, metadata
  ocr/                OCR engines, text layer, hybrid comparison
  document.rs         High-level API

Testing

cargo test            # 1150+ tests
cargo bench           # Criterion benchmarks
cargo clippy          # Lint (clean on stable + nightly)
cargo doc --no-deps   # Docs (zero warnings)

CI: Ubuntu/Windows/macOS (stable) + nightly + fuzz + benchmarks. On-demand: 1800+ external PDFs from veraPDF, qpdf, BFO.

Performance

Operation Time
Parse 14-page PDF (1 MB) 26 ms
Text extraction per page 4.7 ms
Render page at 150 DPI 210 ms

Security

  • Zero panic! or unreachable!() in production code
  • Constant-time password hash comparison
  • Checked arithmetic on image dimensions and buffer allocations
  • Resource limits: 256 MB decoded stream max, 1M xref rebuild cap
  • Depth limits: page tree (64), outlines (32), inherited properties (32)
  • Cycle detection on outline /Next chains
  • Raw deflate fallback for corrupt zlib headers
  • Fuzz-tested: 3 cargo-fuzz + 18 proptest + 22 adversarial tests

License

Dual-licensed under MIT or Apache-2.0 at your option.