oxidize-pdf
A pure Rust PDF generation and manipulation library with zero external PDF dependencies. Battle-tested against 9,000+ real-world PDFs with a 99.3% success rate, 6,400+ tests, and validated performance of 3,000-4,000 pages/second for realistic business documents.
Features
- ๐ Pure Rust Core - No C dependencies for PDF operations (OCR feature requires Tesseract)
- ๐ PDF Generation - Create multi-page documents with text, graphics, and images
- ๐ PDF Parsing - Read and extract content from existing PDFs (tested on 9,000+ real-world PDFs)
- ๐ก๏ธ Corruption Recovery - Robust error recovery for damaged or malformed PDFs (99.3% success rate)
- โ๏ธ PDF Operations - Split, merge, and rotate PDFs while preserving basic content
- ๐ผ๏ธ Image Support - Embed JPEG and PNG images with automatic compression
- ๐จ Transparency & Blending - Full alpha channel, SMask, blend modes for watermarking and overlays
- ๐ CJK Text Support - Chinese, Japanese, and Korean text rendering and extraction with ToUnicode CMap
- ๐จ Rich Graphics - Vector graphics with shapes, paths, colors (RGB/CMYK/Gray)
- ๐ Advanced Text - Custom TTF/OTF fonts, standard fonts, text flow with automatic wrapping, alignment
- ๐ ฐ๏ธ Custom Fonts - Load and embed TrueType/OpenType fonts with full Unicode support
- ๐ OCR Support - Extract text from scanned PDFs using Tesseract OCR
- ๐ค AI/RAG Integration - Document chunking for LLM pipelines with sentence boundaries and metadata
- ๐ Invoice Extraction - Automatic structured data extraction from invoice PDFs with multi-language support
- ๐๏ธ Compression - FlateDecode, LZWDecode, CCITTFaxDecode, JBIG2Decode, and more
- ๐ Encryption - RC4, AES-128, AES-256 (R5/R6) with full permission support
- โ๏ธ Digital Signatures - Detection, PKCS#7 verification, and certificate validation (Mozilla CA roots)
- ๐ PDF/A Validation - 8 conformance levels (1a/b, 2a/b/u, 3a/b/u)
- ๐ Type Safe - Leverage Rust's type system for safe PDF manipulation
๐ What's New in v2.0.0
- ๐ MIT License - Consolidated across all project files
- ๐ 9,000+ PDF Corpus - 7-tier test infrastructure (T0-T6) with 99.3% success rate
- ๐ผ๏ธ JBIG2 Decoder - Full pure Rust implementation (ITU-T T.88, 9 modules, 416 tests)
- โ๏ธ Digital Signature Verification - PKCS#7 + Mozilla CA root certificates
- ๐ PDF/A Validation - 8 conformance levels (1a/b, 2a/b/u, 3a/b/u)
- ๐๏ธ CCITTFaxDecode - Group 3/4 fax compression support
- ๐ AES-256 R5/R6 Encryption - RustCrypto, Algorithm 2.B, qpdf compatible
- ๐งช 6,400+ Tests - Unit, integration, doc tests, and property-based testing
See CHANGELOG.md for previous releases.
๐ Why oxidize-pdf?
Performance & Efficiency
- Production-ready performance - 3,000-4,000 pages/second generation, 35.9 PDFs/second parsing
- 5.2 MB binary - 3x smaller than PDFSharp, 40x smaller than IronPDF
- Zero dependencies - No runtime, no Chrome, just a single binary
- Low memory usage - Efficient streaming for large PDFs
Safety & Reliability
- Memory safe - Guaranteed by Rust compiler (no null pointers, no buffer overflows)
- Type safe API - Catch errors at compile time
- 6,400+ tests - Comprehensive test suite with 9,000+ real-world PDFs
- No CVEs possible - Memory safety eliminates entire classes of vulnerabilities
Developer Experience
- Modern API - Designed in 2024, not ported from 2005
- True cross-platform - Single binary runs on Linux, macOS, Windows, ARM
- Easy deployment - One file to ship, no dependencies to manage
- Fast compilation - Incremental builds in seconds
Quick Start
Add oxidize-pdf to your Cargo.toml:
[]
= "2.0.0"
# For OCR support (optional)
= { = "2.0.0", = ["ocr-tesseract"] }
Basic PDF Generation
use ;
AI/RAG Document Chunking (v1.3.0+)
use DocumentChunker;
use ;
use Result;
Invoice Data Extraction (v1.6.2+)
use Document;
use ;
use ;
Supported Languages: Spanish (ES), English (EN), German (DE), Italian (IT)
Extracted Fields: Invoice number, dates, amounts (total/tax/net), VAT numbers, supplier/customer names, currency, line items
See docs/INVOICE_EXTRACTION_GUIDE.md for complete documentation.
Custom Fonts Example
use ;
Parse Existing PDF
use ;
Working with Images & Transparency
use ;
use TransparencyGroup;
Advanced Text Flow
use ;
PDF Operations
use ;
use Result;
OCR Text Extraction
use ;
use ;
use PageContentAnalyzer;
use PdfReader;
use Result;
OCR Installation
Before using OCR features, install Tesseract on your system:
macOS:
Ubuntu/Debian:
Windows: Download from: https://github.com/UB-Mannheim/tesseract/wiki
More Examples
Explore comprehensive examples in the examples/ directory:
recovery_corrupted_pdf.rs- Handle damaged or malformed PDFs with robust error recoverypng_transparency_watermark.rs- Create watermarks, blend modes, and transparent overlayscjk_text_extraction.rs- Work with Chinese, Japanese, and Korean textbasic_chunking.rs- Document chunking for AI/RAG pipelinesrag_pipeline.rs- Complete RAG workflow with embeddings
Run any example:
Supported Features
PDF Generation
- โ Multi-page documents
- โ Vector graphics (rectangles, circles, paths, lines)
- โ Text rendering with standard fonts (Helvetica, Times, Courier)
- โ JPEG and PNG image embedding with transparency
- โ Transparency groups, blend modes, and opacity control
- โ RGB, CMYK, and Grayscale colors
- โ Graphics transformations (translate, rotate, scale)
- โ Text flow with automatic line wrapping
- โ FlateDecode compression
PDF Parsing
- โ PDF 1.0 - 1.7 basic structure support
- โ Cross-reference table parsing with automatic recovery
- โ XRef streams (PDF 1.5+) and object streams
- โ Object and stream parsing with corruption tolerance
- โ Page tree navigation with circular reference detection
- โ Content stream parsing (basic operators)
- โ Text extraction with CJK (Chinese, Japanese, Korean) support
- โ CMap and ToUnicode parsing for complex encodings
- โ Document metadata extraction
- โ Filter support (FlateDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode, LZWDecode, DCTDecode)
- โ Lenient parsing with multiple error recovery strategies
PDF Operations
- โ Split by pages, ranges, or size
- โ Merge multiple PDFs
- โ Rotate pages (90ยฐ, 180ยฐ, 270ยฐ)
- โ Basic content preservation
OCR Support (v0.1.3+)
- โ Tesseract OCR integration with feature flag
- โ Multi-language support (50+ languages)
- โ Page analysis and scanned page detection
- โ Configurable preprocessing (denoise, deskew, contrast)
- โ Layout preservation with position information
- โ Confidence scoring and filtering
- โ Multiple page segmentation modes (PSM)
- โ Character whitelisting/blacklisting
- โ Mock OCR provider for testing
- โ Parallel and batch processing
Performance
Validated Metrics (based on comprehensive benchmarking):
- PDF Generation: 3,000-4,000 pages/second for realistic business documents
- Complex Content: 670 pages/second for dense analytics dashboards
- PDF Parsing: 35.9 PDFs/second (99.3% success rate on 9,000+ real-world PDFs)
- Memory Efficient: Streaming operations available for large documents
- Pure Rust: No external C dependencies for PDF operations
See PERFORMANCE_HONEST_REPORT.md for detailed benchmarking methodology and results.
Examples
Check out the examples directory for more usage patterns:
hello_world.rs- Basic PDF creationgraphics_demo.rs- Vector graphics showcasetext_formatting.rs- Advanced text featurescustom_fonts.rs- TTF/OTF font loading and embeddingjpeg_image.rs- Image embeddingparse_pdf.rs- PDF parsing and text extractioncomprehensive_demo.rs- All features demonstrationtesseract_ocr_demo.rs- OCR text extraction (requires--features ocr-tesseract)scanned_pdf_analysis.rs- Analyze PDFs for scanned contentextract_images.rs- Extract embedded images from PDFscreate_pdf_with_images.rs- Advanced image embedding examples
Run examples with:
# For OCR examples
License
This project is licensed under the MIT License - see the LICENSE file for details.
Known Limitations
We prioritize transparency about what works and what doesn't.
Working Features
- โ Compression: FlateDecode, ASCIIHexDecode, ASCII85Decode, RunLengthDecode, LZWDecode, DCTDecode, CCITTFaxDecode, JBIG2Decode
- โ Color Spaces: DeviceRGB, DeviceCMYK, DeviceGray
- โ Fonts: Standard 14 fonts + TTF/OTF custom font loading and embedding
- โ Images: JPEG embedding, raw RGB/Gray data, PNG with transparency
- โ Operations: Split, merge, rotate, page extraction, text extraction
- โ Graphics: Vector operations, clipping paths, transparency (CA/ca)
- โ Encryption: RC4 40/128-bit, AES-128/256, AES-256 R5/R6
- โ Forms: Basic text fields, checkboxes, radio buttons, combo boxes, list boxes
- โ Digital Signatures: Detection + PKCS#7 verification + certificate validation (signing not yet supported)
- โ PDF/A Validation: 8 conformance levels (1a/b, 2a/b/u, 3a/b/u)
- โ JBIG2 Decoding: Full pure Rust decoder (ITU-T T.88)
Missing Features
- ๐ง Form Interactions: Forms can be created but not edited interactively
- ๐ง Tagged PDFs: Structure tree API (partial โ no marked content operators)
- โ Rendering: No PDF to image conversion
- โ JPXDecode: JPEG 2000 compression not supported
- โ Advanced Graphics: Complex patterns, shadings, gradients
- โ Digital Signing: Signature creation (verification works, signing does not)
- โ Advanced Color: ICC profiles, spot colors, Lab color space
- โ JavaScript: No form calculations or validation scripts
- โ Multimedia: No sound, video, or 3D content support
Important Notes
- Parsing success doesn't mean full feature support
- Many PDFs will parse but advanced features will be ignored
Project Structure
oxidize-pdf/
โโโ oxidize-pdf-core/ # Core PDF library (MIT)
โโโ oxidize-pdf-api/ # REST API server
โโโ oxidize-pdf-cli/ # CLI interface
โโโ test-corpus/ # 9,000+ PDFs across 7 tiers (T0-T6)
โโโ docs/ # Documentation
โโโ dev-tools/ # Development utilities
โโโ benches/ # Benchmarks
โโโ lints/ # Custom Clippy lints
See REPOSITORY_ARCHITECTURE.md for detailed information.
Testing
oxidize-pdf uses a 7-tier corpus (T0-T6) with 9,000+ PDFs and 6,400+ tests:
| Tier | Description | PDFs | Purpose |
|---|---|---|---|
| T0 | Synthetic | Generated | Unit tests, CI/CD |
| T1 | Reference | ~1,300 | pdf.js, pdfium, poppler suites |
| T2 | Real-world | ~7,000 | GovDocs, academic, corporate |
| T3 | Stress | ~200 | Malformed, edge cases |
| T4 | Performance | ~100 | Benchmarking targets |
| T5 | Quality | ~300 | Text extraction accuracy |
| T6 | Adversarial | ~100 | Security, fuzzing |
# Run standard test suite (T0 โ synthetic PDFs, runs in CI)
# Run corpus tests (requires downloaded corpus)
# Run OCR tests (requires Tesseract installation)
The T0 tier runs in CI with zero external dependencies. T1-T6 tiers require downloading the corpus (~15 GB) โ see test-corpus/ for setup instructions.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Roadmap
oxidize-pdf is under active development. Our focus areas include:
Current Focus
- Parsing & Compatibility: Improving support for diverse PDF structures
- Core Operations: Enhancing split, merge, and manipulation capabilities
- Performance: Optimizing memory usage and processing speed
- Stability: Addressing edge cases and error handling
Upcoming Areas
- Extended Format Support: Additional image formats and encodings
- Advanced Text Processing: Improved text extraction and layout analysis
- Enterprise Features: Features designed for production use at scale
- Developer Experience: Better APIs, documentation, and tooling
Long-term Vision
- Comprehensive PDF standard compliance for common use cases
- Production-ready reliability and performance
- Rich ecosystem of tools and integrations
- Sustainable open source development model
We prioritize features based on community feedback and real-world usage. Have a specific need? Open an issue to discuss!
Support
- ๐ Documentation
- ๐ Issue Tracker
- ๐ฌ Discussions
Star History
Acknowledgments
Built with โค๏ธ using Rust. Special thanks to the Rust community and all contributors.