oxidize-pdf
The Rust PDF library built for AI. Parse any PDF into structure-aware, embedding-ready chunks with one line of code. Pure Rust, zero C dependencies, 99.3% success rate on 9,000+ real-world PDFs.
let chunks = open?.rag_chunks?;
// Each chunk: text, pages, bounding boxes, element types, heading context, token estimate
Why oxidize-pdf for RAG?
Most PDF libraries give you a wall of text. oxidize-pdf gives you structured, metadata-rich chunks ready for your vector store:
| What you get | Why it matters |
|---|---|
chunk.full_text |
Heading context prepended -- better embeddings |
chunk.page_numbers |
Citation back to source pages |
chunk.bounding_boxes |
Spatial position for visual grounding |
chunk.element_types |
Filter by "table", "title", "paragraph" |
chunk.token_estimate |
Right-size chunks for your model's context window |
chunk.heading_context |
Section awareness without post-processing |
Performance: Pure Rust, 3,000-4,000 pages/sec generation, 85ms full-text extraction for a 930KB PDF.
Quick Start
[]
= "2.3"
RAG Pipeline -- One Liner
use PdfDocument;
Custom Chunk Size
use HybridChunkConfig;
// Smaller chunks for more precise retrieval
let config = HybridChunkConfig ;
let chunks = doc.rag_chunks_with?;
JSON for Vector Store Ingestion
// Serialize all chunks to JSON (requires `semantic` feature)
let json = doc.rag_chunks_json?;
write?;
Element Partitioning
For fine-grained control, access the typed element pipeline directly:
use ExtractionProfile;
let doc = open?;
// Partition into typed elements
let elements = doc.partition?;
for el in &elements
// Or with a pre-configured profile
let elements = doc.partition_with_profile?;
// Build a relationship graph (parent/child sections)
let = doc.partition_graph?;
for section in graph.top_level_sections
Beyond RAG
oxidize-pdf is a full-featured PDF library. Everything below works alongside the RAG pipeline.
PDF Generation
use ;
PDF Parsing
use ;
let doc = open?;
let text = doc.extract_text?;
for in text.iter.enumerate
Encryption (Read + Write)
use ;
// Write encrypted PDFs
let mut doc = new;
doc.add_page;
doc.set_encryption;
doc.save?;
// Read encrypted PDFs
let mut reader = open?;
reader.unlock?;
Invoice Extraction
use InvoiceExtractor;
let doc = open?;
let text = doc.extract_text?;
let extractor = builder
.with_language
.build;
let invoice = extractor.extract?;
// invoice.fields: invoice number, dates, amounts, VAT, line items
PDF Operations
use ;
// Split
new?.split_by_pages?;
// Merge
let mut merger = new;
merger.add_pdf?;
merger.add_pdf?;
merger.save?;
Full Feature Set
AI/RAG Pipeline
- Structure-aware chunking with
RagChunkmetadata (pages, bboxes, types, headings) - Element partitioning: Title, Paragraph, Table, ListItem, Image, CodeBlock, KeyValue
ElementGraphfor parent/child section relationships- 6 extraction profiles (Standard, Academic, Form, Government, Dense, Presentation)
- Reading order strategies (Simple, XYCut)
- LLM-optimized export formats (Markdown, Contextual, JSON)
- Invoice data extraction (ES, EN, DE, IT)
PDF Processing
- Parse PDF 1.0-1.7 with 99.3% success rate (9,000+ PDFs tested)
- Generate multi-page documents with text, graphics, images
- Encryption: RC4-40/128, AES-128, AES-256 (R5/R6) -- read and write
- Digital signatures: detection, PKCS#7 verification, certificate validation
- PDF/A validation: 8 conformance levels (1a/b, 2a/b/u, 3a/b/u)
- JBIG2 decoder: pure Rust (ITU-T T.88)
- OCR via Tesseract (optional feature)
- Split, merge, rotate operations
- CJK text support (Chinese, Japanese, Korean)
- Corruption recovery and lenient parsing
- Decompression bomb protection
Performance
| Operation | Speed |
|---|---|
| PDF generation | 3,000-4,000 pages/sec |
| Full text extraction (930KB) | 85 ms |
| Page text extraction | 546 us |
| File loading | 738 us |
Benchmarked with Criterion. Baseline: v2.0.0-profiling.
Testing
7,993 tests across unit, integration, and doc tests. 7-tier corpus (T0-T6) with 9,000+ PDFs.
License
MIT -- see LICENSE.