Expand description
PDF Parser Module - Complete PDF parsing and rendering support
This module provides a comprehensive, 100% native Rust implementation for parsing PDF files according to the ISO 32000-1 (PDF 1.7) and ISO 32000-2 (PDF 2.0) specifications.
§Overview
The parser is designed to support building PDF renderers, content extractors, and analysis tools. It provides multiple levels of API access:
- High-level:
PdfDocumentfor easy document manipulation - Mid-level:
ParsedPage, content streams, and resources - Low-level: Direct access to PDF objects and streams
§Quick Start
use oxidize_pdf::parser::{PdfDocument, PdfReader};
use oxidize_pdf::parser::content::ContentParser;
// Open a PDF document
let reader = PdfReader::open("document.pdf")?;
let document = PdfDocument::new(reader);
// Get document information
println!("Pages: {}", document.page_count()?);
println!("Version: {}", document.version()?);
// Process first page
let page = document.get_page(0)?;
println!("Page size: {}x{} points", page.width(), page.height());
// Parse content streams
let streams = page.content_streams_with_document(&document)?;
for stream in streams {
let operations = ContentParser::parse(&stream)?;
println!("Operations: {}", operations.len());
}
// Extract text
let text = document.extract_text_from_page(0)?;
println!("Text: {}", text.text);§Architecture
┌─────────────────────────────────────────────────┐
│ PdfDocument │ ← High-level API
│ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│ │PdfReader │ │PageTree │ │ResourceManager │ │
│ └──────────┘ └──────────┘ └────────────────┘ │
└─────────────────────────────────────────────────┘
│ │ │
↓ ↓ ↓
┌─────────────────────────────────────────────────┐
│ ParsedPage │ ← Page API
│ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│ │Properties│ │Resources │ │Content Streams │ │
│ └──────────┘ └──────────┘ └────────────────┘ │
└─────────────────────────────────────────────────┘
│ │ │
↓ ↓ ↓
┌─────────────────────────────────────────────────┐
│ ContentParser & PdfObject │ ← Low-level API
│ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│ │Tokenizer │ │Operators │ │Object Types │ │
│ └──────────┘ └──────────┘ └────────────────┘ │
└─────────────────────────────────────────────────┘§Features
- Complete PDF Object Model: All PDF object types supported
- Content Stream Parsing: Full operator support for rendering
- Resource Management: Fonts, images, color spaces, patterns
- Text Extraction: With position and formatting information
- Page Navigation: Efficient page tree traversal
- Stream Filters: Decompression support (FlateDecode, ASCIIHex, etc.)
- Reference Resolution: Automatic handling of indirect objects
§Example: Building a Simple Renderer
use oxidize_pdf::parser::{PdfDocument, PdfReader};
use oxidize_pdf::parser::content::{ContentParser, ContentOperation};
struct SimpleRenderer {
current_path: Vec<(f32, f32)>,
}
impl SimpleRenderer {
fn render_page(document: &PdfDocument<std::fs::File>, page_idx: u32) -> Result<(), Box<dyn std::error::Error>> {
let page = document.get_page(page_idx)?;
let streams = page.content_streams_with_document(&document)?;
let mut renderer = SimpleRenderer {
current_path: Vec::new(),
};
for stream in streams {
let operations = ContentParser::parse(&stream)?;
for op in operations {
match op {
ContentOperation::MoveTo(x, y) => {
renderer.current_path.clear();
renderer.current_path.push((x, y));
}
ContentOperation::LineTo(x, y) => {
renderer.current_path.push((x, y));
}
ContentOperation::Stroke => {
println!("Draw path with {} points", renderer.current_path.len());
renderer.current_path.clear();
}
ContentOperation::ShowText(text) => {
println!("Draw text: {:?}", String::from_utf8_lossy(&text));
}
_ => {} // Handle other operations
}
}
}
Ok(())
}
}Re-exports§
pub use self::content::ContentOperation;pub use self::content::ContentParser;pub use self::content::TextElement;pub use self::document::PdfDocument;pub use self::document::ResourceManager;pub use self::encoding::CharacterDecoder;pub use self::encoding::EncodingOptions;pub use self::encoding::EncodingResult;pub use self::encoding::EncodingType;pub use self::encoding::EnhancedDecoder;pub use self::encryption_handler::ConsolePasswordProvider;pub use self::encryption_handler::EncryptionHandler;pub use self::encryption_handler::EncryptionInfo;pub use self::encryption_handler::InteractiveDecryption;pub use self::encryption_handler::PasswordProvider;pub use self::encryption_handler::PasswordResult;pub use self::objects::PdfArray;pub use self::objects::PdfDictionary;pub use self::objects::PdfName;pub use self::objects::PdfObject;pub use self::objects::PdfStream;pub use self::objects::PdfString;pub use self::optimized_reader::OptimizedPdfReader;pub use self::page_tree::ParsedPage;pub use self::reader::DocumentMetadata;pub use self::reader::PdfReader;
Modules§
- content
- PDF Content Stream Parser - Complete support for PDF graphics operators
- document
- PDF Document wrapper - High-level interface for PDF parsing and manipulation
- encoding
- Character Encoding Detection and Conversion Module
- encryption_
handler - PDF encryption detection and password handling
- filter_
impls - PDF stream filter implementations
- filters
- PDF Stream Filters
- header
- PDF Header Parser
- lexer
- PDF Lexer
- object_
stream - PDF Object Stream Parser
- objects
- PDF Object Parser - Core PDF data types and parsing
- optimized_
reader - Optimized PDF Reader with LRU caching
- page_
tree - PDF Page Tree Parser
- reader
- High-level PDF Reader API
- stack_
safe - Stack-safe parsing utilities
- stack_
safe_ tests - Comprehensive tests for stack-safe parsing implementations
- trailer
- PDF Trailer Parser
- xref
- PDF Cross-Reference Table Parser
- xref_
stream - Cross-reference stream support for PDF 1.5+
- xref_
types - XRef Entry Type Definitions
Structs§
- Parse
Options - Options for parsing PDF files with different levels of strictness
Enums§
- Parse
Error - PDF Parser errors covering all failure modes during parsing.
- Parse
Warning - Warnings that can be collected during lenient parsing
Type Aliases§
- Parse
Result - Result type for parser operations