Module parser

Module parser 

Source
Expand description

PDF Parser Module - Complete PDF parsing and rendering support

This module provides a comprehensive, 100% native Rust implementation for parsing PDF files according to the ISO 32000-1 (PDF 1.7) and ISO 32000-2 (PDF 2.0) specifications.

§Overview

The parser is designed to support building PDF renderers, content extractors, and analysis tools. It provides multiple levels of API access:

  • High-level: PdfDocument for easy document manipulation
  • Mid-level: ParsedPage, content streams, and resources
  • Low-level: Direct access to PDF objects and streams

§Quick Start

use oxidize_pdf::parser::{PdfDocument, PdfReader};
use oxidize_pdf::parser::content::ContentParser;

// Open a PDF document
let reader = PdfReader::open("document.pdf")?;
let document = PdfDocument::new(reader);

// Get document information
println!("Pages: {}", document.page_count()?);
println!("Version: {}", document.version()?);

// Process first page
let page = document.get_page(0)?;
println!("Page size: {}x{} points", page.width(), page.height());

// Parse content streams
let streams = page.content_streams_with_document(&document)?;
for stream in streams {
    let operations = ContentParser::parse(&stream)?;
    println!("Operations: {}", operations.len());
}

// Extract text
let text = document.extract_text_from_page(0)?;
println!("Text: {}", text.text);

§Architecture

┌─────────────────────────────────────────────────┐
│                 PdfDocument                     │ ← High-level API
│  ┌──────────┐ ┌──────────┐ ┌────────────────┐  │
│  │PdfReader │ │PageTree  │ │ResourceManager │  │
│  └──────────┘ └──────────┘ └────────────────┘  │
└─────────────────────────────────────────────────┘
           │              │              │
           ↓              ↓              ↓
┌─────────────────────────────────────────────────┐
│              ParsedPage                         │ ← Page API
│  ┌──────────┐ ┌──────────┐ ┌────────────────┐  │
│  │Properties│ │Resources │ │Content Streams │  │
│  └──────────┘ └──────────┘ └────────────────┘  │
└─────────────────────────────────────────────────┘
           │              │              │
           ↓              ↓              ↓
┌─────────────────────────────────────────────────┐
│         ContentParser & PdfObject               │ ← Low-level API
│  ┌──────────┐ ┌──────────┐ ┌────────────────┐  │
│  │Tokenizer │ │Operators │ │Object Types    │  │
│  └──────────┘ └──────────┘ └────────────────┘  │
└─────────────────────────────────────────────────┘

§Features

  • Complete PDF Object Model: All PDF object types supported
  • Content Stream Parsing: Full operator support for rendering
  • Resource Management: Fonts, images, color spaces, patterns
  • Text Extraction: With position and formatting information
  • Page Navigation: Efficient page tree traversal
  • Stream Filters: Decompression support (FlateDecode, ASCIIHex, etc.)
  • Reference Resolution: Automatic handling of indirect objects

§Example: Building a Simple Renderer

use oxidize_pdf::parser::{PdfDocument, PdfReader};
use oxidize_pdf::parser::content::{ContentParser, ContentOperation};

struct SimpleRenderer {
    current_path: Vec<(f32, f32)>,
}

impl SimpleRenderer {
    fn render_page(document: &PdfDocument<std::fs::File>, page_idx: u32) -> Result<(), Box<dyn std::error::Error>> {
        let page = document.get_page(page_idx)?;
        let streams = page.content_streams_with_document(&document)?;
         
        let mut renderer = SimpleRenderer {
            current_path: Vec::new(),
        };
         
        for stream in streams {
            let operations = ContentParser::parse(&stream)?;
            for op in operations {
                match op {
                    ContentOperation::MoveTo(x, y) => {
                        renderer.current_path.clear();
                        renderer.current_path.push((x, y));
                    }
                    ContentOperation::LineTo(x, y) => {
                        renderer.current_path.push((x, y));
                    }
                    ContentOperation::Stroke => {
                        println!("Draw path with {} points", renderer.current_path.len());
                        renderer.current_path.clear();
                    }
                    ContentOperation::ShowText(text) => {
                        println!("Draw text: {:?}", String::from_utf8_lossy(&text));
                    }
                    _ => {} // Handle other operations
                }
            }
        }
        Ok(())
    }
}

Re-exports§

pub use self::content::ContentOperation;
pub use self::content::ContentParser;
pub use self::content::TextElement;
pub use self::document::PdfDocument;
pub use self::document::ResourceManager;
pub use self::encoding::CharacterDecoder;
pub use self::encoding::EncodingOptions;
pub use self::encoding::EncodingResult;
pub use self::encoding::EncodingType;
pub use self::encoding::EnhancedDecoder;
pub use self::encryption_handler::ConsolePasswordProvider;
pub use self::encryption_handler::EncryptionHandler;
pub use self::encryption_handler::EncryptionInfo;
pub use self::encryption_handler::InteractiveDecryption;
pub use self::encryption_handler::PasswordProvider;
pub use self::encryption_handler::PasswordResult;
pub use self::objects::PdfArray;
pub use self::objects::PdfDictionary;
pub use self::objects::PdfName;
pub use self::objects::PdfObject;
pub use self::objects::PdfStream;
pub use self::objects::PdfString;
pub use self::optimized_reader::OptimizedPdfReader;
pub use self::page_tree::ParsedPage;
pub use self::reader::DocumentMetadata;
pub use self::reader::PdfReader;

Modules§

content
PDF Content Stream Parser - Complete support for PDF graphics operators
document
PDF Document wrapper - High-level interface for PDF parsing and manipulation
encoding
Character Encoding Detection and Conversion Module
encryption_handler
PDF encryption detection and password handling
filter_impls
PDF stream filter implementations
filters
PDF Stream Filters
header
PDF Header Parser
lexer
PDF Lexer
object_stream
PDF Object Stream Parser
objects
PDF Object Parser - Core PDF data types and parsing
optimized_reader
Optimized PDF Reader with LRU caching
page_tree
PDF Page Tree Parser
reader
High-level PDF Reader API
stack_safe
Stack-safe parsing utilities
stack_safe_tests
Comprehensive tests for stack-safe parsing implementations
trailer
PDF Trailer Parser
xref
PDF Cross-Reference Table Parser
xref_stream
Cross-reference stream support for PDF 1.5+
xref_types
XRef Entry Type Definitions

Structs§

ParseOptions
Options for parsing PDF files with different levels of strictness

Enums§

ParseError
PDF Parser errors covering all failure modes during parsing.
ParseWarning
Warnings that can be collected during lenient parsing

Type Aliases§

ParseResult
Result type for parser operations