pdf_oxide 0.3.22

# Architecture Overview

System design and implementation details for pdf_oxide.

## Table of Contents

- [High-Level Architecture](#high-level-architecture)
- [Core Components](#core-components)
- [Data Flow](#data-flow)
- [Module Organization](#module-organization)
- [Algorithm Details](#algorithm-details)
- [Performance Design](#performance-design)
- [Extension Points](#extension-points)

## High-Level Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        Public API Layer                          │
│  (PdfDocument, Exporters, Configuration)                        │
└───────────────────┬─────────────────────────────────────────────┘
                    │
┌───────────────────┴─────────────────────────────────────────────┐
│                     Processing Pipeline                          │
│                                                                  │
│  ┌──────────┐   ┌─────────┐   ┌──────────┐   ┌─────────────┐ │
│  │  Lexer   │──▶│ Parser  │──▶│  Stream  │──▶│   Layout    │ │
│  │          │   │         │   │ Decoder  │   │  Analysis   │ │
│  └──────────┘   └─────────┘   └──────────┘   └─────────────┘ │
│                                                       │          │
│  ┌──────────┐   ┌─────────┐   ┌──────────┐          │         │
│  │   Text   │◀──│  Font   │◀──│  Object  │◀─────────┘         │
│  │Extractor │   │ Parser  │   │  Resolver│                     │
│  └──────────┘   └─────────┘   └──────────┘                     │
│        │                                                         │
│        ▼                                                         │
│  ┌──────────┐   ┌─────────┐   ┌──────────┐                    │
│  │ Markdown │   │  HTML   │   │   Text   │                    │
│  │ Exporter │   │Exporter │   │ Exporter │                    │
│  └──────────┘   └─────────┘   └──────────┘                    │
└─────────────────────────────────────────────────────────────────┘
                    │
┌───────────────────┴─────────────────────────────────────────────┐
│                     Storage Layer                                │
│  (Memory-mapped files, Object cache, Font cache)               │
└─────────────────────────────────────────────────────────────────┘
```

## Core Components

### 1. Lexer (`src/lexer.rs`)

**Purpose:** Tokenize PDF byte stream into lexical tokens.

**Key Features:**
- Zero-copy tokenization using byte slices
- Handles all PDF token types (numbers, strings, names, operators)
- Efficient whitespace and comment skipping
- Position tracking for error reporting

**Performance:**
- ~1-2ms for typical PDFs
- O(n) complexity where n = file size
- Minimal allocations (uses `&[u8]` slices)

**Example:**
```rust
pub struct Lexer<'a> {
    data: &'a [u8],
    position: usize,
}

impl<'a> Lexer<'a> {
    pub fn next_token(&mut self) -> Result<Token<'a>> {
        // Zero-copy token extraction
    }
}
```

### 2. Parser (`src/parser.rs`)

**Purpose:** Parse PDF objects from token stream.

**Key Features:**
- Recursive descent parser
- Cycle detection for circular references
- Recursion depth limiting (max 100 levels)
- Proper error context propagation

**Object Types:**
- Null, Boolean, Integer, Real
- String, Name
- Array, Dictionary
- Stream, Reference

**Example:**
```rust
pub struct Parser<'a> {
    lexer: Lexer<'a>,
    objects: HashMap<ObjectId, Object>,
    recursion_depth: u32,
}

impl<'a> Parser<'a> {
    pub fn parse_object(&mut self) -> Result<Object> {
        // Recursive object parsing with cycle detection
    }
}
```

### 3. Stream Decoder (`src/stream_decoder.rs`)

**Purpose:** Decompress PDF streams using various filters.

**Supported Filters:**
- **FlateDecode**: Zlib/Deflate (most common)
- **LZWDecode**: LZW compression
- **ASCII85Decode**: ASCII85 encoding
- **ASCIIHexDecode**: Hex encoding
- **RunLengthDecode**: Run-length encoding
- **DCTDecode**: JPEG (passthrough)
- **CCITTFaxDecode**: CCITT Group 3/4

**Design:**
```rust
pub trait StreamFilter {
    fn decode(&self, data: &[u8], params: &DecodeParms) -> Result<Vec<u8>>;
}

pub struct StreamDecoder {
    filters: Vec<Box<dyn StreamFilter>>,
}
```

**Performance:**
- Streaming decompression where possible
- Decompression bomb protection (10× ratio limit)
- Reusable buffers to minimize allocations

### 4. Layout Analysis (`src/layout/`)

**Purpose:** Understand document spatial structure.

**Submodules:**

#### 4.1 DBSCAN Clustering (`layout/dbscan.rs`)

**Algorithm:** Density-Based Spatial Clustering of Applications with Noise

**Parameters:**
- Epsilon (ε): 1.5× median character height
- MinPts: 2 for chars→words, 3 for words→lines

**Data Structure:** R*-tree for O(log n) neighborhood queries

**Performance:** O(n log n) instead of naive O(n²)

```rust
pub struct DbscanClusterer {
    spatial_index: RTree<Character>,
    epsilon: f32,
    min_points: usize,
}

impl DbscanClusterer {
    pub fn cluster(&self, chars: &[Character]) -> Vec<Cluster> {
        // DBSCAN with spatial indexing
    }
}
```

#### 4.2 XY-Cut (`layout/xy_cut.rs`)

**Algorithm:** Recursive projection-based page segmentation

**Steps:**
1. Project content onto X-axis (horizontal)
2. Find gaps in projection (valleys)
3. Split at largest gap
4. Recurse on sub-regions
5. Repeat for Y-axis

**Configuration:**
- Split threshold: 0.05 × page dimension (5%)
- Max recursion: 10 levels
- Gaussian smoothing: σ=2.0

```rust
pub struct XyCutSegmenter {
    threshold: f32,
    max_depth: u32,
}

impl XyCutSegmenter {
    pub fn segment(&self, page: &Page) -> Vec<Region> {
        self.segment_recursive(page, 0)
    }

    fn segment_recursive(&self, region: &Region, depth: u32) -> Vec<Region> {
        // Recursive XY-Cut with projection analysis
    }
}
```

#### 4.3 Font Clustering (`layout/font_clustering.rs`)

**Purpose:** Group text by font characteristics for formatting detection.

**Features:**
- Size tolerance: ±1pt
- Family exact matching
- Outlier rejection: fonts used in <2% of characters

**Use Cases:**
- Bold/italic detection
- Heading identification
- Style consistency checking

### 5. Text Extraction (`src/text.rs`)

**Purpose:** Extract text with proper spacing and formatting.

**Key Algorithms:**
- **Word spacing detection**: 0.25× character width threshold
- **Line spacing detection**: 1.2× character height threshold
- **Bold detection**: Font weight analysis
- **Reading order**: Top-to-bottom, left-to-right (with column detection)

**Quality:**
- 100% word spacing accuracy (tested on 103 PDFs)
- 100% whitespace preservation
- 37% more bold sections detected vs. reference

```rust
pub struct TextExtractor {
    config: TextConfig,
    font_cache: HashMap<String, Font>,
}

impl TextExtractor {
    pub fn extract(&mut self, page: &Page) -> Result<String> {
        // Character clustering → word formation → line formation → text
    }
}
```

### 6. Font Parser (`src/font.rs`)

**Purpose:** Parse PDF fonts and handle character encoding.

**Features:**
- **Standard fonts**: Built-in Type1 fonts (Times, Helvetica, etc.)
- **ToUnicode CMap**: Unicode mapping for custom encodings
- **Encoding tables**: PDFDocEncoding, MacRomanEncoding, WinAnsiEncoding
- **Font metrics**: Character widths, ascent, descent

**Caching:**
- Parsed fonts cached per document
- Reduces re-parsing overhead for multi-page documents
- ~25% speedup on typical documents

### 7. Image Extraction (`src/image.rs`)

**Purpose:** Extract embedded images from PDFs.

**Supported Formats:**
- JPEG (DCTDecode)
- PNG (FlateDecode + predictor)
- TIFF (CCITTFaxDecode)
- Raw RGB/CMYK

**Features:**
- Color space conversion (CMYK → RGB, etc.)
- Predictor functions (PNG filters)
- Decompression
- Metadata extraction (dimensions, DPI)

### 8. Exporters (`src/converters/`)

**Purpose:** Convert extracted content to various formats.

#### 8.1 Markdown Exporter

**Quality:** 99.8/100 (tested on 103 PDFs)

**Features:**
- Bold text detection and formatting
- Heading detection (font size heuristics)
- List detection
- Link preservation
- Proper whitespace normalization

#### 8.2 HTML Exporter

**Quality:** 94.0/100

**Features:**
- Semantic HTML5
- URL/email linkification
- Bold/italic tags
- Proper entity encoding
- CSS-friendly structure

#### 8.3 Plain Text Exporter

**Quality:** 100.0/100 (whitespace)

**Features:**
- Whitespace normalization
- Word spacing preservation
- Line break consistency

## Data Flow

### Opening a Document

```
1. User calls PdfDocument::open("file.pdf")
   ↓
2. Memory-map file (or load into buffer)
   ↓
3. Parse header, locate cross-reference table
   ↓
4. Build object index (lazy loading)
   ↓
5. Parse catalog and page tree
   ↓
6. Return PdfDocument handle
```

### Extracting Text from a Page

```
1. User calls doc.extract_text(page_num)
   ↓
2. Resolve page object from page tree
   ↓
3. Get content stream(s)
   ↓
4. Decode streams (decompress)
   ↓
5. Parse content stream operators
   ↓
6. Extract character data with positions
   ↓
7. Cluster characters into words (DBSCAN)
   ↓
8. Cluster words into lines (DBSCAN)
   ↓
9. Detect columns (XY-Cut)
   ↓
10. Sort into reading order
   ↓
11. Format text with proper spacing
   ↓
12. Return extracted text
```

### Exporting to Markdown

```
1. User calls exporter.export_all(doc)
   ↓
2. For each page:
   ├─ Extract text blocks with metadata
   ├─ Detect bold sections (font analysis)
   ├─ Detect headings (size heuristics)
   ├─ Detect lists (indentation + markers)
   ↓
3. Format as Markdown:
   ├─ Headings: # ## ###
   ├─ Bold: **text**
   ├─ Lists: - item
   ├─ Links: [text](url)
   ↓
4. Concatenate pages with separators
   ↓
5. Return Markdown string
```

## Module Organization

```
src/
├── lib.rs              # Public API, re-exports
├── error.rs            # Error types
├── object.rs           # PDF object types
├── lexer.rs            # Tokenization
├── parser.rs           # Object parsing
├── document.rs         # High-level API
├── stream_decoder.rs   # Stream decompression
├── font.rs             # Font parsing
├── text.rs             # Text extraction
├── image.rs            # Image extraction
├── form.rs             # Form field extraction
├── bookmark.rs         # Bookmark extraction
├── annotation.rs       # Annotation extraction
│
├── layout/             # Layout analysis
│   ├── mod.rs
│   ├── dbscan.rs       # DBSCAN clustering
│   ├── xy_cut.rs       # XY-Cut segmentation
│   ├── font_clustering.rs
│   └── spatial_index.rs  # R*-tree
│
├── converters/         # Export formats
│   ├── mod.rs
│   ├── markdown.rs     # Markdown exporter
│   ├── html.rs         # HTML exporter
│   └── text.rs         # Plain text exporter
│
└── utils/              # Utilities
    ├── mod.rs
    ├── encoding.rs     # Character encoding
    └── geometry.rs     # Bounding box math

python/                 # Python bindings
├── src/
│   └── lib.rs         # PyO3 bindings
└── pyproject.toml

tests/                  # Integration tests
├── test_parsing.rs
├── test_text_extraction.rs
├── test_exporters.rs
└── fixtures/          # Test PDFs

benches/               # Performance benchmarks
└── parsing.rs

examples/              # Usage examples
├── basic.rs
├── batch_processing.rs
└── custom_exporter.rs
```

## Algorithm Details

### DBSCAN Clustering

**Pseudo-code:**
```
function dbscan(points, epsilon, minPts):
    clusters = []
    visited = set()

    for point in points:
        if point in visited:
            continue

        visited.add(point)
        neighbors = find_neighbors(point, epsilon)  # O(log n) with R*-tree

        if len(neighbors) < minPts:
            # Noise point
            continue

        # Start new cluster
        cluster = [point]
        queue = neighbors

        while queue:
            neighbor = queue.pop()
            if neighbor not in visited:
                visited.add(neighbor)
                new_neighbors = find_neighbors(neighbor, epsilon)

                if len(new_neighbors) >= minPts:
                    queue.extend(new_neighbors)

            if neighbor not in any cluster:
                cluster.append(neighbor)

        clusters.append(cluster)

    return clusters
```

**Optimization:** R*-tree spatial index allows O(log n) neighbor queries instead of O(n) linear scan.

### XY-Cut Segmentation

**Pseudo-code:**
```
function xy_cut(region, depth):
    if depth > MAX_DEPTH:
        return [region]

    # Project onto X-axis
    x_projection = project_horizontal(region)
    x_gaps = find_gaps(x_projection, threshold)

    if x_gaps:
        split_x = largest_gap(x_gaps)
        left, right = split_region(region, split_x, vertical=True)
        return xy_cut(left, depth+1) + xy_cut(right, depth+1)

    # Project onto Y-axis
    y_projection = project_vertical(region)
    y_gaps = find_gaps(y_projection, threshold)

    if y_gaps:
        split_y = largest_gap(y_gaps)
        top, bottom = split_region(region, split_y, vertical=False)
        return xy_cut(top, depth+1) + xy_cut(bottom, depth+1)

    # No split found
    return [region]
```

### Word Spacing Detection

**Algorithm:**
```
For each pair of adjacent characters (c1, c2):
    gap = c2.x - (c1.x + c1.width)
    threshold = 0.25 * c1.width

    if gap > threshold:
        insert_space()
```

**Dynamic threshold:** Uses character width for adaptive spacing detection.

## Performance Design

### Zero-Copy Parsing

**Technique:** Use byte slices (`&[u8]`) instead of copying data.

**Example:**
```rust
// Zero-copy token
pub enum Token<'a> {
    Name(&'a [u8]),
    String(&'a [u8]),
    // ...
}

// No allocation needed
let token = lexer.next_token()?;
```

### Memory-Mapped Files

**Technique:** Use `mmap` for large files.

**Benefits:**
- OS handles memory management
- Lazy loading (only accessed pages loaded)
- Shared memory across processes

### Object Caching

**Strategy:** Cache parsed objects by object ID.

**Benefits:**
- Avoid re-parsing referenced objects
- O(1) lookup for cross-references
- Memory bounded by unique object count

### Streaming

**Technique:** Process pages one at a time.

**Benefits:**
- Constant memory regardless of document size
- Can process 10 GB documents with <100 MB RAM

## Extension Points

### Custom Exporters

Implement the `Exporter` trait:

```rust
pub trait Exporter {
    fn export_page(&self, doc: &mut PdfDocument, page_num: u32) -> Result<String>;
    fn export_all(&self, doc: &mut PdfDocument) -> Result<String>;
}
```

### Custom Stream Filters

Implement the `StreamFilter` trait:

```rust
pub trait StreamFilter {
    fn decode(&self, data: &[u8], params: &DecodeParms) -> Result<Vec<u8>>;
}
```

### Custom Layout Analyzers

Replace or extend layout analysis:

```rust
pub trait LayoutAnalyzer {
    fn analyze(&self, page: &Page) -> Result<Layout>;
}
```

## Security Considerations

### Input Validation

**PDF as untrusted input:**
- Limit file size (default: 500 MB)
- Limit object count (default: 1M objects)
- Limit recursion depth (default: 100 levels)
- Validate array/dictionary sizes

### Decompression Bombs

**Protection:**
- Limit decompressed size to 10× compressed size
- Check compression ratio before full decompression
- Stream decompression where possible

### Memory Safety

**Rust guarantees:**
- No buffer overflows
- No use-after-free
- No data races
- Bounds checking on all array accesses

**Minimal `unsafe`:**
- Used sparingly (e.g., performance-critical paths)
- Thoroughly documented with safety invariants
- Reviewed by multiple developers

## Future Architecture

### Planned Improvements

**v1.x:**
- Stream API for ultra-low memory usage
- Table detection (smart heuristics)
- Rotated text handling

**v2.x:**
- ML-based layout analysis (optional)
- GPU acceleration (optional)
- Parallel page processing

**v3.x:**
- OCR integration (Tesseract)
- PDF/A validation
- Advanced form handling

## References

- **PDF Specification**: ISO 32000-1:2008 (PDF 1.7) - `docs/spec/pdf.md`
- **Architecture**: ARCHITECTURE.md - Detailed system design
- **DBSCAN**: Ester et al., "A Density-Based Algorithm for Discovering Clusters" (1996)
- **XY-Cut**: Nagy & Seth, "Hierarchical Representation of Optically Scanned Documents" (1984)
- **R*-tree**: Beckmann et al., "The R*-tree: An Efficient and Robust Access Method" (1990)

## Questions?

- **GitHub Issues**: https://github.com/yfedoseev/pdf_oxide/issues
- **Discussions**: https://github.com/yfedoseev/pdf_oxide/discussions
- **Documentation**: https://docs.rs/pdf_oxide