pdfrs 0.1.0 - Docs.rs

# PDF-CLI Architecture Documentation

## Overview

PDF-CLI is architectured as a modular system with clear separation of concerns. The design prioritizes maintainability, extensibility, and performance while implementing PDF functionality from scratch without external dependencies.

## High-Level Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                        PDF-CLI CLI                          │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │   PDF I/O   │  │ Markdown I/O│  │   Image I/O         │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │  PDF Core   │  │  Markdown   │  │   Image Processing │  │
│  │  Engine     │  │  Parser     │  │   Engine           │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │Compression  │  │   Utility   │  │    Error            │  │
│  │   Module    │  │  Functions  │  │   Handling          │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
```

## Module Architecture

### 1. CLI Module (`src/main.rs`)

**Purpose**: Command-line interface and application orchestration

**Responsibilities**:

- Parse command-line arguments using `clap`
- Route commands to appropriate handlers
- Coordinate between modules
- Handle application-level error reporting

**Key Components**:

```rust
struct Cli {
    command: Commands,
}

enum Commands {
    PdfToMd { input: String, output: String },
    MdToPdf { input: String, output: String, font: String, font_size: f32 },
    Extract { input: String },
    Create { output: String, text: String, font: String, font_size: f32 },
    AddImage { pdf_file: String, image_file: String, x: f32, y: f32, width: f32, height: f32 },
}
```

### 2. PDF Core Engine (`src/pdf.rs`)

**Purpose**: PDF parsing and text extraction

**Architecture Pattern**: Document Object Model (DOM) parser

**Key Classes**:

```rust
pub struct PdfDocument {
    pub version: String,
    pub objects: HashMap<u32, PdfObject>,
    pub catalog: u32,
    pub pages: Vec<u32>,
}

pub enum PdfObject {
    Dictionary(HashMap<String, PdfValue>),
    Stream { dictionary: HashMap<String, PdfValue>, data: Vec<u8> },
    Array(Vec<PdfValue>),
    String(String),
    Number(f64),
    Boolean(bool),
    Null,
    Reference(u32, u32),
    Name(String),
}
```

**Parsing Pipeline**:

```
PDF File → Header Parser → XRef Parser → Object Parser → Document Builder → Text Extractor
```

**Design Decisions**:

- Lazy loading of objects for memory efficiency
- Simple object model for maintainability
- Stream-based processing for large files

### 3. PDF Generator (`src/pdf_generator.rs`)

**Purpose**: Create PDF files from scratch

**Architecture Pattern**: Builder Pattern

**Key Classes**:

```rust
pub struct PdfGenerator {
    objects: Vec<PdfObject>,
    next_id: u32,
}

struct PdfObject {
    id: u32,
    generation: u32,
    content: String,
    is_stream: bool,
    stream_data: Option<Vec<u8>>,
}
```

**Generation Pipeline**:

```
Text Input → Page Builder → Content Stream Generator → Object Manager → File Writer
```

**Design Decisions**:

- Object ID management for proper references
- Content stream optimization
- PDF compliance with version 1.4 specification

### 4. Structured Elements (`src/elements.rs`)

**Purpose**: Define and parse structured document elements from Markdown

**Architecture Pattern**: Line Scanner → Element Classifier → Element Tree

**Key Types**:

```rust
pub enum Element {
    Heading { level: u8, text: String },
    Paragraph { text: String },
    UnorderedListItem { text: String, depth: u8 },
    OrderedListItem { number: u32, text: String, depth: u8 },
    TaskListItem { checked: bool, text: String },
    CodeBlock { language: String, code: String },
    InlineCode { code: String },
    TableRow { cells: Vec<String>, is_separator: bool, alignments: Vec<TableAlignment> },
    BlockQuote { text: String, depth: u8 },
    DefinitionItem { term: String, definition: String },
    Footnote { label: String, text: String },
    Link { text: String, url: String },
    Image { alt: String, path: String },
    StyledText { text: String, bold: bool, italic: bool },
    PageBreak,
    HorizontalRule,
    EmptyLine,
}
```

**Key Functions**:

- `parse_markdown()`: Parse markdown text into `Vec<Element>`
- `strip_inline_formatting()`: Remove bold/italic/code/link/strikethrough syntax

**Design Decisions**:

- Elements carry formatting intent (heading level, list depth, checked state)
- Inline formatting stripped at parse time; structure preserved for PDF rendering
- Enables font size variations, indentation, and page numbering in PDF output

### 5. Markdown Converter (`src/markdown.rs`)

**Purpose**: Orchestrate Markdown-to-PDF and Markdown-to-text conversion

**Architecture Pattern**: Pipeline Coordinator

**Parsing Pipeline**:

```
Markdown Text → elements::parse_markdown() → Vec<Element> → pdf_generator::create_pdf_from_elements()
```

**Key Functions**:

- `markdown_to_text()`: Convert Markdown to plain text (legacy, uses elements internally)
- `markdown_to_pdf_with_options()`: Convert with styling via structured elements
- `elements_to_text()`: Render elements back to plain text

### 6. Image Processing (`src/image.rs`)

**Purpose**: Handle image embedding in PDFs

**Architecture Pattern**: Strategy Pattern for different image formats

**Key Components**:

```rust
pub fn create_jpeg_image_object(
    generator: &mut PdfGenerator,
    jpeg_data: Vec<u8>,
    width: u32,
    height: u32
) -> Result<u32>

pub fn create_image_content_stream(
    x: f32,
    y: f32,
    width: f32,
    height: f32,
    image_name: &str
) -> Result<Vec<u8>>
```

**Supported Formats**:

- JPEG (with DCTDecode)
- PNG (planned)
- BMP (planned)

### 7. PDF Operations (`src/pdf_ops.rs`)

**Purpose**: High-level PDF manipulation operations

**Key Functions**:

- `merge_pdfs()`: Combine multiple PDFs into one
- `split_pdf()`: Extract page range into a new PDF
- `rotate_pdf()`: Apply rotation (0/90/180/270°) to all pages
- `create_pdf_with_metadata()`: Generate PDF with Info dictionary (title, author, etc.)
- `create_pdf_with_annotations()`: Generate PDF with text and link annotations
- `create_pdf_with_images()`: Place multiple JPEG images on a single page

**Key Types**:

- `PdfMetadata`: Document properties (title, author, subject, keywords, creator)
- `TextAnnotation`: Positioned text note on a page
- `LinkAnnotation`: Clickable URI region on a page
- `HighlightAnnotation`: Colored highlight rectangle with QuadPoints
- `Color`: RGB color struct for text rendering
- `TextAlign`: Left/Center alignment enum

### 8. PDF Validation (`src/pdf.rs` — validation functions)

**Purpose**: Validate PDF structural integrity

**Key Functions**:

- `validate_pdf()`: Validate a PDF file on disk
- `validate_pdf_bytes()`: Validate PDF bytes in memory (no filesystem needed)
- `parse_xref_stream()`: Parse PDF 1.5+ cross-reference streams
- `parse_object_stream()`: Parse compressed object streams (/Type /ObjStm)

**Key Types**:

```rust
pub struct PdfValidation {
    pub valid: bool,
    pub errors: Vec<String>,
    pub warnings: Vec<String>,
    pub page_count: usize,
    pub object_count: usize,
}
```

**Validation Checks**: PDF header, %%EOF marker, xref table, trailer, /Catalog, /Pages, page count, object/endobj pairing, stream/endstream pairing, /Root reference.

### 9. Compression Module (`src/compression.rs`)

**Purpose**: Handle PDF stream compression

**Architecture Pattern**: Strategy Pattern for compression algorithms

**Key Functions**:

- `decompress_deflate()`: Decompress zlib/deflate streams
- `compress_deflate()`: Compress data streams
- `decode_hex_string()`: Decode hex-encoded strings
- `encode_hex_string()`: Encode data as hex strings

## Data Flow Architecture

### PDF Generation Flow

```
Input Text → Markdown Parser → Content Builder → PDF Generator → File Writer
     ↓              ↓              ↓                ↓             ↓
Raw String → Structured Text → Content Streams → PDF Objects → Binary File
```

### PDF Parsing Flow

```
PDF File → Header Parser → Object Locator → Object Parser → Document Builder → Text Extractor
    ↓           ↓            ↓              ↓              ↓             ↓
Binary File → PDF Version → Object Offsets → PDF Objects → Document Model → Plain Text
```

### Conversion Flow

```
Source File → Appropriate Parser → Text Processing → Target Generator → Output File
     ↓              ↓                  ↓                ↓               ↓
PDF/MD File → PDF/MD Parser → Text Transformation → Target Generator → Target Format
```

## Design Patterns Used

### 1. Builder Pattern

- **Location**: `PdfGenerator`
- **Purpose**: Construct complex PDF objects step by step
- **Benefits**: Fluent interface, flexible configuration

### 2. Strategy Pattern

- **Location**: `Image` and `Compression` modules
- **Purpose**: Handle different formats and algorithms
- **Benefits**: Extensibility, interchangeable algorithms

### 3. Command Pattern

- **Location**: CLI module
- **Purpose**: Encapsulate user commands as objects
- **Benefits**: Decoupling, undo/redo capabilities

### 4. Iterator Pattern

- **Location**: PDF parsing
- **Purpose**: Traverse PDF objects and collections
- **Benefits**: Uniform interface, lazy evaluation

## Performance Considerations

### Memory Management

- **Strategy**: Streaming for large files, lazy loading
- **Implementation**: Buffered readers, object pools
- **Benefits**: Reduced memory footprint, better scalability

### CPU Optimization

- **Strategy**: Efficient string handling, minimal allocations
- **Implementation**: String builders, buffer reuse
- **Benefits**: Faster processing, lower CPU usage

### I/O Optimization

- **Strategy**: Buffered I/O, batch operations
- **Implementation**: Buffered readers/writers, bulk operations
- **Benefits**: Fewer system calls, better throughput

## Error Handling Architecture

### Error Hierarchy

```
Error
├── ParseError (PDF parsing failures)
├── IoError (File system issues)
├── FormatError (Unsupported formats)
└── ValidationError (Invalid input)
```

### Error Propagation

- **Strategy**: Result<T, Error> throughout the codebase
- **Implementation**: Question mark operator (?) for propagation
- **Benefits**: Explicit error handling, clear error paths

### Recovery Mechanisms

- **Partial Processing**: Continue processing other objects on failure
- **Graceful Degradation**: Fallback to simpler processing modes
- **User Feedback**: Clear error messages and suggestions

## Testing Architecture

### Test Organization

```
tests/
├── unit/ (Module-specific tests)
│   ├── pdf_tests.rs
│   ├── markdown_tests.rs
│   └── image_tests.rs
├── integration/ (End-to-end tests)
│   ├── conversion_tests.rs
│   └── cli_tests.rs
└── performance/ (Benchmarks)
    ├── parsing_benchmarks.rs
    └── generation_benchmarks.rs
```

### Test Strategies

- **Unit Tests**: Individual module functionality
- **Integration Tests**: Module interactions
- **Property Tests**: Input validation and edge cases
- **Performance Tests**: Resource usage and speed

## Extensibility Design

### Plugin Architecture (Future)

```
Plugin Interface
├── Parser Plugins (New formats)
├── Generator Plugins (New output formats)
├── Filter Plugins (Text processing)
└── Compression Plugins (New algorithms)
```

### Configuration System (Future)

- File-based configuration
- Runtime parameter adjustment
- Feature toggles
- Performance tuning options

## Security Architecture

### Input Validation

- File type verification
- Size limitations
- Content sanitization
- Path traversal prevention

### Resource Management

- Memory limits
- File handle limits
- Processing timeouts
- Temporary file cleanup

## Future Architecture Enhancements

### Multi-threading Support

- Parallel PDF parsing
- Concurrent image processing
- Background I/O operations
- Worker thread pools

### Caching System

- Object caching for repeated operations
- Result memoization
- Temporary file caching
- Metadata caching

### Plugin System

- Dynamic loading of parsers
- Custom output generators
- Extensible filter system
- Third-party integrations

This architecture provides a solid foundation for the current implementation while allowing for future enhancements and maintainability.