pdfrs 0.1.0

A CLI tool to read/write PDFs and convert to/from markdown
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
# PDF-CLI Architecture Documentation

## Overview

PDF-CLI is architectured as a modular system with clear separation of concerns. The design prioritizes maintainability, extensibility, and performance while implementing PDF functionality from scratch without external dependencies.

## High-Level Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                        PDF-CLI CLI                          │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │   PDF I/O   │  │ Markdown I/O│  │   Image I/O         │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │  PDF Core   │  │  Markdown   │  │   Image Processing │  │
│  │  Engine     │  │  Parser     │  │   Engine           │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │Compression  │  │   Utility   │  │    Error            │  │
│  │   Module    │  │  Functions  │  │   Handling          │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
```

## Module Architecture

### 1. CLI Module (`src/main.rs`)

**Purpose**: Command-line interface and application orchestration

**Responsibilities**:

- Parse command-line arguments using `clap`
- Route commands to appropriate handlers
- Coordinate between modules
- Handle application-level error reporting

**Key Components**:

```rust
struct Cli {
    command: Commands,
}

enum Commands {
    PdfToMd { input: String, output: String },
    MdToPdf { input: String, output: String, font: String, font_size: f32 },
    Extract { input: String },
    Create { output: String, text: String, font: String, font_size: f32 },
    AddImage { pdf_file: String, image_file: String, x: f32, y: f32, width: f32, height: f32 },
}
```

### 2. PDF Core Engine (`src/pdf.rs`)

**Purpose**: PDF parsing and text extraction

**Architecture Pattern**: Document Object Model (DOM) parser

**Key Classes**:

```rust
pub struct PdfDocument {
    pub version: String,
    pub objects: HashMap<u32, PdfObject>,
    pub catalog: u32,
    pub pages: Vec<u32>,
}

pub enum PdfObject {
    Dictionary(HashMap<String, PdfValue>),
    Stream { dictionary: HashMap<String, PdfValue>, data: Vec<u8> },
    Array(Vec<PdfValue>),
    String(String),
    Number(f64),
    Boolean(bool),
    Null,
    Reference(u32, u32),
    Name(String),
}
```

**Parsing Pipeline**:

```
PDF File → Header Parser → XRef Parser → Object Parser → Document Builder → Text Extractor
```

**Design Decisions**:

- Lazy loading of objects for memory efficiency
- Simple object model for maintainability
- Stream-based processing for large files

### 3. PDF Generator (`src/pdf_generator.rs`)

**Purpose**: Create PDF files from scratch

**Architecture Pattern**: Builder Pattern

**Key Classes**:

```rust
pub struct PdfGenerator {
    objects: Vec<PdfObject>,
    next_id: u32,
}

struct PdfObject {
    id: u32,
    generation: u32,
    content: String,
    is_stream: bool,
    stream_data: Option<Vec<u8>>,
}
```

**Generation Pipeline**:

```
Text Input → Page Builder → Content Stream Generator → Object Manager → File Writer
```

**Design Decisions**:

- Object ID management for proper references
- Content stream optimization
- PDF compliance with version 1.4 specification

### 4. Structured Elements (`src/elements.rs`)

**Purpose**: Define and parse structured document elements from Markdown

**Architecture Pattern**: Line Scanner → Element Classifier → Element Tree

**Key Types**:

```rust
pub enum Element {
    Heading { level: u8, text: String },
    Paragraph { text: String },
    UnorderedListItem { text: String, depth: u8 },
    OrderedListItem { number: u32, text: String, depth: u8 },
    TaskListItem { checked: bool, text: String },
    CodeBlock { language: String, code: String },
    InlineCode { code: String },
    TableRow { cells: Vec<String>, is_separator: bool, alignments: Vec<TableAlignment> },
    BlockQuote { text: String, depth: u8 },
    DefinitionItem { term: String, definition: String },
    Footnote { label: String, text: String },
    Link { text: String, url: String },
    Image { alt: String, path: String },
    StyledText { text: String, bold: bool, italic: bool },
    PageBreak,
    HorizontalRule,
    EmptyLine,
}
```

**Key Functions**:

- `parse_markdown()`: Parse markdown text into `Vec<Element>`
- `strip_inline_formatting()`: Remove bold/italic/code/link/strikethrough syntax

**Design Decisions**:

- Elements carry formatting intent (heading level, list depth, checked state)
- Inline formatting stripped at parse time; structure preserved for PDF rendering
- Enables font size variations, indentation, and page numbering in PDF output

### 5. Markdown Converter (`src/markdown.rs`)

**Purpose**: Orchestrate Markdown-to-PDF and Markdown-to-text conversion

**Architecture Pattern**: Pipeline Coordinator

**Parsing Pipeline**:

```
Markdown Text → elements::parse_markdown() → Vec<Element> → pdf_generator::create_pdf_from_elements()
```

**Key Functions**:

- `markdown_to_text()`: Convert Markdown to plain text (legacy, uses elements internally)
- `markdown_to_pdf_with_options()`: Convert with styling via structured elements
- `elements_to_text()`: Render elements back to plain text

### 6. Image Processing (`src/image.rs`)

**Purpose**: Handle image embedding in PDFs

**Architecture Pattern**: Strategy Pattern for different image formats

**Key Components**:

```rust
pub fn create_jpeg_image_object(
    generator: &mut PdfGenerator,
    jpeg_data: Vec<u8>,
    width: u32,
    height: u32
) -> Result<u32>

pub fn create_image_content_stream(
    x: f32,
    y: f32,
    width: f32,
    height: f32,
    image_name: &str
) -> Result<Vec<u8>>
```

**Supported Formats**:

- JPEG (with DCTDecode)
- PNG (planned)
- BMP (planned)

### 7. PDF Operations (`src/pdf_ops.rs`)

**Purpose**: High-level PDF manipulation operations

**Key Functions**:

- `merge_pdfs()`: Combine multiple PDFs into one
- `split_pdf()`: Extract page range into a new PDF
- `rotate_pdf()`: Apply rotation (0/90/180/270°) to all pages
- `create_pdf_with_metadata()`: Generate PDF with Info dictionary (title, author, etc.)
- `create_pdf_with_annotations()`: Generate PDF with text and link annotations
- `create_pdf_with_images()`: Place multiple JPEG images on a single page

**Key Types**:

- `PdfMetadata`: Document properties (title, author, subject, keywords, creator)
- `TextAnnotation`: Positioned text note on a page
- `LinkAnnotation`: Clickable URI region on a page
- `HighlightAnnotation`: Colored highlight rectangle with QuadPoints
- `Color`: RGB color struct for text rendering
- `TextAlign`: Left/Center alignment enum

### 8. PDF Validation (`src/pdf.rs` — validation functions)

**Purpose**: Validate PDF structural integrity

**Key Functions**:

- `validate_pdf()`: Validate a PDF file on disk
- `validate_pdf_bytes()`: Validate PDF bytes in memory (no filesystem needed)
- `parse_xref_stream()`: Parse PDF 1.5+ cross-reference streams
- `parse_object_stream()`: Parse compressed object streams (/Type /ObjStm)

**Key Types**:

```rust
pub struct PdfValidation {
    pub valid: bool,
    pub errors: Vec<String>,
    pub warnings: Vec<String>,
    pub page_count: usize,
    pub object_count: usize,
}
```

**Validation Checks**: PDF header, %%EOF marker, xref table, trailer, /Catalog, /Pages, page count, object/endobj pairing, stream/endstream pairing, /Root reference.

### 9. Compression Module (`src/compression.rs`)

**Purpose**: Handle PDF stream compression

**Architecture Pattern**: Strategy Pattern for compression algorithms

**Key Functions**:

- `decompress_deflate()`: Decompress zlib/deflate streams
- `compress_deflate()`: Compress data streams
- `decode_hex_string()`: Decode hex-encoded strings
- `encode_hex_string()`: Encode data as hex strings

## Data Flow Architecture

### PDF Generation Flow

```
Input Text → Markdown Parser → Content Builder → PDF Generator → File Writer
     ↓              ↓              ↓                ↓             ↓
Raw String → Structured Text → Content Streams → PDF Objects → Binary File
```

### PDF Parsing Flow

```
PDF File → Header Parser → Object Locator → Object Parser → Document Builder → Text Extractor
    ↓           ↓            ↓              ↓              ↓             ↓
Binary File → PDF Version → Object Offsets → PDF Objects → Document Model → Plain Text
```

### Conversion Flow

```
Source File → Appropriate Parser → Text Processing → Target Generator → Output File
     ↓              ↓                  ↓                ↓               ↓
PDF/MD File → PDF/MD Parser → Text Transformation → Target Generator → Target Format
```

## Design Patterns Used

### 1. Builder Pattern

- **Location**: `PdfGenerator`
- **Purpose**: Construct complex PDF objects step by step
- **Benefits**: Fluent interface, flexible configuration

### 2. Strategy Pattern

- **Location**: `Image` and `Compression` modules
- **Purpose**: Handle different formats and algorithms
- **Benefits**: Extensibility, interchangeable algorithms

### 3. Command Pattern

- **Location**: CLI module
- **Purpose**: Encapsulate user commands as objects
- **Benefits**: Decoupling, undo/redo capabilities

### 4. Iterator Pattern

- **Location**: PDF parsing
- **Purpose**: Traverse PDF objects and collections
- **Benefits**: Uniform interface, lazy evaluation

## Performance Considerations

### Memory Management

- **Strategy**: Streaming for large files, lazy loading
- **Implementation**: Buffered readers, object pools
- **Benefits**: Reduced memory footprint, better scalability

### CPU Optimization

- **Strategy**: Efficient string handling, minimal allocations
- **Implementation**: String builders, buffer reuse
- **Benefits**: Faster processing, lower CPU usage

### I/O Optimization

- **Strategy**: Buffered I/O, batch operations
- **Implementation**: Buffered readers/writers, bulk operations
- **Benefits**: Fewer system calls, better throughput

## Error Handling Architecture

### Error Hierarchy

```
Error
├── ParseError (PDF parsing failures)
├── IoError (File system issues)
├── FormatError (Unsupported formats)
└── ValidationError (Invalid input)
```

### Error Propagation

- **Strategy**: Result<T, Error> throughout the codebase
- **Implementation**: Question mark operator (?) for propagation
- **Benefits**: Explicit error handling, clear error paths

### Recovery Mechanisms

- **Partial Processing**: Continue processing other objects on failure
- **Graceful Degradation**: Fallback to simpler processing modes
- **User Feedback**: Clear error messages and suggestions

## Testing Architecture

### Test Organization

```
tests/
├── unit/ (Module-specific tests)
│   ├── pdf_tests.rs
│   ├── markdown_tests.rs
│   └── image_tests.rs
├── integration/ (End-to-end tests)
│   ├── conversion_tests.rs
│   └── cli_tests.rs
└── performance/ (Benchmarks)
    ├── parsing_benchmarks.rs
    └── generation_benchmarks.rs
```

### Test Strategies

- **Unit Tests**: Individual module functionality
- **Integration Tests**: Module interactions
- **Property Tests**: Input validation and edge cases
- **Performance Tests**: Resource usage and speed

## Extensibility Design

### Plugin Architecture (Future)

```
Plugin Interface
├── Parser Plugins (New formats)
├── Generator Plugins (New output formats)
├── Filter Plugins (Text processing)
└── Compression Plugins (New algorithms)
```

### Configuration System (Future)

- File-based configuration
- Runtime parameter adjustment
- Feature toggles
- Performance tuning options

## Security Architecture

### Input Validation

- File type verification
- Size limitations
- Content sanitization
- Path traversal prevention

### Resource Management

- Memory limits
- File handle limits
- Processing timeouts
- Temporary file cleanup

## Future Architecture Enhancements

### Multi-threading Support

- Parallel PDF parsing
- Concurrent image processing
- Background I/O operations
- Worker thread pools

### Caching System

- Object caching for repeated operations
- Result memoization
- Temporary file caching
- Metadata caching

### Plugin System

- Dynamic loading of parsers
- Custom output generators
- Extensible filter system
- Third-party integrations

This architecture provides a solid foundation for the current implementation while allowing for future enhancements and maintainability.