pdfrs 0.1.0

A CLI tool to read/write PDFs and convert to/from markdown
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
# PDF-CLI Technical Specification

## Overview

PDF-CLI is a command-line tool written in Rust that provides functionality for reading, writing, and converting PDF files to and from Markdown format. The implementation is designed to be self-contained, not relying on external PDF libraries, and implements core PDF specifications from scratch.

## Requirements

### Functional Requirements

#### FR1: PDF Generation

- **FR1.1**: Create PDF files from raw text input
- **FR1.2**: Support customizable fonts (Helvetica, Times-Roman, Courier)
- **FR1.3**: Support customizable font sizes
- **FR1.4**: Automatically split content into multiple pages when needed
- **FR1.5**: Generate PDFs compliant with PDF 1.4 specification

#### FR2: PDF Parsing

- **FR2.1**: Parse PDF file structure and extract objects
- **FR2.2**: Extract text content from PDF pages
- **FR2.3**: Handle compressed streams (deflate/zlib)
- **FR2.4**: Process PDF content streams and text operators
- **FR2.5**: Detect and handle different PDF encodings

#### FR3: Markdown Integration

- **FR3.1**: Parse Markdown syntax (headers, lists, emphasis, code blocks, tables)
- **FR3.2**: Convert Markdown to structured elements for rich PDF generation
- **FR3.3**: Convert extracted PDF text to Markdown format
- **FR3.4**: Preserve document structure during conversions
- **FR3.5**: Task list support (`- [x]` / `- [ ]`)
- **FR3.6**: Strikethrough text (`~~text~~`)
- **FR3.7**: Blockquote support with nesting (`>`, `>>`, `>>>`)
- **FR3.8**: Definition lists (`term` / `: definition`)
- **FR3.9**: Table alignment parsing (`:---`, `:---:`, `---:`)

#### FR4: Image Support

- **FR4.1**: Detect image formats (JPEG, PNG, BMP) with dimension parsing
- **FR4.2**: Embed JPEG images in PDF files (DCTDecode)
- **FR4.3**: Support image positioning and sizing with aspect-ratio scaling
- **FR4.4**: CLI `add-image` command fully wired

#### FR5: CLI Interface

- **FR5.1**: Provide subcommands for different operations
- **FR5.2**: Support command-line arguments for customization
- **FR5.3**: Provide helpful error messages and usage information
- **FR5.4**: Support input/output file specifications
- **FR5.5**: Page orientation (`--landscape` flag)

#### FR6: PDF Generation Enhancements

- **FR6.1**: Header font size hierarchy (H1=2x, H2=1.6x, H3=1.3x, H4=1.1x)
- **FR6.2**: Page numbering in footer
- **FR6.3**: Code block rendering with reduced font size (0.85x)
- **FR6.4**: Horizontal rule rendering
- **FR6.5**: Configurable page layout (portrait/landscape)
- **FR6.6**: Structured element pipeline (Markdown → Elements → PDF)

#### FR7: PDF Manipulation

- **FR7.1**: Merge multiple PDFs into a single output (`merge` command)
- **FR7.2**: Split PDF by page range (`split` command)
- **FR7.3**: Rotate all pages by 0/90/180/270° (`rotate` command)
- **FR7.4**: Document metadata embedding (title, author, subject, keywords) (`md-to-pdf-meta`)

#### FR8: Annotations and Multi-Image

- **FR8.1**: Text annotations with positioned notes on pages
- **FR8.2**: Link annotations with clickable URI actions
- **FR8.3**: Multiple JPEG images per page with independent positioning
- **FR8.4**: Highlight annotations with QuadPoints and color

#### FR9: Library API

- **FR9.1**: In-memory PDF generation via `generate_pdf_bytes()` (no filesystem needed)
- **FR9.2**: PDF structural validation via `validate_pdf_bytes()` returning `PdfValidation`
- **FR9.3**: Rich `Element` enum with 17 variants for document modeling
- **FR9.4**: Round-trip validation: generate → validate → parse → verify content
- **FR9.5**: Cross-reference stream parsing for PDF 1.5+ (`parse_xref_stream`)
- **FR9.6**: Object stream handling for compressed objects (`parse_object_stream`)

#### FR10: Extended Markdown Elements

- **FR10.1**: Image elements (`![alt](path)`) parsed and rendered
- **FR10.2**: Standalone link elements (`[text](url)`) parsed and rendered in blue
- **FR10.3**: Page break elements (`<!-- pagebreak -->` or `\pagebreak`)
- **FR10.4**: Inline code elements rendered with gray color
- **FR10.5**: Styled text elements (bold/italic) preserved
- **FR10.6**: Footnotes with label and text (`[^label]: text`)
- **FR10.7**: Definition lists (`term` / `: definition`)

#### FR11: Text Styling

- **FR11.1**: RGB color support via `Color` struct
- **FR11.2**: Text alignment (Left, Center) via `TextAlign` enum
- **FR11.3**: H1 headings centered, code blocks in gray, links in blue
- **FR11.4**: Watermarks with diagonal text, configurable opacity/size

### Non-Functional Requirements

#### NFR1: Performance

- **NFR1.1**: Process small PDF files (<1MB) in under 1 second
- **NFR1.2**: Handle large text files without memory issues
- **NFR1.3**: Efficient memory usage during PDF generation

#### NFR2: Compatibility

- **NFR2.1**: Support PDF files created by common applications
- **NFR2.2**: Generate PDFs readable by standard PDF viewers
- **NFR2.3**: Support common Markdown syntax variants

#### NFR3: Reliability

- **NFR3.1**: Handle malformed PDF files gracefully
- **NFR3.2**: Provide clear error messages for troubleshooting
- **NFR3.3**: Not crash on unexpected input

## System Architecture

### Core Components

#### 1. PDF Parser Module (`src/pdf.rs`)

```
PdfDocument
├── version: String
├── objects: HashMap<u32, PdfObject>
├── catalog: u32
└── pages: Vec<u32>

PdfObject
├── Dictionary(HashMap<String, PdfValue>)
├── Stream { dictionary, data }
├── Array(Vec<PdfValue>)
├── String(String)
├── Number(f64)
├── Boolean(bool)
├── Null
├── Reference(u32, u32)
└── Name(String)
```

**Responsibilities:**

- Parse PDF file structure
- Extract objects from PDF streams
- Handle compressed data
- Process content streams for text extraction

#### 2. PDF Generator Module (`src/pdf_generator.rs`)

```
PdfGenerator
├── objects: Vec<PdfObject>
└── next_id: u32

PdfObject
├── id: u32
├── generation: u32
├── content: String
├── is_stream: bool
└── stream_data: Option<Vec<u8>>
```

**Responsibilities:**

- Create PDF file structure
- Generate content streams
- Handle font resources
- Create page tree and catalog
- Write valid PDF format

#### 3. Markdown Parser (`src/markdown.rs`)

```
MarkdownParser
├── headers: Vec<Header>
├── paragraphs: Vec<Paragraph>
├── lists: Vec<List>
├── tables: Vec<Table>
└── code_blocks: Vec<CodeBlock>
```

**Responsibilities:**

- Parse Markdown syntax
- Convert to plain text
- Handle formatting preservation
- Process tables and lists

#### 4. Image Handler (`src/image.rs`)

```
ImageHandler
├── format_detector: FormatDetector
├── jpeg_processor: JpegProcessor
├── png_processor: PngProcessor
└── bmp_processor: BmpProcessor
```

**Responsibilities:**

- Detect image formats
- Process image data
- Create PDF image objects
- Generate image content streams

#### 5. Compression Module (`src/compression.rs`)

```
CompressionHandler
├── deflate_compressor: DeflateCompressor
├── hex_encoder: HexEncoder
└── stream_processor: StreamProcessor
```

**Responsibilities:**

- Compress and decompress streams
- Handle hex encoding/decoding
- Process compressed PDF objects

### Data Flow

#### PDF Generation Flow

```
Text Input → Markdown Parser → Text Processor → PDF Generator → PDF File
```

#### PDF Parsing Flow

```
PDF File → PDF Parser → Object Extractor → Text Processor → Text Output
```

#### Markdown to PDF Flow

```
Markdown File → Markdown Parser → Text Processor → PDF Generator → PDF File
```

## Algorithms

### PDF Object Parsing

1. Read PDF header to determine version
2. Locate and parse xref table
3. Extract objects based on xref references
4. Parse object dictionaries and streams
5. Handle compressed streams if present
6. Build object graph for document structure

### Text Extraction Algorithm

1. Iterate through page objects
2. Extract content streams from pages
3. Decompress streams if necessary
4. Parse content stream operators
5. Extract text strings from operators
6. Apply positioning and formatting
7. Combine text from all pages

### PDF Generation Algorithm

1. Create page objects with content streams
2. Generate font resources
3. Create page tree structure
4. Generate document catalog
5. Calculate object offsets
6. Generate xref table
7. Write trailer and EOF marker

### Markdown Parsing Algorithm

1. Tokenize input into lines
2. Identify block elements (headers, lists, code blocks, tables)
3. Parse inline elements (emphasis, links, code)
4. Build document structure
5. Convert to plain text representation

## Error Handling

### Error Types

1. **Parse Errors**: Malformed PDF structure
2. **IO Errors**: File access issues
3. **Format Errors**: Unsupported content
4. **Encoding Errors**: Invalid character encodings

### Error Recovery

- Skip malformed objects when possible
- Provide partial results when complete parsing fails
- Generate warnings for non-critical issues
- Fail gracefully with helpful error messages

## Security Considerations

### Input Validation

- Validate PDF file structure
- Check for buffer overflows
- Validate image file formats
- Sanitize text content

### Resource Limits

- Limit maximum file size
- Limit number of objects processed
- Limit recursion depth in parsing
- Monitor memory usage

## Performance Considerations

### Optimization Strategies

- Stream-based processing for large files
- Lazy loading of PDF objects
- Efficient string handling
- Minimal memory allocations

### Benchmarks

- Target: <1s for 1MB PDF processing
- Target: <100MB memory usage for typical operations
- Target: 10MB/s text extraction rate

## Testing Strategy

### Unit Tests

- PDF object parsing
- Text extraction algorithms
- Markdown parsing
- Image format detection
- Compression functions

### Integration Tests

- End-to-end PDF generation
- PDF to Markdown conversion
- Markdown to PDF conversion
- CLI command functionality

### Performance Tests

- Large file processing
- Memory usage profiling
- CPU usage monitoring
- Concurrency testing

## Future Enhancements

### Completed Features

- Advanced PDF parsing (xref streams, object streams, font encodings)
- Annotations (text, link, highlight)
- PDF manipulation (merge, split, rotate, reorder, watermark)
- Security (password protection, permissions)
- Library API (in-memory generation, validation)
- 17 element types with round-trip validation
- 251 tests (115 lib + 112 bin + 13 integration + 11 bench)

### Remaining Features

- Embedded/TrueType font support
- Full tagged PDF output for accessibility
- Vector graphics (SVG) support
- Digital signatures
- WebAssembly compilation
- Rustdoc API documentation with examples

### Advanced Features (Surpassing Ghostscript)

#### FR12: Streaming & Incremental Processing

- **FR12.1**: Streaming PDF generation for large documents
- **FR12.2**: Page-by-page rendering without full document load
- **FR12.3**: Incremental PDF writing (stream to disk during generation)
- **FR12.4**: Memory-efficient processing of multi-gigabyte PDFs

#### FR13: Performance & Parallelism

- **FR13.1**: Parallel page processing using Rayon
- **FR13.2**: Concurrent PDF merging (process multiple files in parallel)
- **FR13.3**: SIMD-optimized text rendering operations
- **FR13.4**: Lazy loading of PDF pages (load only needed pages)
- **FR13.5**: Async PDF processing for web servers

#### FR14: Smart Content Analysis

- **FR14.1**: AI-powered structure detection (headers, sections, tables)
- **FR14.2**: Automatic table extraction to CSV/Excel
- **FR14.3**: Smart form field detection and filling
- **FR14.4**: Content-aware compression (compress low-importance images)
- **FR14.5**: Automatic PDF/A validation and conversion

#### FR15: Developer Experience Features

- **FR15.1**: Type-safe PDF builder API with compile-time guarantees
- **FR15.2**: Property-based testing for PDF generation
- **FR15.3**: Diff/patch support for PDF version control
- **FR15.4**: Hot-reload PDF preview during development
- **FR15.5**: Interactive REPL for PDF manipulation

#### FR16: WebAssembly & Browser Support

- **FR16.1**: Compile to WASM for browser-based PDF rendering
- **FR16.2**: JavaScript API for web applications
- **FR16.3**: Canvas-based PDF viewer in browser
- **FR16.4**: Real-time collaborative PDF editing

#### FR17: Advanced Format Support

- **FR17.1**: PDF 2.0 feature support
- **FR17.2**: PDF/A-3 and PDF/UA (universal accessibility)
- **FR17.3**: Embedded attachments with metadata
- **FR17.4**: Portfolio and collection support
- **FR17.5**: 3D annotations and rich media

#### FR18: Intelligent Optimization

- **FR18.1**: Smart image compression based on content importance
- **FR18.2**: Font subsetting to reduce file size
- **FR18.3**: Object deduplication across pages
- **FR18.4**: Automatic optimization profiles (web, print, archive, ebook)
- **FR18.5**: Quality-aware compression (maintain visual quality)

#### FR19: Security & Validation

- **FR19.1**: Malformed PDF detection and sanitization
- **FR19.2**: JavaScript sandbox for PDF actions
- **FR19.3**: Digital signature creation and verification
- **FR19.4**: Certificate management
- **FR19.5**: DRM and permission enforcement

## Implementation Roadmap

### Phase 1: Foundation (Current)
✅ Basic PDF generation and parsing
✅ Markdown to PDF conversion
✅ Table rendering with borders and text wrapping
✅ Code blocks with syntax highlighting
✅ Font styles (bold/italic) and text alignment

### Phase 2: Performance (Next 2 weeks)
- [ ] FR12.1-12.4: Streaming processing
- [ ] FR13.1-13.4: Parallel processing
- [ ] Benchmarking suite

### Phase 3: Smart Features (1 month)
- [ ] FR14.1-14.5: Content analysis
- [ ] FR18.1-18.5: Intelligent optimization
- [ ] Machine learning integration prep

### Phase 4: Developer Experience (2 weeks)
- [ ] FR15.1-15.5: DX features
- [ ] FR15.3: Diff/patch support

### Phase 5: Web & Modern (1 month)
- [ ] FR16.1-16.4: WASM support
- [ ] FR17.1-17.5: Advanced formats
- [ ] Web-based PDF viewer

### Phase 6: Security & Advanced (1 month)
- [ ] FR19.1-19.5: Security features
- [ ] FR17.3: Attachments
- [ ] Production hardening