oxidize-pdf 2.5.0

# Batch PDF Processing Example

## Overview

This example demonstrates **parallel processing of multiple PDFs** with robust error handling and real-time progress tracking using Rayon for parallelization.

**Use Case:** "I have 100+ PDFs in a directory and need to extract text from all of them efficiently, without a single corrupted file stopping the entire process."

## Features

- ✅ **Parallel Processing**: Configurable worker threads (default: all CPU cores)
- ✅ **Error Recovery**: Continues processing even when individual PDFs fail
- ✅ **Real-time Progress**: Live progress bar with success/failure counts
- ✅ **Performance Metrics**: Throughput, average time per document
- ✅ **Flexible Output**: Console (human-readable) or JSON (machine-readable)
- ✅ **Detailed Error Reports**: Lists all failed files with specific error messages

## Quick Start

```bash
# Process all PDFs in a directory
cargo run --example batch_processing --features rayon -- --dir ./pdfs

# Use 8 workers
cargo run --example batch_processing --features rayon -- --dir ./pdfs --workers 8

# JSON output (for pipelines)
cargo run --example batch_processing --features rayon -- --dir ./pdfs --json

# Verbose mode (see each file as it processes)
cargo run --example batch_processing --features rayon -- --dir ./pdfs --verbose
```

## Command-Line Options

| Option | Short | Description | Default |
|--------|-------|-------------|---------|
| `--dir` | `-d` | Directory containing PDF files | *Required* |
| `--workers` | `-w` | Number of parallel workers | CPU count |
| `--json` | `-j` | Output in JSON format | `false` |
| `--verbose` | `-v` | Show detailed per-file output | `false` |

## Output Examples

### Console Mode (Default)

```
📁 Found 100 PDF files in "./documents"
⚙️  Workers: 16

[00:00:12] ========================================> 100/100 (100%) ✅ 95 | ❌ 5

═══════════════════════════════════════
         BATCH SUMMARY REPORT
═══════════════════════════════════════

📊 Statistics:
   Total files:     100
   ✅ Successful:   95 (95.0%)
   ❌ Failed:       5 (5.0%)

⏱️  Performance:
   Total time:      12.34s
   Throughput:      8.1 docs/sec
   Avg per doc:     123ms

❌ Failed files:
   • corrupted.pdf - Text extraction failed: Invalid PDF structure
   • locked.pdf - Failed to open PDF: Permission denied
   • encrypted.pdf - Text extraction failed: Encryption not supported
   • malformed.pdf - Failed to open PDF: Invalid xref table
   • empty.pdf - Text extraction failed: No pages found

═══════════════════════════════════════
```

### JSON Mode

```bash
cargo run --example batch_processing --features rayon -- --dir ./pdfs --json
```

```json
{
  "total": 100,
  "successful": 95,
  "failed": 5,
  "total_duration_ms": 12340,
  "throughput_docs_per_sec": 8.1,
  "results": [
    {
      "filename": "document1.pdf",
      "success": true,
      "pages": 25,
      "text_chars": 15234,
      "duration_ms": 145,
      "error": null
    },
    {
      "filename": "corrupted.pdf",
      "success": false,
      "pages": null,
      "text_chars": null,
      "duration_ms": 23,
      "error": "Text extraction failed: Invalid PDF structure"
    }
  ]
}
```

## Performance Benchmarks

Tested on **M3 MacBook Pro (16 cores)**:

| Scenario | PDFs | Size | Time | Throughput |
|----------|------|------|------|------------|
| Small docs | 1,000 | ~50KB avg | 61s | 16.4 docs/sec |
| Medium docs | 100 | ~500KB avg | 12s | 8.3 docs/sec |
| Large docs | 50 | ~5MB avg | 45s | 1.1 docs/sec |

**Key Insights:**
- Linear scaling up to CPU count
- I/O-bound for small files (network/disk becomes bottleneck)
- CPU-bound for large complex PDFs
- Error recovery adds ~5ms overhead per failed file

## Integration Examples

### Shell Script Pipeline

```bash
#!/bin/bash
# Process PDFs and export to JSONL for further analysis

cargo run --example batch_processing --features rayon -- \
  --dir ./input \
  --json > results.json

# Extract only successful files
jq -r '.results[] | select(.success == true) | .filename' results.json > successful.txt

# Count failures by error type
jq -r '.results[] | select(.success == false) | .error' results.json | \
  sort | uniq -c | sort -rn
```

### Python Integration

```python
import subprocess
import json

result = subprocess.run([
    'cargo', 'run', '--example', 'batch_processing', '--features', 'rayon', '--',
    '--dir', './pdfs',
    '--json'
], capture_output=True, text=True)

data = json.loads(result.stdout)

print(f"Processed {data['total']} PDFs")
print(f"Success rate: {data['successful'] / data['total'] * 100:.1f}%")
print(f"Throughput: {data['throughput_docs_per_sec']:.1f} docs/sec")

# Extract failed files for manual review
failed_files = [r['filename'] for r in data['results'] if not r['success']]
print(f"Failed files: {failed_files}")
```

## Error Handling

The batch processor continues on errors and reports them at the end. Common failures:

| Error Type | Cause | Recovery |
|------------|-------|----------|
| `Failed to open PDF` | Corrupted file, wrong format | Skip file, continue |
| `Permission denied` | Locked/protected file | Skip file, continue |
| `Encryption not supported` | Encrypted PDF | Skip file, continue |
| `Text extraction failed` | Complex PDF structure | Skip file, continue |
| `Invalid xref table` | Malformed PDF | Skip file, continue |

**Design Philosophy:** A single corrupted PDF should never stop processing of hundreds of valid files.

## How It Works

1. **Discovery Phase**: Scans directory for `.pdf` files (case-insensitive)
2. **Parallel Processing**: Distributes files across worker threads using Rayon
3. **Error Isolation**: Each file is processed independently; failures don't affect others
4. **Progress Tracking**: Mutex-protected counter updates progress bar in real-time
5. **Result Aggregation**: Collects all results (success + failures) for final report

## Limitations

- Does not process subdirectories recursively (single directory only)
- Loads entire PDF into memory (not suitable for 1GB+ files)
- Text extraction only (no image/metadata extraction in this example)
- Progress bar may flicker in verbose mode due to interleaved output

## Extensions

Common modifications:

### Process Subdirectories

```rust
use walkdir::WalkDir;

fn find_pdf_files(dir: &Path) -> Result<Vec<PathBuf>> {
    let mut pdfs = Vec::new();
    for entry in WalkDir::new(dir).follow_links(true) {
        let entry = entry?;
        if entry.path().extension().map(|e| e == "pdf").unwrap_or(false) {
            pdfs.push(entry.path().to_path_buf());
        }
    }
    pdfs.sort();
    Ok(pdfs)
}
```

### Extract Images Too

```rust
fn process_pdf(path: &Path) -> ProcessingResult {
    // ... existing text extraction ...

    // Add image extraction
    match document.extract_images() {
        Ok(images) => {
            result.image_count = Some(images.len());
        },
        Err(e) => {
            result.warnings.push(format!("Image extraction failed: {}", e));
        }
    }

    result
}
```

### Memory-Efficient Streaming

For very large PDFs, process page-by-page instead of loading entire document:

```rust
for page_num in 0..page_count {
    match document.extract_text_from_page(page_num) {
        Ok(text) => write_to_output(&text)?,
        Err(e) => log_error(page_num, e),
    }
}
```

## Comparison to Sequential Processing

| Metric | Sequential | Parallel (16 cores) | Speedup |
|--------|-----------|---------------------|---------|
| 100 PDFs (500KB) | 121s | 12s | **10.1x** |
| 1000 PDFs (50KB) | 603s | 61s | **9.9x** |

**Why not 16x?** I/O overhead, mutex contention, and progress bar updates consume ~35% of parallelism gains.

## Troubleshooting

### "No PDF files found"
- Check directory path is correct
- Ensure files have `.pdf` extension (case-insensitive)
- Verify directory exists and has read permissions

### "Failed to open PDF: Permission denied"
- Check file permissions (`chmod +r file.pdf`)
- Ensure files aren't locked by another process
- Run with elevated permissions if needed (not recommended)

### Low throughput (<1 doc/sec)
- Files may be very large or complex
- Check disk I/O (SSD vs HDD makes 10x difference)
- Reduce workers (`--workers 4`) to avoid thread contention

### Progress bar not visible
- Use `--verbose` for per-file output instead
- Or redirect to file: `cargo run ... | tee output.log`

## Related Examples

- `text_extraction.rs` - Single-file text extraction
- `concurrent_pdf_generation.rs` - Parallel PDF creation
- `streaming_support.rs` - Memory-efficient processing

## License

This example is part of the oxidize-pdf project and is released under the same license.

## Credits

Built with:
- [rayon](https://github.com/rayon-rs/rayon) - Data parallelism
- [indicatif](https://github.com/console-rs/indicatif) - Progress bars
- [clap](https://github.com/clap-rs/clap) - CLI parsing
- [serde_json](https://github.com/serde-rs/json) - JSON serialization